Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolves issues with UTF-8 #906

Merged
merged 5 commits into from
Oct 29, 2017
Merged

Resolves issues with UTF-8 #906

merged 5 commits into from
Oct 29, 2017

Conversation

consideRatio
Copy link
Contributor

Issue #902 is resolved by this PR, but this PR isn't very thorough yet, the fix of clarifying what encoding to use when reading a file on windows (10) might need to be applied at more places in the nbgrader project.

@jhamrick correctly guessed that this change would resolve the issue, I just tried it.

@jhamrick
Copy link
Member

Oh good, I'm really glad this resolved the issue!

might need to be applied at more places in the nbgrader project

I just did a quick grep through the source code there aren't really any other places in the main code base, but there are a few places in the tests that could probably be fixed. Here is a patch that fixes all the places I think might be susceptible to this issue (it also modifies the header notebook used in the tests to include special characters, so hopefully we won't get a recurrence of this bug again):
changes.txt

You can apply it with git apply changes.txt and then commit the result.'

Thanks for working on this!

@jhamrick jhamrick added this to the 0.6.0 milestone Oct 29, 2017
@consideRatio
Copy link
Contributor Author

consideRatio commented Oct 29, 2017

I'll look into this test failure:

    def test_collect_timestamp_skip_older(self, course_dir, archive_dir):
        extracted_dir = join(archive_dir, "..", "extracted")
        submitted_dir = join(course_dir, "submitted")
    
        # submissions are sorted so a before b
        os.makedirs(join(archive_dir, 'ps1_hacker_a_2017-01-30-15-30-10'))
        with io.open(join(archive_dir, 'ps1_hacker_a_2017-01-30-15-30-10', 'problem1.ipynb'), mode='w', encoding='utf-8') as fh:
            fh.write('')

E TypeError: write() argument 1 must be unicode, not str

@jhamrick
Copy link
Member

Oops, sorry. I tested my patch on python 3 but not python 2, which is where it seems the test failures are coming from... I should have known better! Thanks for looking into it regardless.

@consideRatio
Copy link
Contributor Author

It seems Python 2's string literals are not Unicode literals, while Python 3's string literals are by default Unicode encoded with utf-8.

This test was actually passing on Windows 10. So, if nbgrader is run on Linux, with Python 27, then this is an issue it seems.

The fix is probably to include the following in one of the first two lines of some files, (perhaps only the files where string literals is written to files?). Perhaps someone with previous know-how of encoding issues can suggest a clear line of action?

# -*- coding: utf-8 -*-

Reference

Python 2: https://docs.python.org/2/howto/unicode.html#unicode-literals-in-python-source-code
Python 3: https://docs.python.org/3/howto/unicode.html#python-s-unicode-support

@consideRatio
Copy link
Contributor Author

consideRatio commented Oct 29, 2017

What I've learned

To analyze the error...

Use Linux and Python 2 and run the following test.

py.test nbgrader/tests/apps/test_nbgrader_zip_collect.py


Difference between Unicode and utf-8

"Unicode only define code points, that is, a number which represents a character. How you store these code points in memory depends of the encoding that you are using. UTF-8 is one way of encoding Unicode characters, among many others." - https://stackoverflow.com/a/643719/2220152


What # coding: utf-8 and # -*- coding: utf-8 -*- is about

Ideally, you’d want to be able to write literals in your language’s natural encoding. You could then edit Python source code with your favorite editor which would display the accented characters naturally, and have the right characters used at runtime.

Python supports writing Unicode literals in any encoding, but you have to declare the encoding being used. This is done by including a special comment as either the first or second line of the source file:

# coding: utf-8

"The -*- symbols indicate to Emacs that the comment is special; they have no significance to Python but are a convention." - https://docs.python.org/2/howto/unicode.html#unicode-literals-in-python-source-code


nbconvert project

https://github.com/jupyter/nbconvert/blob/68a3925efef5c304c2c8682ad05cac9560c4173a/nbconvert/writers/files.py

nbconvert seem to have run into similar issues. They have made a stdout- / stdin-helper they made. They declare two functions that returns a writer or reader stream that will always read and write Unicode using utf-8 encoding.

They have also made a {FilesWriter class](https://github.com/jupyter/nbconvert/blob/68a3925efef5c304c2c8682ad05cac9560c4173a/nbconvert/writers/files.py). It is realizing what kind of string is about to be written, and adjusts. Look for example at the following parts:

from ipython_genutils.py3compat import unicode_type

# ...

# Write conversion results.
self.log.info("Writing %i bytes to %s", len(output), dest)
if isinstance(output, unicode_type):
    with io.open(dest, 'w', encoding='utf-8') as f:
        f.write(output)
else:
    with io.open(dest, 'wb') as f:
        f.write(output)

What needs to be done?

Without a full grasp of either nbgrader, nbconvert or unicode details in various environments, I'm reluctant to adjust the code-base in nbgrader further without some input from someone. Should we let pretty much all read and write operations be made through a nbconvert io-helper?

I noticed I could quickfix test failures if I...

# did this
io.write(u'')
# instead of
#io.write('')

... but that didn't feel like the way to go, I don't fully understand the consequences either. So, perhaps we should have a FilesWriter solution and use it to read and write data systematically throughout the project?

@jhamrick
Copy link
Member

Ugh, yeah, dealing with unicode support between python 2 and 3 can be a real pain---thanks for looking into this so thoroughly!

I think using the u prefix like you've suggested is the way to go here. I found http://six.readthedocs.io/#six.u (six is a library for python 2/3 compatibility), but the documentation there says that function shouldn't be necessary as the u prefix exists in python 3.3+, which makes me think it is actually probably the correct solution. But it might not always be the solution if there are other issues with other tests.

I am currently working on finishing up another PR but I'll look into this some more when I'm done with that.

@consideRatio
Copy link
Contributor Author

@jhamrick okay! I went with it (using the u prefix) and saw where it got me.

  • I searched for all unicode string literals (u'...') and added # coding: utf-8 too all those files-
  • Where you updated open to io.open(..., encoding="utf-8") statements, I updated writing of normal literals to unicode literals.
  • I found situations where unicode string literals had been used but not systematically, and made the use of them systematic.
  • I ran all tests successfully on my linux with python2 (578 passed, 3 skipped, 0 failed)

Pweh! Closer... perhaps... at least :)

Copy link
Member

@jhamrick jhamrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks great! I will merge once the tests pass.

@@ -1,3 +1,5 @@
# coding: utf-8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these lines aren't strictly necessary since this will only make a difference if the special characters are in the python source code itself, and I don't think this is the case anywhere in nbgrader. But I don't think it hurts to have these lines, so it's fine for them to be there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was uncertain about this myself, thanks for the feedback!

@jhamrick jhamrick changed the title Resolves utf-8 issue (#902) Resolves issues with UTF-8 Oct 29, 2017
@jhamrick
Copy link
Member

Tests are passing! Thanks for putting in so much work for this 🎉

@jhamrick jhamrick merged commit 379ff9e into jupyter:master Oct 29, 2017
@consideRatio
Copy link
Contributor Author

@jhamrick Thanks for the encouragement and guidance!

I really appreciate Jupyter and the tools surrounding it such as the nbgrader project - so thanks for your work as well :D

@jhamrick
Copy link
Member

Thanks! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants