Resolves issues with UTF-8 #906

consideRatio · 2017-10-28T22:55:04Z

Issue #902 is resolved by this PR, but this PR isn't very thorough yet, the fix of clarifying what encoding to use when reading a file on windows (10) might need to be applied at more places in the nbgrader project.

@jhamrick correctly guessed that this change would resolve the issue, I just tried it.

jhamrick · 2017-10-29T11:33:09Z

Oh good, I'm really glad this resolved the issue!

might need to be applied at more places in the nbgrader project

I just did a quick grep through the source code there aren't really any other places in the main code base, but there are a few places in the tests that could probably be fixed. Here is a patch that fixes all the places I think might be susceptible to this issue (it also modifies the header notebook used in the tests to include special characters, so hopefully we won't get a recurrence of this bug again):
changes.txt

You can apply it with git apply changes.txt and then commit the result.'

Thanks for working on this!

consideRatio · 2017-10-29T12:11:28Z

I'll look into this test failure:

    def test_collect_timestamp_skip_older(self, course_dir, archive_dir):
        extracted_dir = join(archive_dir, "..", "extracted")
        submitted_dir = join(course_dir, "submitted")
    
        # submissions are sorted so a before b
        os.makedirs(join(archive_dir, 'ps1_hacker_a_2017-01-30-15-30-10'))
        with io.open(join(archive_dir, 'ps1_hacker_a_2017-01-30-15-30-10', 'problem1.ipynb'), mode='w', encoding='utf-8') as fh:
            fh.write('')

E TypeError: write() argument 1 must be unicode, not str

jhamrick · 2017-10-29T12:19:01Z

Oops, sorry. I tested my patch on python 3 but not python 2, which is where it seems the test failures are coming from... I should have known better! Thanks for looking into it regardless.

consideRatio · 2017-10-29T12:51:39Z

It seems Python 2's string literals are not Unicode literals, while Python 3's string literals are by default Unicode encoded with utf-8.

This test was actually passing on Windows 10. So, if nbgrader is run on Linux, with Python 27, then this is an issue it seems.

The fix is probably to include the following in one of the first two lines of some files, (perhaps only the files where string literals is written to files?). Perhaps someone with previous know-how of encoding issues can suggest a clear line of action?

# -*- coding: utf-8 -*-

Reference

Python 2: https://docs.python.org/2/howto/unicode.html#unicode-literals-in-python-source-code
Python 3: https://docs.python.org/3/howto/unicode.html#python-s-unicode-support

consideRatio · 2017-10-29T14:36:52Z

What I've learned

To analyze the error...

Use Linux and Python 2 and run the following test.

py.test nbgrader/tests/apps/test_nbgrader_zip_collect.py

Difference between Unicode and utf-8

"Unicode only define code points, that is, a number which represents a character. How you store these code points in memory depends of the encoding that you are using. UTF-8 is one way of encoding Unicode characters, among many others." - https://stackoverflow.com/a/643719/2220152

What `# coding: utf-8` and `# -- coding: utf-8 --` is about

Ideally, you’d want to be able to write literals in your language’s natural encoding. You could then edit Python source code with your favorite editor which would display the accented characters naturally, and have the right characters used at runtime.

Python supports writing Unicode literals in any encoding, but you have to declare the encoding being used. This is done by including a special comment as either the first or second line of the source file:

# coding: utf-8

"The -*- symbols indicate to Emacs that the comment is special; they have no significance to Python but are a convention." - https://docs.python.org/2/howto/unicode.html#unicode-literals-in-python-source-code

nbconvert project

https://github.com/jupyter/nbconvert/blob/68a3925efef5c304c2c8682ad05cac9560c4173a/nbconvert/writers/files.py

nbconvert seem to have run into similar issues. They have made a stdout- / stdin-helper they made. They declare two functions that returns a writer or reader stream that will always read and write Unicode using utf-8 encoding.

They have also made a {FilesWriter class](https://github.com/jupyter/nbconvert/blob/68a3925efef5c304c2c8682ad05cac9560c4173a/nbconvert/writers/files.py). It is realizing what kind of string is about to be written, and adjusts. Look for example at the following parts:

from ipython_genutils.py3compat import unicode_type

# ...

# Write conversion results.
self.log.info("Writing %i bytes to %s", len(output), dest)
if isinstance(output, unicode_type):
    with io.open(dest, 'w', encoding='utf-8') as f:
        f.write(output)
else:
    with io.open(dest, 'wb') as f:
        f.write(output)

What needs to be done?

Without a full grasp of either nbgrader, nbconvert or unicode details in various environments, I'm reluctant to adjust the code-base in nbgrader further without some input from someone. Should we let pretty much all read and write operations be made through a nbconvert io-helper?

I noticed I could quickfix test failures if I...

# did this
io.write(u'')
# instead of
#io.write('')

... but that didn't feel like the way to go, I don't fully understand the consequences either. So, perhaps we should have a FilesWriter solution and use it to read and write data systematically throughout the project?

jhamrick · 2017-10-29T16:58:15Z

Ugh, yeah, dealing with unicode support between python 2 and 3 can be a real pain---thanks for looking into this so thoroughly!

I think using the u prefix like you've suggested is the way to go here. I found http://six.readthedocs.io/#six.u (six is a library for python 2/3 compatibility), but the documentation there says that function shouldn't be necessary as the u prefix exists in python 3.3+, which makes me think it is actually probably the correct solution. But it might not always be the solution if there are other issues with other tests.

I am currently working on finishing up another PR but I'll look into this some more when I'm done with that.

consideRatio · 2017-10-29T18:13:28Z

@jhamrick okay! I went with it (using the u prefix) and saw where it got me.

I searched for all unicode string literals (u'...') and added # coding: utf-8 too all those files-
Where you updated open to io.open(..., encoding="utf-8") statements, I updated writing of normal literals to unicode literals.
I found situations where unicode string literals had been used but not systematically, and made the use of them systematic.
I ran all tests successfully on my linux with python2 (578 passed, 3 skipped, 0 failed)

Pweh! Closer... perhaps... at least :)

jhamrick

Thanks, this looks great! I will merge once the tests pass.

jhamrick · 2017-10-29T19:09:03Z

nbgrader/apps/assignapp.py

@@ -1,3 +1,5 @@
+# coding: utf-8


I think these lines aren't strictly necessary since this will only make a difference if the special characters are in the python source code itself, and I don't think this is the case anywhere in nbgrader. But I don't think it hurts to have these lines, so it's fine for them to be there.

Ah, I was uncertain about this myself, thanks for the feedback!

jhamrick · 2017-10-29T19:18:12Z

Tests are passing! Thanks for putting in so much work for this 🎉

consideRatio · 2017-10-29T19:47:05Z

@jhamrick Thanks for the encouragement and guidance!

I really appreciate Jupyter and the tools surrounding it such as the nbgrader project - so thanks for your work as well :D

jhamrick · 2017-10-29T19:58:25Z

Thanks! 😄

resolved utf-8 issue (jupyter#902)

a364b1b

jhamrick added the bugfix label Oct 29, 2017

jhamrick added this to the 0.6.0 milestone Oct 29, 2017

consideRatio added 2 commits October 29, 2017 12:36

proper resolution of windows utf-8 issue (jupyter#902)

1c95739

forgot import statement

2534b8d

linux+py27 utf-8 encoding issues locally fixed?

966ff7b

unicode rework where u'' literals were found

59adf7b

jhamrick approved these changes Oct 29, 2017

View reviewed changes

jhamrick changed the title ~~Resolves utf-8 issue (#902)~~ Resolves issues with UTF-8 Oct 29, 2017

jhamrick merged commit 379ff9e into jupyter:master Oct 29, 2017

jhamrick mentioned this pull request Oct 29, 2017

The header file is not åäö-friendly #902

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolves issues with UTF-8 #906

Resolves issues with UTF-8 #906

consideRatio commented Oct 28, 2017

jhamrick commented Oct 29, 2017

consideRatio commented Oct 29, 2017 •

edited

Loading

jhamrick commented Oct 29, 2017

consideRatio commented Oct 29, 2017

consideRatio commented Oct 29, 2017 •

edited

Loading

jhamrick commented Oct 29, 2017

consideRatio commented Oct 29, 2017

jhamrick left a comment

jhamrick Oct 29, 2017

consideRatio Oct 29, 2017

jhamrick commented Oct 29, 2017

consideRatio commented Oct 29, 2017

jhamrick commented Oct 29, 2017

Resolves issues with UTF-8 #906

Resolves issues with UTF-8 #906

Conversation

consideRatio commented Oct 28, 2017

jhamrick commented Oct 29, 2017

consideRatio commented Oct 29, 2017 • edited Loading

jhamrick commented Oct 29, 2017

consideRatio commented Oct 29, 2017

Reference

consideRatio commented Oct 29, 2017 • edited Loading

What I've learned

To analyze the error...

Difference between Unicode and utf-8

What # coding: utf-8 and # -*- coding: utf-8 -*- is about

nbconvert project

What needs to be done?

jhamrick commented Oct 29, 2017

consideRatio commented Oct 29, 2017

jhamrick left a comment

Choose a reason for hiding this comment

jhamrick Oct 29, 2017

Choose a reason for hiding this comment

consideRatio Oct 29, 2017

Choose a reason for hiding this comment

jhamrick commented Oct 29, 2017

consideRatio commented Oct 29, 2017

jhamrick commented Oct 29, 2017

consideRatio commented Oct 29, 2017 •

edited

Loading

consideRatio commented Oct 29, 2017 •

edited

Loading

What `# coding: utf-8` and `# -- coding: utf-8 --` is about