-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Support for different CSV Encodings in Import_Scripts/Populate_Metadata.py #198
base: develop
Are you sure you want to change the base?
Conversation
…quires omero-py with support for different file encodings in omero.utils.populate_roi.DownloadingOriginalFileProvider
Thanks for this contribution. Does this sound feasible? |
Hey @will-moore, there are certainly options to do that by importing The problem with trying encodings is, that with the exception of .decode("utf-8") I can't get the other encodings to reliably throw errors when the wrong encoding is used. They will just return nonsensical strings. |
I wasn't aware of One option for users to choose "auto-detect" encoding would be to use |
We could also check in So it would be something like: Check if |
That sounds like a good solution to me. But best to hear any vetos from Josh or Chris before committing (and maybe Seb - who is away till Thursday). |
@@ -120,7 +120,7 @@ def populate_metadata(client, conn, script_params): | |||
original_file = get_original_file( | |||
conn, data_type, object_id, file_ann_id) | |||
provider = DownloadingOriginalFileProvider(conn) | |||
data_for_preprocessing = provider.get_original_file_data(original_file) | |||
data_for_preprocessing = provider.get_original_file_data(original_file, encoding=encoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
presumably this line fails if you use the wrong encoding? Or only if you use utf-8
?
A try/except that returns a useful message (and/or prints it to stdout) could be helpful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately no. I discovered this issue because one of our clients was using german Umlauts in their iso-8859-1 encoded CSV, which indeed raise an UnicodeDecodeError at this position. Unfortunately this does not hold true, if it would be the other way around, i.e. specifying 'iso-8859-1' on a Unicode encoded CSV. This will not return an error, but the string is non-sensical. I presume that this can lead to failures down the line in the script (e.g. when the image name is not correct due to the mis-read string), but it could also just lead to nonsensical annotations.
I'm not sure, how one would go about catching this behaviour in general.
You can see the effect for yourself with this test code:
test_str = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZäöüÄÖÜ'
test_str.encode('latin-1').decode('utf-8')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 52: invalid continuation byte
test.encode('utf-8').decode('latin-1')
> 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZäöüÃ\x84Ã\x96Ã\x9c'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so it's not guaranteed to fail, but if it does raise a UnicodeDecodeError
that could be worth catching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I will try to find ways to break the other encodings as well and check what kind of errors they will give. Might not be 100% but better catching some of the errors than none.
Detecting encodings is notoriously error prone and time consuming. 1 I'm not adverse to adding Footnotes |
Hey @chris-allan , I agree with using it sparingly and probably even making it optional in the first place, as well as selecting a sensible number of rows. The problem with "Selecting an incorrect encoding and inserting subtly corrupt data into OMERO is worse than failing spectacularly early" is that there is AFAICS no way to make wrong encoding imports fail spectacularly reliably. |
Definitely not disputing that. Guessing encodings and having any assumed default is fraught with all sorts of problems. My thinking has always been that defaulting UTF-8 is going to fail spectacularly in the most incorrect scenarios. Whether those scenarios overlap with your use cases and make sense to you is a dice roll. There are so many crazy use cases dealing with delimited text input containing no authoritative statement of encoding that we've seen that don't just differ because of locale. Excel on Windows and Excel on macOS can behave very, very differently in how they handle export to CSV/TSV and don't get me started on byte order marks 1, line endings or quotation. What I very much do not want to do here is create a scenario where we normalize encoding autodetection for users. Especially if it has a high likelihood of "fallback" to Footnotes |
60 second attempt testing
That's IMHO just simply an unacceptable outcome. Could certainly be my naïve use or maybe |
Ugh ... I agree. That is really unacceptable. Usually Okay, so to summarize my understanding of your points and the testing results: If we can agree to this, I would just add some error catchings for the UnicodeDecodeError as well as at least checking |
👍 |
…ovide clear error message and exit the function early
…chine. All test cases should either import without error or raise the correct error message
…fering from utf-8. Refactored the creation of the scripts.client to allow for dynamic display of the encoding field only if support is available. Also: Switched free string input for encodings over to a list input based on the encodings available on the server.
checks if get_original_file_data has the encoding argument. Check via import of DecodingError is not possible anymore since the custom class was removed and also not a direct proof that get_original_file_data supports the encoding argument. Also changed the respective error catch and refactored the encoding detection code. Change Error Catch to UnicodeDecodeError Fixed variable
So ... as this got kind of buried in my to-do list and we needed to figure out the details of open-source contribution licensing in the institute with a long delay finally the script with support for different csv encodings. As discussed we require the user to specify the encoding and do not rely on any detection mechanisms. Furthermore, the test now should check for all encodings and either expects an successful import or at least a clean exit. I cannot figure out how to set up a correct testing environment for the integration tests, so this code is only tested manually, not using pytest. Accompanying changes have also been made in ome/omero-py#325. // Julian |
statements, that would make the code harder to read.
… by writing strings to csv.
…py to check for clean exit
involving encodings that don't support the test strings
I'm afraid I cannot figure out why the checks are failing from the Github Action log. Any ideas what needs fixing? |
@JulianHn I'm afraid that the I think they are all fixed in #195 which is awaiting testing/merge. I could fix them again in a separate PR: it would just mean subsequently fixing all the merge conflicts with that branch! |
@will-moore Ah ... That explains it, thanks for the heads-up. No worries about fixing them separately, I just did not assume that the action would outright fail because of flake8 warnings and was confused what causes the fail. |
Now that #195 is merged, if you can merge origin/develop branch into this, that should help get the build green. |
This draft will add support for csv encodings differing from utf-8 to populate_metadata.py.
This is relevant e.g. when using the Populate_Metadata.py script distributed with OMERO, when using a CSV File exported from Excel with default settings, which are cp1252 for US and iso-8859-1 for EU system locales.
It requires merging of ome/omero-py#325 in omero-py, to add support for this in the imported
omero.utils.populate_roi
script. If this gets merged, support for different encoding can happen by simply adding a new string input field to thescript_params
, that will contain the file encoding and default to utf-8 (i.e. legacy behaviour).Tests for
cp1252
andiso-8859-1
, the default encodings for US and EU Excel CSV exports have been added as well.// Julian