Adding Support for different CSV Encodings in Import_Scripts/Populate_Metadata.py #198

JulianHn · 2022-03-28T09:44:00Z

This draft will add support for csv encodings differing from utf-8 to populate_metadata.py.
This is relevant e.g. when using the Populate_Metadata.py script distributed with OMERO, when using a CSV File exported from Excel with default settings, which are cp1252 for US and iso-8859-1 for EU system locales.

It requires merging of ome/omero-py#325 in omero-py, to add support for this in the imported omero.utils.populate_roi script. If this gets merged, support for different encoding can happen by simply adding a new string input field to the script_params, that will contain the file encoding and default to utf-8 (i.e. legacy behaviour).

Tests for cp1252 and iso-8859-1, the default encodings for US and EU Excel CSV exports have been added as well.

// Julian

…quires omero-py with support for different file encodings in omero.utils.populate_roi.DownloadingOriginalFileProvider

will-moore · 2022-03-28T10:03:12Z

Thanks for this contribution.
I'm just wondering if there's a reliable way to make this easier for users, so that they don't have to know what encoding to choose (and to maybe give them a nice error message if they need to choose a different encoding).
I've not tried this yet, so I haven't looked to see where it fails if you use the wrong encoding.
But if the script can check whether the correct encoding was used immediately after reading the data, and then give the user a message (instead of failing later) that would be nice.
The default behaviour (if a user doesn't pick an encoding) could even be to try a handful of the most common encodings, in order of prevalence (e.g. utf8, latin-1, cp1252...) and use the first one that "works"?

Does this sound feasible?

JulianHn · 2022-03-28T10:12:22Z

Hey @will-moore,

there are certainly options to do that by importing chardet that has function for guessing the encoding. This obviously is not 100% guaranteed, but in my experience works well for the most common encodings.

The problem with trying encodings is, that with the exception of .decode("utf-8") I can't get the other encodings to reliably throw errors when the wrong encoding is used. They will just return nonsensical strings.
If this sounds reasonable I can certainly add this to the modified DownloadingOriginalFileProvider and amend #325 over in omero-py

will-moore · 2022-03-28T10:52:01Z

I wasn't aware of chardet - looks handy. I'm not sure we'd want to make it a dependency of omero-py and force all users to install it? (although it doesn't seem to have any dependencies) cc @joshmoore @chris-allan
If we don't want to make it required, then we could put the import in a try/except.

One option for users to choose "auto-detect" encoding would be to use encoding=None in:
def get_original_file_data(self, original_file, encoding="utf-8"):
and then if chardet hasn't been imported, raise an Exception?

JulianHn · 2022-03-28T13:19:25Z

We could also check in populate_metadata-py if the module is available by try/except importing chardet and modifying the Description displayed when opening the script in OMERO.web/Insight, as done with the outdated metadata plugin detection already. I guess that would make it even easier for users that are not familiar with reading the stderr/stdout outputs?

So it would be something like:

Check if chardet is avalaible. If yes, set the encoding field to "None"/"Detect" etc...
If Not, set the encoding field to "utf-8" (to ensure that the behaviour at the moment is replicated) and display a warning in the description, that for non utf-8 encoded files an encoding as to be explicitly specified or chardet needs to be installed?

will-moore · 2022-03-28T13:43:50Z

That sounds like a good solution to me. But best to hear any vetos from Josh or Chris before committing (and maybe Seb - who is away till Thursday).

will-moore · 2022-03-28T13:46:08Z

omero/import_scripts/Populate_Metadata.py

@@ -120,7 +120,7 @@ def populate_metadata(client, conn, script_params):
    original_file = get_original_file(
        conn, data_type, object_id, file_ann_id)
    provider = DownloadingOriginalFileProvider(conn)
-    data_for_preprocessing = provider.get_original_file_data(original_file)
+    data_for_preprocessing = provider.get_original_file_data(original_file, encoding=encoding)


presumably this line fails if you use the wrong encoding? Or only if you use utf-8?
A try/except that returns a useful message (and/or prints it to stdout) could be helpful?

Unfortunately no. I discovered this issue because one of our clients was using german Umlauts in their iso-8859-1 encoded CSV, which indeed raise an UnicodeDecodeError at this position. Unfortunately this does not hold true, if it would be the other way around, i.e. specifying 'iso-8859-1' on a Unicode encoded CSV. This will not return an error, but the string is non-sensical. I presume that this can lead to failures down the line in the script (e.g. when the image name is not correct due to the mis-read string), but it could also just lead to nonsensical annotations.
I'm not sure, how one would go about catching this behaviour in general.

You can see the effect for yourself with this test code:

test_str = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZäöüÄÖÜ' test_str.encode('latin-1').decode('utf-8') > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 52: invalid continuation byte test.encode('utf-8').decode('latin-1') > 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZÃ¤Ã¶Ã¼Ã\x84Ã\x96Ã\x9c'

OK, so it's not guaranteed to fail, but if it does raise a UnicodeDecodeError that could be worth catching?

Sure. I will try to find ways to break the other encodings as well and check what kind of errors they will give. Might not be 100% but better catching some of the errors than none.

chris-allan · 2022-03-29T08:44:09Z

Detecting encodings is notoriously error prone and time consuming. ¹ I'm not adverse to adding chardet as a dependency but it has to be used intelligently and sparingly in order to not cause more problems than doing no detection at all. Selecting an incorrect encoding and inserting subtly corrupt data into OMERO is worse than failing spectacularly early. At a minimum I think an implementation would not default to using chardet but rather add it as an option and select a reasonable number, possibly configurable, of rows to apply it to rather than the whole file.

https://chardet.readthedocs.io/en/latest/faq.html ↩

JulianHn · 2022-03-29T08:55:05Z

Hey @chris-allan ,

I agree with using it sparingly and probably even making it optional in the first place, as well as selecting a sensible number of rows.

The problem with "Selecting an incorrect encoding and inserting subtly corrupt data into OMERO is worse than failing spectacularly early" is that there is AFAICS no way to make wrong encoding imports fail spectacularly reliably. UTF-8 decoding might fail if by chance a byte combination not in the codepage is used, decoding with latin-1 or cp1252 will always "work" (but might be non-sense), since they are not throwing errors.
So the problem with importing subtly corrupt data into OMERO exists anyway already and we guessing the encoding might be better than the current situation at least.
An option for letting those fail at least in some cases might be to look for bytes that are not in the codepage (e.g. 7F-9F in latin-1) or check for the occurence of some of the control bytes in (00 to 1F) to that should not be used in "normal" CSV files. But of course this also won't catch all cases.

chris-allan · 2022-03-29T09:54:46Z

So the problem with importing subtly corrupt data into OMERO exists anyway already and we guessing the encoding might be better than the current situation at least.

Definitely not disputing that. Guessing encodings and having any assumed default is fraught with all sorts of problems. My thinking has always been that defaulting UTF-8 is going to fail spectacularly in the most incorrect scenarios. Whether those scenarios overlap with your use cases and make sense to you is a dice roll.

There are so many crazy use cases dealing with delimited text input containing no authoritative statement of encoding that we've seen that don't just differ because of locale. Excel on Windows and Excel on macOS can behave very, very differently in how they handle export to CSV/TSV and don't get me started on byte order marks ¹, line endings or quotation.

What I very much do not want to do here is create a scenario where we normalize encoding autodetection for users. Especially if it has a high likelihood of "fallback" to latin1 / ISO-8859-1 or CP-1252 because the first n rows of the CSV happen to not contain an μ but a following row does. That would be a disaster for us. I'm not experienced enough with chardet to comment on the likelihood of such a scenario. It also cannot create performance problems for analytical use cases, we regularly deal with CSV input containing millions of rows as well as rows that contain several kilobytes of text.

https://docs.microsoft.com/en-us/windows/win32/intl/using-byte-order-marks ↩

chris-allan · 2022-03-29T10:13:52Z

60 second attempt testing chardet for an example of the type of scientific input we all regularly see:

>>> import chardet
>>> a = 'nucleus is 54μm in diameter'.encode('utf-8')
>>> chardet.detect(a)
{'encoding': 'ISO-8859-1', 'language': '', 'confidence': 0.73}
>>> a.decode('iso-8859-1')
'nucleus is 54Î¼m in diameter'
>>> a.decode('utf-8')
'nucleus is 54μm in diameter'

That's IMHO just simply an unacceptable outcome. Could certainly be my naïve use or maybe chardet is just not the right solution for scientific data.

JulianHn · 2022-03-29T10:18:27Z

That's IMHO just simply an unacceptable outcome. Could certainly be my naïve use or maybe chardet is just not the right solution for scientific data.

Ugh ... I agree. That is really unacceptable. Usually chardet gets better the more data it gets (since it's based on some language heuristics), but especially for more "technical" metadata that does not contain significant "natural language" parts, it probably is not guaranteed to find a lot to work with.

Okay, so to summarize my understanding of your points and the testing results:
Supporting multiple encodings is good, but it is probably better to force the user to specify the encoding (while keeping the default to utf-8) since either we run into performance problems by having to scan millions of rows (which indeed takes quite a while with chardet in my experience) and probably still can't ensure correct guessing in >95% of the cases or risk even more problems with "corrupted" data,

If we can agree to this, I would just add some error catchings for the UnicodeDecodeError as well as at least checking latin-1 for bytes not in the codepage to let the script fail with some nice Errors.

chris-allan · 2022-03-29T10:20:40Z

Supporting multiple encodings is good, but it is probably better to force the user to specify the encoding (while keeping the default to utf-8) since either we run into performance problems by having to scan millions of rows (which indeed takes quite a while with chardet in my experience) and probably still can't ensure correct guessing in >95% of the cases or risk even more problems with "corrupted" data,

👍

…ovide clear error message and exit the function early

…chine. All test cases should either import without error or raise the correct error message

…fering from utf-8. Refactored the creation of the scripts.client to allow for dynamic display of the encoding field only if support is available. Also: Switched free string input for encodings over to a list input based on the encodings available on the server.

checks if get_original_file_data has the encoding argument. Check via import of DecodingError is not possible anymore since the custom class was removed and also not a direct proof that get_original_file_data supports the encoding argument. Also changed the respective error catch and refactored the encoding detection code. Change Error Catch to UnicodeDecodeError Fixed variable

JulianHn · 2022-06-03T14:15:19Z

So ... as this got kind of buried in my to-do list and we needed to figure out the details of open-source contribution licensing in the institute with a long delay finally the script with support for different csv encodings.

As discussed we require the user to specify the encoding and do not rely on any detection mechanisms.
The script automatically queries the omero.util.populate_roi.DownloadingOriginalFileProvider.get_original_file_data() function for support for encodings. If there is no support, the script will look like before, with a message added as already done for the omero-metadata plugin check.
If support is enabled, an additional field is added to the script that allows the user to select an encoding, while still defaulting to utf-8. The list of encodings is automatically queried from the list of encodings installed on the server.
I've also added a try-catch around the respective part in the code, which adds a hopefully more clear error message to the user that might help them to find the problem in this case more quickly. This will still not catch all problems, (e.g. latin-1 is designed in a way, that it will always decode without an error, even if there are symbols not in the codepage), but this is a design choice and AFAICS not solvable from our side.

Furthermore, the test now should check for all encodings and either expects an successful import or at least a clean exit. I cannot figure out how to set up a correct testing environment for the integration tests, so this code is only tested manually, not using pytest.

Accompanying changes have also been made in ome/omero-py#325.

// Julian

statements, that would make the code harder to read.

… by writing strings to csv.

…py to check for clean exit

involving encodings that don't support the test strings

JulianHn · 2022-06-14T11:08:59Z

I'm afraid I cannot figure out why the checks are failing from the Github Action log. Any ideas what needs fixing?

will-moore · 2022-06-14T11:20:37Z

@JulianHn I'm afraid that the flake8 checks somehow got missed when #193 was merged (adding a bunch of new scripts) and they are now failing in other PRs.

I think they are all fixed in #195 which is awaiting testing/merge. I could fix them again in a separate PR: it would just mean subsequently fixing all the merge conflicts with that branch!

JulianHn · 2022-06-14T11:23:17Z

@will-moore Ah ... That explains it, thanks for the heads-up. No worries about fixing them separately, I just did not assume that the action would outright fail because of flake8 warnings and was confused what causes the fail.

will-moore · 2022-06-30T15:10:16Z

Now that #195 is merged, if you can merge origin/develop branch into this, that should help get the build green.

JulianHn added 2 commits March 28, 2022 11:19

Added support for different csv encodings in Populate_Metadata.py. Re…

3ccb2a5

…quires omero-py with support for different file encodings in omero.utils.populate_roi.DownloadingOriginalFileProvider

Added Test for latin-1 and cp1252 encoding in Populate_Metadata.py

10f73a0

JulianHn mentioned this pull request Mar 28, 2022

Addding support for other encodings than utf-8 in DownloadingOriginalFileProvider ome/omero-py#325

Open

will-moore reviewed Mar 28, 2022

View reviewed changes

JulianHn added 5 commits June 2, 2022 14:07

Added try-except block around the get_original_file_data() code to pr…

e21a065

…ovide clear error message and exit the function early

Rewrote encoding test case to check all available encodings on the ma…

f75a8a4

…chine. All test cases should either import without error or raise the correct error message

Added skip condition to encoding test, if omero-py does not support it

3627a2a

JulianHn marked this pull request as ready for review June 3, 2022 14:15

JulianHn added 8 commits June 3, 2022 16:44

Keep encoding variable untouched in populate_metadata if EncSup = False

2490122

Fixed pytest complaints except two of the "line too long" PEP8ify

dc54b13

statements, that would make the code harder to read.

More flake8 fixes

7f8e019

Fixed flake8 messages except for some line too longs, that are caused…

074acba

… by writing strings to csv.

Added try/except to catch the type-error thrown by Populate_Metadata.…

a971f2f

…py to check for clean exit

Added try around the string creation facility to skip test cases

b1b92a3

involving encodings that don't support the test strings

More flake8 fixes

0d8672e

More flake8 fixes

9055d76

Changed Caught Exception type, as omero-py upstream changed

c322eb4

will-moore mentioned this pull request Aug 8, 2023

Improved delimiter sniffing for .csv files #210

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Support for different CSV Encodings in Import_Scripts/Populate_Metadata.py #198

Adding Support for different CSV Encodings in Import_Scripts/Populate_Metadata.py #198

JulianHn commented Mar 28, 2022

will-moore commented Mar 28, 2022

JulianHn commented Mar 28, 2022

will-moore commented Mar 28, 2022

JulianHn commented Mar 28, 2022 •

edited

Loading

will-moore commented Mar 28, 2022

will-moore Mar 28, 2022

JulianHn Mar 28, 2022

will-moore Mar 28, 2022

JulianHn Mar 28, 2022

chris-allan commented Mar 29, 2022

JulianHn commented Mar 29, 2022

chris-allan commented Mar 29, 2022

chris-allan commented Mar 29, 2022

JulianHn commented Mar 29, 2022

chris-allan commented Mar 29, 2022

JulianHn commented Jun 3, 2022

JulianHn commented Jun 14, 2022

will-moore commented Jun 14, 2022

JulianHn commented Jun 14, 2022

will-moore commented Jun 30, 2022

Adding Support for different CSV Encodings in Import_Scripts/Populate_Metadata.py #198

Are you sure you want to change the base?

Adding Support for different CSV Encodings in Import_Scripts/Populate_Metadata.py #198

Conversation

JulianHn commented Mar 28, 2022

will-moore commented Mar 28, 2022

JulianHn commented Mar 28, 2022

will-moore commented Mar 28, 2022

JulianHn commented Mar 28, 2022 • edited Loading

will-moore commented Mar 28, 2022

will-moore Mar 28, 2022

Choose a reason for hiding this comment

JulianHn Mar 28, 2022

Choose a reason for hiding this comment

will-moore Mar 28, 2022

Choose a reason for hiding this comment

JulianHn Mar 28, 2022

Choose a reason for hiding this comment

chris-allan commented Mar 29, 2022

Footnotes

JulianHn commented Mar 29, 2022

chris-allan commented Mar 29, 2022

Footnotes

chris-allan commented Mar 29, 2022

JulianHn commented Mar 29, 2022

chris-allan commented Mar 29, 2022

JulianHn commented Jun 3, 2022

JulianHn commented Jun 14, 2022

will-moore commented Jun 14, 2022

JulianHn commented Jun 14, 2022

will-moore commented Jun 30, 2022

JulianHn commented Mar 28, 2022 •

edited

Loading