Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Support for different CSV Encodings in Import_Scripts/Populate_Metadata.py #198

Open
wants to merge 16 commits into
base: develop
Choose a base branch
from

Conversation

JulianHn
Copy link

This draft will add support for csv encodings differing from utf-8 to populate_metadata.py.
This is relevant e.g. when using the Populate_Metadata.py script distributed with OMERO, when using a CSV File exported from Excel with default settings, which are cp1252 for US and iso-8859-1 for EU system locales.

It requires merging of ome/omero-py#325 in omero-py, to add support for this in the imported omero.utils.populate_roi script. If this gets merged, support for different encoding can happen by simply adding a new string input field to the script_params, that will contain the file encoding and default to utf-8 (i.e. legacy behaviour).

Tests for cp1252 and iso-8859-1, the default encodings for US and EU Excel CSV exports have been added as well.

// Julian

…quires omero-py with support for different file encodings in omero.utils.populate_roi.DownloadingOriginalFileProvider
@will-moore
Copy link
Member

Thanks for this contribution.
I'm just wondering if there's a reliable way to make this easier for users, so that they don't have to know what encoding to choose (and to maybe give them a nice error message if they need to choose a different encoding).
I've not tried this yet, so I haven't looked to see where it fails if you use the wrong encoding.
But if the script can check whether the correct encoding was used immediately after reading the data, and then give the user a message (instead of failing later) that would be nice.
The default behaviour (if a user doesn't pick an encoding) could even be to try a handful of the most common encodings, in order of prevalence (e.g. utf8, latin-1, cp1252...) and use the first one that "works"?

Does this sound feasible?

@JulianHn
Copy link
Author

Hey @will-moore,

there are certainly options to do that by importing chardet that has function for guessing the encoding. This obviously is not 100% guaranteed, but in my experience works well for the most common encodings.

The problem with trying encodings is, that with the exception of .decode("utf-8") I can't get the other encodings to reliably throw errors when the wrong encoding is used. They will just return nonsensical strings.
If this sounds reasonable I can certainly add this to the modified DownloadingOriginalFileProvider and amend #325 over in omero-py

@will-moore
Copy link
Member

I wasn't aware of chardet - looks handy. I'm not sure we'd want to make it a dependency of omero-py and force all users to install it? (although it doesn't seem to have any dependencies) cc @joshmoore @chris-allan
If we don't want to make it required, then we could put the import in a try/except.

One option for users to choose "auto-detect" encoding would be to use encoding=None in:
def get_original_file_data(self, original_file, encoding="utf-8"):
and then if chardet hasn't been imported, raise an Exception?

@JulianHn
Copy link
Author

JulianHn commented Mar 28, 2022

We could also check in populate_metadata-py if the module is available by try/except importing chardet and modifying the Description displayed when opening the script in OMERO.web/Insight, as done with the outdated metadata plugin detection already. I guess that would make it even easier for users that are not familiar with reading the stderr/stdout outputs?

So it would be something like:

Check if chardet is avalaible. If yes, set the encoding field to "None"/"Detect" etc...
If Not, set the encoding field to "utf-8" (to ensure that the behaviour at the moment is replicated) and display a warning in the description, that for non utf-8 encoded files an encoding as to be explicitly specified or chardet needs to be installed?

@will-moore
Copy link
Member

That sounds like a good solution to me. But best to hear any vetos from Josh or Chris before committing (and maybe Seb - who is away till Thursday).

@@ -120,7 +120,7 @@ def populate_metadata(client, conn, script_params):
original_file = get_original_file(
conn, data_type, object_id, file_ann_id)
provider = DownloadingOriginalFileProvider(conn)
data_for_preprocessing = provider.get_original_file_data(original_file)
data_for_preprocessing = provider.get_original_file_data(original_file, encoding=encoding)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presumably this line fails if you use the wrong encoding? Or only if you use utf-8?
A try/except that returns a useful message (and/or prints it to stdout) could be helpful?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately no. I discovered this issue because one of our clients was using german Umlauts in their iso-8859-1 encoded CSV, which indeed raise an UnicodeDecodeError at this position. Unfortunately this does not hold true, if it would be the other way around, i.e. specifying 'iso-8859-1' on a Unicode encoded CSV. This will not return an error, but the string is non-sensical. I presume that this can lead to failures down the line in the script (e.g. when the image name is not correct due to the mis-read string), but it could also just lead to nonsensical annotations.
I'm not sure, how one would go about catching this behaviour in general.

You can see the effect for yourself with this test code:

test_str = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZäöüÄÖÜ'

test_str.encode('latin-1').decode('utf-8')
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 52: invalid continuation byte

test.encode('utf-8').decode('latin-1')
> 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZäöüÃ\x84Ã\x96Ã\x9c'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so it's not guaranteed to fail, but if it does raise a UnicodeDecodeError that could be worth catching?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will try to find ways to break the other encodings as well and check what kind of errors they will give. Might not be 100% but better catching some of the errors than none.

@chris-allan
Copy link
Member

Detecting encodings is notoriously error prone and time consuming. 1 I'm not adverse to adding chardet as a dependency but it has to be used intelligently and sparingly in order to not cause more problems than doing no detection at all. Selecting an incorrect encoding and inserting subtly corrupt data into OMERO is worse than failing spectacularly early. At a minimum I think an implementation would not default to using chardet but rather add it as an option and select a reasonable number, possibly configurable, of rows to apply it to rather than the whole file.

Footnotes

  1. https://chardet.readthedocs.io/en/latest/faq.html

@JulianHn
Copy link
Author

Hey @chris-allan ,

I agree with using it sparingly and probably even making it optional in the first place, as well as selecting a sensible number of rows.

The problem with "Selecting an incorrect encoding and inserting subtly corrupt data into OMERO is worse than failing spectacularly early" is that there is AFAICS no way to make wrong encoding imports fail spectacularly reliably. UTF-8 decoding might fail if by chance a byte combination not in the codepage is used, decoding with latin-1 or cp1252 will always "work" (but might be non-sense), since they are not throwing errors.
So the problem with importing subtly corrupt data into OMERO exists anyway already and we guessing the encoding might be better than the current situation at least.
An option for letting those fail at least in some cases might be to look for bytes that are not in the codepage (e.g. 7F-9F in latin-1) or check for the occurence of some of the control bytes in (00 to 1F) to that should not be used in "normal" CSV files. But of course this also won't catch all cases.

@chris-allan
Copy link
Member

So the problem with importing subtly corrupt data into OMERO exists anyway already and we guessing the encoding might be better than the current situation at least.

Definitely not disputing that. Guessing encodings and having any assumed default is fraught with all sorts of problems. My thinking has always been that defaulting UTF-8 is going to fail spectacularly in the most incorrect scenarios. Whether those scenarios overlap with your use cases and make sense to you is a dice roll.

There are so many crazy use cases dealing with delimited text input containing no authoritative statement of encoding that we've seen that don't just differ because of locale. Excel on Windows and Excel on macOS can behave very, very differently in how they handle export to CSV/TSV and don't get me started on byte order marks 1, line endings or quotation.

What I very much do not want to do here is create a scenario where we normalize encoding autodetection for users. Especially if it has a high likelihood of "fallback" to latin1 / ISO-8859-1 or CP-1252 because the first n rows of the CSV happen to not contain an μ but a following row does. That would be a disaster for us. I'm not experienced enough with chardet to comment on the likelihood of such a scenario. It also cannot create performance problems for analytical use cases, we regularly deal with CSV input containing millions of rows as well as rows that contain several kilobytes of text.

Footnotes

  1. https://docs.microsoft.com/en-us/windows/win32/intl/using-byte-order-marks

@chris-allan
Copy link
Member

60 second attempt testing chardet for an example of the type of scientific input we all regularly see:

>>> import chardet
>>> a = 'nucleus is 54μm in diameter'.encode('utf-8')
>>> chardet.detect(a)
{'encoding': 'ISO-8859-1', 'language': '', 'confidence': 0.73}
>>> a.decode('iso-8859-1')
'nucleus is 54μm in diameter'
>>> a.decode('utf-8')
'nucleus is 54μm in diameter'

That's IMHO just simply an unacceptable outcome. Could certainly be my naïve use or maybe chardet is just not the right solution for scientific data.

@JulianHn
Copy link
Author

That's IMHO just simply an unacceptable outcome. Could certainly be my naïve use or maybe chardet is just not the right solution for scientific data.

Ugh ... I agree. That is really unacceptable. Usually chardet gets better the more data it gets (since it's based on some language heuristics), but especially for more "technical" metadata that does not contain significant "natural language" parts, it probably is not guaranteed to find a lot to work with.

Okay, so to summarize my understanding of your points and the testing results:
Supporting multiple encodings is good, but it is probably better to force the user to specify the encoding (while keeping the default to utf-8) since either we run into performance problems by having to scan millions of rows (which indeed takes quite a while with chardet in my experience) and probably still can't ensure correct guessing in >95% of the cases or risk even more problems with "corrupted" data,

If we can agree to this, I would just add some error catchings for the UnicodeDecodeError as well as at least checking latin-1 for bytes not in the codepage to let the script fail with some nice Errors.

@chris-allan
Copy link
Member

Supporting multiple encodings is good, but it is probably better to force the user to specify the encoding (while keeping the default to utf-8) since either we run into performance problems by having to scan millions of rows (which indeed takes quite a while with chardet in my experience) and probably still can't ensure correct guessing in >95% of the cases or risk even more problems with "corrupted" data,

👍

JulianHn added 5 commits June 2, 2022 14:07
…ovide clear error message and exit the function early
…chine. All test cases should either import without error or raise the correct error message
…fering from utf-8.

Refactored the creation of the scripts.client to allow for dynamic display of the encoding
field only if support is available.
Also: Switched free string input for encodings over to a list input based on the encodings available on the server.
checks if get_original_file_data has the encoding argument.
Check via import of DecodingError is not possible anymore since the custom
class was removed and also not a direct proof that get_original_file_data
supports the encoding argument. Also changed the respective error catch and refactored
the encoding detection code.

Change Error Catch to UnicodeDecodeError

Fixed variable
@JulianHn
Copy link
Author

JulianHn commented Jun 3, 2022

So ... as this got kind of buried in my to-do list and we needed to figure out the details of open-source contribution licensing in the institute with a long delay finally the script with support for different csv encodings.

As discussed we require the user to specify the encoding and do not rely on any detection mechanisms.
The script automatically queries the omero.util.populate_roi.DownloadingOriginalFileProvider.get_original_file_data() function for support for encodings. If there is no support, the script will look like before, with a message added as already done for the omero-metadata plugin check.
If support is enabled, an additional field is added to the script that allows the user to select an encoding, while still defaulting to utf-8. The list of encodings is automatically queried from the list of encodings installed on the server.
I've also added a try-catch around the respective part in the code, which adds a hopefully more clear error message to the user that might help them to find the problem in this case more quickly. This will still not catch all problems, (e.g. latin-1 is designed in a way, that it will always decode without an error, even if there are symbols not in the codepage), but this is a design choice and AFAICS not solvable from our side.

Furthermore, the test now should check for all encodings and either expects an successful import or at least a clean exit. I cannot figure out how to set up a correct testing environment for the integration tests, so this code is only tested manually, not using pytest.

Accompanying changes have also been made in ome/omero-py#325.

// Julian

@JulianHn JulianHn marked this pull request as ready for review June 3, 2022 14:15
@JulianHn
Copy link
Author

I'm afraid I cannot figure out why the checks are failing from the Github Action log. Any ideas what needs fixing?

@will-moore
Copy link
Member

@JulianHn I'm afraid that the flake8 checks somehow got missed when #193 was merged (adding a bunch of new scripts) and they are now failing in other PRs.

I think they are all fixed in #195 which is awaiting testing/merge. I could fix them again in a separate PR: it would just mean subsequently fixing all the merge conflicts with that branch!

@JulianHn
Copy link
Author

@will-moore Ah ... That explains it, thanks for the heads-up. No worries about fixing them separately, I just did not assume that the action would outright fail because of flake8 warnings and was confused what causes the fail.

@will-moore
Copy link
Member

Now that #195 is merged, if you can merge origin/develop branch into this, that should help get the build green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants