-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to NFC normalisation by default #257
base: master
Are you sure you want to change the base?
Conversation
Switch from NFKC normalisation to NFC normalisation by default. NFC normalisation is more appropriate for OCR, as different characters which may be semantically similar are nevertheless often useful to capture and output in their original form.
Can you elaborate on why NFKC is better suited for OCR than NFC? Can you give example data where NFKC is superior and ideally test data for CI? How will this influence recognition with the widely used en-default and fraktur models? I'm reluctant to merge this until I fully understand the consequences. |
Thanks for commenting @kba. My use case for NFC is to recognise and differentiate long s (ſ) from a short s (s) in old Latin documents. NFKC recognises that they're semantically interchangable, and so changes the long s into a short s, meaning that I can't differentiate them in the OCR output. I realised this as I found that the long s I had in my codec input text wasn't included in the codec debug output, but it also meant that the long s characters in my ground truth were silently changed to short s characters. It makes sense to me to only normalise down to glyphs that are identical, and not that are deemed by Unicode to be "equivalent", so that other such characters can be differentiated. To use the example above of long and short s, with this patch one could still just transcribe all long s characters in ground truth with a short s if the differentiation wasn't important, the difference is that if it is relevant then the different glyphs can be preserved and correctly represented with an appropriate model. I can't think of any cases where this could cause a regression in other training qualities, unless I suppose the ground truth files depended on using different unicode characters that would be normalised to the same character when training. I am not expert in non-Latin or Greek scripts, so perhaps that could be an issue, but that would surprise me. I hope this all makes sense. Do ping me for clarification if I have been too verbose and unclear! (edited as I got NFKC and NFC the wrong way around in this comment initially - sorry!) |
I just tested all the ground truth that was linked to from the wiki, comparing differences between NFC and NFKC: https://github.com/tmbdev/ocropy/wiki/Models https://github.com/ChillarAnand/likitham (Telugu): no difference https://github.com/zuphilip/ocropy-french-models (French): Only difference is several instances of this: https://github.com/jze/ocropus-model_cyrillic (Cyrillic): 1 instance of a difference https://github.com/isaomatsunami/clstm-Japanese (Japanese): many differences, but all small variants of which these seem representative: NFC: NFC: NFC: I also checked the text in ocropy's tests/ directory, and there was no difference between NFC and NFKC. From looking at all of these, it still seems to me that NFC is the best option, as it follows the principle of least surprise: it will ensure that whatever glyph is encoded in the ground truth will be used for the model. I suspect that the people creating these models didn't expect the OCR to alter their characters from the ground truth the way NFKC does. I couldn't see the ground truth for the English and Fraktur models you mention, but I'd be happy to compare them too if they're available. |
Tom is proposing to change "short s" to "long s" after the ocr; while this may work relatively easy on more recent fraktur texts (eg. 19th century) the older the texts get the more difficult this is; for example incunabula (1450-1500) have no uniform grammar or spelling to follow, so it is crucial that the ocr reflects the print as closely as possible. |
@Beckenb is correct, and moreover capturing places where long s is used in a way that is not "correct" can be useful information to capture, in some cases. There will also be cases of other characters for which it is important to recognise the particular glyph, even if it is considered "equivalent" to a different one by unicode. Long s is just a well-known, obvious example. |
Maybe you want to explore what Tesseract 4.00 is doing. |
I want to throw in The only drawback is that polytonic Greek output will look worse on displays as the glyphs in most fonts are only defined for the combined code points. |
Switch from NFKC normalisation to NFC normalisation by default. NFC
normalisation is more appropriate for OCR, as different characters
which may be semantically similar are nevertheless often useful to
capture and output in their original form.