-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding issue - invalid byte sequence in US-ASCII (ArgumentError) #121
Comments
First Q, are you OCRing english or non-english docs? If the latter, you can set the If you're OCRing an english language doc, we'd be interested in seeing a sample doc (as our TextCleaner isn't doing the right thing if that's the case). |
Thanks for the quick reply. Yes we are OCRing the docs. No matter in what language. This doc failed with the above same error. Although it is English with some hand writing in it. And I am using |
Alrighty, mind letting us know what tesseract version you're using? We're up on docsplit |
Here are the full environment details:
Let me know if you want any more information. Thanks |
the TextCleaner will strip out character sequences that look like garbage in English (lots of consonants in a row for example). So if your input is clean-ish turning it off won't do much. |
So the text extraction is only works on English? Any handy tool you can recommend which can extract the plain text out of non english pdf's easily? |
Text cleaning only works in english. Docsplit'll OCR in non-english languages if you specify the input language. |
@intellisense: My environment is pretty close to yours and I'm able to extract your documents successfully. Can you tell me what docsplit command you are running? I ran: Can you also provide the Ruby version from |
@nathanstitt I am using this command: I just ran the command with |
Hm. Since our commands and ruby versions are the same, I'm thinking that the culprit may be Tesseract. Perhaps your version is generating some sequence of UTF characters that Docsplit/Ruby doesn't like. My Do your versions differ? I should also note that docsplit/tesseract didn't do a very good job on the second document you linked above. Since the scan's were blurry, the text is pretty garbled. The text scanner attempted to clean it, but the difference between using |
The tesseract version is exactly the same as yours with every image libraries as you have mentioned no difference whatsoever. I think I should go with the |
Hey @intellisense. Sorry for the confusion but you absolutely can extract text from non-English documents with or without using the All the option does is disable running the TextCleaner (which removes non-valid characters) on the OCR'ed text. Since the TextCleaner only knows how to recognize non-english characters that's the only language it's effective on. |
I am getting several errors like these. Any workaround? Thanks!
The text was updated successfully, but these errors were encountered: