-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tess4j OCR results worse than CLI #268
Comments
Using Polish language is not actually necessary to demonstrate the problem, with English it's similar - the result is a bit mangled in CLI, more mangled with tess4j psm6, empty with tess4j psm7. |
Please see #264 (comment) |
Thanks for the pointer, I missed that comment, but it doesn't seem to solve my problem. Am I missing something? What's the best way forward? |
Also, I just noticed a scary sentence in #264 (comment)
Do you still think that's true? That should be a big red warning label on the front page. I don't think many users are aware that they likely get worse OCR results than with CLI, unless they perform additional research and reimplement their own preprocessing. |
The https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html So if you want to use |
Tesseract engine performs some minimal, basic image processing on input images, such as thresholding, before recognition stage. Tess4j inherits the same benefits when it invokes Tesseract API. For some images, this may be sufficient; but for more complicated ones, it may require the user to carry out additional preprocessing on the images -- such as deskewing, denoising, binarization, etc. -- to improve recognition. |
OK, I'll try to sum up what you said together with what I see in the tess4j code.
And it would be great to enable preprocessing in |
With a lot of help from AI I was able to setup a simple C++ project to test the API directly. Turns out that Here's the C++ code I used:
|
One step further, I've got the good result in Java, using
Variant 1 should be the equivalent of @nguyenq What do you think? Would you consider converting input to |
@mmatela Can you attach a sample image for our investigation? Thanks. |
@nguyenq it's in the first post here. |
Using tesseract 5.4.1 and tess4j-5.13.0 (but also seen the same behavior with tess4j-5.4.0)
Sample image:
When using command line, the results are perfect:
I'm trying to invoke the same through tess4j with the following java code:
The result is an empty string!
I tried different psms and it prints something for 6 (
PSM_SINGLE_BLOCK
), but not a fully correct result:| kogokolwiek, geziekolwiek
.Anyway, it looks like psm 7 (
PSM_SINGLE_LINE
) should work the best, since the image contains a single line.As advised in #264 and related issues, I tried VietOCR and see the same result (nothing recognized by default, imperfect result with psm6).
The text was updated successfully, but these errors were encountered: