Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tess4j OCR results worse than CLI #268

Open
mmatela opened this issue Sep 11, 2024 · 11 comments
Open

Tess4j OCR results worse than CLI #268

mmatela opened this issue Sep 11, 2024 · 11 comments

Comments

@mmatela
Copy link

mmatela commented Sep 11, 2024

Using tesseract 5.4.1 and tess4j-5.13.0 (but also seen the same behavior with tess4j-5.4.0)

Sample image:
ocrtest

When using command line, the results are perfect:

$ tesseract ocrtest.png stdout --oem 1 --psm 7 -l pol --tessdata-dir /usr/share/tesseract-ocr/5/tessdata/
kogokolwiek, gdziekolwiek

I'm trying to invoke the same through tess4j with the following java code:

Tesseract tesseract = new Tesseract();
tesseract.setDatapath("/usr/share/tesseract-ocr/5/tessdata/");
tesseract.setLanguage("pol");
tesseract.setOcrEngineMode(1);
tesseract.setPageSegMode(7);
BufferedImage testImg = ImageIOHelper.getImageList(new File("ocrtest.png")).get(0);
String result = tesseract.doOCR(testImg);
System.out.println(result);

The result is an empty string!
I tried different psms and it prints something for 6 (PSM_SINGLE_BLOCK), but not a fully correct result: | kogokolwiek, geziekolwiek.
Anyway, it looks like psm 7 (PSM_SINGLE_LINE) should work the best, since the image contains a single line.

As advised in #264 and related issues, I tried VietOCR and see the same result (nothing recognized by default, imperfect result with psm6).

@mmatela
Copy link
Author

mmatela commented Sep 11, 2024

Using Polish language is not actually necessary to demonstrate the problem, with English it's similar - the result is a bit mangled in CLI, more mangled with tess4j psm6, empty with tess4j psm7.

@mmatela mmatela changed the title Tess4j results worse than CLI Tess4j OCR results worse than CLI Sep 11, 2024
@nguyenq
Copy link
Owner

nguyenq commented Sep 12, 2024

Please see #264 (comment)

@mmatela
Copy link
Author

mmatela commented Sep 12, 2024

Thanks for the pointer, I missed that comment, but it doesn't seem to solve my problem.
If I understand correctly, the TextRenderer is only available when calling Tesseract.createDocuments and not in Tesseract.doOCR. But createDocuments doesn't let me define rectangles to process only parts of the input image, which is the main advantage of tess4j for me (otherwise I could just use ProcessBuilder to invoke CLI).

Am I missing something? What's the best way forward?
Would it possible to add renderer selection to the doOCR API?
Or are there any tricks to process only parts of an image with createDocuments or with CLI? Otherwise I guess I would have to save these parts as separate temporary files...

@mmatela
Copy link
Author

mmatela commented Sep 12, 2024

Also, I just noticed a scary sentence in #264 (comment)

It's possible or likely that Tesseract CLI performs some basic image preprocessing before OCR stage. You may have to perform similar preprocessing yourself when using tess4j.

Do you still think that's true? That should be a big red warning label on the front page. I don't think many users are aware that they likely get worse OCR results than with CLI, unless they perform additional research and reimplement their own preprocessing.

@nguyenq
Copy link
Owner

nguyenq commented Sep 13, 2024

Thanks for the pointer, I missed that comment, but it doesn't seem to solve my problem. If I understand correctly, the TextRenderer is only available when calling Tesseract.createDocuments and not in Tesseract.doOCR. But createDocuments doesn't let me define rectangles to process only parts of the input image, which is the main advantage of tess4j for me (otherwise I could just use ProcessBuilder to invoke CLI).

Am I missing something? What's the best way forward? Would it possible to add renderer selection to the doOCR API? Or are there any tricks to process only parts of an image with createDocuments or with CLI? Otherwise I guess I would have to save these parts as separate temporary files...

The TextRenderer API expects a path to an image file as input and outputs to a file on the local filesystem. It does not accept specified ROIs. The CLI does not seem to support ROIs either.

https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html

So if you want to use createDocuments on part of an image, you would need to crop it first and save the subimage to the local filesystem before invoking createDocuments. doOCR, which calls Tesseract's GetUTF8Text function behind the scene, supports use of ROIs, but GetUTF8Text API, as opposed to TextRenderer API, follows a different execution path inside Tesseract engine and hence will produce a different result.

@nguyenq
Copy link
Owner

nguyenq commented Sep 13, 2024

Also, I just noticed a scary sentence in #264 (comment)

It's possible or likely that Tesseract CLI performs some basic image preprocessing before OCR stage. You may have to perform similar preprocessing yourself when using tess4j.

Do you still think that's true? That should be a big red warning label on the front page. I don't think many users are aware that they likely get worse OCR results than with CLI, unless they perform additional research and reimplement their own preprocessing.

Tesseract engine performs some minimal, basic image processing on input images, such as thresholding, before recognition stage. Tess4j inherits the same benefits when it invokes Tesseract API. For some images, this may be sufficient; but for more complicated ones, it may require the user to carry out additional preprocessing on the images -- such as deskewing, denoising, binarization, etc. -- to improve recognition.

https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html

@mmatela
Copy link
Author

mmatela commented Sep 13, 2024

OK, I'll try to sum up what you said together with what I see in the tess4j code.

doOCR, depending on the selected output format, uses the API calls TessBaseAPIGet[...]Text which work on images already loaded to memory and support Regions Of Interest but don't do preprocessing so OCR quality is likely worse.

createDocuments uses the API calls Tess[...]RendererCreate which then goes to TessBaseAPIProcessPages which only takes paths to image files and doesn't support ROI but performs preprocessing.

And it would be great to enable preprocessing in doOCR, but it's impossible due to API limitations.
Would it make sense to ask the Tesseract team to enhance the API in that regard?

@mmatela
Copy link
Author

mmatela commented Sep 17, 2024

With a lot of help from AI I was able to setup a simple C++ project to test the API directly. Turns out that GetUTF8Text recognizes my example perfectly! So there must be something else going on, but I have no idea what to check next.

Here's the C++ code I used:

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#include <iostream>

int main() {
    char *outText;

    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    if (api->Init("/usr/share/tesseract-ocr/5/tessdata/", "pol", tesseract::OEM_LSTM_ONLY)) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        exit(1);
    }
    api->SetPageSegMode(tesseract::PSM_SINGLE_LINE);

    Pix *image = pixRead("ocrtest.png");
    api->SetImage(image);
    outText = api->GetUTF8Text();
    printf("OCR output:\n%s", outText);

    api->End();
    delete api;
    delete [] outText;
    pixDestroy(&image);

    return 0;
}

@mmatela
Copy link
Author

mmatela commented Sep 17, 2024

One step further, I've got the good result in Java, using TessAPI instead of the Tesseract wrapper:

			TessAPI api = TessAPI.INSTANCE;
			TessBaseAPI handle = api.TessBaseAPICreate();
			api.TessBaseAPIInit2(handle, "/usr/share/tesseract-ocr/5/tessdata/", "pol", 1);
			api.TessBaseAPISetPageSegMode(handle, 7);
			
			BufferedImage bufImg = ImageIOHelper.getImageList(new File("/home/vagrant/tesseract-test/ocrtest.png")).get(0);
			// variant 1
			ByteBuffer buff = ImageIOHelper.convertImageData(bufImg);
			api.TessBaseAPISetImage(handle, buff, bufImg.getWidth(), bufImg.getHeight(), 1, bufImg.getWidth());
			
			// variant 2
	//		Pix pix = LeptUtils.convertImageToPix(bufImg);
	//		api.TessBaseAPISetImage2(handle, pix);
			
			Pointer textPtr = api.TessBaseAPIGetUTF8Text(handle);
			String str = textPtr.getString(0);
			api.TessDeleteText(textPtr);
			System.out.println(str);
			// TODO more cleanup

Variant 1 should be the equivalent of Tesseract.doOCR(), it uses ByteBuffer and prints nothing (or a mangled result with PSM=6), while variant 2 that uses Leptonica's Pix prints the good result.
So could it be a problem with converting a BufferedImage into a ByteBuffer? I tried to copy the implementation of getImageByteBuffer used in LeptUtils, and it lead to similar effects, but strangely not the same: the PSM=6 result was even more mangled.

@nguyenq What do you think? Would you consider converting input to Pix in doOCR? Looking at https://github.com/tesseract-ocr/tesseract/blob/4f435363354a4c06730ee1b9a2b5facacf353d6b/src/api/baseapi.cpp#L521 it seems to be highly recommended.

@nguyenq
Copy link
Owner

nguyenq commented Sep 23, 2024

@mmatela Can you attach a sample image for our investigation? Thanks.

@mmatela
Copy link
Author

mmatela commented Sep 23, 2024

@nguyenq it's in the first post here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants