Tess4j OCR results worse than CLI #268

mmatela · 2024-09-11T15:11:10Z

Using tesseract 5.4.1 and tess4j-5.13.0 (but also seen the same behavior with tess4j-5.4.0)

Sample image:

When using command line, the results are perfect:

$ tesseract ocrtest.png stdout --oem 1 --psm 7 -l pol --tessdata-dir /usr/share/tesseract-ocr/5/tessdata/
kogokolwiek, gdziekolwiek

I'm trying to invoke the same through tess4j with the following java code:

Tesseract tesseract = new Tesseract();
tesseract.setDatapath("/usr/share/tesseract-ocr/5/tessdata/");
tesseract.setLanguage("pol");
tesseract.setOcrEngineMode(1);
tesseract.setPageSegMode(7);
BufferedImage testImg = ImageIOHelper.getImageList(new File("ocrtest.png")).get(0);
String result = tesseract.doOCR(testImg);
System.out.println(result);

The result is an empty string!
I tried different psms and it prints something for 6 (PSM_SINGLE_BLOCK), but not a fully correct result: | kogokolwiek, geziekolwiek.
Anyway, it looks like psm 7 (PSM_SINGLE_LINE) should work the best, since the image contains a single line.

As advised in #264 and related issues, I tried VietOCR and see the same result (nothing recognized by default, imperfect result with psm6).

The text was updated successfully, but these errors were encountered:

mmatela · 2024-09-11T15:19:26Z

Using Polish language is not actually necessary to demonstrate the problem, with English it's similar - the result is a bit mangled in CLI, more mangled with tess4j psm6, empty with tess4j psm7.

nguyenq · 2024-09-12T04:19:38Z

Please see #264 (comment)

mmatela · 2024-09-12T07:55:24Z

Thanks for the pointer, I missed that comment, but it doesn't seem to solve my problem.
If I understand correctly, the TextRenderer is only available when calling Tesseract.createDocuments and not in Tesseract.doOCR. But createDocuments doesn't let me define rectangles to process only parts of the input image, which is the main advantage of tess4j for me (otherwise I could just use ProcessBuilder to invoke CLI).

Am I missing something? What's the best way forward?
Would it possible to add renderer selection to the doOCR API?
Or are there any tricks to process only parts of an image with createDocuments or with CLI? Otherwise I guess I would have to save these parts as separate temporary files...

mmatela · 2024-09-12T08:43:12Z

Also, I just noticed a scary sentence in #264 (comment)

It's possible or likely that Tesseract CLI performs some basic image preprocessing before OCR stage. You may have to perform similar preprocessing yourself when using tess4j.

Do you still think that's true? That should be a big red warning label on the front page. I don't think many users are aware that they likely get worse OCR results than with CLI, unless they perform additional research and reimplement their own preprocessing.

nguyenq · 2024-09-13T04:01:42Z

Thanks for the pointer, I missed that comment, but it doesn't seem to solve my problem. If I understand correctly, the TextRenderer is only available when calling Tesseract.createDocuments and not in Tesseract.doOCR. But createDocuments doesn't let me define rectangles to process only parts of the input image, which is the main advantage of tess4j for me (otherwise I could just use ProcessBuilder to invoke CLI).

Am I missing something? What's the best way forward? Would it possible to add renderer selection to the doOCR API? Or are there any tricks to process only parts of an image with createDocuments or with CLI? Otherwise I guess I would have to save these parts as separate temporary files...

The TextRenderer API expects a path to an image file as input and outputs to a file on the local filesystem. It does not accept specified ROIs. The CLI does not seem to support ROIs either.

https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html

So if you want to use createDocuments on part of an image, you would need to crop it first and save the subimage to the local filesystem before invoking createDocuments. doOCR, which calls Tesseract's GetUTF8Text function behind the scene, supports use of ROIs, but GetUTF8Text API, as opposed to TextRenderer API, follows a different execution path inside Tesseract engine and hence will produce a different result.

nguyenq · 2024-09-13T04:11:07Z

Also, I just noticed a scary sentence in #264 (comment)

It's possible or likely that Tesseract CLI performs some basic image preprocessing before OCR stage. You may have to perform similar preprocessing yourself when using tess4j.

Do you still think that's true? That should be a big red warning label on the front page. I don't think many users are aware that they likely get worse OCR results than with CLI, unless they perform additional research and reimplement their own preprocessing.

Tesseract engine performs some minimal, basic image processing on input images, such as thresholding, before recognition stage. Tess4j inherits the same benefits when it invokes Tesseract API. For some images, this may be sufficient; but for more complicated ones, it may require the user to carry out additional preprocessing on the images -- such as deskewing, denoising, binarization, etc. -- to improve recognition.

https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html

mmatela · 2024-09-13T09:00:21Z

OK, I'll try to sum up what you said together with what I see in the tess4j code.

doOCR, depending on the selected output format, uses the API calls TessBaseAPIGet[...]Text which work on images already loaded to memory and support Regions Of Interest but don't do preprocessing so OCR quality is likely worse.

createDocuments uses the API calls Tess[...]RendererCreate which then goes to TessBaseAPIProcessPages which only takes paths to image files and doesn't support ROI but performs preprocessing.

And it would be great to enable preprocessing in doOCR, but it's impossible due to API limitations.
Would it make sense to ask the Tesseract team to enhance the API in that regard?

mmatela · 2024-09-17T10:03:00Z

With a lot of help from AI I was able to setup a simple C++ project to test the API directly. Turns out that GetUTF8Text recognizes my example perfectly! So there must be something else going on, but I have no idea what to check next.

Here's the C++ code I used:

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#include <iostream>

int main() {
    char *outText;

    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    if (api->Init("/usr/share/tesseract-ocr/5/tessdata/", "pol", tesseract::OEM_LSTM_ONLY)) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        exit(1);
    }
    api->SetPageSegMode(tesseract::PSM_SINGLE_LINE);

    Pix *image = pixRead("ocrtest.png");
    api->SetImage(image);
    outText = api->GetUTF8Text();
    printf("OCR output:\n%s", outText);

    api->End();
    delete api;
    delete [] outText;
    pixDestroy(&image);

    return 0;
}

mmatela · 2024-09-17T15:13:23Z

One step further, I've got the good result in Java, using TessAPI instead of the Tesseract wrapper:

			TessAPI api = TessAPI.INSTANCE;
			TessBaseAPI handle = api.TessBaseAPICreate();
			api.TessBaseAPIInit2(handle, "/usr/share/tesseract-ocr/5/tessdata/", "pol", 1);
			api.TessBaseAPISetPageSegMode(handle, 7);
			
			BufferedImage bufImg = ImageIOHelper.getImageList(new File("/home/vagrant/tesseract-test/ocrtest.png")).get(0);
			// variant 1
			ByteBuffer buff = ImageIOHelper.convertImageData(bufImg);
			api.TessBaseAPISetImage(handle, buff, bufImg.getWidth(), bufImg.getHeight(), 1, bufImg.getWidth());
			
			// variant 2
	//		Pix pix = LeptUtils.convertImageToPix(bufImg);
	//		api.TessBaseAPISetImage2(handle, pix);
			
			Pointer textPtr = api.TessBaseAPIGetUTF8Text(handle);
			String str = textPtr.getString(0);
			api.TessDeleteText(textPtr);
			System.out.println(str);
			// TODO more cleanup

Variant 1 should be the equivalent of Tesseract.doOCR(), it uses ByteBuffer and prints nothing (or a mangled result with PSM=6), while variant 2 that uses Leptonica's Pix prints the good result.
So could it be a problem with converting a BufferedImage into a ByteBuffer? I tried to copy the implementation of getImageByteBuffer used in LeptUtils, and it lead to similar effects, but strangely not the same: the PSM=6 result was even more mangled.

@nguyenq What do you think? Would you consider converting input to Pix in doOCR? Looking at https://github.com/tesseract-ocr/tesseract/blob/4f435363354a4c06730ee1b9a2b5facacf353d6b/src/api/baseapi.cpp#L521 it seems to be highly recommended.

nguyenq · 2024-09-23T03:33:57Z

@mmatela Can you attach a sample image for our investigation? Thanks.

mmatela · 2024-09-23T07:35:20Z

@nguyenq it's in the first post here.

mmatela changed the title ~~Tess4j results worse than CLI~~ Tess4j OCR results worse than CLI Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tess4j OCR results worse than CLI #268

Tess4j OCR results worse than CLI #268

mmatela commented Sep 11, 2024

mmatela commented Sep 11, 2024

nguyenq commented Sep 12, 2024

mmatela commented Sep 12, 2024

mmatela commented Sep 12, 2024

nguyenq commented Sep 13, 2024

nguyenq commented Sep 13, 2024

mmatela commented Sep 13, 2024

mmatela commented Sep 17, 2024

mmatela commented Sep 17, 2024

nguyenq commented Sep 23, 2024

mmatela commented Sep 23, 2024

Tess4j OCR results worse than CLI #268

Tess4j OCR results worse than CLI #268

Comments

mmatela commented Sep 11, 2024

mmatela commented Sep 11, 2024

nguyenq commented Sep 12, 2024

mmatela commented Sep 12, 2024

mmatela commented Sep 12, 2024

nguyenq commented Sep 13, 2024

nguyenq commented Sep 13, 2024

mmatela commented Sep 13, 2024

mmatela commented Sep 17, 2024

mmatela commented Sep 17, 2024

nguyenq commented Sep 23, 2024

mmatela commented Sep 23, 2024