Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get words and confidences #213

Open
peterkronenberg opened this issue May 4, 2021 · 7 comments
Open

Get words and confidences #213

peterkronenberg opened this issue May 4, 2021 · 7 comments

Comments

@peterkronenberg
Copy link

I found this repo, at https://github.com/nguyenq/tess4j/tree/master/src/test/java/net/sourceforge/tess4j, which is different from the tess4j 4.5.4 distribution. How is this code different?

The code in TessApiTest has some good examples of getting the confidence values. But I can't figure out how the Progress Monitor is used. Since I didn't need a monitor, I tried to eliminate its usage. But I get errors about memory leaks. It needs to be passed to the call to TessBaseAPIRecognize. This ProgressMonitor class doesn't exist at all in the tess4j distribution, although the call to TessBaseAPIRecognize does require an argument of a ETEXT_DESC. Can you explain more?

@nguyenq
Copy link
Owner

nguyenq commented May 5, 2021

The distribution is based on https://github.com/nguyenq/tess4j/tree/tess4j-4 branch. The master contains development code for latest Tesseract 5.x version.

ProgressMonitor is a client class designed to poll the engine for progress status; however, it seems to no longer work. If you want the feature, you'd need to use the recently added TessMonitor API methods. For example, please consult Tesseract documentation or its unit tests.

Below is an example of calling TessBaseAPIAllWordConfidences method. The calling function must delete the array after use, which I have not been able to do.

/**
    * Test of TessBaseAPIAllWordConfidences method, of class TessAPI.
    *
    * @throws java.lang.Exception
    */
   @Test
   public void testTessBaseAPIAllWordConfidences() throws Exception {
       logger.info("TessBaseAPIAllWordConfidences");
       File tiff = new File(this.testResourcesDataPath, "eurotext.tif");
       Pix pix = Leptonica1.pixRead(tiff.getPath());
       TessAPI1.TessBaseAPIInit3(handle, datapath, language);
       TessAPI1.TessBaseAPISetImage2(handle, pix);
       IntByReference wordConfidences = TessAPI1.TessBaseAPIAllWordConfidences(handle);
       Pointer confs = wordConfidences.getPointer();
       int i = 0;
       int word = 0;
       while (true) {
           int conf = confs.getInt(i);
           if (conf == -1) {
               break; // array terminated by -1
           }
           i++;
           if (conf < 0 || conf > 100) {
               continue; // skip invalid confidence value
           }
           word++;
           logger.info("Word Confidence " + word + ": " + conf);
       }

//        IntBuffer ib = IntBuffer.wrap(confs.getIntArray(0, i));
//        TessAPI1.TessDeleteIntArray(ib);

       //release Pix resource
       PointerByReference pRef = new PointerByReference();
       pRef.setValue(pix.getPointer());
       Leptonica1.pixDestroy(pRef);

       assertTrue(i > 0);
   }

@peterkronenberg
Copy link
Author

Thank you.

What do you mean when you say you haven't been able to free the array? I see the code you have commented. Do you mean it's not working?

What would be the best way if I wanted to have a re-usable instance that I can pass in multiple files successively?
If I initialize TessAPI1 just once with
TessAPI1.TessBaseAPIInit3(handle, datapath, language);

could I then re-use that instance to process multiple files like this

// Process 1st file
File tiff = new File(this.testResourcesDataPath, "file1.tif");
Pix pix = Leptonica1.pixRead(tiff.getPath());
TessAPI1.TessBaseAPISetImage2(handle, pix);
.
.
.
// close resource
PointerByReference pRef = new PointerByReference();
pRef.setValue(pix.getPointer());
Leptonica1.pixDestroy(pRef);

// Process 2nd file
File tiff = new File(this.testResourcesDataPath, "file2.tif");
Pix pix = Leptonica1.pixRead(tiff.getPath());
TessAPI1.TessBaseAPISetImage2(handle, pix);
.
.
.
// close resource
PointerByReference pRef = new PointerByReference();
pRef.setValue(pix.getPointer());
Leptonica1.pixDestroy(pRef);

// when I'm all done, are there any other resources that need to be closed/released?

@peterkronenberg
Copy link
Author

peterkronenberg commented May 5, 2021

I just realized this only returns the confidences and not the words. How do I get the words? TessBaseAPIGetUTF8Text only returns a single word. Is there a way to get the words and confidences at once?

@nguyenq
Copy link
Owner

nguyenq commented May 6, 2021

Right, I haven't been able to free the array.

Your approach for multiple images looks alright, but beware of memory leaks from Tesseract library. You may want to start a new instance after so many images.

You need to read the documentation better. There's a Tesseract.getWords method that can get both the text and its confidence value.

@peterkronenberg
Copy link
Author

ok, I see now. It wasn't immediately obvious that getWords() also includes the confidence. The Javadoc is fine to some extent. It documents the methods, but it's not always clear how to use them. For example, I'm not sure what the pageIteratorLevel is. Do I just start off with 0? In my particular use case, each file just has a few words, on a single page. And there is no real documentation about how all the other structures work, such as the various iterators.

Also, getWords() initializes TessAPI and disposes it each time. Seems like it would be more performant to write my own version that copies the code in init(), calls it once and then copies the code in getWords(). Would be nice if the code already contained the building blocks I need without having to replicate the code myself. I would have liked to be able to just create an instance of Tesseract and then call tess.init() and another version of tess.getWords() that doesn't do the setup and break down. And just leave that aspect of it to the caller

@peterkronenberg peterkronenberg changed the title How is this repo different from the Tess4j distribution How is this repo different from the Tess4j distribution -- Get words and confidences May 6, 2021
@peterkronenberg peterkronenberg changed the title How is this repo different from the Tess4j distribution -- Get words and confidences Get words and confidences May 6, 2021
@nguyenq
Copy link
Owner

nguyenq commented May 6, 2021

The building blocks are in TessAPI class, which mirrors the C-API of Tesseract native library. The provided unit tests and Tesseract class already depict typical usages of the API. It's unrealistic to expect all possible use cases documented. If you want to get into more depth, you need to consult Tesseract's native code and documentation.

You can either implement your custom class or extend the existing ones.

Good luck.

@peterkronenberg
Copy link
Author

I guess I'm talking about higher-level building blocks. I can extend Tesseract and call init() since it's protected, but getWords() is doing too much, so I'd have to implement my own to separate the init and destroy from the core functionality of getting the words This is true for most of the methods that call init().
Wwould be more useful if getWords() was implemented like this:

init()
_getWords()
destroy()

Where _getWords() has the core functionality. This would allow someone to call getWords() and have the same behavior as today but it would also allow someone to call _getWords() directly and handle the init stuff themselves

I appreciate your help and all the work you have put into this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants