-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get words and confidences #213
Comments
The distribution is based on https://github.com/nguyenq/tess4j/tree/tess4j-4 branch. The master contains development code for latest Tesseract 5.x version.
Below is an example of calling
|
Thank you. What do you mean when you say you haven't been able to free the array? I see the code you have commented. Do you mean it's not working? What would be the best way if I wanted to have a re-usable instance that I can pass in multiple files successively? could I then re-use that instance to process multiple files like this
|
I just realized this only returns the confidences and not the words. How do I get the words? |
Right, I haven't been able to free the array. Your approach for multiple images looks alright, but beware of memory leaks from Tesseract library. You may want to start a new instance after so many images. You need to read the documentation better. There's a |
ok, I see now. It wasn't immediately obvious that getWords() also includes the confidence. The Javadoc is fine to some extent. It documents the methods, but it's not always clear how to use them. For example, I'm not sure what the pageIteratorLevel is. Do I just start off with 0? In my particular use case, each file just has a few words, on a single page. And there is no real documentation about how all the other structures work, such as the various iterators. Also, getWords() initializes TessAPI and disposes it each time. Seems like it would be more performant to write my own version that copies the code in init(), calls it once and then copies the code in getWords(). Would be nice if the code already contained the building blocks I need without having to replicate the code myself. I would have liked to be able to just create an instance of Tesseract and then call tess.init() and another version of tess.getWords() that doesn't do the setup and break down. And just leave that aspect of it to the caller |
The building blocks are in You can either implement your custom class or extend the existing ones. Good luck. |
I guess I'm talking about higher-level building blocks. I can extend Tesseract and call
Where I appreciate your help and all the work you have put into this |
I found this repo, at https://github.com/nguyenq/tess4j/tree/master/src/test/java/net/sourceforge/tess4j, which is different from the tess4j 4.5.4 distribution. How is this code different?
The code in TessApiTest has some good examples of getting the confidence values. But I can't figure out how the Progress Monitor is used. Since I didn't need a monitor, I tried to eliminate its usage. But I get errors about memory leaks. It needs to be passed to the call to TessBaseAPIRecognize. This ProgressMonitor class doesn't exist at all in the tess4j distribution, although the call to TessBaseAPIRecognize does require an argument of a ETEXT_DESC. Can you explain more?
The text was updated successfully, but these errors were encountered: