You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After upgrading my container's base image from Ubuntu 22.04 to Ubuntu 24.04, I started experiencing minor but consistent issues with the OCR output generated by ocrmypdf.
I have a suite of unit tests that uses the OCR that have been stable for some time, but some of these tests started failing after the upgrade. These tests use a PDF file as input, and compare the result with an expected output.
I did some experiments. The main difference is that Ubuntu 24.04 provides a different version of Ghostscript, 10.x, while 22.04 provides 9.55. There was a major rewrite of PDF handling between 9.x and 10.x in Ghostscript, and the new version is significantly lower in quality from an OCR perspective -- 10.x produces output that most PDF viewers will see as extra word breaks in the middle of words.
In a challenging test document, Ghostscript 10 produces "Al l f i xt ur es and har dwar e wi l l be pr oper l y and s ecur el y i ns t al l ed." (3 words identified correctly)
while Ghostscript 9 produces "All fi xtures and h ardware wi ll be properly and securely i nstalled." (6 words identified correctly, still not great)
What were you trying to do?
After upgrading my container's base image from Ubuntu 22.04 to Ubuntu 24.04, I started experiencing minor but consistent issues with the OCR output generated by ocrmypdf.
I have a suite of unit tests that uses the OCR that have been stable for some time, but some of these tests started failing after the upgrade. These tests use a PDF file as input, and compare the result with an expected output.
My requirements.txt file looks like below:
And I'm running ocrmypdf with the following params:
ocrmypdf {inputpdf} {outputpdf} --force-ocr --pages 1,2 --optimize 0 --tesseract-pagesegmode 6 --pdf-renderer 'hocr' --sidecar {outputtxt}
Environment Details:
Expected Behavior:
OCR output should match the behavior observed when using Ubuntu 22.04, producing accurate and consistent text output.
Observed Behavior:
Misrecognized characters (e.g., "p" becomes "o").
Additional spaces introduced in the OCR output.
Additional Information:
The same setup works perfectly when using Ubuntu 22.04.
Where are you installing/running from?
PyPI (pip, poetry, pipx, etc.)
OCRmyPDF version
16.6.2
What operating system are you working on?
Linux
Operating system details and version
Ubuntu 24.04
Simple sanity checks
Relevant log output
No response
The text was updated successfully, but these errors were encountered: