[Bug]: OCR Output Quality Regression on Ubuntu 24.04 #1439

guilhermebferreira · 2024-12-02T23:33:33Z

What were you trying to do?

After upgrading my container's base image from Ubuntu 22.04 to Ubuntu 24.04, I started experiencing minor but consistent issues with the OCR output generated by ocrmypdf.

I have a suite of unit tests that uses the OCR that have been stable for some time, but some of these tests started failing after the upgrade. These tests use a PDF file as input, and compare the result with an expected output.

My requirements.txt file looks like below:


cffi==1.17.1
charset-normalizer==3.4.0
cryptography==44.0.0
deprecated==1.2.15
deprecation==2.1.0
grpcio==1.68.1
img2pdf==0.5.1
lxml==5.3.0
markdown-it-py==3.0.0
mdurl==0.1.2
ocrmypdf==16.6.2
packaging==24.2
pdfminer-six==20240706
pi-heif==0.21.0
pikepdf==9.4.2
pillow==11.0.0
pluggy==1.5.0
protobuf==3.20.3
pycparser==2.22
pygments==2.18.0
rich==13.9.4
typing-extensions==4.12.2
wrapt==1.17.0

And I'm running ocrmypdf with the following params:

ocrmypdf {inputpdf} {outputpdf} --force-ocr --pages 1,2 --optimize 0 --tesseract-pagesegmode 6 --pdf-renderer 'hocr' --sidecar {outputtxt}

Environment Details:

OS: Ubuntu 24.04
System Packages:
- Ghostscript: 10.02.1
- Tesseract: 5.3.4
- pngquant: 2.18.0
- Unpaper: 7.0.0

Expected Behavior:

OCR output should match the behavior observed when using Ubuntu 22.04, producing accurate and consistent text output.

Observed Behavior:

Misrecognized characters (e.g., "p" becomes "o").
Additional spaces introduced in the OCR output.
Additional Information:

The same setup works perfectly when using Ubuntu 22.04.

Where are you installing/running from?

PyPI (pip, poetry, pipx, etc.)

OCRmyPDF version

16.6.2

What operating system are you working on?

Linux

Operating system details and version

Ubuntu 24.04

Simple sanity checks

Operating system is currently supported by its vendor (not end of life)
Python version is compatible with OCRmyPDF
This issue is not about a specific input file

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2024-12-05T10:16:48Z

I did some experiments. The main difference is that Ubuntu 24.04 provides a different version of Ghostscript, 10.x, while 22.04 provides 9.55. There was a major rewrite of PDF handling between 9.x and 10.x in Ghostscript, and the new version is significantly lower in quality from an OCR perspective -- 10.x produces output that most PDF viewers will see as extra word breaks in the middle of words.

In a challenging test document, Ghostscript 10 produces
"Al l f i xt ur es and har dwar e wi l l be pr oper l y and s ecur el y i ns t al l ed." (3 words identified correctly)
while Ghostscript 9 produces
"All fi xtures and h ardware wi ll be properly and securely i nstalled." (6 words identified correctly, still not great)

jbarlow83 · 2024-12-06T21:19:49Z

@stumpylog I think paperless-ngx should consider pinning Ghostscript 9, based on my findings so far.

guilhermebferreira added the triage Issue needs triage label Dec 2, 2024

guilhermebferreira assigned jbarlow83 Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: OCR Output Quality Regression on Ubuntu 24.04 #1439

[Bug]: OCR Output Quality Regression on Ubuntu 24.04 #1439

guilhermebferreira commented Dec 2, 2024

jbarlow83 commented Dec 5, 2024 •

edited

Loading

jbarlow83 commented Dec 6, 2024 •

edited

Loading

[Bug]: OCR Output Quality Regression on Ubuntu 24.04 #1439

[Bug]: OCR Output Quality Regression on Ubuntu 24.04 #1439

Comments

guilhermebferreira commented Dec 2, 2024

What were you trying to do?

Where are you installing/running from?

OCRmyPDF version

What operating system are you working on?

Operating system details and version

Simple sanity checks

Relevant log output

jbarlow83 commented Dec 5, 2024 • edited Loading

jbarlow83 commented Dec 6, 2024 • edited Loading

jbarlow83 commented Dec 5, 2024 •

edited

Loading

jbarlow83 commented Dec 6, 2024 •

edited

Loading