Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: OCR Output Quality Regression on Ubuntu 24.04 #1439

Open
3 tasks done
guilhermebferreira opened this issue Dec 2, 2024 · 2 comments
Open
3 tasks done

[Bug]: OCR Output Quality Regression on Ubuntu 24.04 #1439

guilhermebferreira opened this issue Dec 2, 2024 · 2 comments
Assignees
Labels
triage Issue needs triage

Comments

@guilhermebferreira
Copy link

What were you trying to do?

After upgrading my container's base image from Ubuntu 22.04 to Ubuntu 24.04, I started experiencing minor but consistent issues with the OCR output generated by ocrmypdf.

I have a suite of unit tests that uses the OCR that have been stable for some time, but some of these tests started failing after the upgrade. These tests use a PDF file as input, and compare the result with an expected output.

My requirements.txt file looks like below:


cffi==1.17.1
charset-normalizer==3.4.0
cryptography==44.0.0
deprecated==1.2.15
deprecation==2.1.0
grpcio==1.68.1
img2pdf==0.5.1
lxml==5.3.0
markdown-it-py==3.0.0
mdurl==0.1.2
ocrmypdf==16.6.2
packaging==24.2
pdfminer-six==20240706
pi-heif==0.21.0
pikepdf==9.4.2
pillow==11.0.0
pluggy==1.5.0
protobuf==3.20.3
pycparser==2.22
pygments==2.18.0
rich==13.9.4
typing-extensions==4.12.2
wrapt==1.17.0

And I'm running ocrmypdf with the following params:

ocrmypdf {inputpdf} {outputpdf} --force-ocr --pages 1,2 --optimize 0 --tesseract-pagesegmode 6 --pdf-renderer 'hocr' --sidecar {outputtxt}

Environment Details:

  • OS: Ubuntu 24.04
  • System Packages:
    • Ghostscript: 10.02.1
    • Tesseract: 5.3.4
    • pngquant: 2.18.0
    • Unpaper: 7.0.0

Expected Behavior:

OCR output should match the behavior observed when using Ubuntu 22.04, producing accurate and consistent text output.

Observed Behavior:

Misrecognized characters (e.g., "p" becomes "o").
Additional spaces introduced in the OCR output.
Additional Information:

The same setup works perfectly when using Ubuntu 22.04.

Where are you installing/running from?

PyPI (pip, poetry, pipx, etc.)

OCRmyPDF version

16.6.2

What operating system are you working on?

Linux

Operating system details and version

Ubuntu 24.04

Simple sanity checks

  • Operating system is currently supported by its vendor (not end of life)
  • Python version is compatible with OCRmyPDF
  • This issue is not about a specific input file

Relevant log output

No response

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Dec 5, 2024

I did some experiments. The main difference is that Ubuntu 24.04 provides a different version of Ghostscript, 10.x, while 22.04 provides 9.55. There was a major rewrite of PDF handling between 9.x and 10.x in Ghostscript, and the new version is significantly lower in quality from an OCR perspective -- 10.x produces output that most PDF viewers will see as extra word breaks in the middle of words.

In a challenging test document, Ghostscript 10 produces
"Al l f i xt ur es and har dwar e wi l l be pr oper l y and s ecur el y i ns t al l ed." (3 words identified correctly)
while Ghostscript 9 produces
"All fi xtures and h ardware wi ll be properly and securely i nstalled." (6 words identified correctly, still not great)

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Dec 6, 2024

@stumpylog I think paperless-ngx should consider pinning Ghostscript 9, based on my findings so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Issue needs triage
Projects
None yet
Development

No branches or pull requests

2 participants