New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Missing Words while extracting from PDF #167

Open

abhiwins opened this issue Aug 14, 2024 · 3 comments

abhiwins commented Aug 14, 2024

Lot of words are missing when the data is extracted from the PDF.
Scenario :- In event of large text pages more than( 1000) words.

Collaborator

lfoppiano commented Aug 22, 2024

Hi @abhiwins could you please provide some examples?
Including input pdf and output. Also, on which OS/platform did you run it?

Thank you

Author

abhiwins commented Aug 23, 2024 •

edited

Loading

attached PDF, Image Output.

validated on ubuntu 20.04, 24.04,

Test_pdf_word_issue.pdf

calee88 commented Nov 29, 2024 •

edited

Loading

I have a similar issue. When I run it for a specific page, it works. But, when I try whole file at once, it missed a character. When I tried pdftohtml v4.03 it has no problem. https://www.kyobo.com/file/ajax/download?fName=/dtc/pdf/mm/1312890060288_%EB%AC%B4%EB%B0%B0%EB%8B%B9%EA%B5%90%EB%B3%B4%EA%B0%80%EC%A1%B1%EC%82%AC%EB%9E%91%ED%86%B5%ED%95%A9CI%EB%B3%B4%ED%97%98%20%EC%A4%91%EB%8F%84%EB%B6%80%EA%B0%80%ED%8A%B9%EC%95%BD%20%ED%86%B5%ED%95%A9%EC%95%BD%EA%B4%80_2011.08.01_.pdf
Page 221 약 is removed.
This happens both on v0.4 and v0.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment