Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Words while extracting from PDF #167

Open
abhiwins opened this issue Aug 14, 2024 · 3 comments
Open

Missing Words while extracting from PDF #167

abhiwins opened this issue Aug 14, 2024 · 3 comments

Comments

@abhiwins
Copy link

Lot of words are missing when the data is extracted from the PDF.
Scenario :- In event of large text pages more than( 1000) words.

@lfoppiano
Copy link
Collaborator

Hi @abhiwins could you please provide some examples?
Including input pdf and output. Also, on which OS/platform did you run it?

Thank you

@abhiwins
Copy link
Author

abhiwins commented Aug 23, 2024

attached PDF, Image Output.

validated on ubuntu 20.04, 24.04,

Test_pdf_word_issue
Test_pdf_word_issue.pdf

@calee88
Copy link

calee88 commented Nov 29, 2024

I have a similar issue. When I run it for a specific page, it works. But, when I try whole file at once, it missed a character. When I tried pdftohtml v4.03 it has no problem. https://www.kyobo.com/file/ajax/download?fName=/dtc/pdf/mm/1312890060288_%EB%AC%B4%EB%B0%B0%EB%8B%B9%EA%B5%90%EB%B3%B4%EA%B0%80%EC%A1%B1%EC%82%AC%EB%9E%91%ED%86%B5%ED%95%A9CI%EB%B3%B4%ED%97%98%20%EC%A4%91%EB%8F%84%EB%B6%80%EA%B0%80%ED%8A%B9%EC%95%BD%20%ED%86%B5%ED%95%A9%EC%95%BD%EA%B4%80_2011.08.01_.pdf
Page 221 약 is removed.
This happens both on v0.4 and v0.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants