We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lot of words are missing when the data is extracted from the PDF. Scenario :- In event of large text pages more than( 1000) words.
The text was updated successfully, but these errors were encountered:
Hi @abhiwins could you please provide some examples? Including input pdf and output. Also, on which OS/platform did you run it?
Thank you
Sorry, something went wrong.
attached PDF, Image Output.
validated on ubuntu 20.04, 24.04,
Test_pdf_word_issue.pdf
I have a similar issue. When I run it for a specific page, it works. But, when I try whole file at once, it missed a character. When I tried pdftohtml v4.03 it has no problem. https://www.kyobo.com/file/ajax/download?fName=/dtc/pdf/mm/1312890060288_%EB%AC%B4%EB%B0%B0%EB%8B%B9%EA%B5%90%EB%B3%B4%EA%B0%80%EC%A1%B1%EC%82%AC%EB%9E%91%ED%86%B5%ED%95%A9CI%EB%B3%B4%ED%97%98%20%EC%A4%91%EB%8F%84%EB%B6%80%EA%B0%80%ED%8A%B9%EC%95%BD%20%ED%86%B5%ED%95%A9%EC%95%BD%EA%B4%80_2011.08.01_.pdf Page 221 약 is removed. This happens both on v0.4 and v0.5.
No branches or pull requests
Lot of words are missing when the data is extracted from the PDF.
Scenario :- In event of large text pages more than( 1000) words.
The text was updated successfully, but these errors were encountered: