-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with coordinates in some PDF #43
Comments
The coordinates that are displayed by Preview with the selection are not correct, look: The origin is already shifted... I think the coordinates I put above from PDF.js are the correct/expected ones for origin at (0,0) (x:57, y:90 for top left corner of "Association" and not x:85, y:126 from pdfalto or x:83.92, y:124.62 from Preview). Or somehow the size of the page is not correct and should be shifted/rescaled accordingly. |
Ok I think I know the problem, actually there are different level for the boxes (media/crop/bleed) each is used by particular impression equipments, when these are not the same sized box it leads to such issues, I'll see how to fix this |
So i've made a change to use crop box by default instead of media box : b14cd4e |
This should be a dynamic option from pdfalto command line what do you think ? |
Just reminder, this was legacy from pdf2xml.. |
Yes the issue was from pdf2xml ! Your fix entirely solves the issue for all my examples cases, and everything is fine with usual documents, so it's super many thanks! |
I am trying to trace the problem of incorrect coordinates for string elements in some PDF. One example is the attached PubMed Central PDF. Using it with GROBID and the PDF.js document display + annotations, we see that the bounding boxes for the annotations are not correct (while usually they are!).
The problem is apparently coming from pdfalto, but I am not sure if it comes from incorrect page dimension or an incorrect origin point on the page for the string coordinates.
So in the attached PDF, all page dimensions are x:662, y:860. First page, first token "Association" is positioned with x:85, y:126, w:115, h:17.8. Proportion x/y is visually incorrect. x and y should be x:57, y:90 (from PDF.js)
Second page, first token "Xia" is positioned x:71, y:64, w:10, h:7.3, once again x/y is not visually clearly not correct. It should be x:42, y:21 (from PDF.js)
Looking at
XmlAltoOutputDev.cc
andTextPage::startPage
, page coordinates come from GfxState, and the pagebox, but then I saw nothing that looks really related to this :/PMC5348138.pdf
The text was updated successfully, but these errors were encountered: