[Suggestion] Reporting the byte location of images #161

keto33 · 2023-11-03T22:57:31Z

I have tested many PDF-to-text programs, and this one is the most robust. However, handling images is always a question since they are heavy objects and usually unnecessary. If I am correct, starting from version 0.7, GROBID dropped the option of extracting images.

I suggest adding an option to save the byte location of image elements instead of saving the image to disk. In this case, we can later read the image directly from the PDF file whenever needed instead of storing all images on the disk.

Implementing this feature should be trivial since the location and length of the image objects are already known to pdfalto.

kermitt2 · 2023-11-06T10:26:22Z

Thank you @keto33 !

In pdfalto, you can choose to extract and process images (embedded bitmap and vector graphics) or not with the argument -noImage.

Grobid can extract or not images (called "assets"):

using service processFulltextAssetDocument (instead of processFulltextDocument, which returns a zip with the XML and the images,
option ignoreAssets in the batch command https://grobid.readthedocs.io/en/latest/Grobid-batch/#processfulltext

Using the byte location in the PDF as alternative seems a good idea for pdfalto! This could be used then by Grobid for example. I don't know if it can be portable and how exactly doing it, I will look at it.

keto33 · 2023-11-12T01:10:54Z

Thanks for your kind attention, @kermitt2 !

This actually should be done in xpdf rather than pdfalto itself. I am not a C++ expert, but I am a little bit familiar with xpdf, as I tweaked it for a project.

Since images are stored as stream objects in PDF, xpdf fetches and writes them by str = ((DCTStream *)str)->getRawStream(); line in ImageOutputDev.cc. Therefore, we need to add a new function of, say, getStreamBoundaries in Stream.cc to return the location/length of the object instead of its content.

I think the existing function of getStreamIndex defined in Parser.h does part of the job.

kermitt2 added the enhancement New feature or request label Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Suggestion] Reporting the byte location of images #161

[Suggestion] Reporting the byte location of images #161

keto33 commented Nov 3, 2023

kermitt2 commented Nov 6, 2023

keto33 commented Nov 12, 2023

[Suggestion] Reporting the byte location of images #161

[Suggestion] Reporting the byte location of images #161

Comments

keto33 commented Nov 3, 2023

kermitt2 commented Nov 6, 2023

keto33 commented Nov 12, 2023