Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF parsing is poor #7

Open
Roznoshchik opened this issue Feb 12, 2022 · 6 comments
Open

PDF parsing is poor #7

Roznoshchik opened this issue Feb 12, 2022 · 6 comments

Comments

@Roznoshchik
Copy link
Owner

The current pdf library leaves a lot to be desired.

It only works for simple pdfs with plain images And text.

Anything more complex that has graphs, charts, etc, comes through very poorly.

One idea is to just work with Pdfs as images. And then possibly do an OCR on the text content.

But there is a lot that needs to be Explored there to render things properly so that it works with lurnby.

@Artaud
Copy link

Artaud commented Feb 14, 2022

I have had great success with parsing very complex pdf to html using pdf2htmlex, especially this fork https://github.com/pdf2htmlEX/pdf2htmlEX (the original is unmaintained).
Doesn't do ocr though.

@Roznoshchik
Copy link
Owner Author

Thanks @Artaud,

I'll try to play with this and see how it works. I think my biggest concern is how it would work on mobile, but I guess that should be secondary to actually having it work for the majority of inputs.

A brief look at some of the samples, showed that on mobile there isn't any rerendering, the whole page just shrinks to a tiny size.

@Roznoshchik
Copy link
Owner Author

Looking at this closer, pdf2htmlEX does seem promising, but it's not a python package. Which means to use it on Heroku where I'm currently hosting the app, would require some extra work.

I'm not sure how to compile C apps to run on Heroku, so the best bet seems to be to convert to a docker deployment and deploy the docker image to heroku.

I've started that process, but it involves quite a lot of changes so will see how it goes.

@Roznoshchik
Copy link
Owner Author

Roznoshchik commented Feb 24, 2022

I was able to get pdf2htmlEX running on the docker container, but it's not working with some of the pdfs. Likely some missing font libraries.

But on closer look I realized that I was mostly able to get the same output using pymupdf which I was already using. I just wasn't using the automatic html conversion. I was building the html manually.

And I remembered why I made that decision. Both pymupdf and pdf2htmlEX convert the pdf to html, but they do so with a lot of inline css to render the page exactly the same.

This kills many of Lurnby's reader functions like dark/light mode, font size adjustments, etc. And makes it a bit annoying to try and highlight text due to the way it's rendered. Removing the inline css also doesn't lead to great layouts.

All of this is maybe fine, but the way in which I'm currently rendering the article content into the reader means that many pdfs, even those converted to html using those libraries will completely break and destroy the page layout. To pursue that option, I would need to render a separate reader for pdfs to account for any changes.

Which isn't necessarily a bad thing. Just requires a lot more research and testing to determine if that's the best way forward or not.

Another not so great option that I'm considering is to work with pdfs in image format. pymupdf has an option to convert a pdf page to an image. This has it's own drawbacks obviously. The text isn't selectable, it doesn't work for mobile and desktop, etc.

But, it aligns with another feature I'm considering which is the ability to highlight images.

I'm looking at incorporating Mozilla's screenshot library.

This would allow me to capture a portion of a page and then save that image. This way, an image pdf would possibly still be able to be annotated and worked on.

In short, looking at a bunch of seemingly sub optimal options.

@ghost
Copy link

ghost commented Jun 9, 2023

@Roznoshchik do we have any updates on this?

@Roznoshchik
Copy link
Owner Author

No unfortunately. I have been too busy to be able to do anything on this and the readwise team has been killing it, so it hasn't felt like there was a strong need for this.

I personally haven't been reading to many pdfs either so it hasn't been a priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants