-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory management for large PDFs #19
Comments
I think unidoc/#128 (lazy loading) would help with this. Also could put a size limit on the object cache, probably freeing the objects with the oldest load timestamp first. Then just load them again when needed. Note: Lazy loading is supported in v3 with NewPdfReaderLazy(). |
Presumably most of the memory consumed is from large stream objects (would be good to verify based on a large corpus). A potential fix for this would be to change The tricky part might be knowing whether the data is encoded or not and applying an encoding. If the stream is changed (e.g. has been encoded) then need to keep it in memory. (Or stored into some temporary storage where it could be re-accessed without staying in memory the entire time, such as boltdb or something similar). |
Is there a solution to writing large PDF whose content doesn't fit in memory? |
We're currently working on implementing a solution for handling very large PDFs, including features like lazy writing. If you have any questions or need further assistance, please let us know Thanks |
Currently it isn't possible to parse/edit/write PDFs if their size approaches the available memory on the computer. Unidoc parses the object tree and stores them in memory. Based on the architecture of a PDF file we should be able to parse (and for example extract text) from arbitrarily large PDFs that greatly exceed the memory capacity of the server. #128 could resolve a lot of this but would require pages/objects to be able to be freed from memory when no longer needed.
Additionally when writing large PDFs there would need to be something implemented to either cache objects to disk before writing the completed PDF or stream the PDF writing while each page is completed and releasing the memory back. This would get tricky with shared objects.
The text was updated successfully, but these errors were encountered: