Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean_contents with sanitize=True removes page contents #4130

Open
C-Saunders opened this issue Dec 9, 2024 · 11 comments
Open

clean_contents with sanitize=True removes page contents #4130

C-Saunders opened this issue Dec 9, 2024 · 11 comments
Labels
fix developed release schedule to be determined upstream bug bug outside this package

Comments

@C-Saunders
Copy link

C-Saunders commented Dec 9, 2024

Description of the bug

We've encountered a number of PDFs recently (all from the same source, so I suspect this is specific to a quirk of their format) where calling clean_contents() removes all visible page content. I found that setting sanitize=False causes the content to be retained.

Our process requires adding text to the PDFs, and we call clean_contents() because we've found that without that, text sometimes isn't successfully added.

I'm happy to add sanitize=False to our code if this isn't a bug. Thanks for taking a look!

How to reproduce the bug

single_page.pdf

import pymupdf

single_page = pymupdf.open("./single_page.pdf")
for page in single_page.pages():
  page.clean_contents()

single_page.save("./single_page_cleaned.pdf", garbage=2, deflate=True)

PyMuPDF version

1.25.0

Operating system

MacOS

Python version

3.11

@JorjMcKie
Copy link
Collaborator

There has been a related bug in the base library - which should have been fixed though. Going to take another look in any case.

But your motivation for using page clean at all seems to relate to misplaced insertion positions - guessing correctly?
If true: This notorious issue has been resolved back in some 1.23.* version. The cause getting unwanted insertion positions is now automatically detected and removed - IAW it should always "just work".

@JorjMcKie
Copy link
Collaborator

Hm, MuPDF's cli mutool indeed does not do it correctly (so PyMuPDF has no chance to do this any better).
I filing a bug in MuPDF's issue system.

@JorjMcKie JorjMcKie added the upstream bug bug outside this package label Dec 10, 2024
@JorjMcKie
Copy link
Collaborator

Here is MuPDF's issue link: https://bugs.ghostscript.com/show_bug.cgi?id=708186

@JorjMcKie
Copy link
Collaborator

So bottom line:
Please try without any cleaning. If you do then get text insertion problems, PyMuPDF has a bug. Please use the sanitize=False option in this case and file a new bug here.

@C-Saunders
Copy link
Author

Thanks so much!

your motivation for using page clean at all seems to relate to misplaced insertion positions - guessing correctly?

Unfortunately I added this to our code over six months ago, and I don't remember the details. The commit message says "Run page.clean_contents() on every page before attempting to add text. This fixes silent failures to add text to the PDF", which implies the text wasn't added at all, but I didn't leave myself a pointer to an example, or add a test, sadly.

In that same commit I also updated PyMuPDF from 1.23.26 to 1.24.2, and hopefully I reproduced the issue with that newer version, but I can't make any guarantees.

I'm going to use sanitize=False for now and I've put it on my todo list to look into this.

@C-Saunders
Copy link
Author

After I made the change to set sanitize=False I started seeing this error for some documents when attempting to call TextWriter.write_text: pymupdf.mupdf.FzErrorArgument: code=4: cannot create dictionary without a document

If I do clean_contents(sanitize=True) the text writing works as expected and without error. Also, I get the same error if I don't call clean_contents.

I could maybe do clean_contents(sanitize=False), then try/except to catch this error and do clean_contents(sanitize=True), but 1) it feels a bit icky and 2) I'm concerned about losing the page content.

It looks like MuPDF already made a release with a fix for the latter issue. Safe to assume that's coming to PyMuPDF soon?

@JorjMcKie
Copy link
Collaborator

Let me restate what happens or does not happen in the 3 alternatives:

  1. page.clean_contents(sanitize=False): The appearance source files of the page are concatenated into one which then replaces any previous separate ones. E.g. the page object might have had this entry /Contents [1 0 R 2 0 R]. Afterwards it will have /Contents 4711 0 R, with a newly created object at xref 4711 which contains the concatenated source uncompressed. The old xrefs 1, 2 are still there but now unreferenced and should be removed by garbage collection, the new object at 4711 should be compressed. As part of that concatenation, some source code cleaning also happens. This includes standard format of numbers (e.g. 0.25000 => .25), multiple white spaces between tokens will be replaced by one space, etc. Any missing stack / unstack commands ("q"/"Q") will also be inserted. The latter takes care of worry-free insertions of text, images, and everything else. No positioning concerns.
  2. page.clean_contents(sanitize=True): includes everything previous. In addition the object references in the page's /Resources dictionary will be synchronized with the references made in the /Contents source. So images displayed will all be named in /Resources, but only and exactly those. Same is true for fonts and all other objects. This entails that these references will be renamed according to some naming convention (images are called "/Im1" ...,) and the respective cross reference numbers (xrefs) will all change. So this is a major thing happening. All previous page.get_images() will now be invalid for instance, and must be re-executed.
  3. Do nothing: just insert your stuff. This should always work since a number of versions. Internal checks will be made to see whether the configuration of the stack / unstack commands is safe. If an unbalanced number of "q" and "Q" commands is detected, then this will be corrected automatically - with least required changes to the /Contents sources. This is definitely the most space-effective alternative and avoids work from happening that has nothing to do with what you want to achieve.

@C-Saunders
Copy link
Author

Do nothing: just insert your stuff. This should always work since a number of versions.

I get pymupdf.mupdf.FzErrorArgument: code=4: cannot create dictionary without a document if I do nothing (don't call clean_contents) and try to insert, or if I use sanitize=False.

If I use sanitize=True, this error is not raised and text is inserted successfully.

I was able to make (by grabbing a PII-free page from a problematic doc) an example that reproduces this problem: FzErrorArgument_single_page.pdf

Happy to open a new issue, if you'd like, since this is starting to feel like a separate topic.

@JorjMcKie
Copy link
Collaborator

Cannot confirm at all. The following works flawlessly:
image

And produces this result as it should:
image

@C-Saunders
Copy link
Author

C-Saunders commented Dec 11, 2024

Interesting! I get the error when I'm using a TextWriter, but I can use insert_text instead, if needed.

import pymupdf

original_pdf = pymupdf.open("FzErrorArgument_single_page.pdf")
page = original_pdf[0]

print("original text = ", page.get_text("text"))
 
text_writer = pymupdf.TextWriter(page.rect)
text_writer.append((100, 100), "Hallo", fontsize=20)

text_writer.write_text(page, color=(1,0,0), opacity=1)

Edit - I replicated your result successfully. I'll switch to using insert_text in all cases for now.

Second edit - it looks like insert_text is about an order of magnitude slower for our use-case, so a fix that works with TextWriter would be really great.

@JorjMcKie JorjMcKie added the fix developed release schedule to be determined label Dec 11, 2024
@JorjMcKie
Copy link
Collaborator

In the meantime I detected that this is a separate bug. The peculiarity of your page is that it has no /Resources dictionary yet - which is something not expected by Textwriter and some other function.
I will open a separate issue for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix developed release schedule to be determined upstream bug bug outside this package
Projects
None yet
Development

No branches or pull requests

2 participants