Scripts (mostly Bash) to repair, verify, OCR, compress (etc.) PDFs.
Currently in beta status, so except backward-incompatible changes.
You need to have Bash installed.
The scripts use several software libraries. setup.sh installs them for macOS (via brew) or Ubuntu/Debian.
- Go to root of this repository:
cd pdf-scripts
- Excute script
./pipeline.sh -l deu /path/to/document-in-german.pdf
Please refer to the scripts for the command-line arguments and options. NB: It's not possible to combine options, e.g., use -x -y
instead of -xy
.
Most scripts work on individual PDFs as well as on folders full of PDFs.
OCR PDFs with OCRmyPDF.
Using: pdftocairo
from poppler, mutool clean
from MuPDF, qpdf
Caveat: May remove text in OCRd PDFs. Use --check
to check for OCRd text in order to preserve it.
Checks if text can be extracted (if it's already on the PDF)
Using ghostcript to compress images in PDFs.
Use compress_pdf.sh but also pdfsizeopt to reduze file size of PDFs.
Remove metadata with exiftool.
Detect OCRd PDFs. See also sort_ocrd_pdfs.sh to sort PDFs.
Combining several of the above scripts.
Bash is still the most-used shell. And the scipts comprise mostly of simple conditionals and sequences of CLI commands. This could also be done with Python's psutil
but this would add yet another layer. However, at some point, I most probable port the scripts to simple POSIX-Shell.
- https://dangerzone.rocks/
- https://0xacab.org/jvoisin/mat2
- https://github.com/NicolasBernaerts/ubuntu-scripts/blob/master/pdf/pdf-repair
- https://scantailor.org/ (unmantained)
- focus on Bash v4+
- write Python 3.6+ scripts if Bash gets too complicated
- use Docker images if available
- should run on the major Unix-like OSs (Linux (e.g. Ubuntu), macOS)
- format code with shfmt, e.g., extension for VS Code
- lint scripts with shellcheck, e.g., extension for VS Code
GPLv3.