diff --git a/index.html b/index.html index be419ac..a5c37e4 100755 --- a/index.html +++ b/index.html @@ -159,6 +159,12 @@
@@ -183,7 +189,7 @@ Usage
and format. Pass --pages or -p to choose the specific pages to
image. Passing
--size or -s will specify the desired
image resolution, --density or -d will specify the DPI to rasterize the images
- at during conversion by GraphicsMagick, and --format or -f
+ at during conversion by GraphicsMagick, and --format or -f
will select the format of the final images.
@@ -201,7 +207,7 @@Usage
pass --pages all. You can use the --ocr and --no-ocr flags to force OCR, or disable it, respectively. By default (if Tesseract is installed) Docsplit will OCR the text of each page for which it fails to extract text - directly from the document. Docsplit will also attempt to clean up garbage + directly from the document. Docsplit will also attempt to clean up garbage characters in the OCR'd text — to disable this, pass the --no-clean flag. @@ -272,7 +278,7 @@Internals
Poppler, PDFTK, Tesseract, and - LibreOffice libraries. + LibreOffice libraries. Poppler is used to extract text and metadata from PDF documents, PDFTK is used to split them apart into pages, and GraphicsMagick is used to generate the page images (internally, it's rendering them with @@ -291,7 +297,7 @@Internals
Change Log
- +0.7.6 – Nov. 16, 2014
Docsplit will now automatically use Tesseract's orientation detection model @@ -308,7 +314,7 @@Change Log
0.7.2 – Feb. 23, 2013
Bug fixes for LibreOffice support. - +0.7.0 – Feb. 23, 2013
Docsplit now expresses a preference for LibreOffice over OpenOffice, with @@ -317,81 +323,81 @@Change Log
Improved unicode support now correctly collects non-ascii characters from pdfinfo. - +0.6.4 – Nov. 12, 2012
- +
Added a language flag for the Docsplit commandline, fixed several bugs, and began preparations for the deprecation of pdftk.0.6.2 – Nov. 22, 2011
- +
Bugfix to escape document names during file type detection.0.6.1 – Nov. 18, 2011
- +
Docsplit now supports converting documents using LibreOffice as well as OpenOffice, through JODConverter 3.0 beta4.0.6.0 – Sept. 13, 2011
- +
- Docsplit should now handle shelling out for documents with arbitrary - characters in their filenames correctly, thanks to a series of + Docsplit should now handle shelling out for documents with arbitrary + characters in their filenames correctly, thanks to a series of epic patches from Vladimir Rybas. - A --density option was added for specifying the resolution of + A --density option was added for specifying the resolution of rasterization when generating images from documents. The image resolution for OCR has been doubled from 200 to 400 DPI — - this shouldn't make a noticeable difference for normal docs, but will make + this shouldn't make a noticeable difference for normal docs, but will make a world of difference for the fine print. Docsplit now uses GraphicsMagick's --despeckle before OCR.0.5.2 – May 13, 2011
- +
For transparent conversion to PDF, made Docsplit prefer GraphicsMagick over OpenOffice, when the file format is one that GraphicsMagick is able to read: (png, gif, jpg, jpeg, tif, tiff, bmp, pnm, ppm, svg, eps).0.5.1 – April 26, 2011
- +
Minor tweaks to the TextCleaner to be more lenient about acryonms with hyphens, and words with four vowels in a row.0.5.0
- +
Added a Docsplit::TextCleaner class which is used to post-process OCR'd text, and remove garbage characters that are created when Tesseract encounters non-english text. To disable the cleanup, pass --no-clean.0.4.1
- +
Upgraded the JODConverter dependency for PDF conversion via OpenOffice to - 3.0 beta. Added PNG, GIF, TIF, JPG, and BMP to the list of supported + 3.0 beta. Added PNG, GIF, TIF, JPG, and BMP to the list of supported formats.0.3.4
- +
Adding a suggested optimization from the GraphicsMagick list -- only ever generate one page image per GraphicsMagick call. Saves large amounts of disk space for tempfiles on long documents.0.3.3
- +
Start using the MAGICK_TMPDIR environment variable to prevent parallel Docsplit runs from having the potential to clobber each other's temporary image files.0.3.1
- Added a memory limit to GraphicsMagick while generating the TIFFs for + Added a memory limit to GraphicsMagick while generating the TIFFs for Tesseract OCR -- prevents gm from gobbling up all available memory on large files.