Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text in PDF Recognized as Image Instead of Text During Parsing #572

Open
kurekj opened this issue Dec 11, 2024 · 0 comments
Open

Text in PDF Recognized as Image Instead of Text During Parsing #572

kurekj opened this issue Dec 11, 2024 · 0 comments
Labels
question Further information is requested

Comments

@kurekj
Copy link

kurekj commented Dec 11, 2024

image_000001_0a70fe332b988b47c6e4b59e8f4c6edbcba45055cc60c5293ff72f86bf82544c

Question

When parsing CVs using Docling on Ubuntu with Python 3.11, some portions of the PDF (e.g., containing text) are incorrectly treated as images instead of being recognized as text. This occurs despite enabling OCR and trying different OCR engines and settings.

Environment:
Docling version: 2.10.0
Docling-Parse version: 3.0.0
Docling-Core version: 2.9.0
Operating System: Ubuntu
Python version: 3.11

Relevant Code:
IMAGE_RESOLUTION_SCALE = 10.0

pipeline_options = PdfPipelineOptions()
#pipeline_options = PdfPipelineOptions(backend=DoclingParseV2DocumentBackend)
#pipeline_options = PdfPipelineOptions(backend=DoclingParseV2PageBackend)

pipeline_options.do_ocr = True
#pipeline_options.do_table_structure = True
#pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE  # use more accurate TableFormer model
#pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True
#pipeline_options.ocr_options.bitmap_area_threshold=0.05

# Any of the OCR options can be used:EasyOcrOptions, TesseractOcrOptions, TesseractCliOcrOptions, OcrMacOptions(Mac only), RapidOcrOptions
#ocr_options = EasyOcrOptions(force_full_page_ocr=True)
# ocr_options = TesseractOcrOptions(force_full_page_ocr=True)
# ocr_options = OcrMacOptions(force_full_page_ocr=True)
#ocr_options = RapidOcrOptions(force_full_page_ocr=True)
#ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)
#pipeline_options.ocr_options = ocr_options

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)
@kurekj kurekj added the question Further information is requested label Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant