Error dealing with pdf files #436

soham-aiplanet · 2024-10-28T07:18:19Z

This is my code

def process_google_drive_documents(folder_url: str, service_account_cred: dict):
    source = ab.get_source(
        "source-google-drive",
        config={
            "folder_url": folder_url,
            "credentials": {
                "auth_type": "Service",
                "service_account_info": json.dumps(service_account_cred),
            },
            "streams": [
                {
                    "name": "pdf_loader_stream",
                    "globs": ["**"],
                    "format": {"filetype": "unstructured"},
                }
            ],
        },
    )

    source.check()
    source.select_all_streams()
    read_result = source.read()

And here's the error -
[Document(page_content='', metadata={'_ab_source_file_last_modified': '2023-11-28T19:43:49.000000Z', '_ab_source_file_url': 'TermPaper.docx', 'document_key': 'TermPaper.docx', '_ab_source_file_parse_error': "Error parsing record. This could be due to a mismatch between the config's file type and the actual file type, or because the file or record is not parseable. Contact Support if you need assistance.\nfilename=TermPaper.docx message=\n**********************************************************************\n Resource \x1b[93mpunkt_tab\x1b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \x1b[31m>>> import nltk\n >>> nltk.download('punkt_tab')\n \x1b[0m\n For more information see: https://www.nltk.org/data.html\n\n Attempted to load \x1b[93mtokenizers/punkt_tab/english/\x1b[0m\n\n Searched in:\n - '/home/soham/nltk_data'\n - '/home/soham/work/apps/tryouts/gdrive_integration/.venv-source-google-drive/nltk_data'\n - '/home/soham/work/apps/tryouts/gdrive_integration/.venv-source-google-drive/share/nltk_data'\n - '/home/soham/work/apps/tryouts/gdrive_integration/.venv-source-google-drive/lib/nltk_data'\n - '/usr/share/nltk_data'\n - '/usr/local/share/nltk_data'\n - '/usr/lib/nltk_data'\n - '/usr/local/lib/nltk_data'\n**********************************************************************\n", '_airbyte_raw_id': '01JAQ6ZEB720CS3BNHYVMKFQEC', '_airbyte_extracted_at': datetime.datetime(2024, 10, 21, 9, 36, 50, 530000), '_airbyte_meta': {}, 'last_modified': '2024-10-21T15:06:52.694685'})]

Any idea how to resolve this ?

The text was updated successfully, but these errors were encountered:

pinaak-goel · 2024-11-03T17:13:41Z

Hi @soham-aiplanet !
I see the error message you encountered, and I believe it has to do with a missing resource in the Natural Language Toolkit (NLTK) library.
The error appears because the punkt tokenizer is needed to parse text in the document, but it’s not currently available in your environment. To resolve this, please install punkt by running:

import nltk
nltk.download('punkt')

After installing it, try running the code again, and the error should be resolved. Please let me know if this works or if you run into any other issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error dealing with pdf files #436

Error dealing with pdf files #436

soham-aiplanet commented Oct 28, 2024

pinaak-goel commented Nov 3, 2024

Error dealing with pdf files #436

Error dealing with pdf files #436

Comments

soham-aiplanet commented Oct 28, 2024

pinaak-goel commented Nov 3, 2024