Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error dealing with pdf files #436

Open
soham-aiplanet opened this issue Oct 28, 2024 · 1 comment
Open

Error dealing with pdf files #436

soham-aiplanet opened this issue Oct 28, 2024 · 1 comment

Comments

@soham-aiplanet
Copy link

This is my code

def process_google_drive_documents(folder_url: str, service_account_cred: dict):
    source = ab.get_source(
        "source-google-drive",
        config={
            "folder_url": folder_url,
            "credentials": {
                "auth_type": "Service",
                "service_account_info": json.dumps(service_account_cred),
            },
            "streams": [
                {
                    "name": "pdf_loader_stream",
                    "globs": ["**"],
                    "format": {"filetype": "unstructured"},
                }
            ],
        },
    )

    source.check()
    source.select_all_streams()
    read_result = source.read()

And here's the error -
[Document(page_content='', metadata={'_ab_source_file_last_modified': '2023-11-28T19:43:49.000000Z', '_ab_source_file_url': 'TermPaper.docx', 'document_key': 'TermPaper.docx', '_ab_source_file_parse_error': "Error parsing record. This could be due to a mismatch between the config's file type and the actual file type, or because the file or record is not parseable. Contact Support if you need assistance.\nfilename=TermPaper.docx message=\n**********************************************************************\n Resource \x1b[93mpunkt_tab\x1b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \x1b[31m>>> import nltk\n >>> nltk.download('punkt_tab')\n \x1b[0m\n For more information see: https://www.nltk.org/data.html\n\n Attempted to load \x1b[93mtokenizers/punkt_tab/english/\x1b[0m\n\n Searched in:\n - '/home/soham/nltk_data'\n - '/home/soham/work/apps/tryouts/gdrive_integration/.venv-source-google-drive/nltk_data'\n - '/home/soham/work/apps/tryouts/gdrive_integration/.venv-source-google-drive/share/nltk_data'\n - '/home/soham/work/apps/tryouts/gdrive_integration/.venv-source-google-drive/lib/nltk_data'\n - '/usr/share/nltk_data'\n - '/usr/local/share/nltk_data'\n - '/usr/lib/nltk_data'\n - '/usr/local/lib/nltk_data'\n**********************************************************************\n", '_airbyte_raw_id': '01JAQ6ZEB720CS3BNHYVMKFQEC', '_airbyte_extracted_at': datetime.datetime(2024, 10, 21, 9, 36, 50, 530000), '_airbyte_meta': {}, 'last_modified': '2024-10-21T15:06:52.694685'})]

Any idea how to resolve this ?

@pinaak-goel
Copy link

Hi @soham-aiplanet !
I see the error message you encountered, and I believe it has to do with a missing resource in the Natural Language Toolkit (NLTK) library.
The error appears because the punkt tokenizer is needed to parse text in the document, but it’s not currently available in your environment. To resolve this, please install punkt by running:

import nltk
nltk.download('punkt')

After installing it, try running the code again, and the error should be resolved. Please let me know if this works or if you run into any other issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants