Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAG: newlines missing when transforming from pdf to txt #255

Closed
1 task done
mmuller88 opened this issue Feb 12, 2024 · 1 comment
Closed
1 task done

RAG: newlines missing when transforming from pdf to txt #255

mmuller88 opened this issue Feb 12, 2024 · 1 comment
Labels
bug Something isn't working needs-triage This issue or PR still needs to be triaged.

Comments

@mmuller88
Copy link

mmuller88 commented Feb 12, 2024

Describe the bug

So I noticed that newlines are getting removed when RAG is transforming from pdf to txt. That probably decreases the level of accuracy when using similarity search.

I kind of hat a workaround as my ingestion files don't need bo PDFs so I just could take the txt file like and leave it as it is for the embedding:

---
event: meetup
title: Langchain AI MVP
date: "2024-02-03"
tags: ["meetup", "langchain"]
---

0:00:00.719,0:00:03.719
okay

0:00:05.400,0:00:09.120
awesome so I will start with the first

0:00:07.980,0:00:11.460
talk

0:00:09.120,0:00:14.280
uh thanks again for attending here to

I'm sure that had better results then without newlines!

Expected Behavior

Newlines are not removed

Current Behavior

Newlines are removed

Reproduction Steps

do a RAG and view the txt file.

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.124.0

Framework Version

No response

Node.js Version

20

OS

macos

Language

Typescript

Language Version

No response

Region experiencing the issue

us-east-1

Code modification

....

Other information

No response

Service quota

  • I have reviewed the service quotas for this construct
@mmuller88 mmuller88 added bug Something isn't working needs-triage This issue or PR still needs to be triaged. labels Feb 12, 2024
@krokoko
Copy link
Collaborator

krokoko commented Feb 28, 2024

Hi @mmuller88 , thank you for reporting this ! This method is currently relying on Langchain, we will be merging soon a new capability for users to provide their own lambda business logic in case they want to use a different library / transformation method !
Will close this ticket and mention #284 which should be merged pretty soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage This issue or PR still needs to be triaged.
Projects
None yet
Development

No branches or pull requests

2 participants