-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C4GT] Document Uploader ( Text chunking into paragraphs + content retrieval ) #78
Comments
I understood the problem statement to take the transcriptions and store the embeddings in the database. I would like to contribute to this issue . Please assign it to me!! |
I have good knowledge of working on NLP and I also understand your problem . So I would to contribute to this issue . Could you please assign it to me |
I have worked on a similar problem statement earlier. We had been given paragraphs on several topics and then a question on a specific topic was asked and we had to retrieve the answer for that query using the given paragraphs. The solution we had come up with was to convert the paragraphs into embeddings using the hugging face transformer model. The embeddings were indexed using the FAISS indexing library. Then for a question, we took its embeddings and retrieved the closest paragraph embeddings from the index using cosine similarity. Here is the link to the code notebook for reference click here We used retrieval then question Answering to solve the problem. Thus, I think I can work on converting the above code into a proper API as required by the project. |
I have made FastAPI to upload the PDF file and extract the text as per mentioned in the "git with basic implementation". I implemented the requirements of "Behavior of Upload API". Please review it |
The approach which initally suggested was to creating a window for the embeddings and checking for any sharp changes in the embeddings. However, now we aren't sure if the changing in the similarity score is a good enough approach as information about a variety of things may be present in a paragraph and this then separates them into different chunks. I think what will be required will be : Some sample PDFs are provided here. A simple test can be done on a page and we can see if the text extracted is getting chunked into the same paragraphs as in the pdf. |
Okay sir, Currently I am dividing the page into chunks and then I am doing embedding. So now what I need to do is first divide the content of pages based on different topic and then do the embedding. Have I understood right Sir? I will explore the PDFs you have attached. |
That is correct. Potential flow for solving this could be :
Next steps:
|
Ok sir |
I have been working on this project since 20 May. I did see many projects, but finally, I will understand everything related to this problem statement. I am submitting only this proposal. I have good knowledge of NLP and have been learning about HuggingFace Transformers for the last 1 week. I am interested in this project. I have been doing Machine learning for the last 1 year, so I have good knowledge of Python Language. |
I have worked on more or else similar project before using hugging face transformer model. I would like to contribute on this project. |
Hello @GautamR-Samagra Sir, I wanted to contribute to the development of the document uploader API within the AI Toolchain, for |
Project Details
AI Toolchain is a collection of tools for quickly building and deploying machine learning models for various use cases. Currently, the toolchain includes a text translation model, and more models may be added in the future. It abstracts the dirty details of how a model works similar to Huggingface and gives a clean API that you can orchestrate at a BFF level.
Features to be implemented
The idea is to implement a document uploader API that is async and returns the embeddings for chunks of that document. It should save the data for a short period until the user asks for the download. This data can then be uploaded by the user wherever they have a search engine. The current problem statement doesn't cover this.
How it works
Extract the text from the PDF file. Tokenize the extracted text using cosine distance and create chunks. For each chunk, create vector embeddings using an Instructor Model.
Create APIs to upload the following document Types
Behavior of Upload API
Taken from here
File Status API
yet_to_start
,in_progress
,completed
, andfailed
Taken from here
Chunking
Sample pdfs:
https://drive.google.com/drive/u/0/folders/1sAsuh-EFH-xmFYrxzhmj0VRUZYNzsyLw
OpenAI Embedding Alternatives
Learning Path
Complexity
Medium
Skills Required
Python, Knowledge of HuggingFace Transformers, NLP.
Name of Mentors:
@GautamR-Samagra
Project size
8 Weeks
Product Set Up
See the setup here
Acceptance Criteria
Milestone
Every document type supported is a milestone.
Reference
C4GT
This issue is nominated for Code for GovTech (C4GT) 2023 edition.
C4GT is India's first annual coding program to create a community that can build and contribute to global Digital Public Goods. If you want to use Open Source GovTech to create impact, then this is the opportunity for you! More about C4GT here: https://codeforgovtech.in/
The scope of this ticket has now expanded to make it the 'content processing' part of 'FAQ bot'.
The FAQ bot allows a user to be able to provide content input in the form on csvs, free text, pdfs, audio, video and the bot is able to add it to a 'Content DB'. The user is then able to interact with the bot via text/speech on related content and the bot is able to identify relevant content using RAG techniques and be able to be able to respond to the user in a conversational manner.
This ticket covers the content processing part of the bot. It includes the following tasks in its scope:
The text was updated successfully, but these errors were encountered: