Retrieval augmented generation (RAG) demos with Mistral, Zephyr, Phi, Gemma, Llama, Aya-Expanse
The demos use quantized models and run on CPU with acceptable inference time. They can run offline without Internet access, thus allowing deployment in an air-gapped environment.
The demos also allow user to
- apply propositionizer to document chunks
- perform reranking upon retrieval
- perform hypothetical document embedding (HyDE)
You will need to set up your development environment using conda, which you can install directly.
conda env create --name rag python=3.11
conda activate rag
pip install -r requirements.txt
We shall use unstructured
to process PDFs. Refer to nstallation Instructions for Local Development.
You would also need to download punkt_tab
and averaged_perceptron_tagger_eng
from nltk.
import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
Note that we shall only use strategy="fast"
in this demo. WIP for extraction of tables from PDFs.
Activate the environment.
conda activate rag
Using a different LLM might lead to poor responses and even fail to output a response. It will require testing, prompt engineering and code refactoring.
Download and save the models in ./models
and update config.yaml
. The models used in this demo are:
- Embeddings
- Rerankers:
- BAAI/bge-reranker-base: save in
models/bge-reranker-base/
- facebook/tart-full-flan-t5-xl: save in
models/tart-full-flan-t5-xl/
- BAAI/bge-reranker-base: save in
- Propositionizer
- chentong00/propositionizer-wiki-flan-t5-large save in
models/propositionizer-wiki-flan-t5-large/
- chentong00/propositionizer-wiki-flan-t5-large save in
- LLMs
- bartowski/aya-expanse-8b-GGUF
- bartowski/Llama-3.2-3B-Instruct-GGUF
- allenai/OLMoE-1B-7B-0924-Instruct-GGUF
- bartowski/Meta-Llama-3.1-8B-Instruct-GGUF
- microsoft/Phi-3-mini-4k-instruct-gguf
- QuantFactory/Meta-Llama-3-8B-Instruct-GGUF
- lmstudio-ai/gemma-2b-it-GGUF
- TheBloke/zephyr-7B-beta-GGUF
- TheBloke/Mistral-7B-Instruct-v0.2-GGUF
- TheBloke/Llama-2-7B-Chat-GGUF
The LLMs can be loaded directly in the app, or they can be first deployed with Ollama server.
You can also choose to use models from Groq. Set GROQ_API_KEY
in .env
.
Since each model type has its own prompt format, include the format in ./src/prompt_templates.py
. For example, the format used in openbuddy
models is
"""{system}
User: {user}
Assistant:"""
We shall use Phoenix for LLM tracing. Phoenix is an open-source observability library designed for experimentation, evaluation, and troubleshooting. Before running the app, start a phoenix server
python3 -m phoenix.server.main serve
The traces can be viewed at http://localhost:6006
.
We use Streamlit as the interface for the demos. There are three demos:
- Conversational Retrieval
streamlit run app_conv.py
- Retrieval QA
streamlit run app_qa.py
- Conversational Retrieval using ReAct
NOTE: Using gemini-1.5-flash as LLM
Create vectorstore first and update config.yaml
python -m vectorize --filepaths <your-filepath>
Run the app
streamlit run app_react.py
To get started, upload a PDF and click on Build VectorDB
. Creating vector DB will take a while.