This repository contains a Retrieval-Augmented Generation (RAG) API that allows querying a vector database (ChromaDB) to retrieve relevant document chunks and generate responses using Ollama (Mistral).
✅ FastAPI-based API for document retrieval and comparison
✅ Embeddings with Ollama (nomic-embed-text) for vector search
✅ PDF document loading and chunking with PyPDFDirectoryLoader
✅ Cosine similarity-based document comparison
✅ Automated testing with pytest
git clone https://github.com/smshelar/rag_pipeline.git
cd your-repopython -m venv venv
source venv/bin/activate # On macOS/Linux
venv\Scripts\activate # On Windowspip install -r requirements.txtuvicorn rag_api:app --host 0.0.0.0 --port 8000 --reloadEndpoint:
POST /query/Request Body:
{
"query_text": "What is the company name?"
}Response:
{
"response": "The company name is ConocoPhillips.",
"sources": ["document_1.pdf(page_num:chunk_num)",
"document_2.pdf(page_num:chunk_num)",
"document_3.pdf(page_num:chunk_num)"]
}Endpoint:
POST /compare/Request Body:
{
"query_1": "Impact of climate change",
"query_2": "Rising sea levels"
}Response:
{
"query_1": "Impact of climate change",
"query_2": "Rising sea levels",
"similarity_score": 0.87,
"source_1": "doc1.pdf",
"source_2": "doc2.pdf"
}Endpoint:
POST /populate/Request Body:
{
"reset": true
}Response:
{
"message": "Database populated with 100 chunks"
}Run all tests using:
pytest test.py📁 your-repo
│-- 📂 data/ # Directory for PDFs
│-- 📂 chroma/ # ChromaDB storage
│-- 📜 embedding_function.py # Ollama embedding function
│-- 📜 query.py # Query processing
│-- 📜 compare_embeddings.py # Document similarity comparison
│-- 📜 load_model.py # Data pipeline for ChromaDB
│-- 📜 rag_api.py # FastAPI server
│-- 📜 test.py # Pytest-based tests
│-- 📜 requirements.txt # Dependencies
│-- 📜 README.md # Project Documentation
- 🔹 Dockerization for deployment
- 🔹 Support for more document formats (TXT, DOCX)
- 🔹 Advanced ranking using LLM-generated summaries
🚀 Developed with ❤️ using Python, LangChain & FastAPI
- ✅ Add Docker setup?
- ✅ Include environment variables (
.env)? - ✅ Create a GitHub Actions CI/CD pipeline?
Let me know what you need! 🚀🔥