A production-ready Retrieval-Augmented Generation (RAG) system built with FastAPI, ChromaDB, and Ollama. This pipeline provides intelligent document processing, hybrid retrieval (BM25 + vector search), and LLM-powered question answering.
- PDF Processing: Advanced PDF-to-Markdown conversion using Docling with OCR support
- Hybrid Retrieval: Combines BM25 sparse retrieval and dense vector search with Reciprocal Rank Fusion (RRF)
- Smart Chunking: Intelligent document chunking with support for code, markdown, and hybrid fallback strategies
- Embedding Cache: SQLite-based embedding cache to avoid redundant computations
- Async Workers: Parallel embedding generation with configurable worker pools
- Vector Store: ChromaDB for efficient similarity search
- LLM Integration: Ollama integration for answer generation with context
- Reranking: Optional cross-encoder reranking for improved relevance
- RESTful API: FastAPI-based API with automatic documentation
- CLI Tool: Command-line interface for easy interaction
βββββββββββββββββββ
β FastAPI API β
ββββββββββ¬βββββββββ
β
βββββββββββββββ¬ββββββββββββββ¬βββββββββββββββ
β β β β
ββββββΌβββββ βββββΌβββββ ββββββΌββββββ βββββΌββββββ
β Upload β β Ask β β Health β β CLI β
β Router β βEndpointβ β Check β β Client β
ββββββ¬βββββ βββββ¬βββββ ββββββββββββ βββββββββββ
β β
β β
ββββββΌβββββββββββββΌβββββ
β Docling Converter β
β (PDF β Markdown) β
ββββββββββββ¬ββββββββββββ
β
ββββββββββββΌβββββββββββ
β Chunking System β
β - Markdown β
β - Code β
β - Hybrid Fallback β
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β Embedding Workers β
β (Async Pool) β
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β Embedding Cache β
β (SQLite) β
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β ChromaDB Store β
βββββββββββββββββββββββ
β
ββββββββββββΌβββββββββββ
β Hybrid Retriever β
β - BM25 (Sparse) β
β - Vector (Dense) β
β - RRF Fusion β
β - Reranking β
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β Ollama LLM β
β (Answer Gen) β
βββββββββββββββββββββββ
- Python 3.9+
- Ollama installed and running
- GPU recommended for faster processing (optional)
git clone https://github.com/LathissKhumar/RAG_PIPELINE.git
cd RAG_PIPELINEpython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txt# Install Ollama (visit https://ollama.ai/ for instructions)
# Pull required models
ollama pull bge-m3 # Embedding model
ollama pull llama3 # LLM model (or your preferred model)The application will automatically create required directories on first run:
uploaded_pdfs/- Stores uploaded PDF filesconverted_mds/- Stores converted Markdown fileschroma_db/- ChromaDB vector storebm25_index/- BM25 index files
Configure the application using environment variables. Create a .env file in the root directory:
# Ollama Configuration
# Note: Use either OLLAMA_BASE_URL (recommended) or OLLAMA_URL
OLLAMA_BASE_URL=http://localhost:11434 # Base URL for Ollama API (used by main.py)
OLLAMA_URL=http://localhost:11434 # Alternative URL (used by embeddings/llm modules)
EMBED_MODEL=bge-m3 # Embedding model name
LLM_MODEL=llama3 # LLM model for answer generation
# Embedding Worker Configuration
EMBED_WORKERS=3 # Number of parallel workers
EMBED_BATCH_SIZE=64 # Chunks per batch
EMBED_BATCH_WAIT_MS=200 # Batch wait time in milliseconds
EMBED_CACHE_PATH=embeddings_cache.sqlite3 # Cache file path
# ChromaDB Configuration
CHROMA_COLLECTION=documents # Collection name for vector store
# Hybrid Retrieval Configuration
HYBRID_ALPHA=0.5 # 0=BM25 only, 1=vector only, 0.5=balanced
USE_RERANKER=1 # 1=enabled, 0=disabled
# LLM Configuration
OLLAMA_LLM_TIMEOUT=180 # Request timeout in seconds
OLLAMA_LLM_RETRIES=3 # Number of retry attempts
OLLAMA_LLM_BACKOFF=1.5 # Exponential backoff multiplier
# API/CLI Configuration
API_BASE_URL=http://localhost:8000 # API server URL for CLI
ASK_TOP_K=5 # Default top-k results for CLI| Parameter | Default | Description |
|---|---|---|
OLLAMA_BASE_URL |
http://localhost:11434 |
Base URL for Ollama API (health checks) |
OLLAMA_URL |
http://localhost:11434 |
Ollama URL for embeddings and LLM calls |
EMBED_MODEL |
bge-m3 |
Ollama embedding model name |
LLM_MODEL |
llama3 |
Ollama LLM model for answer generation |
EMBED_WORKERS |
3 |
Number of parallel embedding workers |
EMBED_BATCH_SIZE |
64 |
Batch size for embedding generation |
EMBED_BATCH_WAIT_MS |
200 |
Wait time in milliseconds before processing batch |
EMBED_CACHE_PATH |
embeddings_cache.sqlite3 |
SQLite cache file for embeddings |
CHROMA_COLLECTION |
documents |
ChromaDB collection name |
HYBRID_ALPHA |
0.5 |
Weight for dense retrieval (0=BM25 only, 1=vector only) |
USE_RERANKER |
1 |
Enable cross-encoder reranking (1=yes, 0=no) |
OLLAMA_LLM_TIMEOUT |
180 |
LLM API request timeout in seconds |
OLLAMA_LLM_RETRIES |
3 |
Number of retry attempts for LLM requests |
OLLAMA_LLM_BACKOFF |
1.5 |
Exponential backoff multiplier for retries |
API_BASE_URL |
http://localhost:8000 |
API server URL (used by CLI) |
ASK_TOP_K |
5 |
Default number of results for CLI queries |
# Start the FastAPI server
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reloadThe API will be available at http://localhost:8000
Interactive API documentation: http://localhost:8000/docs
curl http://localhost:8000/healthResponse:
{
"status": "healthy",
"ollama_available": true,
"workers_running": true
}curl -X POST "http://localhost:8000/convert/" \
-H "Content-Type: multipart/form-data" \
-F "files=@document1.pdf" \
-F "files=@document2.pdf"Response:
{
"successful": [
{
"filename": "document1.pdf",
"md_path": "converted_mds/document1/document1.md",
"chunks_count": 42
}
],
"failed": [],
"total_processed": 1,
"total_successful": 1,
"total_failed": 0
}Features:
- Supports multiple file upload (max 10 files per request)
- Maximum file size: 50MB per file
- Automatic OCR for scanned documents
- Smart chunking and embedding
- Duplicate detection and deduplication
curl -X POST "http://localhost:8000/ask" \
-H "Content-Type: application/json" \
-d '{
"question": "What is the main topic of the document?",
"top_k": 5,
"use_llm": true
}'Response:
{
"question": "What is the main topic of the document?",
"answer": "Based on the provided context, the main topic...",
"top_k": 5,
"results": [
{
"id": "document1__001",
"text": "Relevant text chunk...",
"metadata": {
"source_md": "converted_mds/document1/document1.md",
"chunk_index": 1
},
"distance": 0.85
}
]
}Query Parameters:
question(required): The question to asktop_k(optional, default: 10): Number of chunks to retrieveuse_llm(optional, default: true): Generate LLM answerwhere(optional): Metadata filters (ChromaDB format)where_document(optional): Document content filters
The CLI provides a convenient way to interact with the RAG system from the terminal.
# The CLI is installed with the package dependencies
python -m app.cli --help# Basic question
python -m app.cli ask "What is the main topic?"
# Retrieve more results
python -m app.cli ask "Explain the methodology" --top-k 10
# Show source chunks
python -m app.cli ask "What are the key findings?" --show-sources
# Disable LLM and show only retrieved chunks
python -m app.cli ask "Find information about X" --no-llmCLI Options:
| Option | Description | Default |
|---|---|---|
--top-k |
Number of results to return | 5 |
--no-llm |
Disable LLM answer generation | False |
--show-sources |
Show source chunks with answer | False |
Environment Variables for CLI:
export API_BASE_URL=http://localhost:8000
export ASK_TOP_K=5{
"question": str, # Required: Question to ask
"top_k": int = 10, # Optional: Number of results
"use_llm": bool = True, # Optional: Generate LLM answer
"where": dict = None, # Optional: Metadata filter
"where_document": dict = None # Optional: Document filter
}{
"question": str, # Echo of the question
"answer": str | None, # LLM-generated answer
"top_k": int, # Number of results returned
"results": [ # Retrieved chunks
{
"id": str, # Chunk ID
"text": str, # Chunk text
"metadata": dict, # Chunk metadata
"distance": float # Similarity score
}
]
}Filter results by metadata fields:
# Find chunks from a specific document
{
"question": "What is X?",
"where": {
"source_md": {"$eq": "converted_mds/document1/document1.md"}
}
}
# Find recent chunks (if timestamp metadata exists)
{
"question": "Recent updates?",
"where": {
"timestamp": {"$gte": "2024-01-01"}
}
}RAG_PIPELINE/
βββ app/
β βββ main.py # FastAPI application
β βββ cli.py # CLI client
β βββ routers/
β β βββ upload.py # Upload endpoints
β βββ embeddings/
β β βββ worker.py # Async embedding workers
β β βββ cache.py # Embedding cache
β β βββ ollama_embeddings.py
β βββ vector_store/
β β βββ chroma_client.py # ChromaDB client
β βββ retrieval/
β β βββ hybrid_retriever.py # Hybrid retrieval
β β βββ bm25_retriever.py # BM25 retrieval
β β βββ reranker.py # Cross-encoder reranker
β βββ llm/
β β βββ ollama_llm.py # LLM integration
β βββ utils/
β βββ docling_converter.py # PDF conversion
β βββ file_registry.py # File tracking
β βββ chunker/
β βββ markdown_chunker.py
β βββ code_chunker.py
β βββ hybrid_fallback.py
β βββ optimizer.py
βββ requirements.txt
βββ pyproject.toml
βββ README.md
# Test Ollama API connection
python test_ollama_api.py
# Test dense retrieval
python test_dense_retrieval.py
# Test embedding with debug info
python test_embedding_debug.py
# Diagnose ChromaDB issues
python diagnose_chroma.pyIf you need to clear caches and rebuild:
# Clear all caches and databases
python reset_caches.py
# This will remove:
# - embeddings_cache.sqlite3
# - chroma_db/
# - bm25_index/
# - file_registry.db# Format code (if using ruff or black)
ruff format .
# Lint code
ruff check .Error: Ollama health check failed or Connection refused
Solution:
- Ensure Ollama is running:
ollama serve - Check the URL:
OLLAMA_BASE_URL=http://localhost:11434 - Verify models are pulled:
ollama list
Error: Embedding dimension mismatch detected
Solution:
- Clear ChromaDB and re-ingest:
python reset_caches.py - Ensure consistent
EMBED_MODELacross runs - Re-upload all documents after clearing
Error: CUDA out of memory or system hangs
Solution:
- Reduce
EMBED_BATCH_SIZE(try 32 or 16) - Reduce
EMBED_WORKERS(try 1 or 2) - Use CPU-only mode:
onnxruntimeinstead ofonnxruntime-gpu
Solution:
- Reduce image quality in Docling settings
- Use GPU acceleration for OCR
- Process files in smaller batches
Solution:
- Check if documents were successfully ingested
- Verify ChromaDB collection:
python diagnose_chroma.py - Check BM25 index exists:
ls bm25_index/ - Rebuild indexes if needed
Enable detailed logging:
import logging
logging.basicConfig(level=logging.DEBUG)This project is available for use and modification. Please check the repository for specific license information.
Contributions are welcome! Please feel free to submit a Pull Request.
For issues and questions, please open an issue on GitHub or contact the maintainer.