RAG Pipeline

A production-ready Retrieval-Augmented Generation (RAG) system built with FastAPI, ChromaDB, and Ollama. This pipeline provides intelligent document processing, hybrid retrieval (BM25 + vector search), and LLM-powered question answering.

🚀 Features

PDF Processing: Advanced PDF-to-Markdown conversion using Docling with OCR support
Hybrid Retrieval: Combines BM25 sparse retrieval and dense vector search with Reciprocal Rank Fusion (RRF)
Smart Chunking: Intelligent document chunking with support for code, markdown, and hybrid fallback strategies
Embedding Cache: SQLite-based embedding cache to avoid redundant computations
Async Workers: Parallel embedding generation with configurable worker pools
Vector Store: ChromaDB for efficient similarity search
LLM Integration: Ollama integration for answer generation with context
Reranking: Optional cross-encoder reranking for improved relevance
RESTful API: FastAPI-based API with automatic documentation
CLI Tool: Command-line interface for easy interaction

🏗 Architecture

┌─────────────────┐
│   FastAPI API   │
└────────┬────────┘
         │
         ├─────────────┬─────────────┬──────────────┐
         │             │             │              │
    ┌────▼────┐   ┌───▼────┐   ┌────▼─────┐   ┌───▼─────┐
    │ Upload  │   │  Ask   │   │ Health   │   │   CLI   │
    │ Router  │   │Endpoint│   │  Check   │   │  Client │
    └────┬────┘   └───┬────┘   └──────────┘   └─────────┘
         │            │
         │            │
    ┌────▼────────────▼────┐
    │  Docling Converter   │
    │  (PDF → Markdown)    │
    └──────────┬───────────┘
               │
    ┌──────────▼──────────┐
    │  Chunking System    │
    │  - Markdown         │
    │  - Code             │
    │  - Hybrid Fallback  │
    └──────────┬──────────┘
               │
    ┌──────────▼──────────┐
    │  Embedding Workers  │
    │  (Async Pool)       │
    └──────────┬──────────┘
               │
    ┌──────────▼──────────┐
    │  Embedding Cache    │
    │  (SQLite)           │
    └──────────┬──────────┘
               │
    ┌──────────▼──────────┐
    │   ChromaDB Store    │
    └─────────────────────┘
               │
    ┌──────────▼──────────┐
    │ Hybrid Retriever    │
    │  - BM25 (Sparse)    │
    │  - Vector (Dense)   │
    │  - RRF Fusion       │
    │  - Reranking        │
    └──────────┬──────────┘
               │
    ┌──────────▼──────────┐
    │    Ollama LLM       │
    │  (Answer Gen)       │
    └─────────────────────┘

📦 Installation

Prerequisites

Python 3.9+
Ollama installed and running
GPU recommended for faster processing (optional)

Step 1: Clone the Repository

git clone https://github.com/LathissKhumar/RAG_PIPELINE.git
cd RAG_PIPELINE

Step 2: Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Step 3: Install Dependencies

pip install -r requirements.txt

Step 4: Set Up Ollama

# Install Ollama (visit https://ollama.ai/ for instructions)

# Pull required models
ollama pull bge-m3      # Embedding model
ollama pull llama3      # LLM model (or your preferred model)

Step 5: Initialize Directories

The application will automatically create required directories on first run:

uploaded_pdfs/ - Stores uploaded PDF files
converted_mds/ - Stores converted Markdown files
chroma_db/ - ChromaDB vector store
bm25_index/ - BM25 index files

⚙️ Configuration

Configure the application using environment variables. Create a .env file in the root directory:

# Ollama Configuration
# Note: Use either OLLAMA_BASE_URL (recommended) or OLLAMA_URL
OLLAMA_BASE_URL=http://localhost:11434  # Base URL for Ollama API (used by main.py)
OLLAMA_URL=http://localhost:11434       # Alternative URL (used by embeddings/llm modules)
EMBED_MODEL=bge-m3                       # Embedding model name
LLM_MODEL=llama3                         # LLM model for answer generation

# Embedding Worker Configuration
EMBED_WORKERS=3                          # Number of parallel workers
EMBED_BATCH_SIZE=64                      # Chunks per batch
EMBED_BATCH_WAIT_MS=200                  # Batch wait time in milliseconds
EMBED_CACHE_PATH=embeddings_cache.sqlite3 # Cache file path

# ChromaDB Configuration
CHROMA_COLLECTION=documents              # Collection name for vector store

# Hybrid Retrieval Configuration
HYBRID_ALPHA=0.5                         # 0=BM25 only, 1=vector only, 0.5=balanced
USE_RERANKER=1                           # 1=enabled, 0=disabled

# LLM Configuration
OLLAMA_LLM_TIMEOUT=180                   # Request timeout in seconds
OLLAMA_LLM_RETRIES=3                     # Number of retry attempts
OLLAMA_LLM_BACKOFF=1.5                   # Exponential backoff multiplier

# API/CLI Configuration
API_BASE_URL=http://localhost:8000       # API server URL for CLI
ASK_TOP_K=5                              # Default top-k results for CLI

Configuration Parameters

Parameter	Default	Description
`OLLAMA_BASE_URL`	`http://localhost:11434`	Base URL for Ollama API (health checks)
`OLLAMA_URL`	`http://localhost:11434`	Ollama URL for embeddings and LLM calls
`EMBED_MODEL`	`bge-m3`	Ollama embedding model name
`LLM_MODEL`	`llama3`	Ollama LLM model for answer generation
`EMBED_WORKERS`	`3`	Number of parallel embedding workers
`EMBED_BATCH_SIZE`	`64`	Batch size for embedding generation
`EMBED_BATCH_WAIT_MS`	`200`	Wait time in milliseconds before processing batch
`EMBED_CACHE_PATH`	`embeddings_cache.sqlite3`	SQLite cache file for embeddings
`CHROMA_COLLECTION`	`documents`	ChromaDB collection name
`HYBRID_ALPHA`	`0.5`	Weight for dense retrieval (0=BM25 only, 1=vector only)
`USE_RERANKER`	`1`	Enable cross-encoder reranking (1=yes, 0=no)
`OLLAMA_LLM_TIMEOUT`	`180`	LLM API request timeout in seconds
`OLLAMA_LLM_RETRIES`	`3`	Number of retry attempts for LLM requests
`OLLAMA_LLM_BACKOFF`	`1.5`	Exponential backoff multiplier for retries
`API_BASE_URL`	`http://localhost:8000`	API server URL (used by CLI)
`ASK_TOP_K`	`5`	Default number of results for CLI queries

🚀 Usage

Starting the API Server

# Start the FastAPI server
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

The API will be available at http://localhost:8000

Interactive API documentation: http://localhost:8000/docs

API Endpoints

1. Health Check

curl http://localhost:8000/health

Response:

{
  "status": "healthy",
  "ollama_available": true,
  "workers_running": true
}

2. Upload and Convert PDFs

curl -X POST "http://localhost:8000/convert/" \
  -H "Content-Type: multipart/form-data" \
  -F "files=@document1.pdf" \
  -F "files=@document2.pdf"

Response:

{
  "successful": [
    {
      "filename": "document1.pdf",
      "md_path": "converted_mds/document1/document1.md",
      "chunks_count": 42
    }
  ],
  "failed": [],
  "total_processed": 1,
  "total_successful": 1,
  "total_failed": 0
}

Features:

Supports multiple file upload (max 10 files per request)
Maximum file size: 50MB per file
Automatic OCR for scanned documents
Smart chunking and embedding
Duplicate detection and deduplication

3. Ask Questions

curl -X POST "http://localhost:8000/ask" \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is the main topic of the document?",
    "top_k": 5,
    "use_llm": true
  }'

Response:

{
  "question": "What is the main topic of the document?",
  "answer": "Based on the provided context, the main topic...",
  "top_k": 5,
  "results": [
    {
      "id": "document1__001",
      "text": "Relevant text chunk...",
      "metadata": {
        "source_md": "converted_mds/document1/document1.md",
        "chunk_index": 1
      },
      "distance": 0.85
    }
  ]
}

Query Parameters:

question (required): The question to ask
top_k (optional, default: 10): Number of chunks to retrieve
use_llm (optional, default: true): Generate LLM answer
where (optional): Metadata filters (ChromaDB format)
where_document (optional): Document content filters

CLI Tool

The CLI provides a convenient way to interact with the RAG system from the terminal.

Installation

# The CLI is installed with the package dependencies
python -m app.cli --help

Ask Questions

# Basic question
python -m app.cli ask "What is the main topic?"

# Retrieve more results
python -m app.cli ask "Explain the methodology" --top-k 10

# Show source chunks
python -m app.cli ask "What are the key findings?" --show-sources

# Disable LLM and show only retrieved chunks
python -m app.cli ask "Find information about X" --no-llm

CLI Options:

Option	Description	Default
`--top-k`	Number of results to return	5
`--no-llm`	Disable LLM answer generation	False
`--show-sources`	Show source chunks with answer	False

Environment Variables for CLI:

export API_BASE_URL=http://localhost:8000
export ASK_TOP_K=5

📚 API Documentation

Request/Response Models

QueryRequest

{
  "question": str,           # Required: Question to ask
  "top_k": int = 10,        # Optional: Number of results
  "use_llm": bool = True,   # Optional: Generate LLM answer
  "where": dict = None,     # Optional: Metadata filter
  "where_document": dict = None  # Optional: Document filter
}

AskResponse

{
  "question": str,          # Echo of the question
  "answer": str | None,     # LLM-generated answer
  "top_k": int,            # Number of results returned
  "results": [             # Retrieved chunks
    {
      "id": str,           # Chunk ID
      "text": str,         # Chunk text
      "metadata": dict,    # Chunk metadata
      "distance": float    # Similarity score
    }
  ]
}

Metadata Filtering

Filter results by metadata fields:

# Find chunks from a specific document
{
  "question": "What is X?",
  "where": {
    "source_md": {"$eq": "converted_mds/document1/document1.md"}
  }
}

# Find recent chunks (if timestamp metadata exists)
{
  "question": "Recent updates?",
  "where": {
    "timestamp": {"$gte": "2024-01-01"}
  }
}

🔧 Development

Project Structure

RAG_PIPELINE/
├── app/
│   ├── main.py                 # FastAPI application
│   ├── cli.py                  # CLI client
│   ├── routers/
│   │   └── upload.py          # Upload endpoints
│   ├── embeddings/
│   │   ├── worker.py          # Async embedding workers
│   │   ├── cache.py           # Embedding cache
│   │   └── ollama_embeddings.py
│   ├── vector_store/
│   │   └── chroma_client.py   # ChromaDB client
│   ├── retrieval/
│   │   ├── hybrid_retriever.py # Hybrid retrieval
│   │   ├── bm25_retriever.py   # BM25 retrieval
│   │   └── reranker.py         # Cross-encoder reranker
│   ├── llm/
│   │   └── ollama_llm.py      # LLM integration
│   └── utils/
│       ├── docling_converter.py # PDF conversion
│       ├── file_registry.py    # File tracking
│       └── chunker/
│           ├── markdown_chunker.py
│           ├── code_chunker.py
│           ├── hybrid_fallback.py
│           └── optimizer.py
├── requirements.txt
├── pyproject.toml
└── README.md

Running Tests

# Test Ollama API connection
python test_ollama_api.py

# Test dense retrieval
python test_dense_retrieval.py

# Test embedding with debug info
python test_embedding_debug.py

# Diagnose ChromaDB issues
python diagnose_chroma.py

Reset Caches

If you need to clear caches and rebuild:

# Clear all caches and databases
python reset_caches.py

# This will remove:
# - embeddings_cache.sqlite3
# - chroma_db/
# - bm25_index/
# - file_registry.db

Code Quality

# Format code (if using ruff or black)
ruff format .

# Lint code
ruff check .

🐛 Troubleshooting

Common Issues

1. Ollama Connection Failed

Error: Ollama health check failed or Connection refused

Solution:

Ensure Ollama is running: ollama serve
Check the URL: OLLAMA_BASE_URL=http://localhost:11434
Verify models are pulled: ollama list

2. Embedding Dimension Mismatch

Error: Embedding dimension mismatch detected

Solution:

Clear ChromaDB and re-ingest: python reset_caches.py
Ensure consistent EMBED_MODEL across runs
Re-upload all documents after clearing

3. Out of Memory

Error: CUDA out of memory or system hangs

Solution:

Reduce EMBED_BATCH_SIZE (try 32 or 16)
Reduce EMBED_WORKERS (try 1 or 2)
Use CPU-only mode: onnxruntime instead of onnxruntime-gpu

4. Slow PDF Processing

Solution:

Reduce image quality in Docling settings
Use GPU acceleration for OCR
Process files in smaller batches

5. No Results from Search

Solution:

Check if documents were successfully ingested
Verify ChromaDB collection: python diagnose_chroma.py
Check BM25 index exists: ls bm25_index/
Rebuild indexes if needed

Logs

Enable detailed logging:

import logging
logging.basicConfig(level=logging.DEBUG)

📄 License

This project is available for use and modification. Please check the repository for specific license information.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📧 Support

For issues and questions, please open an issue on GitHub or contact the maintainer.

🙏 Acknowledgments

Ollama - Local LLM runtime
ChromaDB - Vector database
Docling - PDF conversion
FastAPI - Web framework
BGE-M3 - Embedding model

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
app		app
bm25_index		bm25_index
chroma_db		chroma_db
uploaded_pdfs		uploaded_pdfs
.gitignore		.gitignore
README.md		README.md
diagnose_chroma.py		diagnose_chroma.py
embeddings_cache.sqlite3		embeddings_cache.sqlite3
embeddings_cache.sqlite3-shm		embeddings_cache.sqlite3-shm
embeddings_cache.sqlite3-wal		embeddings_cache.sqlite3-wal
file_registry.db		file_registry.db
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
reset_caches.py		reset_caches.py
test_dense_retrieval.py		test_dense_retrieval.py
test_embedding_debug.py		test_embedding_debug.py
test_ollama_api.py		test_ollama_api.py

Folders and files

Latest commit

History

Repository files navigation

RAG Pipeline

🚀 Features

📋 Table of Contents

🏗 Architecture

📦 Installation

Prerequisites

Step 1: Clone the Repository

Step 2: Create Virtual Environment

Step 3: Install Dependencies

Step 4: Set Up Ollama

Step 5: Initialize Directories

⚙️ Configuration

Configuration Parameters

🚀 Usage

Starting the API Server

API Endpoints

1. Health Check

2. Upload and Convert PDFs

3. Ask Questions

CLI Tool

Installation

Ask Questions

📚 API Documentation

Request/Response Models

QueryRequest

AskResponse

Metadata Filtering

🔧 Development

Project Structure

Running Tests

Reset Caches

Code Quality

🐛 Troubleshooting

Common Issues

1. Ollama Connection Failed

2. Embedding Dimension Mismatch

3. Out of Memory

4. Slow PDF Processing

5. No Results from Search

Logs

📄 License

🤝 Contributing

📧 Support

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages