Skip to content

LathissKhumar/RAG_PIPELINE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RAG Pipeline

A production-ready Retrieval-Augmented Generation (RAG) system built with FastAPI, ChromaDB, and Ollama. This pipeline provides intelligent document processing, hybrid retrieval (BM25 + vector search), and LLM-powered question answering.

πŸš€ Features

  • PDF Processing: Advanced PDF-to-Markdown conversion using Docling with OCR support
  • Hybrid Retrieval: Combines BM25 sparse retrieval and dense vector search with Reciprocal Rank Fusion (RRF)
  • Smart Chunking: Intelligent document chunking with support for code, markdown, and hybrid fallback strategies
  • Embedding Cache: SQLite-based embedding cache to avoid redundant computations
  • Async Workers: Parallel embedding generation with configurable worker pools
  • Vector Store: ChromaDB for efficient similarity search
  • LLM Integration: Ollama integration for answer generation with context
  • Reranking: Optional cross-encoder reranking for improved relevance
  • RESTful API: FastAPI-based API with automatic documentation
  • CLI Tool: Command-line interface for easy interaction

πŸ“‹ Table of Contents

πŸ— Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   FastAPI API   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚             β”‚             β”‚              β”‚
    β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”   β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”
    β”‚ Upload  β”‚   β”‚  Ask   β”‚   β”‚ Health   β”‚   β”‚   CLI   β”‚
    β”‚ Router  β”‚   β”‚Endpointβ”‚   β”‚  Check   β”‚   β”‚  Client β”‚
    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚            β”‚
         β”‚            β”‚
    β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
    β”‚  Docling Converter   β”‚
    β”‚  (PDF β†’ Markdown)    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Chunking System    β”‚
    β”‚  - Markdown         β”‚
    β”‚  - Code             β”‚
    β”‚  - Hybrid Fallback  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Embedding Workers  β”‚
    β”‚  (Async Pool)       β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Embedding Cache    β”‚
    β”‚  (SQLite)           β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   ChromaDB Store    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Hybrid Retriever    β”‚
    β”‚  - BM25 (Sparse)    β”‚
    β”‚  - Vector (Dense)   β”‚
    β”‚  - RRF Fusion       β”‚
    β”‚  - Reranking        β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚    Ollama LLM       β”‚
    β”‚  (Answer Gen)       β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“¦ Installation

Prerequisites

  • Python 3.9+
  • Ollama installed and running
  • GPU recommended for faster processing (optional)

Step 1: Clone the Repository

git clone https://github.com/LathissKhumar/RAG_PIPELINE.git
cd RAG_PIPELINE

Step 2: Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Step 3: Install Dependencies

pip install -r requirements.txt

Step 4: Set Up Ollama

# Install Ollama (visit https://ollama.ai/ for instructions)

# Pull required models
ollama pull bge-m3      # Embedding model
ollama pull llama3      # LLM model (or your preferred model)

Step 5: Initialize Directories

The application will automatically create required directories on first run:

  • uploaded_pdfs/ - Stores uploaded PDF files
  • converted_mds/ - Stores converted Markdown files
  • chroma_db/ - ChromaDB vector store
  • bm25_index/ - BM25 index files

βš™οΈ Configuration

Configure the application using environment variables. Create a .env file in the root directory:

# Ollama Configuration
# Note: Use either OLLAMA_BASE_URL (recommended) or OLLAMA_URL
OLLAMA_BASE_URL=http://localhost:11434  # Base URL for Ollama API (used by main.py)
OLLAMA_URL=http://localhost:11434       # Alternative URL (used by embeddings/llm modules)
EMBED_MODEL=bge-m3                       # Embedding model name
LLM_MODEL=llama3                         # LLM model for answer generation

# Embedding Worker Configuration
EMBED_WORKERS=3                          # Number of parallel workers
EMBED_BATCH_SIZE=64                      # Chunks per batch
EMBED_BATCH_WAIT_MS=200                  # Batch wait time in milliseconds
EMBED_CACHE_PATH=embeddings_cache.sqlite3 # Cache file path

# ChromaDB Configuration
CHROMA_COLLECTION=documents              # Collection name for vector store

# Hybrid Retrieval Configuration
HYBRID_ALPHA=0.5                         # 0=BM25 only, 1=vector only, 0.5=balanced
USE_RERANKER=1                           # 1=enabled, 0=disabled

# LLM Configuration
OLLAMA_LLM_TIMEOUT=180                   # Request timeout in seconds
OLLAMA_LLM_RETRIES=3                     # Number of retry attempts
OLLAMA_LLM_BACKOFF=1.5                   # Exponential backoff multiplier

# API/CLI Configuration
API_BASE_URL=http://localhost:8000       # API server URL for CLI
ASK_TOP_K=5                              # Default top-k results for CLI

Configuration Parameters

Parameter Default Description
OLLAMA_BASE_URL http://localhost:11434 Base URL for Ollama API (health checks)
OLLAMA_URL http://localhost:11434 Ollama URL for embeddings and LLM calls
EMBED_MODEL bge-m3 Ollama embedding model name
LLM_MODEL llama3 Ollama LLM model for answer generation
EMBED_WORKERS 3 Number of parallel embedding workers
EMBED_BATCH_SIZE 64 Batch size for embedding generation
EMBED_BATCH_WAIT_MS 200 Wait time in milliseconds before processing batch
EMBED_CACHE_PATH embeddings_cache.sqlite3 SQLite cache file for embeddings
CHROMA_COLLECTION documents ChromaDB collection name
HYBRID_ALPHA 0.5 Weight for dense retrieval (0=BM25 only, 1=vector only)
USE_RERANKER 1 Enable cross-encoder reranking (1=yes, 0=no)
OLLAMA_LLM_TIMEOUT 180 LLM API request timeout in seconds
OLLAMA_LLM_RETRIES 3 Number of retry attempts for LLM requests
OLLAMA_LLM_BACKOFF 1.5 Exponential backoff multiplier for retries
API_BASE_URL http://localhost:8000 API server URL (used by CLI)
ASK_TOP_K 5 Default number of results for CLI queries

πŸš€ Usage

Starting the API Server

# Start the FastAPI server
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

The API will be available at http://localhost:8000

Interactive API documentation: http://localhost:8000/docs

API Endpoints

1. Health Check

curl http://localhost:8000/health

Response:

{
  "status": "healthy",
  "ollama_available": true,
  "workers_running": true
}

2. Upload and Convert PDFs

curl -X POST "http://localhost:8000/convert/" \
  -H "Content-Type: multipart/form-data" \
  -F "files=@document1.pdf" \
  -F "files=@document2.pdf"

Response:

{
  "successful": [
    {
      "filename": "document1.pdf",
      "md_path": "converted_mds/document1/document1.md",
      "chunks_count": 42
    }
  ],
  "failed": [],
  "total_processed": 1,
  "total_successful": 1,
  "total_failed": 0
}

Features:

  • Supports multiple file upload (max 10 files per request)
  • Maximum file size: 50MB per file
  • Automatic OCR for scanned documents
  • Smart chunking and embedding
  • Duplicate detection and deduplication

3. Ask Questions

curl -X POST "http://localhost:8000/ask" \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is the main topic of the document?",
    "top_k": 5,
    "use_llm": true
  }'

Response:

{
  "question": "What is the main topic of the document?",
  "answer": "Based on the provided context, the main topic...",
  "top_k": 5,
  "results": [
    {
      "id": "document1__001",
      "text": "Relevant text chunk...",
      "metadata": {
        "source_md": "converted_mds/document1/document1.md",
        "chunk_index": 1
      },
      "distance": 0.85
    }
  ]
}

Query Parameters:

  • question (required): The question to ask
  • top_k (optional, default: 10): Number of chunks to retrieve
  • use_llm (optional, default: true): Generate LLM answer
  • where (optional): Metadata filters (ChromaDB format)
  • where_document (optional): Document content filters

CLI Tool

The CLI provides a convenient way to interact with the RAG system from the terminal.

Installation

# The CLI is installed with the package dependencies
python -m app.cli --help

Ask Questions

# Basic question
python -m app.cli ask "What is the main topic?"

# Retrieve more results
python -m app.cli ask "Explain the methodology" --top-k 10

# Show source chunks
python -m app.cli ask "What are the key findings?" --show-sources

# Disable LLM and show only retrieved chunks
python -m app.cli ask "Find information about X" --no-llm

CLI Options:

Option Description Default
--top-k Number of results to return 5
--no-llm Disable LLM answer generation False
--show-sources Show source chunks with answer False

Environment Variables for CLI:

export API_BASE_URL=http://localhost:8000
export ASK_TOP_K=5

πŸ“š API Documentation

Request/Response Models

QueryRequest

{
  "question": str,           # Required: Question to ask
  "top_k": int = 10,        # Optional: Number of results
  "use_llm": bool = True,   # Optional: Generate LLM answer
  "where": dict = None,     # Optional: Metadata filter
  "where_document": dict = None  # Optional: Document filter
}

AskResponse

{
  "question": str,          # Echo of the question
  "answer": str | None,     # LLM-generated answer
  "top_k": int,            # Number of results returned
  "results": [             # Retrieved chunks
    {
      "id": str,           # Chunk ID
      "text": str,         # Chunk text
      "metadata": dict,    # Chunk metadata
      "distance": float    # Similarity score
    }
  ]
}

Metadata Filtering

Filter results by metadata fields:

# Find chunks from a specific document
{
  "question": "What is X?",
  "where": {
    "source_md": {"$eq": "converted_mds/document1/document1.md"}
  }
}

# Find recent chunks (if timestamp metadata exists)
{
  "question": "Recent updates?",
  "where": {
    "timestamp": {"$gte": "2024-01-01"}
  }
}

πŸ”§ Development

Project Structure

RAG_PIPELINE/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ main.py                 # FastAPI application
β”‚   β”œβ”€β”€ cli.py                  # CLI client
β”‚   β”œβ”€β”€ routers/
β”‚   β”‚   └── upload.py          # Upload endpoints
β”‚   β”œβ”€β”€ embeddings/
β”‚   β”‚   β”œβ”€β”€ worker.py          # Async embedding workers
β”‚   β”‚   β”œβ”€β”€ cache.py           # Embedding cache
β”‚   β”‚   └── ollama_embeddings.py
β”‚   β”œβ”€β”€ vector_store/
β”‚   β”‚   └── chroma_client.py   # ChromaDB client
β”‚   β”œβ”€β”€ retrieval/
β”‚   β”‚   β”œβ”€β”€ hybrid_retriever.py # Hybrid retrieval
β”‚   β”‚   β”œβ”€β”€ bm25_retriever.py   # BM25 retrieval
β”‚   β”‚   └── reranker.py         # Cross-encoder reranker
β”‚   β”œβ”€β”€ llm/
β”‚   β”‚   └── ollama_llm.py      # LLM integration
β”‚   └── utils/
β”‚       β”œβ”€β”€ docling_converter.py # PDF conversion
β”‚       β”œβ”€β”€ file_registry.py    # File tracking
β”‚       └── chunker/
β”‚           β”œβ”€β”€ markdown_chunker.py
β”‚           β”œβ”€β”€ code_chunker.py
β”‚           β”œβ”€β”€ hybrid_fallback.py
β”‚           └── optimizer.py
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ pyproject.toml
└── README.md

Running Tests

# Test Ollama API connection
python test_ollama_api.py

# Test dense retrieval
python test_dense_retrieval.py

# Test embedding with debug info
python test_embedding_debug.py

# Diagnose ChromaDB issues
python diagnose_chroma.py

Reset Caches

If you need to clear caches and rebuild:

# Clear all caches and databases
python reset_caches.py

# This will remove:
# - embeddings_cache.sqlite3
# - chroma_db/
# - bm25_index/
# - file_registry.db

Code Quality

# Format code (if using ruff or black)
ruff format .

# Lint code
ruff check .

πŸ› Troubleshooting

Common Issues

1. Ollama Connection Failed

Error: Ollama health check failed or Connection refused

Solution:

  • Ensure Ollama is running: ollama serve
  • Check the URL: OLLAMA_BASE_URL=http://localhost:11434
  • Verify models are pulled: ollama list

2. Embedding Dimension Mismatch

Error: Embedding dimension mismatch detected

Solution:

  • Clear ChromaDB and re-ingest: python reset_caches.py
  • Ensure consistent EMBED_MODEL across runs
  • Re-upload all documents after clearing

3. Out of Memory

Error: CUDA out of memory or system hangs

Solution:

  • Reduce EMBED_BATCH_SIZE (try 32 or 16)
  • Reduce EMBED_WORKERS (try 1 or 2)
  • Use CPU-only mode: onnxruntime instead of onnxruntime-gpu

4. Slow PDF Processing

Solution:

  • Reduce image quality in Docling settings
  • Use GPU acceleration for OCR
  • Process files in smaller batches

5. No Results from Search

Solution:

  • Check if documents were successfully ingested
  • Verify ChromaDB collection: python diagnose_chroma.py
  • Check BM25 index exists: ls bm25_index/
  • Rebuild indexes if needed

Logs

Enable detailed logging:

import logging
logging.basicConfig(level=logging.DEBUG)

πŸ“„ License

This project is available for use and modification. Please check the repository for specific license information.

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“§ Support

For issues and questions, please open an issue on GitHub or contact the maintainer.

πŸ™ Acknowledgments

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages