Skip to content

abhinavchat/bharat-rag

Bharat-RAG

Bharat-RAG is an India-centric, cloud-neutral, protocol-first retrieval engine for multimodal documents (text, PDFs, scanned images, videos, websites, etc.).

The project has two main goals:

  1. Define the Bharat-RAG Protocol (BRP) for ingestion and retrieval (See the specification in detail.).
  2. Provide a lightweight reference implementation that can run:
    • on a single laptop (individual users)
    • on-prem / government / enterprise clusters (India-scale)

🚧 Status: Pre-Alpha / Active Development
Core ingestion, retrieval, and RAG answering features are implemented.


Design Principles

  • Protocol-first: BRP is defined as a JSON/HTTP spec that anyone can implement.
  • Implementation-agnostic: No requirement on a specific DB, vector store, or cloud.
  • Lightweight & memory-efficient: Favour small local models, batching, and streaming.
  • Fault-tolerant: Queue-based ingestion, stateless services, retries & dead-letter queues.
  • Cloud-neutral: Can run on bare metal or any cloud; prefers open components.
  • India-centric: Designed for multilingual, scanned, and government/enterprise documents,
    with future integration into India Stack services.

Current Features

βœ… Implemented

  • Core Data Models: Collections, Documents, Chunks, Ingestion Jobs
  • Text Ingestion: Plain text, Markdown, DOCX files
  • PDF Ingestion: Page-by-page extraction with metadata
  • Image Ingestion: OCR using EasyOCR (English & Hindi support)
  • Video Ingestion: Audio extraction and transcription using Whisper
  • Website Ingestion: Article extraction from web pages
  • Retrieval API: Semantic search with vector similarity
  • RAG Answering: Context-aware answers with citations
  • Job Tracking: Async ingestion with progress monitoring
  • Observability: Request-scoped logging with context tracking

🚧 Planned

  • 3D asset ingestion
  • Advanced chunking strategies
  • Multi-tenant support
  • Dashboard UI

Quick Start

Prerequisites

  • Python 3.12+
  • PostgreSQL with pgvector extension
  • ffmpeg (for video processing)
    • macOS: brew install ffmpeg
    • Ubuntu/Debian: sudo apt-get install ffmpeg
    • Windows: Download from ffmpeg.org

Installation

  1. Clone the repository

    git clone https://github.com/your-org/bharat-rag.git
    cd bharat-rag
  2. Install dependencies

    # Install uv (Python package manager)
    pip install uv
    
    # Install project dependencies
    uv sync
  3. Set up database

    # Create PostgreSQL database with pgvector
    createdb bharatrag
    psql bharatrag -c "CREATE EXTENSION vector;"
    
    # Set database URL (or use .env file)
    export DATABASE_URL="postgresql+psycopg2://user:password@localhost:5432/bharatrag"
  4. Run migrations

    uv run alembic upgrade head
  5. Start the server

    uv run uvicorn bharatrag.main:app --reload

The API will be available at http://localhost:8000

API Documentation

Once the server is running, visit:


Supported Formats

Format Source Type Features
txt, md, docx file Full text extraction
pdf file Page-by-page extraction, metadata
png, jpg, jpeg file OCR (English & Hindi)
mp4, avi, mov file Audio transcription with timestamps
html url Article extraction, metadata

Example Usage

1. Create a Collection

curl -X POST "http://localhost:8000/collections" \
  -H "Content-Type: application/json" \
  -d '{"name": "my-documents"}'

2. Ingest a Document

curl -X POST "http://localhost:8000/ingestion-jobs" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_id": "<collection-id>",
    "source_type": "file",
    "format": "pdf",
    "uri": "file:///path/to/document.pdf"
  }'

3. Query for Relevant Chunks

curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_id": "<collection-id>",
    "query": "What is the main topic?",
    "top_k": 5
  }'

4. Get an Answer with Citations

curl -X POST "http://localhost:8000/answer" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_id": "<collection-id>",
    "question": "What is the main topic?",
    "top_k": 5
  }'

Development

Running Tests

# Run all tests (without database)
BHARATRAG_RUN_DB_TESTS=0 uv run pytest

# Run all tests (with database)
BHARATRAG_RUN_DB_TESTS=1 uv run pytest

# Run specific test file
uv run pytest tests/unit/test_pdf_ingestion.py -v

Code Quality

# Format and lint
uv run ruff check src/
uv run ruff format src/

Docker

# Build image
docker build -t bharat-rag .

# Run container
docker run -p 8000:8000 bharat-rag

Repository Layout

bharat-rag/
  docs/                   # Documentation (PRD, etc.)
  specs/                  # BRP protocol specifications
  src/bharatrag/          # Reference implementation
    api/                  # FastAPI endpoints
    domain/               # Domain models
    services/             # Business logic
      ingestion_handlers/ # Format-specific handlers
    db/                   # Database models & migrations
  tests/                  # Automated tests
  infra/                  # Docker/K8s manifests
  .github/                # GitHub workflows and templates
  README.md
  CONTRIBUTING.md
  LICENSE

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages