Bharat-RAG

Bharat-RAG is an India-centric, cloud-neutral, protocol-first retrieval engine for multimodal documents (text, PDFs, scanned images, videos, websites, etc.).

The project has two main goals:

Define the Bharat-RAG Protocol (BRP) for ingestion and retrieval (See the specification in detail.).
Provide a lightweight reference implementation that can run:
- on a single laptop (individual users)
- on-prem / government / enterprise clusters (India-scale)

🚧 Status: Pre-Alpha / Active Development
Core ingestion, retrieval, and RAG answering features are implemented.

Design Principles

Protocol-first: BRP is defined as a JSON/HTTP spec that anyone can implement.
Implementation-agnostic: No requirement on a specific DB, vector store, or cloud.
Lightweight & memory-efficient: Favour small local models, batching, and streaming.
Fault-tolerant: Queue-based ingestion, stateless services, retries & dead-letter queues.
Cloud-neutral: Can run on bare metal or any cloud; prefers open components.
India-centric: Designed for multilingual, scanned, and government/enterprise documents,
with future integration into India Stack services.

Current Features

✅ Implemented

Core Data Models: Collections, Documents, Chunks, Ingestion Jobs
Text Ingestion: Plain text, Markdown, DOCX files
PDF Ingestion: Page-by-page extraction with metadata
Image Ingestion: OCR using EasyOCR (English & Hindi support)
Video Ingestion: Audio extraction and transcription using Whisper
Website Ingestion: Article extraction from web pages
Retrieval API: Semantic search with vector similarity
RAG Answering: Context-aware answers with citations
Job Tracking: Async ingestion with progress monitoring
Observability: Request-scoped logging with context tracking

🚧 Planned

3D asset ingestion
Advanced chunking strategies
Multi-tenant support
Dashboard UI

Quick Start

Prerequisites

Python 3.12+
PostgreSQL with pgvector extension
ffmpeg (for video processing)
- macOS: brew install ffmpeg
- Ubuntu/Debian: sudo apt-get install ffmpeg
- Windows: Download from ffmpeg.org

Installation

Clone the repository

git clone https://github.com/your-org/bharat-rag.git
cd bharat-rag

Install dependencies

# Install uv (Python package manager)
pip install uv

# Install project dependencies
uv sync

Set up database

# Create PostgreSQL database with pgvector
createdb bharatrag
psql bharatrag -c "CREATE EXTENSION vector;"

# Set database URL (or use .env file)
export DATABASE_URL="postgresql+psycopg2://user:password@localhost:5432/bharatrag"

Run migrations
```
uv run alembic upgrade head
```

Start the server

uv run uvicorn bharatrag.main:app --reload

The API will be available at http://localhost:8000

API Documentation

Once the server is running, visit:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Supported Formats

Format	Source Type	Features
`txt`, `md`, `docx`	`file`	Full text extraction
`pdf`	`file`	Page-by-page extraction, metadata
`png`, `jpg`, `jpeg`	`file`	OCR (English & Hindi)
`mp4`, `avi`, `mov`	`file`	Audio transcription with timestamps
`html`	`url`	Article extraction, metadata

Example Usage

1. Create a Collection

curl -X POST "http://localhost:8000/collections" \
  -H "Content-Type: application/json" \
  -d '{"name": "my-documents"}'

2. Ingest a Document

curl -X POST "http://localhost:8000/ingestion-jobs" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_id": "<collection-id>",
    "source_type": "file",
    "format": "pdf",
    "uri": "file:///path/to/document.pdf"
  }'

3. Query for Relevant Chunks

curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_id": "<collection-id>",
    "query": "What is the main topic?",
    "top_k": 5
  }'

4. Get an Answer with Citations

curl -X POST "http://localhost:8000/answer" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_id": "<collection-id>",
    "question": "What is the main topic?",
    "top_k": 5
  }'

Development

Running Tests

# Run all tests (without database)
BHARATRAG_RUN_DB_TESTS=0 uv run pytest

# Run all tests (with database)
BHARATRAG_RUN_DB_TESTS=1 uv run pytest

# Run specific test file
uv run pytest tests/unit/test_pdf_ingestion.py -v

Code Quality

# Format and lint
uv run ruff check src/
uv run ruff format src/

Docker

# Build image
docker build -t bharat-rag .

# Run container
docker run -p 8000:8000 bharat-rag

Repository Layout

bharat-rag/
  docs/                   # Documentation (PRD, etc.)
  specs/                  # BRP protocol specifications
  src/bharatrag/          # Reference implementation
    api/                  # FastAPI endpoints
    domain/               # Domain models
    services/             # Business logic
      ingestion_handlers/ # Format-specific handlers
    db/                   # Database models & migrations
  tests/                  # Automated tests
  infra/                  # Docker/K8s manifests
  .github/                # GitHub workflows and templates
  README.md
  CONTRIBUTING.md
  LICENSE

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.devcontainer		.devcontainer
.github		.github
alembic		alembic
docs		docs
infra		infra
specs		specs
src/bharatrag		src/bharatrag
tests/unit		tests/unit
.gitignore		.gitignore
.python-version		.python-version
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
alembic.ini		alembic.ini
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bharat-RAG

Design Principles

Current Features

✅ Implemented

🚧 Planned

Quick Start

Prerequisites

Installation

API Documentation

Supported Formats

Example Usage

1. Create a Collection

2. Ingest a Document

3. Query for Relevant Chunks

4. Get an Answer with Citations

Development

Running Tests

Code Quality

Docker

Repository Layout

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bharat-RAG

Design Principles

Current Features

✅ Implemented

🚧 Planned

Quick Start

Prerequisites

Installation

API Documentation

Supported Formats

Example Usage

1. Create a Collection

2. Ingest a Document

3. Query for Relevant Chunks

4. Get an Answer with Citations

Development

Running Tests

Code Quality

Docker

Repository Layout

Contributing

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages