Bharat-RAG is an India-centric, cloud-neutral, protocol-first retrieval engine for multimodal documents (text, PDFs, scanned images, videos, websites, etc.).
The project has two main goals:
- Define the Bharat-RAG Protocol (BRP) for ingestion and retrieval (See the specification in detail.).
- Provide a lightweight reference implementation that can run:
- on a single laptop (individual users)
- on-prem / government / enterprise clusters (India-scale)
π§ Status: Pre-Alpha / Active Development
Core ingestion, retrieval, and RAG answering features are implemented.
- Protocol-first: BRP is defined as a JSON/HTTP spec that anyone can implement.
- Implementation-agnostic: No requirement on a specific DB, vector store, or cloud.
- Lightweight & memory-efficient: Favour small local models, batching, and streaming.
- Fault-tolerant: Queue-based ingestion, stateless services, retries & dead-letter queues.
- Cloud-neutral: Can run on bare metal or any cloud; prefers open components.
- India-centric: Designed for multilingual, scanned, and government/enterprise documents,
with future integration into India Stack services.
- Core Data Models: Collections, Documents, Chunks, Ingestion Jobs
- Text Ingestion: Plain text, Markdown, DOCX files
- PDF Ingestion: Page-by-page extraction with metadata
- Image Ingestion: OCR using EasyOCR (English & Hindi support)
- Video Ingestion: Audio extraction and transcription using Whisper
- Website Ingestion: Article extraction from web pages
- Retrieval API: Semantic search with vector similarity
- RAG Answering: Context-aware answers with citations
- Job Tracking: Async ingestion with progress monitoring
- Observability: Request-scoped logging with context tracking
- 3D asset ingestion
- Advanced chunking strategies
- Multi-tenant support
- Dashboard UI
- Python 3.12+
- PostgreSQL with
pgvectorextension - ffmpeg (for video processing)
- macOS:
brew install ffmpeg - Ubuntu/Debian:
sudo apt-get install ffmpeg - Windows: Download from ffmpeg.org
- macOS:
-
Clone the repository
git clone https://github.com/your-org/bharat-rag.git cd bharat-rag -
Install dependencies
# Install uv (Python package manager) pip install uv # Install project dependencies uv sync
-
Set up database
# Create PostgreSQL database with pgvector createdb bharatrag psql bharatrag -c "CREATE EXTENSION vector;" # Set database URL (or use .env file) export DATABASE_URL="postgresql+psycopg2://user:password@localhost:5432/bharatrag"
-
Run migrations
uv run alembic upgrade head
-
Start the server
uv run uvicorn bharatrag.main:app --reload
The API will be available at http://localhost:8000
Once the server is running, visit:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
| Format | Source Type | Features |
|---|---|---|
txt, md, docx |
file |
Full text extraction |
pdf |
file |
Page-by-page extraction, metadata |
png, jpg, jpeg |
file |
OCR (English & Hindi) |
mp4, avi, mov |
file |
Audio transcription with timestamps |
html |
url |
Article extraction, metadata |
curl -X POST "http://localhost:8000/collections" \
-H "Content-Type: application/json" \
-d '{"name": "my-documents"}'curl -X POST "http://localhost:8000/ingestion-jobs" \
-H "Content-Type: application/json" \
-d '{
"collection_id": "<collection-id>",
"source_type": "file",
"format": "pdf",
"uri": "file:///path/to/document.pdf"
}'curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{
"collection_id": "<collection-id>",
"query": "What is the main topic?",
"top_k": 5
}'curl -X POST "http://localhost:8000/answer" \
-H "Content-Type: application/json" \
-d '{
"collection_id": "<collection-id>",
"question": "What is the main topic?",
"top_k": 5
}'# Run all tests (without database)
BHARATRAG_RUN_DB_TESTS=0 uv run pytest
# Run all tests (with database)
BHARATRAG_RUN_DB_TESTS=1 uv run pytest
# Run specific test file
uv run pytest tests/unit/test_pdf_ingestion.py -v# Format and lint
uv run ruff check src/
uv run ruff format src/# Build image
docker build -t bharat-rag .
# Run container
docker run -p 8000:8000 bharat-ragbharat-rag/
docs/ # Documentation (PRD, etc.)
specs/ # BRP protocol specifications
src/bharatrag/ # Reference implementation
api/ # FastAPI endpoints
domain/ # Domain models
services/ # Business logic
ingestion_handlers/ # Format-specific handlers
db/ # Database models & migrations
tests/ # Automated tests
infra/ # Docker/K8s manifests
.github/ # GitHub workflows and templates
README.md
CONTRIBUTING.md
LICENSE
We welcome contributions! Please see CONTRIBUTING.md for guidelines.