A RAG (Retrieval Augmented Generation) application built with LangChain, ChromaDB, and Streamlit. Upload PDF documents and ask natural language questions to retrieve contextually relevant answers powered by OpenAI.
When you ask a question, the system follows this RAG (Retrieval-Augmented Generation) pipeline:
| Step | Component | Description |
|---|---|---|
| 1. Chunk | Document Processor | PDF split into searchable text pieces |
| 2. Embed | OpenAI | Creates vector embeddings (text-embedding-3-large) |
| 3. Hybrid Search | BM25 + Semantic | Combines keyword and meaning-based search |
| 4. Rerank | Jina (local) | Cross-encoder refines result relevance |
| 5. Answer | GPT-4o-mini | Generates response based only on your documents |
Note: The LLM only answers based on your uploaded documents, not general knowledge. Search on the home page includes all documents and collections.
- 📄 PDF Document Processing: Upload and parse PDF files with automatic text extraction
- 🔗 Semantic Chunking: Intelligent text splitting with configurable chunk sizes and overlap
- 🎯 Vector Embeddings: High-quality embeddings using OpenAI text-embedding-3-large
- 💾 Persistent Vector Store: ChromaDB for efficient similarity search with disk persistence
- 💬 Natural Language Queries: Ask questions in plain language about your documents
- ✨ Contextual Answers: GPT-4o-mini generates answers based solely on document content
- 🔄 Real-time Streaming: Token-by-token answer generation for responsive UX
- 🔃 Document Persistence: Uploaded documents persist across app restarts
- 🧭 Sidebar Navigation: Quick access to all pages (Search, How It Works, Collections)
- 📄 Documents Panel: View document statistics (chunks, size, overlap) and list all uploaded documents on the home page
- 🎨 Consistent Branding: Unified styling and navigation across all pages
- 📱 Responsive Layout: Works well on different screen sizes
- 📁 Document Collections: Organize documents into searchable collections
- 🎯 Scoped Search: Search within specific collections or across all documents
- 📊 Collection Stats: Track document counts and chunk counts per collection
- ⚙️ Per-Collection Settings: Configure chunk size and overlap per collection
- 🗑️ Cascade Deletion: Deleting a collection removes all associated documents and chunks
- 🔄 Duplicate Detection: Prevents uploading the same document twice to a collection
- 📝 Search History: Per-collection search results persist during session navigation
- 🔀 Hybrid Search: Combine semantic (vector) and BM25 (keyword) search with RRF fusion
- 🎛️ Retrieval Presets: Pre-configured profiles accessible from the sidebar dropdown:
- 🎯 High Precision: Fewer, highly relevant results (k=3, alpha=0.7, reranking on)
- ⚖️ Balanced: Good mix of precision and recall (k=5, alpha=0.5, reranking on)
- 🔍 High Recall: More results, broader coverage (k=10, alpha=0.3, reranking off)
- 🏆 Re-ranking: Cross-encoder re-ranking with configurable provider:
- Jina (local): No API key needed, no latency - requires
sentence-transformers - Cohere (cloud): Fast API, excellent quality - requires
COHERE_API_KEY - Auto mode (default): Tries Jina (local) first, falls back to Cohere (cloud)
- Override: Set
provider: "cohere"in config.yaml to force cloud API
- Jina (local): No API key needed, no latency - requires
- 📊 Score Visibility: View semantic, BM25, and rerank scores for each result
- 💬 Conversation History: Context-aware follow-up questions within sessions
- 🧪 A/B Testing Framework: Compare all 4 retrieval methods (semantic, BM25, hybrid, hybrid+rerank) with a single click
- 📋 Duplicate Detection: Automatic check before uploading already-indexed PDFs
- 📊 Context Transparency: View exactly which document chunks were used for answers
- 🛡️ Robust Error Handling: Automatic retry logic with exponential backoff for API resilience
- 📝 Structured Logging: Comprehensive logging for debugging and monitoring
- ⚙️ YAML Configuration: Easily customizable settings without code changes
- 🎨 Modular Architecture: Clean separation of concerns for maintainability
- 🧪 Comprehensive Test Suite: 290+ unit and integration tests
Upload PDFs and view your document library with statistics:
Ask questions and get AI-generated answers with source context:
View retrieved chunks with semantic and rerank scores:
Organize documents into searchable collections:
View and manage documents within a collection:
Search within a specific collection for focused results:
The app includes a comprehensive "How It Works" page with interactive guides:
| Guide | Description | Screenshot |
|---|---|---|
| Retrieval Methods | Semantic, BM25, and hybrid search explained | View |
| Precision vs Recall | Trade-offs and preset configurations | View |
| Collections | How to organize documents effectively | View |
| A/B Testing | Compare retrieval methods empirically | View |
| Configuration | All settings explained with examples | View |
| Conversation History | Context-aware follow-up questions | View |
semantic-search/
├── app.py # Streamlit UI application
├── config.yaml # Centralized configuration
├── config_loader.py # Configuration management
├── core/ # Core business logic
│ ├── __init__.py
│ ├── document_processor.py # PDF loading and chunking
│ ├── vector_store.py # ChromaDB management
│ ├── qa_chain.py # Question answering pipeline
│ ├── hybrid_retriever.py # Hybrid search (BM25 + Semantic)
│ ├── bm25_retriever.py # BM25 keyword search
│ ├── reranker.py # Cohere/Jina re-ranking
│ ├── conversation.py # Conversation history management
│ ├── ab_testing.py # A/B testing framework
│ ├── collection_manager.py # Collection CRUD operations
│ ├── document_manager.py # Document CRUD operations
│ ├── search_manager.py # Unified search interface
│ ├── storage.py # JSON file persistence
│ └── models/ # Data models
│ ├── collection.py # Collection model
│ ├── document.py # Document model
│ ├── search.py # Search request/response models
│ ├── responses.py # API response models
│ └── errors.py # Custom exceptions
├── pages/ # Streamlit multi-page app
│ ├── 1_How_It_Works.py # Interactive documentation
│ └── 2_Collections.py # Collection management UI
├── ui/ # Shared UI components
│ ├── __init__.py
│ ├── shared_components.py # Branding, navigation, CSS
│ └── sidebar_components.py # Retrieval settings, config display
├── utils/ # Utility functions
│ ├── __init__.py
│ └── retry_utils.py # API retry decorators
├── tests/ # Test suite
│ ├── conftest.py # pytest fixtures
│ └── test_*.py # Unit and integration tests
├── data/ # Data storage
│ ├── collections.json # Collection metadata
│ └── documents.json # Document metadata
├── screenshots/ # Application screenshots
├── pytest.ini # pytest configuration
├── requirements.txt # Python dependencies
├── HOW_IT_WORKS.md # Detailed technical documentation
├── .env.example # Environment variables template
├── .gitignore # Git ignore rules
└── README.md # This file
┌──────────────┐
│ PDF Upload │
└──────┬───────┘
│
▼
┌─────────────────┐
│ Document │
│ Processor │ (PDF → Pages → Chunks)
└────────┬────────┘
│
▼
┌─────────────────┐
│ Vector Store │
│ Manager │ (Chunks → Embeddings → ChromaDB)
└────────┬────────┘
│
▼
┌─────────────────┐
│ User Question │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Hybrid │
│ Retriever │ (Semantic + BM25 → RRF Fusion)
└────────┬────────┘
│
▼
┌─────────────────┐
│ Re-ranker │
│ (Optional) │ (Cross-encoder scoring)
└────────┬────────┘
│
▼
┌─────────────────┐
│ QA Chain │ (Context → GPT-4o-mini → Answer)
└────────┬────────┘
│
▼
┌─────────────────┐
│ Streamed Answer │
└─────────────────┘
- Python 3.8 or higher
- OpenAI API key (Get one here)
- 2GB+ free disk space (for ChromaDB vector store)
- sentence-transformers - For Jina re-ranker (local, preferred):
pip install sentence-transformers - Cohere API key (Get one here) - For Cohere re-ranker (cloud fallback)
- The system prefers Jina (local) first, then falls back to Cohere (cloud) if Jina is unavailable
git clone https://github.com/shrimpy8/semantic-serach.git
cd semantic-serachpython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txt# Copy the example environment file
cp .env.example .env
# Edit .env and add your API keys
OPENAI_API_KEY=your_openai_api_key_here
COHERE_API_KEY=your_cohere_api_key_here # Optional - for Cohere reranker.env file to version control. It's already in .gitignore.
streamlit run app.pyThe application will open in your default browser at http://localhost:8501.
The app has three main pages accessible via sidebar navigation:
| Page | Purpose |
|---|---|
| 🔍 Search | Upload documents and search (home page) |
| 📚 How It Works | Interactive documentation and configuration guide |
| 📁 Collections | Organize documents into searchable collections |
- Upload PDF: Click "Select a PDF file" and choose your document
- Wait for Processing: The app extracts text, creates chunks, and generates embeddings
- View Documents Panel: Expand to see document list and statistics (chunks, size, overlap)
- Select Retrieval Preset: Choose High Precision, Balanced, or High Recall from sidebar
- Ask Questions: Type your question in the chat input at the bottom
- View Scores: Each result shows relevance scores (semantic, BM25, rerank)
- View Context: Expand source sections to see the chunks used for the answer
- Navigate to Collections: Click "📁 Collections" in the sidebar navigation
- Create a Collection: Enter a name and optional description, then click "Create Collection"
- Upload Documents: Select a collection, go to the "Upload" tab, and add PDF files
- View Documents: See all documents in a collection with their status and chunk counts
- Search in Collection: Use the "Search" tab to search within a specific collection
- Delete: Remove individual documents or entire collections (cascade deletes all associated data)
| Preset | Results (k) | Alpha | Reranking | Best For |
|---|---|---|---|---|
| 🎯 High Precision | 3 | 0.7 | On | Specific questions, exact answers |
| ⚖️ Balanced | 5 | 0.5 | On | General queries (default) |
| 🔍 High Recall | 10 | 0.3 | Off | Comprehensive research |
- Upload a document on the Search page
- Expand the A/B Testing panel (below Documents panel)
- Enter a test query representative of real usage
- Click "Run Comparison" - automatically tests all 4 methods:
- Semantic (pure vector search)
- BM25 (pure keyword search)
- Hybrid (combined, no reranking)
- Hybrid + Rerank (combined with reranking)
- View results: Average score, latency, and recommended variant
- Export to CSV for deeper analysis
All settings are managed in config.yaml. Customize without changing code:
models:
embedding:
name: "text-embedding-3-large" # OpenAI embedding model
chat:
name: "gpt-4o-mini" # Chat model
temperature: 0.0 # 0.0 = deterministicretrieval_presets:
high_precision:
search_k: 3
alpha: 0.7
reranking: true
description: "Fewer, highly relevant results"
balanced:
search_k: 5
alpha: 0.5
reranking: true
description: "Good balance of precision and recall"
high_recall:
search_k: 10
alpha: 0.3
reranking: false
description: "More results, broader coverage"hybrid_retrieval:
enabled: true
default_method: "hybrid"
alpha: 0.5 # 0=BM25 only, 1=Semantic only
rrf_k: 60 # RRF fusion constant
bm25:
k1: 1.5 # Term frequency saturation
b: 0.75 # Length normalization
reranking:
enabled: true
# Provider options:
# - "auto": jina (local) first, then cohere (cloud)
# - "jina": force local model
# - "cohere": force cloud API
provider: "auto"
fetch_k_multiplier: 3 # Fetch 3x candidates for rerankingdocument_processing:
chunk_size: 1000 # Characters per chunk
chunk_overlap: 200 # Overlap between chunks
add_start_index: true # Add metadata for chunk position| Module | Purpose |
|---|---|
core/document_processor.py |
PDF loading, text extraction, chunking |
core/vector_store.py |
ChromaDB management, similarity search |
core/hybrid_retriever.py |
BM25 + Semantic fusion with RRF |
core/reranker.py |
Cohere/Jina cross-encoder re-ranking |
core/qa_chain.py |
RAG pipeline, answer generation |
core/collection_manager.py |
Collection CRUD operations |
core/document_manager.py |
Document CRUD operations |
core/search_manager.py |
Unified search interface |
ui/shared_components.py |
Branding, navigation, shared CSS |
ui/sidebar_components.py |
Retrieval settings, configuration display |
The project includes a comprehensive test suite with 290+ tests:
# Run all tests
pytest tests/ -v
# Run specific test categories
pytest tests/ -v -m unit # Unit tests only
pytest tests/ -v -m integration # Integration tests
# Test coverage
pytest tests/ --cov=. --cov-report=htmlTest Files:
tests/test_models.py- Data models (Collection, Document, Search)tests/test_collection_manager.py- Collection CRUD operationstests/test_document_manager.py- Document CRUD operationstests/test_search_manager.py- Search functionalitytests/test_vector_store.py- ChromaDB operationstests/test_hybrid_retriever.py- Hybrid search logictests/test_ab_testing.py- A/B testing frameworktests/test_conversation.py- Conversation historytests/test_config_loader.py- Configuration loadingtests/test_integration.py- Real ChromaDB integration tests (document lifecycle, orphan cleanup)
- Follow PEP 8 guidelines
- Comprehensive docstrings for all classes and functions
- Type hints for function parameters and returns
- Structured logging with
logging.getLogger(__name__)
Application logs are written to:
- Console: Real-time output during execution
- File:
semantic_search.log(persistent logs)
Key log messages:
🔍 Auto-selecting reranker...
✅ RERANKER SELECTED: Jina (local)
# or if Jina unavailable:
✅ RERANKER SELECTED: Cohere (cloud)
Adjust logging level in config.yaml:
logging:
level: "INFO" # DEBUG, INFO, WARNING, ERROR, CRITICAL- Never commit
.envfiles to version control - Rotate keys immediately if accidentally exposed
- Use environment variables for all sensitive data
- Review
SECURITY_NOTICE.mdfor detailed security guidelines
Add to .git/hooks/pre-commit:
#!/bin/bash
if git diff --cached --name-only | grep -E '\\.env$'; then
echo "❌ ERROR: Attempting to commit .env file!"
exit 1
fi
exit 0Make executable:
chmod +x .git/hooks/pre-commitError: "Configuration file not found"
- Ensure
config.yamlexists in the project directory - Check file permissions
Error: "OpenAI API key not found"
- Verify
.envfile exists with validOPENAI_API_KEY - Ensure
python-dotenvis installed
Error: "Rate limit exceeded"
- Wait for rate limit reset
- Retry logic should handle this automatically
- Check API quota/billing
No reranker available
- Install
sentence-transformersfor Jina (preferred):pip install sentence-transformers - Or add
COHERE_API_KEYto.envfor Cohere (cloud fallback)
Documents lost after restart
- This is now fixed - documents persist automatically
- If issue persists, check ChromaDB connection
ChromaDB Permission Error
- Ensure write permissions for
./chroma/dbdirectory - Try deleting the directory and restarting
When you upload a PDF:
- File is temporarily saved to disk
- PyPDFLoader extracts text from all pages
- RecursiveCharacterTextSplitter creates overlapping chunks
- Temporary file is cleaned up
For each chunk:
- OpenAI text-embedding-3-large generates 3072-dimension vector
- Vector + metadata stored in ChromaDB
- Collection persisted to disk for future sessions
When you search:
- Query processed through both semantic and BM25 retrievers
- Results combined using Reciprocal Rank Fusion (RRF)
- Alpha parameter controls the balance (configurable via presets)
- Optional: Cross-encoder re-ranking for improved quality
After retrieval:
- Top-K chunks formatted as context
- GPT-4o-mini generates answer based solely on context
- Answer streamed token-by-token to UI
- Source chunks displayed with relevance scores
- ✅ Choose the right preset: High Precision for specific queries, High Recall for research
- ✅ Upload well-structured PDFs with clear text (avoid scanned images)
- ✅ Use the A/B testing framework to find optimal settings for your documents
- ✅ Check relevance scores to understand result quality
- ✅ Use Collections to organize documents by topic/project
- ✅ Enable reranking for important queries (at the cost of some latency)
- PDF Only: Currently supports PDF files only (no DOCX, TXT, etc.)
- Text Extraction: Quality depends on PDF structure (scanned PDFs won't work)
- Context Window: Limited to top-K chunks (may miss relevant information)
- English Focused: Best performance with English documents
- Local Storage: Vector store persisted locally (not suitable for multi-user production)
Completed ✅:
- Hybrid search (BM25 + semantic) with RRF fusion
- Re-ranking support (Cohere cloud + Jina local) with auto-selection
- Conversation history and follow-up questions
- Retrieval presets (High Precision / Balanced / High Recall)
- Score visibility for search results
- A/B testing framework for retrieval methods
- Interactive "How It Works" documentation page
- Multi-document collections with isolated searches
- Sidebar navigation across all pages
- Documents panel on home page
- Per-collection search history persistence
- Shared UI components (DRY refactoring)
- Comprehensive test suite (290+ tests)
- Document persistence across app restarts
Future enhancements planned:
- Support for multiple file formats (DOCX, TXT, HTML, Markdown)
- Document similarity and comparison features
- Export search results to CSV/JSON
- Advanced chunking strategies (sentence-based, semantic)
- Citation tracking (show exact page/paragraph references)
- Docker containerization
- Next.js production UI (Stage 2)
- FastAPI backend with Supabase (Stage 2)
- User authentication and multi-user support
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- LangChain - LLM application framework
- ChromaDB - Vector database
- Streamlit - Web application framework
- OpenAI - Embeddings and chat models
- Cohere - Re-ranking API
- Jina AI - Local re-ranking models
Harsh
- GitHub: @shrimpy8
- Repository: semantic-serach
If you encounter any issues or have questions:
- Check the Troubleshooting section
- Review existing GitHub Issues
- Create a new issue with detailed description
Made with ❤️ using LangChain, ChromaDB, and OpenAI






