Skip to content

myuanzhang/rag-system

Repository files navigation

RAG System - Hybrid Retrieval System

A RAG (Retrieval-Augmented Generation) system implementing hybrid retrieval and multi-stage retrieval.

Features

🔍 Retrieval Methods

  • BM25 Sparse Retrieval: Traditional retrieval method based on word frequency statistics
  • Dense Retrieval: Generate embeddings using sentence-transformers with FAISS for similarity search
  • Hybrid Retrieval: Combine BM25 and dense retrieval for both lexical matching and semantic similarity

🔄 Reranking

  • TF-IDF reranking (stable and reliable)
  • Extensible support for neural reranking models (BGE, Cross-Encoder, etc.)
  • Configurable reranking models

📊 Multi-stage Retrieval Pipeline

  1. Stage 1: Hybrid retrieval to get candidate documents
  2. Stage 2: Reranking to improve result quality

Installation

Requirements

  • Python 3.12+
  • UV package manager (recommended) or pip

Using UV (Recommended)

# Clone repository
git clone <your-repo-url>
cd rag_system

# Install dependencies
uv sync

# Activate virtual environment
source .venv/bin/activate  # Linux/Mac
# or .venv\Scripts\activate  # Windows

Using pip

pip install faiss-cpu rank-bm25 torch transformers numpy scikit-learn

Usage

Basic Usage

from hybrid_retrieval import multi_stage_retrieval

# Execute query
query = "How do transformers handle long sequences?"
results = multi_stage_retrieval(query)

# View results
for i, (doc, score) in enumerate(results):
    print(f"Document {i+1} (Score: {score:.4f}):")
    print(doc)
    print()

Custom Parameters

# Custom retrieval parameters
results = multi_stage_retrieval(
    query=query,
    initial_k=10,  # Initially retrieve 10 documents
    final_k=5      # Finally return 5 most relevant documents
)

Direct Component Usage

from hybrid_retrieval import hybrid_retrieval, simple_rerank

# Hybrid retrieval
hybrid_results = hybrid_retrieval(query, k=5)

# Get documents and rerank
documents = [doc for doc, _ in hybrid_results]
reranked_results = simple_rerank(query, documents, top_k=3)

File Description

  • hybrid_retrieval.py - Main retrieval and reranking functionality
  • multistage_retrieval.py - Simplified version of multi-stage retrieval
  • myrag.py - Basic RAG implementation
  • main.py - Example main program
  • test_*.py - Various test files

Tech Stack

  • Vector Database: FAISS (Facebook AI Similarity Search)
  • Text Embeddings: sentence-transformers/all-MiniLM-L6-v2
  • Sparse Retrieval: rank-bm25
  • Reranking: scikit-learn (TF-IDF)
  • Deep Learning: PyTorch, Transformers

Example Output

Query: How do transformers handle long sequences?

Stage 1: Hybrid retrieval...
Initial results:
  1. (Score: 0.6543) Transformers use self-attention mechanisms...
  2. (Score: 0.5432) Recurrent Neural Networks (RNNs)...
  ...

Stage 2: Reranking...
Final results:
Document 1 (Score: 0.3388):
Transformers use self-attention mechanisms to process sequences in parallel...

Document 2 (Score: 0.3234):
Long Short-Term Memory (LSTM) networks are a type of RNN...

Known Issues and Solutions

Cross-Encoder NaN Issues

Some cross-encoder models (like cross-encoder/ms-marco-MiniLM-L-6-v2) produce NaN values in certain environments. Solutions:

  • Use BGE reranker models as alternatives
  • Implemented TF-IDF as a stable backup option
  • Added error handling and multi-model retry mechanisms

Extension Suggestions

  1. Add More Reranking Models:

    • BGE-reranker-large
    • ColBERT
    • SPLADE
  2. Performance Optimization:

    • Document index caching
    • Batch processing
    • GPU acceleration
  3. Feature Enhancement:

    • Query expansion
    • Document chunking strategies
    • Relevance feedback

Contributing

Issues and Pull Requests are welcome!

License

MIT License

About

A hybrid retrieval RAG system combining BM25 and dense embeddings

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages