An agent-driven Retrieval-Augmented Generation (RAG) system that retrieves information from a document knowledge base using multiple retrieval strategies and dynamically orchestrates them through a LangGraph workflow.
The system combines:
- Semantic vector retrieval
- Keyword search (BM25)
- Query decomposition
- Cross-encoder reranking
- Self-evaluation and retry strategies
This architecture enables more reliable and grounded answers compared to traditional RAG pipelines.
The pipeline is implemented as a LangGraph workflow that dynamically routes queries through multiple retrieval strategies before generating and evaluating an answer.
The graph is defined in graph_builder.py using LangGraph's StateGraph, and processes every query through the following stages:
| Stage | Description |
|---|---|
| 1. Query Router | Classifies the query as factual or complex |
| 2. Query Decomposition | Breaks complex queries into sub-questions |
| 3. Vector Retrieval | Semantic search via ChromaDB embeddings |
| 4. Keyword Retrieval | BM25-based keyword search |
| 5. Hybrid Combination | Merges and deduplicates results from both retrievers |
| 6. Cross-Encoder Reranking | Reranks results by relevance |
| 7. Answer Generation | Generates a grounded answer from retrieved context |
| 8. Self-Evaluation | Scores the answer on relevance, completeness, and grounding |
| 9. Retry / Fallback | Rewrites query and retries if score is below threshold |
The workflow automatically retries if the generated answer is judged to be low quality.
The system uses multiple complementary retrieval strategies to maximize the chance of retrieving relevant context. Each method focuses on a different aspect of information retrieval.
Vector search retrieves documents based on semantic similarity, not exact words.
Implementation Overview
- Documents are converted into embeddings using a SentenceTransformer model.
- The embeddings are stored in a Chroma vector database.
- When a query is received, it is embedded and compared against stored document vectors.
- The most semantically similar chunks are returned.
The retrieved documents are returned along with a similarity score and used as context for the LLM.
Keyword search is implemented using BM25, a probabilistic ranking algorithm commonly used in search engines.
Unlike vector search, BM25 focuses on exact keyword matching and term frequency.
Implementation Overview
- Documents are tokenized and cleaned.
- Stopwords are removed.
- A BM25 index is built over the document tokens.
- Queries are matched against the index to rank documents.
Keyword Extraction
The query is normalized and filtered to remove stopwords.
tokens = re.findall(r"\b\w+\b", query.lower())
keywords = [word for word in tokens if word not in STOPWORDS]BM25 Index Construction
tokenized_docs = [extract_keywords(doc.page_content) for doc in docs]
bm25 = BM25Okapi(tokenized_docs)Retrieval
scores = bm25.get_scores(tokenized_query)Documents with the highest BM25 scores are returned.
To improve retrieval quality, the system combines both retrieval strategies.
Process
- Vector search retrieves semantically relevant documents.
- BM25 retrieves keyword-matching documents.
- Results from both searches are merged.
- Duplicate documents are removed.
This hybrid approach improves:
- Recall (more relevant documents retrieved)
- Precision (better ranking after reranking)
After hybrid retrieval, the results are reranked using a Cross-Encoder model.
Unlike embedding similarity, cross-encoders evaluate the query and document together, producing a more accurate relevance score.
Reranking Process
-
Query-document pairs are created.
-
Each pair is scored by the cross-encoder.
-
Documents are sorted by predicted relevance.
The top-ranked documents are used for answer generation.
This step significantly improves answer quality.
One of the key features of the system is its self-correcting retry mechanism.
Instead of immediately returning a low-quality answer, the system evaluates its output and attempts to improve retrieval automatically.
After generating an answer, the system evaluates it using an LLM.
The evaluation prompt asks the model to score the answer based on:
- factual grounding
- completeness
- relevance
The evaluator returns a score between 0 and 1.
The evaluation score determines the next action.
| Score | Action |
|---|---|
| >= threshold | Accept answer |
| < threshold | Retry retrieval |
| retries exceeded | Return fallback |
The threshold is configurable in the graph configuration.
During the second retry attempt, the query is rewritten to broaden retrieval coverage.
Example:
Original query:
"What is Linux CFS scheduling?"
Expanded query:
"Explain Linux process scheduling and the Completely Fair Scheduler."
This allows the system to retrieve documents that may not match the original query exactly.
If another retry occurs:
- Retrieval top_k is increased
- More documents are searched
- Hybrid retrieval is executed again
Example adjustment:
top_k = 20 → 30
This increases the chances of finding relevant information.
If the system reaches the maximum number of retries and still cannot produce a sufficiently good answer, it returns a safe fallback.
Example:
"I could not find relevant information in the knowledge base to answer this question."
This prevents the LLM from hallucinating unsupported answers.
This architecture ensures that answers are:
- grounded in retrieved documents
- automatically improved if quality is low
- safe against hallucinations.
| Component | Role |
|---|---|
| FastAPI | REST API |
| LangGraph | Agent workflow orchestration |
| LangChain | LLM integration |
| ChromaDB | Vector database |
| Sentence Transformers | Embeddings |
| Cross-Encoder | Reranking |
| BM25 | Keyword retrieval |
| Component | Role |
|---|---|
| React 19 | UI framework |
| Vite | Build tool and dev server |
| Tailwind CSS | Styling |
| ReactMarkdown | Markdown rendering |
Agentic-RAG-Pipeline/
│
├── backend/
│ ├── graph/
│ │ ├── graph_builder.py # LangGraph workflow definition
│ │ ├── nodes.py # Individual node implementations
│ │ └── state.py # Shared state schema
│ │
│ ├── ingestion.py # Document chunking and indexing
│ ├── retriever.py # Vector and BM25 retrieval logic
│ ├── evaluation.py # LLM-based answer scoring
│ └── server.py # FastAPI application
│
├── frontend/
│ └── src/
│ ├── components/ # React UI components
│ ├── api.js # API client
│ └── main.jsx # App entry point
│
├── requirements.txt
└── README.md
- Python 3.9+
- Node.js 18+
- An OpenRouter API key
1. Clone the repository
git clone https://github.com/vicky150612/Agentic-RAG
cd Agentic-RAG-Pipeline2. Create and activate a virtual environment
python -m venv venvOn Windows:
venv\Scripts\activateOn macOS / Linux:
source venv/bin/activate3. Install Python dependencies
pip install -r requirements.txt4. Download NLTK stopwords
python -m nltk.downloader stopwords5. Configure environment variables
cp backend/.env.example backend/.envChange the values in the .env file as required.
6. Start the backend server
cd backend
python server.pyThe server will be available at http://localhost:8000.
1. Navigate to the frontend directory
cd frontend2. Install dependencies
npm install3. Start the development server
npm run devThe frontend will be available at http://localhost:5173.
GET /health
Returns the current system status.
POST /ingest
Upload one or more documents (.pdf or .txt). Documents are chunked using RecursiveCharacterTextSplitter and stored in the vector database.
POST /query
Request:
{
"query": "What is Linux fair scheduling?"
}Response:
{
"query": "What is Linux fair scheduling?",
"route": "hybrid",
"subquestions": [
"What is Linux process scheduling?",
"How does the Linux kernel schedule tasks?",
"What is the Completely Fair Scheduler?"
],
"retrieved_docs": [...],
"keyword_docs": [...],
"hybrid_docs": [...],
"final_docs": [...],
"answer": "The Completely Fair Scheduler (CFS) is the default CPU scheduler in Linux...",
"evaluation_score": 0.86
}- PDF (
.pdf) - Plain text (
.txt)
- Streaming responses with Server-Sent Events
- Better evaluation metrics
- Observability with LangSmith
- Graph visualization in frontend
- Multi-document reasoning
