Agentic RAG Pipeline

An agent-driven Retrieval-Augmented Generation (RAG) system that retrieves information from a document knowledge base using multiple retrieval strategies and dynamically orchestrates them through a LangGraph workflow.

The system combines:

Semantic vector retrieval
Keyword search (BM25)
Query decomposition
Cross-encoder reranking
Self-evaluation and retry strategies

This architecture enables more reliable and grounded answers compared to traditional RAG pipelines.

Architecture Overview

The pipeline is implemented as a LangGraph workflow that dynamically routes queries through multiple retrieval strategies before generating and evaluating an answer.

The graph is defined in graph_builder.py using LangGraph's StateGraph, and processes every query through the following stages:

Stage	Description
1. Query Router	Classifies the query as factual or complex
2. Query Decomposition	Breaks complex queries into sub-questions
3. Vector Retrieval	Semantic search via ChromaDB embeddings
4. Keyword Retrieval	BM25-based keyword search
5. Hybrid Combination	Merges and deduplicates results from both retrievers
6. Cross-Encoder Reranking	Reranks results by relevance
7. Answer Generation	Generates a grounded answer from retrieved context
8. Self-Evaluation	Scores the answer on relevance, completeness, and grounding
9. Retry / Fallback	Rewrites query and retries if score is below threshold

The workflow automatically retries if the generated answer is judged to be low quality.

Retrieval Strategies

The system uses multiple complementary retrieval strategies to maximize the chance of retrieving relevant context. Each method focuses on a different aspect of information retrieval.

1. Vector Search (Semantic Retrieval)

Vector search retrieves documents based on semantic similarity, not exact words.

Implementation Overview

Documents are converted into embeddings using a SentenceTransformer model.
The embeddings are stored in a Chroma vector database.
When a query is received, it is embedded and compared against stored document vectors.
The most semantically similar chunks are returned.

The retrieved documents are returned along with a similarity score and used as context for the LLM.

2. Keyword Search (BM25 Retrieval)

Keyword search is implemented using BM25, a probabilistic ranking algorithm commonly used in search engines.

Unlike vector search, BM25 focuses on exact keyword matching and term frequency.

Implementation Overview

Documents are tokenized and cleaned.
Stopwords are removed.
A BM25 index is built over the document tokens.
Queries are matched against the index to rank documents.

Keyword Extraction

The query is normalized and filtered to remove stopwords.

tokens = re.findall(r"\b\w+\b", query.lower())
keywords = [word for word in tokens if word not in STOPWORDS]

BM25 Index Construction

tokenized_docs = [extract_keywords(doc.page_content) for doc in docs]
bm25 = BM25Okapi(tokenized_docs)

Retrieval

scores = bm25.get_scores(tokenized_query)

Documents with the highest BM25 scores are returned.

3. Hybrid Retrieval

To improve retrieval quality, the system combines both retrieval strategies.

Process

Vector search retrieves semantically relevant documents.
BM25 retrieves keyword-matching documents.
Results from both searches are merged.
Duplicate documents are removed.

This hybrid approach improves:

Recall (more relevant documents retrieved)
Precision (better ranking after reranking)

4. Cross-Encoder Reranking

After hybrid retrieval, the results are reranked using a Cross-Encoder model.

Unlike embedding similarity, cross-encoders evaluate the query and document together, producing a more accurate relevance score.

Reranking Process

Query-document pairs are created.
Each pair is scored by the cross-encoder.
Documents are sorted by predicted relevance.

The top-ranked documents are used for answer generation.

This step significantly improves answer quality.

Retry Strategy

One of the key features of the system is its self-correcting retry mechanism.

Instead of immediately returning a low-quality answer, the system evaluates its output and attempts to improve retrieval automatically.

Step 1: Answer Evaluation

After generating an answer, the system evaluates it using an LLM.

The evaluation prompt asks the model to score the answer based on:

factual grounding
completeness
relevance

The evaluator returns a score between 0 and 1.

Step 2: Decision Logic

The evaluation score determines the next action.

Score	Action
>= threshold	Accept answer
< threshold	Retry retrieval
retries exceeded	Return fallback

The threshold is configurable in the graph configuration.

Step 3: Query Expansion

During the second retry attempt, the query is rewritten to broaden retrieval coverage.

Example:

Original query:
"What is Linux CFS scheduling?"

Expanded query:
"Explain Linux process scheduling and the Completely Fair Scheduler."

This allows the system to retrieve documents that may not match the original query exactly.

Step 4: Expanded Retrieval

If another retry occurs:

Retrieval top_k is increased
More documents are searched
Hybrid retrieval is executed again

Example adjustment:

top_k = 20 → 30

This increases the chances of finding relevant information.

Step 5: Fallback Response

If the system reaches the maximum number of retries and still cannot produce a sufficiently good answer, it returns a safe fallback.

Example:

"I could not find relevant information in the knowledge base to answer this question."

This prevents the LLM from hallucinating unsupported answers.

Benefits of the Architecture

This architecture ensures that answers are:

grounded in retrieved documents
automatically improved if quality is low
safe against hallucinations.

Tech Stack

Backend

Component	Role
FastAPI	REST API
LangGraph	Agent workflow orchestration
LangChain	LLM integration
ChromaDB	Vector database
Sentence Transformers	Embeddings
Cross-Encoder	Reranking
BM25	Keyword retrieval

Frontend

Component	Role
React 19	UI framework
Vite	Build tool and dev server
Tailwind CSS	Styling
ReactMarkdown	Markdown rendering

Project Structure

Agentic-RAG-Pipeline/
│
├── backend/
│   ├── graph/
│   │   ├── graph_builder.py   # LangGraph workflow definition
│   │   ├── nodes.py           # Individual node implementations
│   │   └── state.py           # Shared state schema
│   │
│   ├── ingestion.py           # Document chunking and indexing
│   ├── retriever.py           # Vector and BM25 retrieval logic
│   ├── evaluation.py          # LLM-based answer scoring
│   └── server.py              # FastAPI application
│
├── frontend/
│   └── src/
│       ├── components/        # React UI components
│       ├── api.js             # API client
│       └── main.jsx           # App entry point
│
├── requirements.txt
└── README.md

Setup Instructions

Prerequisites

Python 3.9+
Node.js 18+
An OpenRouter API key

Backend Setup

1. Clone the repository

git clone https://github.com/vicky150612/Agentic-RAG
cd Agentic-RAG-Pipeline

2. Create and activate a virtual environment

python -m venv venv

On Windows:

venv\Scripts\activate

On macOS / Linux:

source venv/bin/activate

3. Install Python dependencies

pip install -r requirements.txt

4. Download NLTK stopwords

python -m nltk.downloader stopwords

5. Configure environment variables

cp backend/.env.example backend/.env

Change the values in the .env file as required.

6. Start the backend server

cd backend
python server.py

The server will be available at http://localhost:8000.

Frontend Setup

1. Navigate to the frontend directory

cd frontend

2. Install dependencies

npm install

3. Start the development server

npm run dev

The frontend will be available at http://localhost:5173.

API Reference

Health Check

GET /health

Returns the current system status.

Ingest Documents

POST /ingest

Upload one or more documents (.pdf or .txt). Documents are chunked using RecursiveCharacterTextSplitter and stored in the vector database.

Query

POST /query

Request:

{
  "query": "What is Linux fair scheduling?"
}

Response:

{
  "query": "What is Linux fair scheduling?",
  "route": "hybrid",
  "subquestions": [
    "What is Linux process scheduling?",
    "How does the Linux kernel schedule tasks?",
    "What is the Completely Fair Scheduler?"
  ],
  "retrieved_docs": [...],
  "keyword_docs": [...],
  "hybrid_docs": [...],
  "final_docs": [...],
  "answer": "The Completely Fair Scheduler (CFS) is the default CPU scheduler in Linux...",
  "evaluation_score": 0.86
}

Supported Document Formats

PDF (.pdf)
Plain text (.txt)

Future Improvements

Streaming responses with Server-Sent Events
Better evaluation metrics
Observability with LangSmith
Graph visualization in frontend
Multi-document reasoning

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Agentic RAG Pipeline

Architecture Overview

Retrieval Strategies

1. Vector Search (Semantic Retrieval)

2. Keyword Search (BM25 Retrieval)

3. Hybrid Retrieval

4. Cross-Encoder Reranking

Retry Strategy

Step 1: Answer Evaluation

Step 2: Decision Logic

Step 3: Query Expansion

Step 4: Expanded Retrieval

Step 5: Fallback Response

Benefits of the Architecture

Tech Stack

Backend

Frontend

Project Structure

Setup Instructions

Prerequisites

Backend Setup

Frontend Setup

API Reference

Health Check

Ingest Documents

Query

Supported Document Formats

Future Improvements

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages