RAG system with LangGraph state machine, hybrid search, SSE streaming, and NeMo Guardrails security layer.
Frontend: Streamlit | Backend: GCP Cloud Run | Vector DB: Qdrant Cloud
┌─────────────┐
│ FastAPI │ REST API (upload, stream endpoints)
└──────┬──────┘
│
▼
┌──────────────────┐
│ NeMo Guardrails │ Input check (LLM-based: jailbreak, prompt injection)
└──────┬───────────┘
│ safe
▼
┌──────────────────┐
│ LangGraph Agent │ State machine → streams tokens via SSE
└──────┬───────────┘
│ full response
▼
┌──────────────────┐
│ NeMo Guardrails │ Output check (LLM-based: harmful content, policy)
└──────────────────┘
┌─────────┐
│ START │
└────┬────┘
│
▼
┌───────────┐
│ Router │ (Classify + rewrite query in one LLM call)
└─────┬─────┘
│
┌──────────┴──────────┐
│ │
always web_search=true
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│ Retrieve │ │WebSearch │
│(Hybrid) │ │ │
└────┬─────┘ └────┬─────┘
│ │
└──────────┬──────────┘
│ (parallel fan-in)
▼
┌─────────────┐
│ Grade Docs │ (Batch LLM: relevant?)
└──────┬──────┘
│
▼
┌──────────┐
│ Generate │ (Stream answer via SSE)
└────┬─────┘
│
▼
┌─────┐
│ END │
└─────┘
Node Descriptions:
- Router: LLM classifies query type and rewrites it for semantic search. If web search is needed, both branches run in parallel.
- Retrieve: Always runs. Hybrid search (60% vector similarity + 40% BM25 keyword), top-5 results.
- WebSearch: Runs in parallel with Retrieve when router flags
web_search=true. - Grade Docs: Batch LLM grading over the merged result set from both sources.
- Generate: Synthesize answer from graded documents, stream tokens via SSE.
Documents (PDF, DOCX, TXT) are chunked with overlap, embedded using text-embedding-3-small (1536 dimensions), and stored in Qdrant with metadata (filename, page numbers, chunk index).
Security (NeMo input check):
- LLM-based input rail using
self_check_inputprompt - Colang flows catch: prompt injection, jailbreaks, off-topic requests, system probing, code execution attempts
- Blocked inputs return refusal immediately, before LangGraph runs
Router Node:
- Single LLM call: classifies as
vectorstoreorwebsearchAND rewrites query for semantic search - Explicit phrases ("search web", "check online") override to web search path
Retrieve Node (Hybrid Search):
- Vector search: Qdrant cosine similarity (k=5)
- BM25 search: Keyword-based ranking using spaCy tokenization
- Fusion ranking: Weighted combination (60% vector, 40% BM25), scores normalized
Grade Documents Node:
- Batch LLM grading over the merged result set from both retrieval sources
- Binary relevance scoring (yes/no) per document
Generate Node:
- Synthesizes answer from graded documents, streams tokens via SSE
- Includes
chat_historyfor session-aware multi-turn responses
Output check (post-streaming):
- After streaming completes, full response is checked using NeMo's
self_check_outputprompt template via direct LLM call - NeMo's colang output patterns are not executed (incompatible with streaming architecture)
- If LLM returns "yes" to the policy check, correction event is sent to client
POST /api/stream returns text/event-stream. Each event is a JSON object:
data: {"token": "partial text"} # during generation
data: {"done": true, "sources_count": 5, "session_id": "..."} # on completion
data: {"token": "...", "done": true, "correction": true} # if output flagged
data: {"error": "...", "done": true} # on error
Session-based conversation memory via LangGraph MemorySaver checkpointer. Pass a consistent session_id across requests to maintain context. Each session stores chat_history injected into the generation prompt.
LangGraph AgentState (TypedDict) tracks:
question: Rewritten query (updated by router)raw_documents: Merged result set from parallel retrieval branches (reducer: append)documents: Filtered document list after gradinggeneration: Current answerweb_search: Routing flaggeneration_attempts: Generation retry counterdocs_retrieved_total: Total docs retrieved across all sources (reducer: sum)chat_history: Multi-turn conversation history
Controlled via .env:
QDRANT_MODE=local # uses QDRANT_LOCAL_URL (Docker)
QDRANT_MODE=cloud # uses QDRANT_CLOUD_URL + QDRANT_API_KEY
RAG metrics tracked per query and accessible via /api/evaluation/stats:
- Retrieval Precision: Ratio of relevant to total retrieved documents
- Latency: End-to-end query processing time
- Web Search Rate: Percentage of queries using external search
- Avg Docs Retrieved: Average number of chunks fetched per query
- Avg Docs Relevant: Average number of chunks passing the grader
- Avg Generation Attempts: Average LLM generation calls per query
All metrics are displayed in the UI.
POST /api/stream
- Query documents with full RAG pipeline, response streamed via SSE
- Request:
{question, session_id?} - Returns
text/event-streamwith token events
POST /api/upload
- Upload documents (PDF, DOCX, TXT)
- Response:
{document_id, filename, chunks_created, file_size}
GET /api/evaluation/stats
- Aggregated evaluation metrics
LangGraph, LangChain, NeMo Guardrails, Qdrant, FastAPI, OpenAI, PyMuPDF, BM25 + spaCy, Streamlit, Docker

