Skip to content

mdombrov-33/doc-research-agent

Repository files navigation

Document Research Agent

RAG system with LangGraph state machine, hybrid search, SSE streaming, and NeMo Guardrails security layer.

Frontend: Streamlit | Backend: GCP Cloud Run | Vector DB: Qdrant Cloud

Screenshots

Document Research Agent UI

Qdrant vector store dashboard

Architecture

System Overview

┌─────────────┐
│   FastAPI   │  REST API (upload, stream endpoints)
└──────┬──────┘
       │
       ▼
┌──────────────────┐
│ NeMo Guardrails  │  Input check (LLM-based: jailbreak, prompt injection)
└──────┬───────────┘
       │ safe
       ▼
┌──────────────────┐
│  LangGraph Agent │  State machine → streams tokens via SSE
└──────┬───────────┘
       │ full response
       ▼
┌──────────────────┐
│ NeMo Guardrails  │  Output check (LLM-based: harmful content, policy)
└──────────────────┘

LangGraph Agent State Machine

                    ┌─────────┐
                    │  START  │
                    └────┬────┘
                         │
                         ▼
                   ┌───────────┐
                   │  Router   │ (Classify + rewrite query in one LLM call)
                   └─────┬─────┘
                         │
              ┌──────────┴──────────┐
              │                     │
           always            web_search=true
              │                     │
              ▼                     ▼
        ┌──────────┐          ┌──────────┐
        │ Retrieve │          │WebSearch │
        │(Hybrid)  │          │          │
        └────┬─────┘          └────┬─────┘
             │                     │
             └──────────┬──────────┘
                        │  (parallel fan-in)
                        ▼
                 ┌─────────────┐
                 │ Grade Docs  │ (Batch LLM: relevant?)
                 └──────┬──────┘
                        │
                        ▼
                   ┌──────────┐
                   │ Generate │ (Stream answer via SSE)
                   └────┬─────┘
                        │
                        ▼
                      ┌─────┐
                      │ END │
                      └─────┘

Node Descriptions:

  • Router: LLM classifies query type and rewrites it for semantic search. If web search is needed, both branches run in parallel.
  • Retrieve: Always runs. Hybrid search (60% vector similarity + 40% BM25 keyword), top-5 results.
  • WebSearch: Runs in parallel with Retrieve when router flags web_search=true.
  • Grade Docs: Batch LLM grading over the merged result set from both sources.
  • Generate: Synthesize answer from graded documents, stream tokens via SSE.

How It Works

1. Document Upload & Processing

Documents (PDF, DOCX, TXT) are chunked with overlap, embedded using text-embedding-3-small (1536 dimensions), and stored in Qdrant with metadata (filename, page numbers, chunk index).

2. Query Flow

Security (NeMo input check):

  • LLM-based input rail using self_check_input prompt
  • Colang flows catch: prompt injection, jailbreaks, off-topic requests, system probing, code execution attempts
  • Blocked inputs return refusal immediately, before LangGraph runs

Router Node:

  • Single LLM call: classifies as vectorstore or websearch AND rewrites query for semantic search
  • Explicit phrases ("search web", "check online") override to web search path

Retrieve Node (Hybrid Search):

  • Vector search: Qdrant cosine similarity (k=5)
  • BM25 search: Keyword-based ranking using spaCy tokenization
  • Fusion ranking: Weighted combination (60% vector, 40% BM25), scores normalized

Grade Documents Node:

  • Batch LLM grading over the merged result set from both retrieval sources
  • Binary relevance scoring (yes/no) per document

Generate Node:

  • Synthesizes answer from graded documents, streams tokens via SSE
  • Includes chat_history for session-aware multi-turn responses

Output check (post-streaming):

  • After streaming completes, full response is checked using NeMo's self_check_output prompt template via direct LLM call
  • NeMo's colang output patterns are not executed (incompatible with streaming architecture)
  • If LLM returns "yes" to the policy check, correction event is sent to client

3. Streaming (SSE)

POST /api/stream returns text/event-stream. Each event is a JSON object:

data: {"token": "partial text"}        # during generation
data: {"done": true, "sources_count": 5, "session_id": "..."}  # on completion
data: {"token": "...", "done": true, "correction": true}       # if output flagged
data: {"error": "...", "done": true}   # on error

4. Memory

Session-based conversation memory via LangGraph MemorySaver checkpointer. Pass a consistent session_id across requests to maintain context. Each session stores chat_history injected into the generation prompt.

5. State Management

LangGraph AgentState (TypedDict) tracks:

  • question: Rewritten query (updated by router)
  • raw_documents: Merged result set from parallel retrieval branches (reducer: append)
  • documents: Filtered document list after grading
  • generation: Current answer
  • web_search: Routing flag
  • generation_attempts: Generation retry counter
  • docs_retrieved_total: Total docs retrieved across all sources (reducer: sum)
  • chat_history: Multi-turn conversation history

6. Qdrant Modes

Controlled via .env:

QDRANT_MODE=local   # uses QDRANT_LOCAL_URL (Docker)
QDRANT_MODE=cloud   # uses QDRANT_CLOUD_URL + QDRANT_API_KEY

7. Evaluation & Monitoring

RAG metrics tracked per query and accessible via /api/evaluation/stats:

  • Retrieval Precision: Ratio of relevant to total retrieved documents
  • Latency: End-to-end query processing time
  • Web Search Rate: Percentage of queries using external search
  • Avg Docs Retrieved: Average number of chunks fetched per query
  • Avg Docs Relevant: Average number of chunks passing the grader
  • Avg Generation Attempts: Average LLM generation calls per query

All metrics are displayed in the UI.

API Endpoints

POST /api/stream

  • Query documents with full RAG pipeline, response streamed via SSE
  • Request: {question, session_id?}
  • Returns text/event-stream with token events

POST /api/upload

  • Upload documents (PDF, DOCX, TXT)
  • Response: {document_id, filename, chunks_created, file_size}

GET /api/evaluation/stats

  • Aggregated evaluation metrics

Tech Stack

LangGraph, LangChain, NeMo Guardrails, Qdrant, FastAPI, OpenAI, PyMuPDF, BM25 + spaCy, Streamlit, Docker

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors