GraphRAG — Graph-Enhanced Retrieval-Augmented Generation

Industry-standard RAG pipeline combining semantic chunking, FAISS + persisted knowledge graph, cross-encoder reranking, and Dijkstra-based traversal for explainable QA from PDFs.

TL;DR

What: End-to-end system that ingests PDFs, builds a persisted vector index + knowledge graph, and answers natural-language queries using graph-guided retrieval.

Why it matters: Improves factual grounding and explainability vs. flat RAG by discovering conceptually-connected context through graph traversal, not just vector similarity.

Key results: Sub-second cold-start (persisted indexes), 5-8 node traversal paths, graph visualization shows decision-making.

Highlights

Production-ready architecture: FAISS vector store + NetworkX knowledge graph persisted to disk (no rebuild on subsequent runs)
5-stage retrieval pipeline: Query rewrite → FAISS retrieval → cross-encoder reranking → Dijkstra graph expansion → LLM generation
Explainability: Graph traversal visualization (graph_traversal.png) with concept-labeled nodes and weighted edges
Tech stack: Groq LLM (Llama 3.3 70B), HuggingFace embeddings (all-MiniLM-L6-v2), FAISS, spaCy NER, CrossEncoder reranker, NetworkX, matplotlib
Skills demonstrated: Retrieval engineering, embedding models, reranking, graph algorithms (Dijkstra), LLM prompting, persistence strategies, visualization

Quick Start

Installation

# Clone and setup
git clone https://github.com/amruth6002/GraphRAG.git
cd GraphRAG
python -m venv newenv
source newenv/bin/activate  # On Windows: newenv\Scripts\activate

# Install dependencies
pip install langchain langchain-groq langchain-huggingface langchain-community langchain-experimental
pip install faiss-cpu sentence-transformers networkx scikit-learn spacy nltk matplotlib tqdm pydantic python-dotenv pypdf

# Download spaCy model
python -m spacy download en_core_web_sm

# Setup environment
echo 'GROQ_API_KEY="your-groq-api-key-here"' > .env

Usage

# First run: builds indexes + answers query (takes 2-3 min)
python src/ingestion.py --path data/Understanding_Climate_Change.pdf \
    --query "what is the main cause of climate change?"

# Subsequent runs: loads from disk (< 10 seconds)
python src/ingestion.py --query "what are the effects of deforestation?"

# Force rebuild
python src/ingestion.py --rebuild --path data/Understanding_Climate_Change.pdf \
    --query "explain the greenhouse effect"

Example Output

INGESTION PHASE
Loading from database
FAISS time: 0.56 seconds
Loaded graph: 67 nodes, 234 edges
Graph time: 0.12s

[1] Query Rewriting
changed query: What are the primary anthropogenic factors...
query rewriting time: 0.30s

[2] vector retrieval (top 10 from FAISS)
Retrieved 10 chunks
vector retrieval time: 0.08s

[3] cross-encoder reranking (top 5)
reranking time: 0.45s

[4] graph expansion (dijkstra traversal)
  Graph traversal:
    Step 1 - Node 0 [SEED]: Understanding Climate Change...
    Step 2 - Node 5 [NEIGHBOUR]: Rising Temperatures...
    Step 3 - Node 12 [NEIGHBOUR]: National Policies...
  Traversal complete: 8 nodes visited

Graph visualization saved to graph_traversal.png

[6] answer generation
Answer: The primary anthropogenic factors contributing to global climate change...

Visualization showing concept-guided traversal through the knowledge graph

Architecture

Ingestion Phase (Offline, runs once)

PDF Document
    │
    ├──→ SemanticChunker (percentile-based breakpoints)
    │       └──→ FAISS Vector Store ──→ persist to indexes/faiss/
    │
    └──→ KnowledgeGraph Builder
            ├── spaCy NER (extract entities)
            ├── LLM concept extraction
            ├── Cosine similarity + shared concepts
            └──→ NetworkX graph ──→ persist to indexes/knowledge_graph.pkl

Query Phase (Online, per request)

User Query
    │
    ├─[1]─→ LLM Query Rewrite (more specific, detailed)
    │
    ├─[2]─→ FAISS Retrieval (top-k by cosine similarity)
    │
    ├─[3]─→ CrossEncoder Reranking (ms-marco-MiniLM)
    │           └──→ Top-5 most relevant chunks
    │
    ├─[4]─→ Graph Expansion (Dijkstra traversal)
    │           ├── Map reranked chunks → graph nodes (seeds)
    │           ├── Priority queue by edge weights
    │           ├── Visit neighbors with new concepts
    │           └──→ Ordered context (8-10 chunks)
    │
    └─[5]─→ LLM Answer Generation (single call)
                └──→ Final answer + graph visualization

What Makes This Different from Basic RAG

Feature	Basic RAG	This Project
Chunking	Fixed-size (RecursiveCharacterTextSplitter)	Semantic chunking (splits at meaning boundaries)
Retrieval	Vector similarity only	Vector + knowledge graph traversal
Reranking	None or simple scoring	Cross-encoder reranking (ms-marco-MiniLM)
Query	Raw user query	LLM-rewritten query before retrieval
Persistence	Rebuilt every run	FAISS + graph persisted to disk
Context Assembly	Flat list of top-k chunks	Dijkstra traversal discovers conceptually connected chunks
Explainability	Black box	Graph visualization shows traversal path

How Graph Expansion Works

Traditional RAG stops after retrieving top-k similar chunks. GraphRAG goes further:

Build knowledge graph at ingestion time (not per-query):
- Each chunk = node
- Edges weighted by: 0.7 × cosine_similarity + 0.3 × shared_concepts
- Example: "fossil fuels" chunk connects to "CO2 emissions" chunk (high similarity + shared concept: "carbon")
At query time, start from reranked seeds:
- FAISS retrieves 10 chunks → cross-encoder reranks → top 5 become "seed nodes"
Dijkstra-like traversal:
- Priority queue: visit strongest connections first
- For each node: check if concepts are new → expand to neighbors
- Stop when max nodes reached (default: 8)
Result: context that vector search alone would miss:
- Example: Query "causes of climate change" retrieves "fossil fuels" (seed)
- Graph traversal discovers "deforestation" (neighbor via shared concept: "CO2")
- Final context = fossil fuels + deforestation + methane emissions + policy responses

Visualization shows this: Green node (start) → yellow nodes (traversal path) → red node (end), with edge thickness = connection strength.

Project Structure

GraphRAG/
├── src/
│   ├── ingestion.py          # GraphRAG class: __init__ (ingestion) + run() (query)
│   ├── util.py                # FAISS, KnowledgeGraph, traversal, visualization
│   └── __init__.py
├── data/
│   └── Understanding_Climate_Change.pdf  # Example PDF
├── indexes/                   # Auto-generated (git-ignored)
│   ├── faiss/
│   │   ├── index.faiss
│   │   └── splits.pkl
│   └── knowledge_graph.pkl
├── graph_traversal.png        # Auto-generated visualization
├── .env                       # API keys (git-ignored)
├── .gitignore
└── readme.md

Key Files

File	Purpose	Lines of Code
src/ingestion.py	Pipeline orchestration, CLI, GraphRAG class	~180
src/util.py	FAISS encoding, KnowledgeGraph, Dijkstra traversal, visualization	~330

CLI Arguments

python src/ingestion.py [OPTIONS]

Argument	Type	Default	Description
`--path`	str	`""`	Path to PDF file (required for first run)
`--query`	str	`"what is the main cause of climate change?"`	Question to answer
`--n_retrieved`	int	`10`	Number of chunks to retrieve from FAISS
`--chunk_size`	int	`1000`	Chunk size for semantic splitting
`--chunk_overlap`	int	`200`	Overlap between chunks
`--rebuild`	flag	`false`	Force rebuild persisted indexes

Tech Stack

Component	Technology	Why
LLM	Llama 3.3 70B via Groq	Fast inference, free tier, 128k context
Embeddings	all-MiniLM-L6-v2 (HuggingFace)	Local, 384-dim, fast, good quality
Reranker	ms-marco-MiniLM-L-6-v2 (CrossEncoder)	Local, SOTA for passage reranking
Vector Store	FAISS	Fast similarity search, easy persistence
Knowledge Graph	NetworkX	Flexible graph operations, Dijkstra built-in
NER	spaCy en_core_web_sm	Fast entity extraction
Chunking	LangChain SemanticChunker	Percentile-based semantic boundaries
Framework	LangChain	Prompt management, retriever abstraction

Industry Practices Followed

Offline ingestion / Online querying — Indexes built once, loaded from disk for queries (no rebuild delay)
Persisted vector store + knowledge graph — FAISS save_local() / load_local(), graph pickled
Query rewriting before retrieval — Improved query drives FAISS search, not just final prompt
Two-stage retrieval — Fast approximate (FAISS) → precise reranking (cross-encoder)
Graph-expanded context — Dijkstra traversal discovers conceptually connected chunks beyond vector similarity
Single LLM call for generation — Predictable latency, no iterative answer-checking loops
Rate-limit protection — Sequential concept extraction with time.sleep(1) for Groq free tier
Non-interactive visualization — matplotlib Agg backend saves PNG without blocking

Example Queries & Results

Query 1: "What is the main cause of climate change?"

Traversal path: 5 nodes
Concepts discovered: fossil fuels → CO2 emissions → greenhouse effect → temperature rise → policy responses

Answer excerpt:

"The primary anthropogenic factors contributing to global climate change are the burning of fossil fuels and deforestation, which release greenhouse gases..."

Query 2: "What are the effects of deforestation?"

Traversal path: 7 nodes
Concepts discovered: deforestation → carbon sinks → biodiversity loss → soil erosion → climate feedback loops

Answer excerpt:

"Deforestation contributes to climate change by reducing carbon sequestration capacity and increasing atmospheric CO2..."

Future Enhancements

Metrics dashboard: Add retrieval evaluation (MRR, NDCG) with ground-truth Q&A pairs
Multi-document support: Extend to handle multiple PDFs, track provenance per chunk
Community detection: Cluster graph nodes (Louvain algorithm) and pre-generate summaries (Microsoft GraphRAG approach)
Hybrid search: Add BM25 for keyword-based retrieval alongside vector search
Streaming responses: Use LLM streaming for real-time answer generation
Web UI: Flask/Streamlit interface with interactive graph visualization (D3.js)

License

MIT License - see LICENSE for details

Contact

Amruth - @amruth6002

Project Link: https://github.com/amruth6002/GraphRAG

Acknowledgments

Inspired by Microsoft's GraphRAG and NirDiamant's RAG Techniques
Built with LangChain, FAISS, and NetworkX
LLM powered by Groq

If this project helped you, please star it! It helps others discover this work.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.devcontainer		.devcontainer
.streamlit		.streamlit
data		data
src		src
app.py		app.py
graph_traversal.png		graph_traversal.png
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GraphRAG — Graph-Enhanced Retrieval-Augmented Generation

TL;DR

Highlights

Quick Start

Installation

Usage

Example Output

Architecture

Ingestion Phase (Offline, runs once)

Query Phase (Online, per request)

What Makes This Different from Basic RAG

How Graph Expansion Works

Project Structure

Key Files

CLI Arguments

Tech Stack

Industry Practices Followed

Example Queries & Results

Future Enhancements

License

Contact

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GraphRAG — Graph-Enhanced Retrieval-Augmented Generation

TL;DR

Highlights

Quick Start

Installation

Usage

Example Output

Architecture

Ingestion Phase (Offline, runs once)

Query Phase (Online, per request)

What Makes This Different from Basic RAG

How Graph Expansion Works

Project Structure

Key Files

CLI Arguments

Tech Stack

Industry Practices Followed

Example Queries & Results

Future Enhancements

License

Contact

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages