GraphRAG for Large Engineering PDFs

Production-grade GraphRAG pipeline for building, improving, evaluating, and documenting knowledge graphs over large engineering and technical PDFs.

This repository implements a deterministic + LLM-assisted GraphRAG system with ontology induction, community-aware retrieval, evaluation, and automated documentation generation.

What This Project Does

Extracts entities and relations from engineering PDFs
Builds a materialized knowledge graph (NetworkX)
Learns ontology types and type hierarchies from data
Detects semantic communities in the graph
Prevents semantic drift using community-aware retrieval
Supports hybrid retrieval (Graph + Vector / FAISS)
Runs targeted LLM improve cycles for factual gaps
Evaluates multiple RAG strategies
Generates DOCX and PDF technical documentation automatically

Core Pipeline (Implemented)

PDF ingestion and chunking
Entity & relation extraction
Graph construction (graph.pkl)
Ontology induction (type_registry.json, type_hierarchy.json)
Community detection (communities.json)
Hybrid retrieval (Graph + FAISS)
Improve cycle (LLM-assisted, checkpointed)
Strategy evaluation
Automated documentation generation

Key Artifacts

Artifact	Description
`index/graph.pkl`	Materialized knowledge graph
`index/type_registry.json`	Learned ontology classes
`index/type_hierarchy.json`	Inferred type hierarchy
`index/communities.json`	Graph communities
`index/faiss.index`	Vector index
`index/strategy_comparison.json`	Strategy evaluation
`index/missed_knowledge.json`	Improve-cycle gaps
`documentation/GraphRAG_Documentation.docx`	Auto-generated doc
`documentation/GraphRAG_Documentation.pdf`	PDF (LibreOffice)

Documentation

Technical documentation is generated automatically using:

python scripts/generate_documentation.py

Outputs:

DOCX (always)
PDF (via LibreOffice headless)

Design Philosophy

Deterministic structure before LLMs
LLMs used only for factual completion
Community boundaries prevent semantic drift
Explainability and reproducibility over recall

Status

Core GraphRAG engine: ✅ COMPLETE
Ontology & communities: ✅ COMPLETE
Evaluation & documentation: ✅ COMPLETE
Operationalization: 🚧 IN PROGRESS

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
graphrag_indexer		graphrag_indexer
scripts		scripts
.gitignore		.gitignore
README.md		README.md
arcitecture.md		arcitecture.md
execution_flow.txt		execution_flow.txt
files_tree.txt		files_tree.txt
github-issues.md		github-issues.md
milestones.md		milestones.md
modifyAbsolutePath.py		modifyAbsolutePath.py
requirements.txt		requirements.txt
usage.md		usage.md
work-items.md		work-items.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GraphRAG for Large Engineering PDFs

What This Project Does

Core Pipeline (Implemented)

Key Artifacts

Documentation

Design Philosophy

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

GraphRAG for Large Engineering PDFs

What This Project Does

Core Pipeline (Implemented)

Key Artifacts

Documentation

Design Philosophy

Status

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages