Skip to content

Commit 2c6eff2

Browse files
authored
Add RAG knowledge graph based experimental example (#144)
1 parent 3d42370 commit 2c6eff2

File tree

12 files changed

+1197
-0
lines changed

12 files changed

+1197
-0
lines changed

experimental/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,10 @@ Experimental examples are sample code and deployments for RAG pipelines that are
4343

4444
This example is able to ingest PDFs, PowerPoint slides, Word and other documents with complex data formats including text, images, slides and tables. It allows users to ask questions through a text interface and optionally with an image query, and it can respond with text and reference images, slides and tables in its response, along with source links and downloads.
4545

46+
* [NVIDIA Knowledge Graph RAG](./knowledge_graph_rag)
47+
48+
This example implements a GPU-accelerated pipeline for creating and querying knowledge graphs using Retrieval-Augmented Generation (RAG). The approach leverages NVIDIA's AI technologies and RAPIDS ecosystem to process large-scale datasets efficiently. It allows users to interact through a chat interface and also visualize the corresponding knowledge graph, and perform evaluations against synthetic data generated with NVIDIA's Nemotron-4 340B model.
49+
4650
* [Run RAG-LLM in Azure Machine Learning](./AzureML)
4751

4852
This example shows the configuration changes to using Docker containers and local GPUs that are required
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# Knowledge Graphs for RAG with NVIDIA AI Foundation Models and Endpoints
2+
3+
This repository implements a GPU-accelerated pipeline for creating and querying knowledge graphs using Retrieval-Augmented Generation (RAG). Our approach leverages NVIDIA's AI technologies and RAPIDS ecosystem to process large-scale datasets efficiently.
4+
5+
## Overview
6+
7+
This project demonstrates:
8+
- Creation of knowledge graphs from various document sources
9+
- Provides a simple script to download research papers from Arxiv for a given topic
10+
- GPU-accelerated graph processing and analysis using NVIDIA's RAPIDS Graph Analytics library (cuGraph): https://github.com/rapidsai/cugraph
11+
- Hybrid semantic search combining keyword and dense vector approaches
12+
- Integration of knowledge graphs into RAG workflows
13+
- Visualization of the knowledge graph through [Gephi-Lite](https://github.com/gephi/gephi-lite), an open-source web app for visualization of large graphs
14+
- Comprehensive evaluation metrics using NVIDIA's Nemotron-4 340B model for synthetic data generation and reward scoring
15+
16+
## Technologies Used
17+
18+
- **Frontend**: Streamlit
19+
- **Graph Representation and Optimization**: cuGraph (RAPIDS), NetworkX
20+
- **Vector Database**: Milvus
21+
- **LLM Models**:
22+
- NVIDIA AI Playground hosted models for graph creation and querying, providing numerous instruct-fine-tuned options
23+
- NVIDIA AI Playground hosted Nemotron-4 340B model for synthetic data generation and evaluation reward scoring
24+
25+
## Architecture Diagram
26+
27+
Here is how the ingestion system is designed, by leveraging a high throughput hosted LLM deployment which can process multiple document chunks in parallel. The LLM can optionally be fine-tuned for triple extraction, thereby requiring a shorter prompt and enabling greater accuracy and optimized inference.
28+
29+
```mermaid
30+
graph TD
31+
A[Document Collection] --> B{Document Splitter}
32+
B --> |Chunk 1| C1[LLM Stream 1]
33+
B --> |Chunk 2| C2[LLM Stream 2]
34+
B --> |Chunk 3| C3[LLM Stream 3]
35+
B --> |...| C4[...]
36+
B --> |Chunk N| C5[LLM Stream N]
37+
C1 --> D[Response Parser<br>and Aggregator]
38+
C2 --> D
39+
C3 --> D
40+
C4 --> D
41+
C5 --> D
42+
D --> E[GraphML Generator]
43+
E --> F[Single GraphML File]
44+
```
45+
46+
Here's how the inference system is designed, incorporating both hybrid dense-vector search and sparse keyword-based search, reranking, and Knowledge Graph for multi-hop search:
47+
48+
```mermaid
49+
graph LR
50+
E(User Query) --> A(FRONTEND<br/>Chat UI<br/>Streamlit)
51+
A --Dense-Sparse<br>Retrieval--> C(Milvus Vector DB)
52+
A --Multi-hop<br>Search--> F(Knowledge Graph <br> with cuGraph)
53+
C --Hybrid<br>Chunks--> X(Reranker)
54+
X -- Augmented<br/>Prompt--> B((Hosted LLM API<br/>NVIDIA AI Playground))
55+
F -- Graph Context<br>Triples--> B
56+
B --> D(Streaming<br/>Chat Response)
57+
```
58+
59+
This architecture shows how the user query is processed through both the Milvus Vector DB for traditional retrieval and the Knowledge Graph with cuGraph for multi-hop search. The results from both are then used to augment the prompt sent to the NVIDIA AI Playground backend.
60+
61+
## Setup Steps
62+
63+
Follow these steps to get the chatbot up and running in less than 5 minutes:
64+
65+
### 1. Clone this repository to a Linux machine
66+
67+
```bash
68+
git clone https://github.com/NVIDIA/GenerativeAIExamples/ && cd GenerativeAIExamples/experimental/knowledge_graph_rag
69+
```
70+
71+
### 2. Get an NVIDIA AI Playground API Key
72+
73+
```bash
74+
export NVIDIA_API_KEY="nvapi-*******************"
75+
```
76+
77+
If you don't have an API key, follow [these instructions](https://github.com/NVIDIA/GenerativeAIExamples/blob/main/docs/api-catalog.md#get-an-api-key-for-the-accessing-models-on-the-api-catalog) to sign up for an NVIDIA AI Foundation developer account and obtain access.
78+
79+
### 3. Create a Python virtual environment and activate it
80+
81+
```bash
82+
cd knowledge_graph_rag
83+
pip install virtualenv
84+
python3 -m virtualenv venv
85+
source venv/bin/activate
86+
```
87+
88+
### 4. Install the required packages
89+
90+
```bash
91+
pip install -r requirements.txt
92+
```
93+
94+
### 5. Setup a hosted Milvus vector database
95+
96+
Follow the instructions [here](https://milvus.io/docs/install_standalone-docker.md) to deploy a hosted Milvus instance for the vector database backend. Note that it must be Milvus 2.4 or better to support [hybrid search](https://milvus.io/docs/multi-vector-search.md). We do not support disabling this feature for previous versions of Milvus as of now.
97+
98+
### 5. Launch the Streamlit frontend
99+
100+
```bash
101+
streamlit run app.py
102+
```
103+
104+
Open the URL in your browser to access the UI and chatbot!
105+
106+
### 6. Upload Docs and Train Model
107+
108+
Upload your own documents to a folder, or use an existing folder for the knowledge graph creation. Note that the implementation currently focuses on text from PDFs only. It can be extended to other text file formats using the Unstructured.io data loader in LangChain.
109+
110+
## Pipeline Components
111+
112+
1. **Data Ingestion**:
113+
- ArXiv paper downloader
114+
- Arbitrary document folder ingestion
115+
2. **Knowledge Graph Creation**:
116+
- Uses the API Catalog models through the LangChain NVIDIA AI Endpoints interface
117+
3. **Graph Representation**: cuGraph + RAPIDS + NetworkX
118+
4. **Semantic Search**: Milvus 2.4.x for hybrid (keyword + dense vector) search
119+
5. **RAG Integration**: Custom workflow incorporating knowledge graph retrieval
120+
6. **Evaluation**: Comparison of different RAG approaches using Nemotron-4 340B model
121+
122+
## Evaluation Metrics
123+
124+
We've implemented comprehensive evaluation metrics using NVIDIA's Nemotron-4 340B model, which is designed for synthetic data generation and reward scoring. Our evaluation compares different RAG approaches across five key attributes:
125+
126+
1. **Helpfulness**: Overall helpfulness of the response to the prompt.
127+
2. **Correctness**: Inclusion of all pertinent facts without errors.
128+
3. **Coherence**: Consistency and clarity of expression.
129+
4. **Complexity**: Intellectual depth required to write the response.
130+
5. **Verbosity**: Amount of detail included in the response, relative to what is asked for in the prompt.
131+
132+
## Evaluation Results
133+
134+
We compared four RAG approaches on a small representative dataset using the NeMoTron-340B reward model:
135+
136+
![Evaluation Results](viz.png)
137+
138+
Key takeaways:
139+
- Graph RAG significantly outperforms traditional Text RAG.
140+
- Combined Text and Graph RAG shows promise but doesn't consistently beat the ground truth yet. This may be due to the way we structure the augmented prompt for the LLM and needs more experimentation.
141+
- Our approach improves on verbosity and coherence compared to ground truth.
142+
143+
While we're not beating long-context ground truth across the board, these results show the potential of integrating knowledge graphs into RAG systems. We're particularly excited about the improvements in verbosity and coherence. Next steps include refining how we combine text and graph retrieval to get the best of both worlds.
144+
145+
## Component Swapping
146+
147+
All components are designed to be swappable. Here are some options:
148+
149+
- **Frontend**: The current Streamlit implementation can be replaced with other web frameworks.
150+
- **Retrieval**: The embedding model and reranker model being used for semantic search can be swapped to use other models for higher performance. The number of entities retrieved prior to reranking can also be changed. The chunk size for documents can be changed.
151+
- **Vector DB**: While we use Milvus, it can be replaced with options like ChromaDB, Pinecone, FAISS, etc. Milvus is designed to be highly performant and scale on GPU infrastructure.
152+
- **Backend**:
153+
- Cloud Hosted: Currently uses NVIDIA AI Playground APIs, but can be deployed in a private DGX Cloud or AWS/Azure/GCP with NVIDIA GPUs and LLMs.
154+
- On-Prem/Locally Hosted: Smaller models like Llama2-7B or Mistral-7B can be run locally with appropriate hardware. Fine-tuning can also be done for the purpose of a specific model designed for triple extraction for a given use-case.
155+
156+
## Future Work
157+
158+
- Dynamic information incorporation into knowledge graphs (continuous update of knowledge graphs)
159+
- Further refinement of evaluation metrics and combined semantic-graphRAG pipeline
160+
- Investigating the impact of different graph structures and queries on RAG performance (single/multi-hop retrieval, BFS/DFS, etc)
161+
- Expanding support for various document types and formats (multimodal RAG with knowledge graphs)
162+
- Fine-tuning the Nemotron-4 340B model for domain-specific evaluations
163+
164+
## Contributing
165+
166+
Please create a merge request to this repository, our team appreciates any and all contributions that add features! We will review and get back as soon as possible.
167+
168+
## Acknowledgements
169+
170+
This project utilizes NVIDIA's AI technologies, including the Nemotron-4 340B model, and the RAPIDS ecosystem. We thank the open-source community for their invaluable contributions to the tools and libraries used in this project.
Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2023-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
import os
17+
import streamlit as st
18+
from llama_index.core import SimpleDirectoryReader, KnowledgeGraphIndex
19+
from utils.preprocessor import extract_triples
20+
from llama_index.core import ServiceContext
21+
import multiprocessing
22+
import pandas as pd
23+
import networkx as nx
24+
from utils.lc_graph import process_documents, save_triples_to_csvs
25+
from vectorstore.search import SearchHandler
26+
from langchain_nvidia_ai_endpoints import ChatNVIDIA
27+
28+
def load_data(input_dir, num_workers):
29+
reader = SimpleDirectoryReader(input_dir=input_dir)
30+
documents = reader.load_data(num_workers=num_workers)
31+
return documents
32+
33+
def has_pdf_files(directory):
34+
for file in os.listdir(directory):
35+
if file.endswith(".pdf"):
36+
return True
37+
return False
38+
39+
st.title("Knowledge Graph RAG")
40+
41+
st.subheader("Load Data from Files")
42+
43+
# Variable for documents
44+
if 'documents' not in st.session_state:
45+
st.session_state['documents'] = None
46+
47+
models = ChatNVIDIA.get_available_models()
48+
available_models = [model.id for model in models if model.model_type=="chat" and "instruct" in model.id]
49+
with st.sidebar:
50+
llm = st.selectbox("Choose an LLM", available_models, index=available_models.index("mistralai/mixtral-8x7b-instruct-v0.1"))
51+
st.write("You selected: ", llm)
52+
llm = ChatNVIDIA(model=llm)
53+
54+
def app():
55+
# Get the current working directory
56+
cwd = os.getcwd()
57+
58+
# Get a list of visible directories in the current working directory
59+
directories = [d for d in os.listdir(cwd) if os.path.isdir(os.path.join(cwd, d)) and not d.startswith('.') and '__' not in d]
60+
61+
# Create a dropdown menu for directory selection
62+
selected_dir = st.selectbox("Select a directory:", directories, index=0)
63+
64+
# Construct the full path of the selected directory
65+
directory = os.path.join(cwd, selected_dir)
66+
67+
if st.button("Process Documents"):
68+
# Check if the selected directory has PDF files
69+
res = has_pdf_files(directory)
70+
if not res:
71+
st.error("No PDF files found in directory! Only PDF files and text extraction are supported for now.")
72+
st.stop()
73+
documents, results = process_documents(directory, llm)
74+
print(documents)
75+
st.write(documents)
76+
search_handler = SearchHandler("hybrid_demo3", use_bge_m3=True, use_reranker=True)
77+
search_handler.insert_data(documents)
78+
st.write(f"Processing complete. Total triples extracted: {len(results)}")
79+
80+
with st.spinner("Saving triples to CSV files with Pandas..."):
81+
# write the resulting entities to a CSV, relations to a CSV and all triples with IDs to a CSV
82+
save_triples_to_csvs(results)
83+
84+
with st.spinner("Loading the CSVs into dataframes..."):
85+
# Load the triples from the CSV file
86+
triples_df = pd.read_csv("triples.csv")
87+
# Load the entities and relations DataFrames
88+
entities_df = pd.read_csv("entities.csv")
89+
relations_df = pd.read_csv("relations.csv")
90+
91+
# with st.spinner("Creating the knowledge graph from these triples..."):
92+
# Create a mapping from IDs to entity names and relation names
93+
entity_name_map = entities_df.set_index("entity_id")["entity_name"].to_dict()
94+
relation_name_map = relations_df.set_index("relation_id")["relation_name"].to_dict()
95+
96+
# Create the graph from the triples DataFrame
97+
G = nx.from_pandas_edgelist(
98+
triples_df,
99+
source="entity_id_1",
100+
target="entity_id_2",
101+
edge_attr="relation_id",
102+
create_using=nx.DiGraph,
103+
)
104+
105+
with st.spinner("Relabeling node integers to strings for future retrieval..."):
106+
# Relabel the nodes with the actual entity names
107+
G = nx.relabel_nodes(G, entity_name_map)
108+
109+
# Relabel the edges with the actual relation names
110+
edge_attributes = nx.get_edge_attributes(G, "relation_id")
111+
112+
# Update the edges with the new relation names
113+
new_edge_attributes = {
114+
(u, v): relation_name_map[edge_attributes[(u, v)]]
115+
for u, v in G.edges()
116+
if edge_attributes[(u, v)] in relation_name_map
117+
}
118+
119+
nx.set_edge_attributes(G, new_edge_attributes, "relation")
120+
121+
with st.spinner("Saving the graph to a GraphML file for further visualization and retrieval..."):
122+
try:
123+
nx.write_graphml(G, "knowledge_graph.graphml")
124+
125+
# Verify by reading it back
126+
G_loaded = nx.read_graphml("knowledge_graph.graphml")
127+
if nx.is_directed(G_loaded):
128+
st.success("GraphML file is valid and successfully loaded.")
129+
else:
130+
st.error("GraphML file is invalid.")
131+
except Exception as e:
132+
st.error(f"Error saving or loading GraphML file: {e}")
133+
return
134+
135+
st.success("Done!")
136+
137+
if __name__ == "__main__":
138+
app()

0 commit comments

Comments
 (0)