This repository contains the implementation of paper "Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study over Open-ended Question Answering". This paper introduces OKGQA, a new benchmark for evaluating Knowledge Graph-enhanced LLMs in open-ended question answering, focusing on reducing hallucinations and improving reasoning.
OKGQA is a comprehensive benchmark that combines Knowledge Graphs (KGs) with Large Language Models (LLMs) to enhance the trustworthiness of open-ended question answering. The system leverages DBpedia as its primary knowledge source and implements various techniques to improve answer quality and reduce hallucinations.
- Open-ended QA: Focuses on real-world, free-form questions answering requiring complex reasoning
- Diverse Question Types: Includes 10+ categories of QA to mirror practical scenarios
- Noise-resilient Testing (OKGQA-P): Evaluates robustness against perturbed/contaminated KGs
- Hallucination Metrics: Measures factual accuracy via FActScore and SAFE, alongside LLM-as-judge for answer quality (relevance, correctness, etc.)
- Knowledge Graph Integration: Seamless integration with DBpedia for factual grounding
src/
├── build_subgraph/ # Subgraph construction and manipulation
│ ├── build_subgraph.py # Core subgraph building functionality
│ ├── g_retriever.py # Graph retrieval utilities
│ ├── perturb_kgs.py # Knowledge graph perturbation
│ ├── preprocess.py # Data preprocessing
│ ├── score_function_sG.py # Scoring functions for subgraphs
│ └── statistics_graph.py # Graph statistics calculation
├── generate_qa/ # Question-Answer generation
│ ├── main.py # Main QA generation script
│ ├── generate_query.py # SPARQL query generation
│ ├── retrieve_wikipedia.py# Wikipedia data retrieval
│ ├── post_process.py # Post-processing utilities
│ ├── calculate_stat.py # Statistics calculation
│ └── prompt.txt # LLM prompts for QA generation
├── config/ # Configuration files
│ └── config.py # Main configuration settings
├── utils.py # Utility functions
└── eval/ # Evaluation scripts
build_subgraph.py: Constructs knowledge subgraphs from DBpedia datag_retriever.py: Implements graph retrieval algorithmsperturb_kgs.py: Handles knowledge graph perturbation for robustness testingpreprocess.py: Preprocesses raw data for graph constructionscore_function_sG.py: Implements scoring functions for subgraph qualitystatistics_graph.py: Calculates graph statistics and metrics
main.py: Orchestrates the QA pair generation processgenerate_query.py: Generates SPARQL queries for knowledge retrievalretrieve_wikipedia.py: Fetches relevant Wikipedia contentpost_process.py: Post-processes generated QA pairscalculate_stat.py: Computes statistics for generated QA pairsprompt.txt: Contains LLM prompts for QA generation
- Python 3.8+
- OpenAI API key
- NetworkX (>=3.0)
- SPARQLWrapper (>=2.0.0)
- python-dotenv (>=1.0.0)
- Transformers
- NumPy (>=1.24.0)
- Pandas (>=2.0.0)
- tqdm (>=4.65.0)
- requests (>=2.31.0)
- Clone the repository:
git clone https://github.com/Y-Sui/OKGQA.git
cd okgqa- Install dependencies:
conda env create -f okgqa_39_env_dev.yml- Set up environment variables:
Create a
.envfile in the root directory and add your OpenAI API key:
OPENAI_API_KEY=your_api_key_here
Set up the configuration for generation and perturbation:
vim src/config/config.pyKey configuration parameters include:
- API keys and endpoints
- Model parameters
- Graph construction settings
- QA generation parameters
- Evaluation metrics
- Generate QA pairs:
python -m src.generate_qa.main- Generate subgraphs:
python -m src.build_subgraph.build_subgraph- Generate perturbed subgraphs (for robustness testing in OKGQA-P):
python -m src.build_subgraph.perturb_kgsThe system includes comprehensive evaluation metrics:
- FActScore for factual accuracy
- SAFE for answer quality
- LLM-as-judge for relevance and correctness
- Custom metrics for KG integration effectiveness
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License.
If you use this code in your research, please cite:
@misc{sui2025knowledgegraphsmakelarge,
title={Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering},
author={Yuan Sui and Yufei He and Zifeng Ding and Bryan Hooi},
year={2025},
eprint={2410.08085},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.08085},
}