Retrieving evidence for language model queries from knowledge graphs requires balancing broad search across the graph with multi-hop traversal to follow relational links. Similarity-based retrievers provide coverage but remain shallow, whereas traversal-based methods rely on selecting seed nodes to start exploration, which can fail when queries span multiple entities and relations.
We introduce ARK: Adaptive Retriever of Knowledge, an agentic KG retriever that gives a language model control over this breadth-depth tradeoff using a two-operation toolset:
- Global Search: Lexical search (BM25) over node descriptors for broad discovery
- Neighborhood Exploration: One-hop expansion that composes into multi-hop traversal
ARK alternates between breadth-oriented discovery and depth-oriented expansion without depending on fragile seed selection, a pre-set hop depth, or requiring retrieval training. ARK adapts tool use to queries, using global search for language-heavy queries and neighborhood exploration for relation-heavy queries.
Key Results on STaRK Benchmark:
- 59.1% average Hit@1 and 67.4 average MRR
- Improves average Hit@1 by up to 31.4% and average MRR by up to 28.0% over retrieval-based and agentic training-free methods
- Distilled 8B model retains up to 98.5% of the teacher's Hit@1 rate via label-free imitation
ARK is evaluated on STaRK, a benchmark for entity-level retrieval over heterogeneous, text-rich knowledge graphs (Wu et al., 2024).
STaRK comprises three large, heterogeneous knowledge graphs:
| Dataset | Domain | Entities | Relations | Avg. Degree |
|---|---|---|---|---|
| AMAZON | E-commerce | ~1M | ~9.4M | 18.2 |
| MAG | Academic | ~1.9M | ~39.8M | 43.5 |
| PRIME | Biomedical | ~129K | ~8.1M | 125.2 |
Each node is associated with text-rich attributes, making STaRK a natural testbed for hybrid retrieval over structured and textual signals.
Clone this repository and set up your environment:
git clone https://github.com/mims-harvard/ark.git
cd arkInstall dependencies using uv:
uv syncFor running local models with VLLM, install it separately:
uv pip install vllm --torch-backend=autoDownload the STaRK benchmark data from the official repository:
# Clone STaRK repository to get the raw data
git clone https://github.com/snap-stanford/stark.git
# Follow STaRK instructions to download the knowledge graphs
# The data should be placed in benchmarks/stark/data/raw_graphs/For detailed instructions on downloading STaRK data, please refer to the STaRK paper and repository.
Convert the raw graph data to parquet format for efficient loading:
cd benchmarks/stark/preprocessing
# Preprocess each graph
python amazon_to_parquet.py
python mag_to_parquet.py
python prime_to_parquet.pyThis will create parquet files in benchmarks/stark/data/graphs/{graph_name}/.
Create a .env file in the project root with your API keys:
# For Azure OpenAI (GPT-4.1)
AZURE_API_KEY=your_azure_api_key
AZURE_API_BASE=your_azure_endpoint
# For OpenAI
OPENAI_API_KEY=your_openai_api_keyNavigate to the STaRK benchmark directory and create symlinks:
cd benchmarks/stark
ln -s ../../src srcRun ARK on a specific graph:
# Run on PRIME with GPT-4.1 (default: 3 parallel agents)
python main.py --graph_name prime --model_name azure/gpt-4.1 --split test
# Run on MAG
python main.py --graph_name mag --model_name azure/gpt-4.1 --split test
# Run on AMAZON
python main.py --graph_name amazon --model_name azure/gpt-4.1 --split testAvailable arguments:
--graph_name: Graph to evaluate on (prime,mag,amazon)--model_name: Model to use (azure/gpt-4.1,Qwen/Qwen3-8B, etc.)--split: Data split (train,val,test)--number_of_agents: Number of parallel agents (default: 3)--limit: Limit number of queries (for debugging)
After running experiments, evaluate the results:
python eval.py --graph_name prime --model_name azure/gpt-4.1 --split testThis will output metrics including Hit@1, Hit@5, Recall@10, Recall@20, and MRR.
ARK supports distillation of the retrieval policy into smaller models via label-free trajectory imitation.
First, run ARK with the teacher model on training data:
python main.py --graph_name prime --model_name azure/gpt-4.1 --split train
python main.py --graph_name prime --model_name azure/gpt-4.1 --split valFine-tune a Qwen model on the collected trajectories:
python finetune.py --graph_name prime --model_name Qwen/Qwen3-8B --train_queries_limit 6000Configure fine-tuning parameters in fine_tuning/params.yaml:
graph_name: "prime"
model_name: "Qwen/Qwen3-8B"
train_queries_limit: 6000
val_queries_limit: 200
lora:
r: 32
lora_alpha: 64
lora_dropout: 0.1
training:
max_length: 16384
num_train_epochs: 1
learning_rate: 0.00001Start a VLLM server with the fine-tuned model:
python -m vllm.entrypoints.openai.api_server \
--model data/finetuning/prime/Qwen3-8B/explorer/merged \
--served-model-name Qwen3-8B-graphagent \
--host 0.0.0.0 \
--port 8000 \
--enable-auto-tool-choice \
--tool-call-parser hermesThen run evaluation with the fine-tuned model:
python main.py --graph_name prime --model_name Qwen3-8B-graphagent --split testARK is released under the MIT License. If you use ARK, please consider citing our paper:
@misc{polonuer2026autonomousknowledgegraphexploration,
title={Autonomous Knowledge Graph Exploration with Adaptive Breadth-Depth Retrieval},
author={Joaquín Polonuer and Lucas Vittor and Iñaki Arango and Ayush Noori and David A. Clifton and Luciano Del Corro and Marinka Zitnik},
year={2026},
eprint={2601.13969},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.13969},
}For any questions or feedback, please open an issue in the GitHub repository or contact Luciano Del Corro and Marinka Zitnik.
