A dataset of question-SPARQL-pairs built from Wikidata Query Logs from 2017 to 2018.
The two most important files are:
- wdql.tar.gz: WDQL dataset for KGQA (train/val/test split by cluster, all samples per cluster)
- wdql-one-per-cluster.tar.gz: WDQL dataset for KGQA (train/val/test split by cluster, one sample per cluster)
Both archives contains three JSONL files: train.jsonl, val.jsonl, and test.jsonl.
Each line in these files is a JSON object with the following structure:
Note: If you want to use WDQL for something else than KGQA, you can just concatenate all JSONL files after downloading and extracting
wdql.tar.gzorwdql-one-per-cluster.tar.gzto get a single file with all question-SPARQL pairs.
All assets are available for download at https://wdql.cs.uni-freiburg.de/data:
organic-query-logs.tar.gz: Raw Wikidata SPARQL query logs as TSV filesorganic.tar.gz: Processed and deduplicated query logs in a single JSONL fileorganic-qwen3-next-80b-a3b.tar.gz: Generated question-SPARQL samples with GRASPorganic-qwen3-next-80b-a3b-dataset.tar.gz: Processed GRASP samples with question embeddings and clusterswdql.tar.gz: WDQL dataset for KGQA (train/val/test split by cluster, all samples per cluster)wdql-one-per-cluster.tar.gz: WDQL dataset for KGQA (train/val/test split by cluster, one sample per cluster)wikidata-benchmarks.tar.gz: Other Wikidata benchmarks (for comparison)
Download and extract these files into a subdirectory named data/ to skip
the corresponding steps in the pipeline below.
| Stage | Number |
|---|---|
| Data Collection | |
| Raw organic SPARQL logs | 3,530,955 |
| After deduplication | 859,305 |
| SPARQL Fixing and Question Generation with GRASP | |
| Processed samples | 314,430 |
| With questions (68.5%) | 215,256 |
| Without questions (31.5%) | 99,174 |
| Model API failure | 78,104 |
| Model output failure | 18,280 |
Cancelled via CAN |
2,770 |
| Model stuck in loop | 20 |
| Validation | |
| Valid (93.0%) | 200,186 |
| Invalid (7.0%) | 15,070 |
| SPARQL parsing failed | 392 |
| SPARQL execution failed | 3,111 |
| Empty SPARQL result | 11,567 |
| Clustering | |
| Clustered samples (valid) | 200,186 |
| Num. clusters | 103,327 |
| Max. cluster size | 146 |
| Avg. cluster size | 1.94 |
| KGQA Datasets | |
| WDQL (one-per-cluster) | 103,327 |
| Train / Val / Test | 82,661 / 10,333 / 10,333 |
| WDQL | 200,186 |
| Train / Val / Test | 159,815 / 20,485 / 19,886 |
# Create data directory and install dependencies
mkdir -p data
pip install -r requirements.txt# Download and extract raw query logs
curl -L https://wdql.cs.uni-freiburg.de/data/organic-query-logs.tar.gz \
| tar -xzv -C data/
# Build organic.jsonl from TSV files
python prepare_input.py data/*.tsv data/# Checkout wdql branch of GRASP
git clone -b wikidata-query-logs --single-branch git@github.com:ad-freiburg/grasp.git
# Install and setup GRASP (see GRASP README for more details)
cd grasp
pip install -e .
export GRASP_INDEX_DIR=$(pwd)/grasp-indices
mkdir -p $GRASP_INDEX_DIR
# Download and extract GRASP Wikidata index
curl -L https://ad-publications.cs.uni-freiburg.de/grasp/kg-index/wikidata.tar.gz \
| tar -xzv -C $GRASP_INDEX_DIR
# Install vLLM and start server with Qwen-3-Next-80B-A3B
pip install vllm
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \
--tool-call-parser hermes \
--enable-auto-tool-choice \
--port 8336
# Start GRASP server with wdql config (runs on port 12345)
# By default, the config expects a Wikidata SPARQL endpoint at localhost:7001, but
# here we set it to the public QLever endpoint instead
KG_ENDPOINT=https://qlever.dev/api/wikidata \
grasp serve configs/wikidata-query-logs/qwen3-next-80b-a3b.yaml
# Run generation script (more options available in the script)
python scripts/run_wikidata_query_logs.py \
data/organic.jsonl \
data/organic-qwen3-next-80b-a3b/ \
http://localhost:12345/run # GRASP server URLpython generate_dataset.pypython build_clusters.py# WDQL uniq dataset (one sample per cluster)
python export_kgqa_dataset.py
# WDQL all dataset (all samples per cluster)
python export_kgqa_dataset.py --output-dir data/wdql \
--samples-per-cluster -1Generate some statistics about WDQL and other Wikidata datasets:
# Download and extract other Wikidata benchmarks
curl -L https://wdql.cs.uni-freiburg.de/data/wikidata-benchmarks.tar.gz \
| tar -xzv -C data/
# Generate statistics
for bench in data/(wdql|wdql-one-per-cluster|spinach|simplequestions|qald7|wwq|qawiki|lcquad2|qald10); \
do cat $bench/*.jsonl | jq '.sparql' | python sparql_statistics.py \
> $bench/statistics.txt; \
doneNote: To generate statistics for
wdqlandwdql-one-per-cluster, you need to complete step 5 above or download and extract the corresponding files first.
Run a Streamlit app to visualize the dataset:
streamlit run visualize_app.pyNote: To run the app, you need to complete step 4 above or download and extract
organic-qwen3-next-80b-a3b-dataset.tar.gzfirst.
{ "id": "train_132930", "question": "Works by Victor Hugo with French title \"Les Misérables\"", "sparql": "SELECT ?work WHERE { ?work wdt:P50 ?author . ?author rdfs:label \"Victor Hugo\"@fr . ?work wdt:P1476 \"Les Misérables\"@fr . }", "paraphrases": [ "What works authored by Victor Hugo have the French title \"Les Misérables\"?", "List all works written by Victor Hugo that are titled \"Les Misérables\" in French." ], "info": { // Original SPARQL query from the query logs "raw_sparql": "SELECT ?var1 WHERE { ?var1 <http://www.wikidata.org/prop/direct/P50> ?var2 . ?var2 <http://www.w3.org/2000/01/rdf-schema#label> \"string1\"@fr . ?var1 <http://www.wikidata.org/prop/direct/P1476> ?var3 . ?var3 <http://www.w3.org/2000/01/rdf-schema#label> \"string2\"@fr . }" } }