Wikidata Query Logs Dataset (WDQL)

A dataset of question-SPARQL-pairs built from Wikidata Query Logs from 2017 to 2018.

Overview

The two most important files are:

wdql.tar.gz: WDQL dataset for KGQA (train/val/test split by cluster, all samples per cluster)
wdql-one-per-cluster.tar.gz: WDQL dataset for KGQA (train/val/test split by cluster, one sample per cluster)

Both archives contains three JSONL files: train.jsonl, val.jsonl, and test.jsonl. Each line in these files is a JSON object with the following structure:

{
  "id": "train_132930",
  "question": "Works by Victor Hugo with French title \"Les Misérables\"",
  "sparql": "SELECT ?work WHERE { ?work wdt:P50 ?author . ?author rdfs:label \"Victor Hugo\"@fr . ?work wdt:P1476 \"Les Misérables\"@fr . }",
  "paraphrases": [
    "What works authored by Victor Hugo have the French title \"Les Misérables\"?",
    "List all works written by Victor Hugo that are titled \"Les Misérables\" in French."
  ],
  "info": {
    // Original SPARQL query from the query logs
    "raw_sparql": "SELECT ?var1 WHERE { ?var1 <http://www.wikidata.org/prop/direct/P50> ?var2 . ?var2 <http://www.w3.org/2000/01/rdf-schema#label> \"string1\"@fr . ?var1 <http://www.wikidata.org/prop/direct/P1476> ?var3 . ?var3 <http://www.w3.org/2000/01/rdf-schema#label> \"string2\"@fr . }"
  }
}

Note: If you want to use WDQL for something else than KGQA, you can just concatenate all JSONL files after downloading and extracting wdql.tar.gz or wdql-one-per-cluster.tar.gz to get a single file with all question-SPARQL pairs.

All Downloads

All assets are available for download at https://wdql.cs.uni-freiburg.de/data:

organic-query-logs.tar.gz: Raw Wikidata SPARQL query logs as TSV files
organic.tar.gz: Processed and deduplicated query logs in a single JSONL file
organic-qwen3-next-80b-a3b.tar.gz: Generated question-SPARQL samples with GRASP
organic-qwen3-next-80b-a3b-dataset.tar.gz: Processed GRASP samples with question embeddings and clusters
wdql.tar.gz: WDQL dataset for KGQA (train/val/test split by cluster, all samples per cluster)
wdql-one-per-cluster.tar.gz: WDQL dataset for KGQA (train/val/test split by cluster, one sample per cluster)
wikidata-benchmarks.tar.gz: Other Wikidata benchmarks (for comparison)

Download and extract these files into a subdirectory named data/ to skip the corresponding steps in the pipeline below.

Dataset Creation Statistics

Stage	Number
Data Collection
Raw organic SPARQL logs	3,530,955
After deduplication	859,305
SPARQL Fixing and Question Generation with GRASP
Processed samples	314,430
With questions (68.5%)	215,256
Without questions (31.5%)	99,174
Model API failure	78,104
Model output failure	18,280
Cancelled via `CAN`	2,770
Model stuck in loop	20
Validation
Valid (93.0%)	200,186
Invalid (7.0%)	15,070
SPARQL parsing failed	392
SPARQL execution failed	3,111
Empty SPARQL result	11,567
Clustering
Clustered samples (valid)	200,186
Num. clusters	103,327
Max. cluster size	146
Avg. cluster size	1.94
KGQA Datasets
WDQL (one-per-cluster)	103,327
Train / Val / Test	82,661 / 10,333 / 10,333
WDQL	200,186
Train / Val / Test	159,815 / 20,485 / 19,886

Pipeline

Setup

# Create data directory and install dependencies
mkdir -p data
pip install -r requirements.txt

1. Prepare input from query logs

# Download and extract raw query logs
curl -L https://wdql.cs.uni-freiburg.de/data/organic-query-logs.tar.gz \
  | tar -xzv -C data/
# Build organic.jsonl from TSV files
python prepare_input.py data/*.tsv data/

2. Generate question-SPARQL pairs with GRASP

# Checkout wdql branch of GRASP
git clone -b wikidata-query-logs --single-branch git@github.com:ad-freiburg/grasp.git

# Install and setup GRASP (see GRASP README for more details)
cd grasp
pip install -e .
export GRASP_INDEX_DIR=$(pwd)/grasp-indices
mkdir -p $GRASP_INDEX_DIR

# Download and extract GRASP Wikidata index
curl -L https://ad-publications.cs.uni-freiburg.de/grasp/kg-index/wikidata.tar.gz \
  | tar -xzv -C $GRASP_INDEX_DIR

# Install vLLM and start server with Qwen-3-Next-80B-A3B
pip install vllm
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \
  --tool-call-parser hermes \
  --enable-auto-tool-choice \
  --port 8336

# Start GRASP server with wdql config (runs on port 12345)
# By default, the config expects a Wikidata SPARQL endpoint at localhost:7001, but
# here we set it to the public QLever endpoint instead
KG_ENDPOINT=https://qlever.dev/api/wikidata \
  grasp serve configs/wikidata-query-logs/qwen3-next-80b-a3b.yaml

# Run generation script (more options available in the script)
python scripts/run_wikidata_query_logs.py \
  data/organic.jsonl \
  data/organic-qwen3-next-80b-a3b/ \
  http://localhost:12345/run # GRASP server URL

3. Generate dataset and embeddings

python generate_dataset.py

4. Build clusters from embeddings

python build_clusters.py

5. Export KGQA dataset using clusters

# WDQL uniq dataset (one sample per cluster)
python export_kgqa_dataset.py
# WDQL all dataset (all samples per cluster)
python export_kgqa_dataset.py --output-dir data/wdql \
  --samples-per-cluster -1

Statistics

Generate some statistics about WDQL and other Wikidata datasets:

# Download and extract other Wikidata benchmarks
curl -L https://wdql.cs.uni-freiburg.de/data/wikidata-benchmarks.tar.gz \
  | tar -xzv -C data/

# Generate statistics
for bench in data/(wdql|wdql-one-per-cluster|spinach|simplequestions|qald7|wwq|qawiki|lcquad2|qald10); \
  do cat $bench/*.jsonl | jq '.sparql' | python sparql_statistics.py \
  > $bench/statistics.txt; \
done

Note: To generate statistics for wdql and wdql-one-per-cluster, you need to complete step 5 above or download and extract the corresponding files first.

Visualization

Run a Streamlit app to visualize the dataset:

streamlit run visualize_app.py

Note: To run the app, you need to complete step 4 above or download and extract organic-qwen3-next-80b-a3b-dataset.tar.gz first.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
sparql		sparql
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
VALIDITY_RULES.md		VALIDITY_RULES.md
build_clusters.py		build_clusters.py
export_kgqa_dataset.py		export_kgqa_dataset.py
generate_dataset.py		generate_dataset.py
prepare_input.py		prepare_input.py
requirements.txt		requirements.txt
sparql_statistics.py		sparql_statistics.py
test_sparql_statistics.py		test_sparql_statistics.py
utils.py		utils.py
visualize_app.py		visualize_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikidata Query Logs Dataset (WDQL)

Overview

All Downloads

Dataset Creation Statistics

Pipeline

Setup

1. Prepare input from query logs

2. Generate question-SPARQL pairs with GRASP

3. Generate dataset and embeddings

4. Build clusters from embeddings

5. Export KGQA dataset using clusters

Statistics

Visualization

About

Uh oh!

Languages

License

ad-freiburg/wikidata-query-logs

Folders and files

Latest commit

History

Repository files navigation

Wikidata Query Logs Dataset (WDQL)

Overview

All Downloads

Dataset Creation Statistics

Pipeline

Setup

1. Prepare input from query logs

2. Generate question-SPARQL pairs with GRASP

3. Generate dataset and embeddings

4. Build clusters from embeddings

5. Export KGQA dataset using clusters

Statistics

Visualization

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages