Skip to content

ad-freiburg/wikidata-query-logs

Repository files navigation

Wikidata Query Logs Dataset (WDQL)

A dataset of question-SPARQL-pairs built from Wikidata Query Logs from 2017 to 2018.

Overview

The two most important files are:

  • wdql.tar.gz: WDQL dataset for KGQA (train/val/test split by cluster, all samples per cluster)
  • wdql-one-per-cluster.tar.gz: WDQL dataset for KGQA (train/val/test split by cluster, one sample per cluster)

Both archives contains three JSONL files: train.jsonl, val.jsonl, and test.jsonl. Each line in these files is a JSON object with the following structure:

{
  "id": "train_132930",
  "question": "Works by Victor Hugo with French title \"Les Misérables\"",
  "sparql": "SELECT ?work WHERE { ?work wdt:P50 ?author . ?author rdfs:label \"Victor Hugo\"@fr . ?work wdt:P1476 \"Les Misérables\"@fr . }",
  "paraphrases": [
    "What works authored by Victor Hugo have the French title \"Les Misérables\"?",
    "List all works written by Victor Hugo that are titled \"Les Misérables\" in French."
  ],
  "info": {
    // Original SPARQL query from the query logs
    "raw_sparql": "SELECT ?var1 WHERE { ?var1 <http://www.wikidata.org/prop/direct/P50> ?var2 . ?var2 <http://www.w3.org/2000/01/rdf-schema#label> \"string1\"@fr . ?var1 <http://www.wikidata.org/prop/direct/P1476> ?var3 . ?var3 <http://www.w3.org/2000/01/rdf-schema#label> \"string2\"@fr . }"
  }
}

Note: If you want to use WDQL for something else than KGQA, you can just concatenate all JSONL files after downloading and extracting wdql.tar.gz or wdql-one-per-cluster.tar.gz to get a single file with all question-SPARQL pairs.

All Downloads

All assets are available for download at https://wdql.cs.uni-freiburg.de/data:

  • organic-query-logs.tar.gz: Raw Wikidata SPARQL query logs as TSV files
  • organic.tar.gz: Processed and deduplicated query logs in a single JSONL file
  • organic-qwen3-next-80b-a3b.tar.gz: Generated question-SPARQL samples with GRASP
  • organic-qwen3-next-80b-a3b-dataset.tar.gz: Processed GRASP samples with question embeddings and clusters
  • wdql.tar.gz: WDQL dataset for KGQA (train/val/test split by cluster, all samples per cluster)
  • wdql-one-per-cluster.tar.gz: WDQL dataset for KGQA (train/val/test split by cluster, one sample per cluster)
  • wikidata-benchmarks.tar.gz: Other Wikidata benchmarks (for comparison)

Download and extract these files into a subdirectory named data/ to skip the corresponding steps in the pipeline below.

Dataset Creation Statistics

Stage Number
Data Collection
    Raw organic SPARQL logs 3,530,955
    After deduplication 859,305
SPARQL Fixing and Question Generation with GRASP
    Processed samples 314,430
        With questions (68.5%) 215,256
        Without questions (31.5%) 99,174
            Model API failure 78,104
            Model output failure 18,280
            Cancelled via CAN 2,770
            Model stuck in loop 20
Validation
    Valid (93.0%) 200,186
    Invalid (7.0%) 15,070
        SPARQL parsing failed 392
        SPARQL execution failed 3,111
        Empty SPARQL result 11,567
Clustering
    Clustered samples (valid) 200,186
        Num. clusters 103,327
        Max. cluster size 146
        Avg. cluster size 1.94
KGQA Datasets
    WDQL (one-per-cluster) 103,327
        Train / Val / Test 82,661 / 10,333 / 10,333
    WDQL 200,186
        Train / Val / Test 159,815 / 20,485 / 19,886

Pipeline

Setup

# Create data directory and install dependencies
mkdir -p data
pip install -r requirements.txt

1. Prepare input from query logs

# Download and extract raw query logs
curl -L https://wdql.cs.uni-freiburg.de/data/organic-query-logs.tar.gz \
  | tar -xzv -C data/
# Build organic.jsonl from TSV files
python prepare_input.py data/*.tsv data/

2. Generate question-SPARQL pairs with GRASP

# Checkout wdql branch of GRASP
git clone -b wikidata-query-logs --single-branch git@github.com:ad-freiburg/grasp.git

# Install and setup GRASP (see GRASP README for more details)
cd grasp
pip install -e .
export GRASP_INDEX_DIR=$(pwd)/grasp-indices
mkdir -p $GRASP_INDEX_DIR

# Download and extract GRASP Wikidata index
curl -L https://ad-publications.cs.uni-freiburg.de/grasp/kg-index/wikidata.tar.gz \
  | tar -xzv -C $GRASP_INDEX_DIR

# Install vLLM and start server with Qwen-3-Next-80B-A3B
pip install vllm
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \
  --tool-call-parser hermes \
  --enable-auto-tool-choice \
  --port 8336

# Start GRASP server with wdql config (runs on port 12345)
# By default, the config expects a Wikidata SPARQL endpoint at localhost:7001, but
# here we set it to the public QLever endpoint instead
KG_ENDPOINT=https://qlever.dev/api/wikidata \
  grasp serve configs/wikidata-query-logs/qwen3-next-80b-a3b.yaml

# Run generation script (more options available in the script)
python scripts/run_wikidata_query_logs.py \
  data/organic.jsonl \
  data/organic-qwen3-next-80b-a3b/ \
  http://localhost:12345/run # GRASP server URL

3. Generate dataset and embeddings

python generate_dataset.py

4. Build clusters from embeddings

python build_clusters.py

5. Export KGQA dataset using clusters

# WDQL uniq dataset (one sample per cluster)
python export_kgqa_dataset.py
# WDQL all dataset (all samples per cluster)
python export_kgqa_dataset.py --output-dir data/wdql \
  --samples-per-cluster -1

Statistics

Generate some statistics about WDQL and other Wikidata datasets:

# Download and extract other Wikidata benchmarks
curl -L https://wdql.cs.uni-freiburg.de/data/wikidata-benchmarks.tar.gz \
  | tar -xzv -C data/

# Generate statistics
for bench in data/(wdql|wdql-one-per-cluster|spinach|simplequestions|qald7|wwq|qawiki|lcquad2|qald10); \
  do cat $bench/*.jsonl | jq '.sparql' | python sparql_statistics.py \
  > $bench/statistics.txt; \
done

Note: To generate statistics for wdql and wdql-one-per-cluster, you need to complete step 5 above or download and extract the corresponding files first.

Visualization

Run a Streamlit app to visualize the dataset:

streamlit run visualize_app.py

Note: To run the app, you need to complete step 4 above or download and extract organic-qwen3-next-80b-a3b-dataset.tar.gz first.

About

Dataset of Question-SPARQL-pairs over Wikidata built from SPARQL query logs

Topics

Resources

License

Stars

Watchers

Forks