- [2026-01] 🛠️ Code: Full evaluation code and temporal metrics are released.
Existing temporal QA benchmarks focus on simple fact-seeking queries, while reasoning-intensive retrieval benchmarks lack temporal grounding. TEMPO bridges this gap as the first benchmark combining temporal reasoning with reasoning-intensive retrieval across 13 domains.
TEMPO Features:
- 1,730 Complex Queries: Requiring deep temporal reasoning (tracking changes, trends, cross-period evidence).
- Step-wise Retrieval Planning: 3,976 decomposed steps with gold documents mapped for multi-hop evaluation.
- Novel Temporal Metrics: Introducing Temporal Coverage@k and Temporal Precision@k to measure temporal completeness.
| Domain | BM25 | BGE | Contriever | DiVeR | E5 | GritLM | Inst-L | Qwen | Rader | ReasonIR | SBERT | SFR |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cardano | 13.4 | 13.1 | 12.1 | 29.3 | 35.7 | 21.7 | 14.6 | 20.6 | 18.6 | 22.9 | 21.4 | 28.1 |
| Iota | 9.7 | 36.1 | 38.3 | 38.2 | 41.7 | 36.6 | 34.3 | 28.6 | 19.2 | 41.7 | 33.2 | 37.1 |
| Monero | 2.8 | 14.5 | 9.9 | 20.3 | 20.0 | 14.7 | 16.9 | 11.0 | 21.0 | 19.6 | 15.1 | 23.7 |
| Bitcoin | 6.2 | 14.4 | 13.3 | 17.4 | 16.3 | 19.1 | 15.7 | 11.4 | 14.9 | 16.3 | 14.3 | 17.6 |
| Economics | 5.8 | 12.6 | 16.3 | 27.8 | 25.0 | 17.2 | 17.5 | 17.1 | 22.7 | 20.0 | 15.3 | 21.9 |
| Law | 12.7 | 31.9 | 28.1 | 40.4 | 34.0 | 38.3 | 37.3 | 32.0 | 33.5 | 37.9 | 33.8 | 40.8 |
| Political | 32.7 | 28.2 | 31.6 | 45.5 | 47.9 | 41.4 | 32.6 | 38.1 | 32.4 | 35.4 | 34.6 | 44.9 |
| History | 9.2 | 27.4 | 26.5 | 34.5 | 28.7 | 27.3 | 28.5 | 25.6 | 25.8 | 34.3 | 28.7 | 32.4 |
| Quant | 2.5 | 11.7 | 11.1 | 27.2 | 13.8 | 21.6 | 14.6 | 12.7 | 27.8 | 19.5 | 15.7 | 16.8 |
| Travel | 4.6 | 23.8 | 23.7 | 26.8 | 28.3 | 25.0 | 25.0 | 22.0 | 26.1 | 21.4 | 27.3 | 29.7 |
| Workplace | 6.2 | 27.2 | 23.9 | 42.6 | 32.9 | 30.8 | 36.2 | 30.3 | 36.6 | 30.0 | 34.6 | 31.6 |
| Genealogy | 13.3 | 22.0 | 24.9 | 35.6 | 33.5 | 26.9 | 24.6 | 25.3 | 18.7 | 30.3 | 23.5 | 31.7 |
| HSM | 21.2 | 23.2 | 18.9 | 31.0 | 37.7 | 33.4 | 24.4 | 21.3 | 16.9 | 24.7 | 26.1 | 33.5 |
| Avg. | 10.8 | 22.0 | 21.4 | 32.0 | 30.4 | 27.2 | 24.8 | 22.8 | 24.2 | 27.2 | 24.9 | 30.0 |
Clone the repository and install the dependencies:
git clone https://github.com/tempo-bench/Tempo.git
cd Tempo
pip install -r requirements.txtRun standard retrieval evaluation using run.py. This supports running over all domains or specific ones.
# Evaluate retrieval on the 'history' domain with BM25
python run.py --task history --model bm25 --output_dir outputs
# Evaluate on ALL domains
python run.py --task all --model bm25 --output_dir outputsFor deeper analysis, evaluating the intermediate retrieval steps:
python run_step.py --task economics --model e5 --output_dir outputs_stepsWe provide a dedicated script temporal_metrics.py to calculate advanced temporal metrics like Temporal Coverage and Temporal Precision. This script uses an LLM to judge the temporal relevance of retrieved documents.
Note: You must perform step 1 first to install the required provider libraries (e.g., openai, anthropic, google-generativeai).
The script automatically downloads the necessary queries and corpus data from the Hugging Face Hub (dataset tempo26/Tempo), so you don't need to manually download data files.
Set the environment variables for your chosen provider:
🔷 Azure OpenAI (Default)
export AZURE_OPENAI_ENDPOINT="https://your-endpoint.openai.azure.com/"
export AZURE_OPENAI_API_KEY="your-api-key"
export AZURE_DEPLOYMENT_NAME="gpt-4o"🟢 OpenAI
export OPENAI_API_KEY="sk-..."
export OPENAI_MODEL="gpt-4o" # Optional🟠 Anthropic (Claude)
export ANTHROPIC_API_KEY="sk-ant-..."
export ANTHROPIC_MODEL="claude-3-5-sonnet-20240620" # Optional🔵 Google (Gemini)
export GOOGLE_API_KEY="AIza..."
export GEMINI_MODEL="gemini-1.5-pro" # OptionalRun the script specifying the provider using --provider.
# Using Azure (default)
python temporal_metrics.py \
--model_results_dir ./outputs \
--k_values 10 20 \
--provider azure
# Using OpenAI
python temporal_metrics.py --model_results_dir ./outputs --k_values 10 --provider openai
# Using Claude
python temporal_metrics.py --model_results_dir ./outputs --k_values 10 --provider anthropic| Benchmark | #Q | #D | Src. | Temp. | Reason. | Expert | Step | Cross |
|---|---|---|---|---|---|---|---|---|
| BRIGHT | 1,384 | 12 | Mixed | ❌ | ✅ | ✅ | ❌ | ❌ |
| RAR-b | 45,745 | 17 | Mixed | ❌ | ✅ | ✅ | ❌ | ❌ |
| NTCIR Temporalia | 100 | Open | News/Blogs | ✅ | ❌ | ❌ | ❌ | ❌ |
| TEMPO (Ours) | 1,730 | 13 | Stack Exch. | ✅ | ✅ | ✅ | ✅ | ✅ |
If you use TEMPO in your work, please cite our paper:
@article{abdallah2026tempo,
title={TEMPO: A Realistic Multi-Domain Benchmark for Temporal Reasoning-Intensive Retrieval},
author={Abdallah, Abdelrahman and Ali, Mohammed and Abdul-Mageed, Muhammad and Jatowt, Adam},
journal={arXiv preprint arXiv:2601.09523},
year={2026}
}