This repository provides the official code and resources for TIMTQE,
a benchmark dataset and evaluation framework for translation quality estimation (QE) on text images,
covering both synthetic (MLQE-PE) and historical (HistMTQE) settings.
The dataset is publicly available on HuggingFace Datasets:
π https://huggingface.co/datasets/thinklis/TIMTQE
It includes:
- MLQE-PE β large-scale synthetic subset with rendered text images.
- HistMTQE β human-annotated historical document subset.
For detailed structure and examples, please check the HuggingFace dataset page.
We provide an evaluation toolkit to assess the performance of quality estimation models on TIMTQE.
The main script is evaluate.py, which compares model predictions against human-annotated quality scores.
-
Input Format: Predictions should be stored in a JSON, CSV, or TSV file, containing at least:
id(unique identifier of the sample)prediction(the modelβs QE score for the translation, typically on a 0β100 scale)label(the human-annotated reference score)
-
Normalization: To ensure fair comparison across systems, the script applies z-score normalization to model predictions.
-
Metrics: The following evaluation metrics are computed:
- Pearson correlation β measures the linear relationship between predictions and human scores.
- Spearman correlation β assesses rank-based consistency between predictions and labels.
- RMSE β penalizes larger deviations between predictions and reference scores.
- MAE β captures the average absolute difference between predictions and labels.
python evaluate.py \
--pred_file results/predictions.json \
--ref_file data/histmtqe/test.json \
--output_dir outputs/If you use TIMTQE in your research, please cite it as follows:
@ARTICLE{11267222,
author={Li, Shuo and Bi, Xiaojun and Sun, Yiwen},
journal={IEEE Signal Processing Letters},
title={TIMTQE: Benchmarking Machine Translation Quality Estimation for Text Images},
year={2025},
volume={},
number={},
pages={1-5},
doi={10.1109/LSP.2025.3636988}
}
