Skip to content

Commit e0dc33a

Browse files
Truthfulqa multi harness (#3062)
* truthfulqa-multi task * truthfulqa-multi with chat few-shot * few shot chat implementation * changed until so it outputs lists * changed dataset location * added MT task * Create README.md * do not include MT * changes for PR * tag change * removed yaml extension * adding task to the table * fix task configs * add import exception --------- Co-authored-by: Baber <[email protected]>
1 parent a7ca043 commit e0dc33a

20 files changed

+438
-0
lines changed

lm_eval/tasks/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,7 @@
150150
| [translation](translation/README.md) | Tasks focused on evaluating the language translation capabilities of models. | Arabic, English, Spanish, Basque, Hindi, Indonesian, Burmese, Russian, Swahili, Telugu, Chinese |
151151
| [triviaqa](triviaqa/README.md) | A large-scale dataset for trivia question answering to test general knowledge. | English |
152152
| [truthfulqa](truthfulqa/README.md) | A QA task aimed at evaluating the truthfulness and factual accuracy of model responses. | English |
153+
| [truthfulqa-multi](truthfulqa-multi/README.md) | Is a multilingual version of TruthfulQA, a QA task aimed at evaluating the truthfulness and factual accuracy of model responses. | English, Spanish, Catalan, Basque, Galician |
153154
| [turkishmmlu](turkishmmlu/README.md) | A multiple-choice QA test modeled after MMLU, written in Turkish based on Turkish high-school level exams. | Turkish |
154155
| [unitxt](unitxt/README.md) | A number of tasks implemented using the unitxt library for flexible, shareable, and reusable data preparation and evaluation for generative AI. | English |
155156
| [unscramble](unscramble/README.md) | Tasks involving the rearrangement of scrambled sentences to test syntactic understanding. | English |
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# TruthfulQA-Multi
2+
3+
## Paper
4+
5+
Title: `Truth Knows No Language: Evaluating Truthfulness Beyond English`
6+
7+
Abstract: `[https://arxiv.org/abs/2502.09387v1](https://arxiv.org/abs/2502.09387v1)`
8+
9+
We introduce a professionally translated extension of the TruthfulQA benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and Spanish. Truthfulness evaluations of large language models (LLMs) have primarily been conducted in English. However, the ability of LLMs to maintain truthfulness across languages remains under-explored. Our study evaluates 12 state-of-the-art open LLMs, comparing base and instruction-tuned models using human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Our findings reveal that, while LLMs perform best in English and worst in Basque (the lowest-resourced language), overall truthfulness discrepancies across languages are smaller than anticipated. Furthermore, we show that LLM-as-a-Judge correlates more closely with human judgments than multiple-choice metrics, and that informativeness plays a critical role in truthfulness assessment. Our results also indicate that machine translation provides a viable approach for extending truthfulness benchmarks to additional languages, offering a scalable alternative to professional translation. Finally, we observe that universal knowledge questions are better handled across languages than context- and time-dependent ones, highlighting the need for truthfulness evaluations that account for cultural and temporal variability. Dataset and code are publicly available under open licenses.
10+
11+
### Citation
12+
13+
```text
14+
@misc{figueras2025truthknowslanguageevaluating,
15+
title={Truth Knows No Language: Evaluating Truthfulness Beyond English},
16+
author={Blanca Calvo Figueras and Eneko Sagarzazu and Julen Etxaniz and Jeremy Barnes and Pablo Gamallo and Iria De Dios Flores and Rodrigo Agerri},
17+
year={2025},
18+
eprint={2502.09387},
19+
archivePrefix={arXiv},
20+
primaryClass={cs.CL},
21+
url={https://arxiv.org/abs/2502.09387},
22+
}
23+
```
24+
25+
### Groups, Tags, and Tasks
26+
27+
#### Groups
28+
29+
* `truthfulqa`: This task follows the [TruthfulQA dataset](https://arxiv.org/abs/2109.07958), but expands it to new languages.
30+
31+
#### Tasks
32+
33+
* `truthfulqa-multi_mc2_es`: `Multiple-choice, multiple answers in Spanish`
34+
* `truthfulqa-multi_gen_es`: `Answer generation in Spanish`
35+
* `truthfulqa-multi_mc2_ca`: `Multiple-choice, multiple answers in Catalan`
36+
* `truthfulqa-multi_gen_ca`: `Answer generation in Catalan`
37+
* `truthfulqa-multi_mc2_eu`: `Multiple-choice, multiple answers in Basque`
38+
* `truthfulqa-multi_gen_eu`: `Answer generation in Basque`
39+
* `truthfulqa-multi_mc2_gl`: `Multiple-choice, multiple answers in Galician`
40+
* `truthfulqa-multi_gen_gl`: `Answer generation in Galician`
41+
* `truthfulqa-multi_mc2_en`: `Multiple-choice, multiple answers in English`
42+
* `truthfulqa-multi_gen_en`: `Answer generation in English`
43+
44+
### Checklist
45+
46+
For adding novel benchmarks/datasets to the library:
47+
48+
* [X] Is the task an existing benchmark in the literature?
49+
* [X] Have you referenced the original paper that introduced the task?
50+
* [X] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
51+
52+
If other tasks on this dataset are already supported:
53+
54+
* [ ] Is the "Main" variant of this task clearly denoted?
55+
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
56+
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
57+
58+
### Changelog
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: truthfulqa-multi_gen_common
2+
task: truthfulqa-multi_gen_ca
3+
dataset_name: ca
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
tag:
2+
- truthfulqa_multi
3+
dataset_path: HiTZ/truthfulqa-multi
4+
output_type: generate_until
5+
generation_kwargs:
6+
until:
7+
- "!\n\n"
8+
- "Q:"
9+
- ".\n\n"
10+
training_split: train
11+
validation_split: validation
12+
test_split: null
13+
doc_to_target: "{{'A: ' + best_answer}}"
14+
fewshot_split: train
15+
fewshot_config:
16+
sampler: first_n
17+
process_docs: !function utils.process_docs_gen
18+
process_results: !function utils.process_results_gen
19+
doc_to_text: "{{'Q: ' + question}}"
20+
should_decontaminate: True
21+
doc_to_decontamination_query: question
22+
metric_list:
23+
# - metric: bleurt_max
24+
# aggregation: mean
25+
# higher_is_better: true
26+
# - metric: bleurt_acc
27+
# aggregation: mean
28+
# higher_is_better: true
29+
# - metric: bleurt_diff
30+
# aggregation: mean
31+
# higher_is_better: true
32+
- metric: bleu_max
33+
aggregation: mean
34+
higher_is_better: true
35+
- metric: bleu_acc
36+
aggregation: mean
37+
higher_is_better: true
38+
- metric: bleu_diff
39+
aggregation: mean
40+
higher_is_better: true
41+
#- metric: rouge1_max
42+
# aggregation: mean
43+
# higher_is_better: true
44+
#- metric: rouge1_acc
45+
# aggregation: mean
46+
# higher_is_better: true
47+
# - metric: rouge1_diff
48+
# aggregation: mean
49+
# higher_is_better: true
50+
# - metric: rouge2_max
51+
# aggregation: mean
52+
# higher_is_better: true
53+
# - metric: rouge2_acc
54+
# aggregation: mean
55+
# higher_is_better: true
56+
# - metric: rouge2_diff
57+
# aggregation: mean
58+
# higher_is_better: true
59+
# - metric: rougeL_max
60+
# aggregation: mean
61+
# higher_is_better: true
62+
# - metric: rougeL_acc
63+
# aggregation: mean
64+
# higher_is_better: true
65+
# - metric: rougeL_diff
66+
# aggregation: mean
67+
# higher_is_better: true
68+
metadata:
69+
version: 3.0
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: truthfulqa-multi_gen_common
2+
task: truthfulqa-multi_gen_en
3+
dataset_name: en
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: truthfulqa-multi_gen_common
2+
task: truthfulqa-multi_gen_es
3+
dataset_name: es
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: truthfulqa-multi_gen_common
2+
task: truthfulqa-multi_gen_eu
3+
dataset_name: eu
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: truthfulqa-multi_gen_common
2+
task: truthfulqa-multi_gen_gl
3+
dataset_name: gl
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: truthfulqa-multi_mc_common
2+
task: truthfulqa-multi_mc1_ca
3+
dataset_name: ca
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: truthfulqa-multi_mc_common
2+
task: truthfulqa-multi_mc1_en
3+
dataset_name: en

0 commit comments

Comments
 (0)