Skip to content

Commit de496b8

Browse files
authored
Add eqbench tasks in Spanish and Catalan (#3168)
* Add eqbench tasks in Spanish and Catalan * Incremented catalan_bench and spanish_bench versions. Added 'multilingual' folder inside 'eq_bench' and moved the eqbench_ca and eqbench_es .yaml to that folder. Updated the tasks README with eqbench_es and eqbench_ca, expliciting inside each description both the Hugging Face link and the translation method. * Fixed tasks table. * remove test_task.sh and results folder * Add utils.py to multilingual folder
1 parent a4752cc commit de496b8

File tree

6 files changed

+99
-1
lines changed

6 files changed

+99
-1
lines changed

lm_eval/tasks/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ provided to the individual README.md files for each subfolder.
77

88
| Task Family | Description | Language(s) |
99
|--------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|
10+
| [eq-bench_es](eq_bench/README.md) | Spanish version of EQ-Bench (EN). Task for evaluating emotional reasoning through dialogue-based prompts. [Hugging Face](https://huggingface.co/datasets/BSC-LT/EQ-bench_es) |Spanish **Human Translated** |
11+
| [eq-bench_ca](eq_bench/README.md) | Catalan version of EQ-Bench (EN). Task for evaluating emotional reasoning through dialogue-based prompts. [Hugging Face](https://huggingface.co/datasets/BSC-LT/EQ-bench_ca)| Catalan **Human Translated** |
1012
| [aclue](aclue/README.md) | Tasks focusing on ancient Chinese language understanding and cultural aspects. | Ancient Chinese |
1113
| [acp_bench](acpbench/README.md) | Tasks evaluating the reasoning ability about Action, Change, and Planning | English |
1214
| [acp_bench_hard](acpbench/README.md) | Tasks evaluating the reasoning ability about Action, Change, and Planning | English |

lm_eval/tasks/catalan_bench/catalan_bench.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ task:
66
- copa_ca
77
- openbookqa_ca
88
- parafraseja
9+
- eqbench_ca
910
- paws_ca
1011
- piqa_ca
1112
- siqa_ca
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
task: eqbench_ca
2+
dataset_path: BSC-LT/EQ-bench_ca
3+
output_type: generate_until
4+
validation_split: test
5+
doc_to_text: prompt
6+
doc_to_target: reference_answer_fullscale
7+
process_results: !function utils.calculate_score_fullscale
8+
generation_kwargs:
9+
do_sample: false
10+
temperature: 0.0
11+
max_gen_toks: 80
12+
metric_list:
13+
- metric: eqbench
14+
aggregation: mean
15+
higher_is_better: true
16+
- metric: percent_parseable
17+
aggregation: mean
18+
higher_is_better: true
19+
metadata:
20+
version: 1.0
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
task: eqbench_es
2+
dataset_path: BSC-LT/EQ-bench_es
3+
output_type: generate_until
4+
validation_split: test
5+
doc_to_text: prompt
6+
doc_to_target: reference_answer_fullscale
7+
process_results: !function utils.calculate_score_fullscale
8+
generation_kwargs:
9+
do_sample: false
10+
temperature: 0.0
11+
max_gen_toks: 80
12+
metric_list:
13+
- metric: eqbench
14+
aggregation: mean
15+
higher_is_better: true
16+
- metric: percent_parseable
17+
aggregation: mean
18+
higher_is_better: true
19+
metadata:
20+
version: 1.0
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
import math
2+
import re
3+
4+
5+
def calculate_score_fullscale(docs, results):
6+
reference = eval(docs["reference_answer_fullscale"])
7+
user = dict(re.findall(r"(\w+):\s+(\d+)", results[0]))
8+
# First check that the emotions specified in the answer match those in the reference
9+
if len(user.items()) != 4:
10+
# print('! Error: 4 emotions were not returned')
11+
# print(user)
12+
return {"eqbench": 0, "percent_parseable": 0}
13+
emotions_dict = {}
14+
for emotion, user_emotion_score in user.items():
15+
for i in range(1, 5):
16+
if emotion == reference[f"emotion{i}"]:
17+
emotions_dict[emotion] = True
18+
if len(emotions_dict) != 4:
19+
print("! Error: emotions did not match reference")
20+
print(user)
21+
return {"eqbench": 0, "percent_parseable": 0}
22+
23+
difference_tally = (
24+
0 # Tally of differerence from reference answers for this question
25+
)
26+
27+
# Iterate over each emotion in the user's answers.
28+
for emotion, user_emotion_score in user.items():
29+
# If this emotion is in the reference, calculate the difference between the user's score and the reference score.
30+
for i in range(1, 5):
31+
if emotion == reference[f"emotion{i}"]:
32+
d = abs(
33+
float(user_emotion_score) - float(reference[f"emotion{i}_score"])
34+
)
35+
# this will be a value between 0 and 10
36+
if d == 0:
37+
scaled_difference = 0
38+
elif d <= 5:
39+
# S-shaped scaling function
40+
# https://www.desmos.com/calculator
41+
# 6.5\cdot\ \frac{1}{\left(1\ +\ e^{\left(-1.2\cdot\left(x-4\right)\right)}\right)}
42+
scaled_difference = 6.5 * (1 / (1 + math.e ** (-1.2 * (d - 4))))
43+
44+
else:
45+
scaled_difference = d
46+
difference_tally += scaled_difference
47+
48+
# Inverting the difference tally so that the closer the answer is to reference, the higher the score.
49+
# The adjustment constant is chosen such that answering randomly produces a score of zero.
50+
adjust_const = 0.7477
51+
final_score = 10 - (difference_tally * adjust_const)
52+
final_score_percent = final_score * 10
53+
54+
return {"eqbench": final_score_percent, "percent_parseable": 100}

lm_eval/tasks/spanish_bench/spanish_bench.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,9 @@ task:
1111
- xlsum_es
1212
- paws_es_spanish_bench
1313
- mgsm_direct_es_spanish_bench
14+
- eqbench_es
1415
- flores_es
1516
- phrases_es
1617
- cocoteros_es
1718
metadata:
18-
version: 1.0
19+
version: 1.1

0 commit comments

Comments
 (0)