You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: lm_eval/tasks/README.md
+3-2Lines changed: 3 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -45,6 +45,7 @@ provided to the individual README.md files for each subfolder.
45
45
|[ceval](ceval/README.md)| Tasks that evaluate language understanding and reasoning in an educational context. | Chinese |
46
46
|[cmmlu](cmmlu/README.md)| Multi-subject multiple choice question tasks for comprehensive academic assessment. | Chinese |
47
47
| code_x_glue | Tasks that involve understanding and generating code across multiple programming languages. | Go, Java, JS, PHP, Python, Ruby |
48
+
|[cnn_dailymail_abisee](cnn_dailymail/README.md)| Task designed to measure ability to generate multi-sentence abstractive summaries | Chinese |
48
49
|[commonsense_qa](commonsense_qa/README.md)| CommonsenseQA, a multiple-choice QA dataset for measuring commonsense knowledge. | English |
49
50
|[copal_id](copal_id/README.md) United States | Indonesian causal commonsense reasoning dataset that captures local nuances. | Indonesian |
50
51
|[coqa](coqa/README.md)| Conversational question answering tasks to test dialog understanding. | English |
@@ -81,7 +82,7 @@ provided to the individual README.md files for each subfolder.
81
82
|[hellaswag](hellaswag/README.md)| Tasks to predict the ending of stories or scenarios, testing comprehension and creativity. | English |
82
83
|[hendrycks_ethics](hendrycks_ethics/README.md)| Tasks designed to evaluate the ethical reasoning capabilities of models. | English |
83
84
|[hendrycks_math](hendrycks_math/README.md)| Mathematical problem-solving tasks to test numerical reasoning and problem-solving. | English |
84
-
|[histoires_morales](histoires_morales/README.md)| Evaluation tasks designed to assess moral judgment and ethical reasoning in models within narrative contexts, complementing the single-sentence ETHICS tasks. | French (Some MT) |
85
+
|[histoires_morales](histoires_morales/README.md)| Evaluation tasks designed to assess moral judgment and ethical reasoning in models within narrative contexts, complementing the single-sentence ETHICS tasks. | French (Some MT) |
85
86
|[hrm8k](hrm8k/README.md)| A challenging bilingual math reasoning benchmark for Korean and English. | Korean (Some MT), English (Some MT) |
86
87
|[humaneval](humaneval/README.md)| Code generation task that measure functional correctness for synthesizing programs from docstrings. | Python |
87
88
|[humaneval_infilling](humaneval_infilling/README.md)| Code generation task that measure fill-in-the-middle capability for synthesizing programs from docstrings. | Python |
@@ -131,7 +132,7 @@ provided to the individual README.md files for each subfolder.
131
132
|[mmlu_prox](mmlu_prox/README.md)| A multilingual benchmark that extends MMLU-Pro to multiple typologically diverse languages with human validation. | English, Japanese, Chinese, Korean, French, German, Spanish, Portuguese, Zulu, Swahili, Wolof, Yoruba, Thai, Arabic, Hindi, Bengali, Serbian, Hungarian, Vietnamese, Czech, Marathi, Afrikaans, Nepali, Telugu, Urdu, Russian, Indonesian, Italian, Ukrainian |
132
133
|[mmlusr](mmlusr/README.md)| Variation of MMLU designed to be more rigorous. | English |
133
134
| model_written_evals | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns. ||
134
-
|[moral_stories](moral_stories/README.md)| Evaluation tasks designed to assess moral judgment and ethical reasoning in models within narrative contexts, complementing the single-sentence ETHICS tasks. | English |
135
+
|[moral_stories](moral_stories/README.md)| Evaluation tasks designed to assess moral judgment and ethical reasoning in models within narrative contexts, complementing the single-sentence ETHICS tasks. | English |
135
136
|[mts_dialog](mts_dialog/README.md)| Open-ended healthcare QA from the MTS-Dialog dataset. | English |
136
137
|[multiblimp](multiblimp/README.md)| MultiBLiMP is a (synthetic) multilingual benchmark testing models on linguistic minimal pairs to judge grammatical acceptability | Multiple (101 languages) - Synthetic |
137
138
|[mutual](mutual/README.md)| A retrieval-based dataset for multi-turn dialogue reasoning. | English |
Teaching Machines to Read and Comprehend https://arxiv.org/abs/1506.03340
4
+
5
+
The CNN/DailyMail dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. It is widely used for abstractive text summarization tasks. This task uses the non-anonymized Version 3.0.0.
author={Hermann, Karl Moritz and Kocisky, Tomas and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil},
14
+
booktitle={Advances in Neural Information Processing Systems},
15
+
year={2015}
16
+
}
17
+
```
18
+
19
+
## Groups and Tasks
20
+
### Groups
21
+
*`summarization`
22
+
23
+
### Tasks
24
+
*`cnn_dailymail`: The Version 3.0.0 (non-anonymized) dataset. It evaluates models on their ability to generate multi-sentence abstractive summaries. It uses a zero-shot prompt format (Summarize the following article...) and evaluates using ROUGE (1/2/L) and BERTScore (P/R/F1).
25
+
26
+
## Checklist
27
+
For adding novel benchmarks/datasets to the library:
28
+
29
+
*[x] Is the task an existing benchmark in the literature?
30
+
31
+
*[x] Have you referenced the original paper that introduced the task?
32
+
33
+
*[ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
34
+
35
+
* The task uses the standard 3.0.0 (non-anonymized) split from HuggingFace. The implementation uses a standard zero-shot prompt: "Summarize the following article:\n\n{{article}}\n\nSummary:".
36
+
37
+
* Evaluation metrics include ROUGE (1, 2, L) and BERTScore (Precision, Recall, F1), matching standard reporting practices for abstractive summarization.
38
+
39
+
If other tasks on this dataset are already supported:
40
+
41
+
*[x] Is the "Main" variant of this task clearly denoted?
42
+
43
+
*[x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
44
+
45
+
*[x] Have you noted which, if any, published evaluation setups are matched by this variant?
0 commit comments