Skip to content

Commit e0128c9

Browse files
New Task: Add CNN-DailyMail (3.0.0) (#3426)
* feat: add cnn_dailymail dataset * Add CNN-DailyMail task * pacify pre-commit * add to task table --------- Co-authored-by: Baber <[email protected]>
1 parent 08495a0 commit e0128c9

File tree

4 files changed

+361
-2
lines changed

4 files changed

+361
-2
lines changed

lm_eval/tasks/README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@ provided to the individual README.md files for each subfolder.
4545
| [ceval](ceval/README.md) | Tasks that evaluate language understanding and reasoning in an educational context. | Chinese |
4646
| [cmmlu](cmmlu/README.md) | Multi-subject multiple choice question tasks for comprehensive academic assessment. | Chinese |
4747
| code_x_glue | Tasks that involve understanding and generating code across multiple programming languages. | Go, Java, JS, PHP, Python, Ruby |
48+
| [cnn_dailymail_abisee](cnn_dailymail/README.md) | Task designed to measure ability to generate multi-sentence abstractive summaries | Chinese |
4849
| [commonsense_qa](commonsense_qa/README.md) | CommonsenseQA, a multiple-choice QA dataset for measuring commonsense knowledge. | English |
4950
| [copal_id](copal_id/README.md) United States | Indonesian causal commonsense reasoning dataset that captures local nuances. | Indonesian |
5051
| [coqa](coqa/README.md) | Conversational question answering tasks to test dialog understanding. | English |
@@ -81,7 +82,7 @@ provided to the individual README.md files for each subfolder.
8182
| [hellaswag](hellaswag/README.md) | Tasks to predict the ending of stories or scenarios, testing comprehension and creativity. | English |
8283
| [hendrycks_ethics](hendrycks_ethics/README.md) | Tasks designed to evaluate the ethical reasoning capabilities of models. | English |
8384
| [hendrycks_math](hendrycks_math/README.md) | Mathematical problem-solving tasks to test numerical reasoning and problem-solving. | English |
84-
| [histoires_morales](histoires_morales/README.md) | Evaluation tasks designed to assess moral judgment and ethical reasoning in models within narrative contexts, complementing the single-sentence ETHICS tasks. | French (Some MT) |
85+
| [histoires_morales](histoires_morales/README.md) | Evaluation tasks designed to assess moral judgment and ethical reasoning in models within narrative contexts, complementing the single-sentence ETHICS tasks. | French (Some MT) |
8586
| [hrm8k](hrm8k/README.md) | A challenging bilingual math reasoning benchmark for Korean and English. | Korean (Some MT), English (Some MT) |
8687
| [humaneval](humaneval/README.md) | Code generation task that measure functional correctness for synthesizing programs from docstrings. | Python |
8788
| [humaneval_infilling](humaneval_infilling/README.md) | Code generation task that measure fill-in-the-middle capability for synthesizing programs from docstrings. | Python |
@@ -131,7 +132,7 @@ provided to the individual README.md files for each subfolder.
131132
| [mmlu_prox](mmlu_prox/README.md) | A multilingual benchmark that extends MMLU-Pro to multiple typologically diverse languages with human validation. | English, Japanese, Chinese, Korean, French, German, Spanish, Portuguese, Zulu, Swahili, Wolof, Yoruba, Thai, Arabic, Hindi, Bengali, Serbian, Hungarian, Vietnamese, Czech, Marathi, Afrikaans, Nepali, Telugu, Urdu, Russian, Indonesian, Italian, Ukrainian |
132133
| [mmlusr](mmlusr/README.md) | Variation of MMLU designed to be more rigorous. | English |
133134
| model_written_evals | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns. | |
134-
| [moral_stories](moral_stories/README.md) | Evaluation tasks designed to assess moral judgment and ethical reasoning in models within narrative contexts, complementing the single-sentence ETHICS tasks. | English |
135+
| [moral_stories](moral_stories/README.md) | Evaluation tasks designed to assess moral judgment and ethical reasoning in models within narrative contexts, complementing the single-sentence ETHICS tasks. | English |
135136
| [mts_dialog](mts_dialog/README.md) | Open-ended healthcare QA from the MTS-Dialog dataset. | English |
136137
| [multiblimp](multiblimp/README.md) | MultiBLiMP is a (synthetic) multilingual benchmark testing models on linguistic minimal pairs to judge grammatical acceptability | Multiple (101 languages) - Synthetic |
137138
| [mutual](mutual/README.md) | A retrieval-based dataset for multi-turn dialogue reasoning. | English |
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# CNN-DailyMail
2+
## Paper
3+
Teaching Machines to Read and Comprehend https://arxiv.org/abs/1506.03340
4+
5+
The CNN/DailyMail dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. It is widely used for abstractive text summarization tasks. This task uses the non-anonymized Version 3.0.0.
6+
7+
Homepage: https://huggingface.co/datasets/cnn_dailymail
8+
9+
## Citation
10+
```
11+
@inproceedings{hermann2015teaching,
12+
title={Teaching Machines to Read and Comprehend},
13+
author={Hermann, Karl Moritz and Kocisky, Tomas and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil},
14+
booktitle={Advances in Neural Information Processing Systems},
15+
year={2015}
16+
}
17+
```
18+
19+
## Groups and Tasks
20+
### Groups
21+
* `summarization`
22+
23+
### Tasks
24+
* `cnn_dailymail`: The Version 3.0.0 (non-anonymized) dataset. It evaluates models on their ability to generate multi-sentence abstractive summaries. It uses a zero-shot prompt format (Summarize the following article...) and evaluates using ROUGE (1/2/L) and BERTScore (P/R/F1).
25+
26+
## Checklist
27+
For adding novel benchmarks/datasets to the library:
28+
29+
* [x] Is the task an existing benchmark in the literature?
30+
31+
* [x] Have you referenced the original paper that introduced the task?
32+
33+
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
34+
35+
* The task uses the standard 3.0.0 (non-anonymized) split from HuggingFace. The implementation uses a standard zero-shot prompt: "Summarize the following article:\n\n{{article}}\n\nSummary:".
36+
37+
* Evaluation metrics include ROUGE (1, 2, L) and BERTScore (Precision, Recall, F1), matching standard reporting practices for abstractive summarization.
38+
39+
If other tasks on this dataset are already supported:
40+
41+
* [x] Is the "Main" variant of this task clearly denoted?
42+
43+
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
44+
45+
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
task: cnn_dailymail_abisee
2+
dataset_path: abisee/cnn_dailymail
3+
dataset_name: "3.0.0"
4+
output_type: generate_until
5+
test_split: test
6+
doc_to_text: "Summarize the following article:\n\n{{article}}\n\nSummary:"
7+
doc_to_target: "{{highlights}}"
8+
process_results: !function utils.process_results
9+
metric_list:
10+
- metric: rouge1
11+
aggregation: mean
12+
higher_is_better: true
13+
- metric: rouge2
14+
aggregation: mean
15+
higher_is_better: true
16+
- metric: rougeL
17+
aggregation: mean
18+
higher_is_better: true
19+
- metric: bertscore_precision
20+
aggregation: mean
21+
higher_is_better: true
22+
- metric: bertscore_recall
23+
aggregation: mean
24+
higher_is_better: true
25+
- metric: bertscore_f1
26+
aggregation: mean
27+
higher_is_better: true
28+
generation_kwargs:
29+
max_gen_toks: 128
30+
do_sample: false
31+
num_fewshot: 0
32+
metadata:
33+
version: 1.0
34+
description: "CNN/DailyMail dataset for abstractive summarization"

0 commit comments

Comments
 (0)