Skip to content

Commit 9bbca97

Browse files
author
Griffin Adams
authored
Update README.md to show how to add a new benchmark task
1 parent 87f8769 commit 9bbca97

File tree

1 file changed

+59
-0
lines changed

1 file changed

+59
-0
lines changed

README.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -369,6 +369,65 @@ The prompt compressor is responsible for filtering the prompt down to the max ca
369369

370370
Here, we’ve decided to filter out tokens [in the middle](https://arxiv.org/abs/2307.03172) to fit our cache budget.
371371

372+
### Adding a new Task
373+
374+
We’ve supported several long context benchmarks and are continuing to add tasks.
375+
376+
Adding a new task is simple. We’ll walk through an example of adding the popular CNN/Dailymail summarization dataset. In `task.py`, we’ll create a new class called `CNNDailyMail` which inherits from the abstract `EvaluationTask`.
377+
378+
We need to provide a `DEFAULT_PROMPT_TEMPLATE` that will be used to prompt the LLMs for evaluating this task. We will also set a max_token limit of 256 since we want concise summaries. This argument is important because this will be used to determine the maximum size of the cache (along with the input prompt length).
379+
380+
Since our dataset is on HuggingFace, we can simply pass `hf_args` during initialization, and the base class will fetch the dataset from HuggingFace datasets. If your dataset is not on HuggingFace datasets, you can always overwrite the `_download()` method to download the dataset from elsewhere.
381+
382+
The final thing we need to do during initialization is specify the metrics that will be used for this task. In this case, we will use ROUGE, which is the most commonly used metric for this task.
383+
384+
```
385+
class CNNDailyMail(EvaluationTask):
386+
DEFAULT_PROMPT_TEMPLATE = """You will be shown a news article. Your task to carefully read the article and summarize the main points concisely.
387+
====NEWS ARTICLE====
388+
{article}
389+
"""
390+
def __init__(
391+
self, prompt_template=DEFAULT_PROMPT_TEMPLATE, max_tokens=256, **kwargs
392+
):
393+
super().__init__(
394+
prompt_template, max_tokens,
395+
hf_args=["abisee/cnn_dailymail", "3.0.0"], **kwargs
396+
)
397+
398+
# See full list of metrics in metric.py
399+
self.metrics = {
400+
"Rouge": AutoMetric.from_name("rouge"),
401+
}
402+
```
403+
404+
Next, we need to provide an implementation for the abstract method `prepare_row`, which will process a single item in the dataset and return either a single dict or a list of dicts. The returned dict must contain the following keys: `context`, `question`, `prompt`, and `labels`. Not all tasks will have a different question for each sample, as is the case with our task. In this case we will simply provide a dummy question for the task. The `prompt` is what will be used to prompt the LLM for the response, and the `labels` will be used to evaluate the model generation.
405+
406+
```
407+
def prepare_row(self, row: dict):
408+
article = row["article"]
409+
highlights = row["highlights"]
410+
prompt = self.prompt_template.format(article=article)
411+
412+
return {
413+
"context": article,
414+
"question": "Summarize the main points concisely.",
415+
"prompt": prompt,
416+
"labels": highlights,
417+
}
418+
```
419+
420+
Finally, we need to register the task by adding it to the `TASK_MAPPING` dictionary with a friendly string label, as shown below:
421+
422+
```
423+
TASK_MAPPING = {
424+
...
425+
"cnndm": CNNDailymail,
426+
}
427+
```
428+
429+
**That’s it!** The task is now available as a command-line arg via `python eval.py --tasks cnndm`.
430+
372431
# Getting Involved
373432

374433
We'd **love** for you to get involved and collectively aim to improve `Cold Compress` for future releases.

0 commit comments

Comments
 (0)