Update README.md to show how to add a new benchmark task

Griffin Adams · web-flow · commit 9bbca9747587 · 2024-07-22T09:50:46.000-04:00
diff --git a/README.md b/README.md
@@ -369,6 +369,65 @@ The prompt compressor is responsible for filtering the prompt down to the max ca
 
 Here, we’ve decided to filter out tokens [in the middle](https://arxiv.org/abs/2307.03172) to fit our cache budget.
 
+### Adding a new Task
+
+We’ve supported several long context benchmarks and are continuing to add tasks.
+
+Adding a new task is simple. We’ll walk through an example of adding the popular CNN/Dailymail summarization dataset. In `task.py`, we’ll create a new class called `CNNDailyMail` which inherits from the abstract `EvaluationTask`. 
+
+We need to provide a `DEFAULT_PROMPT_TEMPLATE` that will be used to prompt the LLMs for evaluating this task. We will also set a max_token limit of 256 since we want concise summaries. This argument is important because this will be used to determine the maximum size of the cache (along with the input prompt length). 
+
+Since our dataset is on HuggingFace, we can simply pass `hf_args` during initialization, and the base class will fetch the dataset from HuggingFace datasets. If your dataset is not on HuggingFace datasets, you can always overwrite the `_download()` method to download the dataset from elsewhere.
+
+The final thing we need to do during initialization is specify the metrics that will be used for this task. In this case, we will use ROUGE, which is the most commonly used metric for this task.
+
+```
+class CNNDailyMail(EvaluationTask):
+	DEFAULT_PROMPT_TEMPLATE = """You will be shown a news article. Your task to carefully read the article and summarize the main points concisely.
+====NEWS ARTICLE====
+{article}
+"""
+	def __init__(
+		self, prompt_template=DEFAULT_PROMPT_TEMPLATE, max_tokens=256, **kwargs
+	):
+		super().__init__(
+			prompt_template, max_tokens,
+			hf_args=["abisee/cnn_dailymail", "3.0.0"], **kwargs
+		)
+
+		# See full list of metrics in metric.py
+		self.metrics = {
+			"Rouge": AutoMetric.from_name("rouge"),
+		}
+```
+
+Next, we need to provide an implementation for the abstract method `prepare_row`, which will process a single item in the dataset and return either a single dict or a list of dicts. The returned dict must contain the following keys: `context`, `question`, `prompt`, and `labels`. Not all tasks will have a different question for each sample, as is the case with our task. In this case we will simply provide a dummy question for the task. The `prompt` is what will be used to prompt the LLM for the response, and the `labels` will be used to evaluate the model generation.   
+
+```
+def prepare_row(self, row: dict):
+	article = row["article"]
+	highlights = row["highlights"]
+	prompt = self.prompt_template.format(article=article)
+	
+	return {
+		"context": article,
+		"question": "Summarize the main points concisely.",
+		"prompt": prompt,
+		"labels": highlights,
+	}
+```
+
+Finally, we need to register the task by adding it to the `TASK_MAPPING` dictionary with a friendly string label, as shown below:
+
+```
+TASK_MAPPING = {
+	...
+	"cnndm": CNNDailymail,
+}
+```
+
+**That’s it!** The task is now available as a command-line arg via `python eval.py --tasks cnndm`.
+
 # Getting Involved
 
 We'd **love** for you to get involved and collectively aim to improve `Cold Compress` for future releases.