diff --git a/README.md b/README.md
index 2dde05a586f..2aaa45137c5 100644
--- a/README.md
+++ b/README.md
@@ -21,7 +21,7 @@ pip install lm-eval
 
 ## Basic Usage
 
-To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command.
+To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command. **When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility.** This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](https://github.com/EleutherAI/lm-evaluation-harness#task-versioning) section for more info.
 
 ```bash
 python main.py \
@@ -55,7 +55,7 @@ To evaluate mesh-transformer-jax models that are not available on HF, please inv
 
 ## Implementing new tasks
 
-To implement a new task in eval harness, see [this guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/task-guide.md).
+To implement a new task in eval harness, see [this guide](./docs/task_guide.md).
 
 ## Cite as
 
@@ -128,8 +128,9 @@ To implement a new task in eval harness, see [this guide](https://github.com/Ele
 |openbookqa                                               |✓    |✓  |✓   |          500|acc, acc_norm                                                                 |
 |squad2                                                   |✓    |✓  |    |        11873|exact, f1, HasAns_exact, HasAns_f1, NoAns_exact, NoAns_f1, best_exact, best_f1|
 |race                                                     |✓    |✓  |✓   |         1045|acc                                                                           |
-|headqa                                                   |✓    |✓  |✓   |         2742|acc, acc_norm                                                                 |
 |mathqa                                                   |✓    |✓  |✓   |         2985|acc, acc_norm                                                                 |
+|headqa_es                                                |✓    |✓  |✓   |         2742|acc, acc_norm                                                                 |
+|headqa_en                                                |✓    |✓  |✓   |         2742|acc, acc_norm                                                                 |
 |webqs                                                    |✓    |   |✓   |         2032|acc                                                                           |
 |wsc273                                                   |     |   |✓   |          273|acc                                                                           |
 |winogrande                                               |✓    |✓  |    |         1267|acc                                                                           |
@@ -363,7 +364,6 @@ To inspect what the LM inputs look like, you can run the following command:
 ```bash
 python write_out.py \
 	--tasks all_tasks \
-	--provide_description \
 	--num_fewshot 5 \
 	--num_examples 10 \
 	--output_base_path /path/to/output/folder
diff --git a/docs/description_guide.md b/docs/description_guide.md
new file mode 100644
index 00000000000..b3fea0834f2
--- /dev/null
+++ b/docs/description_guide.md
@@ -0,0 +1,49 @@
+# Description Guide
+
+![fewshot-example](./img/fewshot_example_gpt3.png)
+(Figure from [Brown et al., 2020](https://arxiv.org/pdf/2005.14165.pdf))
+
+Task descriptions provide in-context task instruction for your language model. If you'd like to prepend a natural language description to your few-shot examples and prompt, you can do so on a per-task basis via the `description_dict` arg of [`evaluator.evaluate`](../lm_eval/evaluator.py). This `description_dict` must adhere to the following key-value structure:
+
+- **key**: the task name (`str`) as specified in the lm-eval-harness [task registry](../lm_eval/tasks/__init__.py).
+- **value**: the corresponding (`str`) description/prompt for the task identified by **key**.
+
+```python
+description_dict = {
+    "task_name_1": "description",
+    "task_name_2": "description",
+    ...
+}
+```
+
+Note that a task's description will be separated from its following few-shot examples and prompt by a new line as such:
+
+```python
+"""
+<description>
+
+<examples>
+
+<prompt>
+"""
+```
+
+## Descriptions in File
+
+One can also interface with the aforementioned [`evaluator.evaluate`](../lm_eval/evaluator.py) (or `evaluator.simple_evaluate`) method from a higher level by simply passing a JSON file path to the `description_dict_path` arg of the command-line interface (CLI) program, `main.py`. The JSON file pointed to should be structured the same as the `description_dict`. E.g. for some file at `/your/path/descriptions.json` you may have:
+
+```json
+{
+    "cycle_letters": "Please unscramble the letters into a word, and write that word:",
+    "copa": "Given a premise and one alternative with a causal relation to the premise and another without, choose the more plausible alternative"
+}
+```
+
+which can then be supplied to the CLI as:
+
+```bash
+python main.py  \
+--tasks cycle_letters,copa \
+--description_dict_path /your/path/descriptions.json \
+...
+```
diff --git a/docs/img/fewshot_example_gpt3.png b/docs/img/fewshot_example_gpt3.png
new file mode 100644
index 00000000000..b199736867a
Binary files /dev/null and b/docs/img/fewshot_example_gpt3.png differ
diff --git a/task-guide.md b/docs/task_guide.md
similarity index 94%
rename from task-guide.md
rename to docs/task_guide.md
index 5ea43fc2f41..f3b2c986ba6 100644
--- a/task-guide.md
+++ b/docs/task_guide.md
@@ -87,8 +87,7 @@ There are 2 standard approaches we follow for downloading data:
     ```
    These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set doesn't have publicly available labels, please do not put it down as having a test set.
 
-	Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.:
-	`{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
+	Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.: `{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`:
     ```python
     def training_docs(self):
         return #...
@@ -125,17 +124,9 @@ You can now skip ahead to <a href="#Registering-Your-Task">registering your task
 
 <br>
 
+In the case your task is _not_ multiple-choice, override the following methods for your task class:
 
-In the case your task is not multiple-choice, override the following methods for your task class:
-
-Put the natural language task description as a single line (no `\n`s) string here. E.g. `"Translate English to French:"`
-
-```python
-def fewshot_description(self):
-    return ""
-```
-
-Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example (in dictionary form) . You should concatenate its members into a nicely formatted prompt.
+Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example of type `dict` with `str` key-value members. You should concatenate these `doc` item values together into a neatly formatted prompt.
 
 ```python
 def doc_to_text(self, doc):
@@ -161,11 +152,12 @@ After registering your task, you can now check on your data downloading and veri
 
 ```bash
 python -m scripts.write_out \
-    --task <your-task> \
     --output_base_path <path> \
+    --tasks <your-task> \
     --sets <train | val | test> \
     --num_fewshot K \
-    --num_examples N
+    --num_examples N \ 
+    --description_dict_path <path>
 ```
 
 Open the file specified at the `--output_base_path <path>` and ensure it passes
diff --git a/lm_eval/base.py b/lm_eval/base.py
index 927ecb49f0e..0950315f231 100644
--- a/lm_eval/base.py
+++ b/lm_eval/base.py
@@ -1,6 +1,7 @@
 import abc
 from typing import Iterable
 import numpy as np
+import random
 import re
 import os
 import json
@@ -10,7 +11,7 @@
 import torch
 import torch.nn.functional as F
 
-from lm_eval.metrics import mean, weighted_perplexity, weighted_mean
+from lm_eval.metrics import mean, weighted_perplexity, weighted_mean, bits_per_byte
 from lm_eval import utils
 from abc import abstractmethod
 
@@ -450,11 +451,43 @@ def higher_is_better(self):
         pass
 
     def fewshot_description(self):
+        import warnings
+        warnings.warn(
+            "`fewshot_description` will be removed in futures versions. Pass "
+            "any custom descriptions to the `evaluate` function instead.",
+            DeprecationWarning)
         return ""
 
-    def fewshot_context(self, doc, num_fewshot, provide_description, rnd):
-        raw_description = self.fewshot_description()
-        description = (raw_description + "\n===\n\n") if provide_description and raw_description else ""
+    @utils.positional_deprecated
+    def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None):
+        """ Returns a fewshot context string that is made up of a prepended description
+        (if provided), the `num_fewshot` number of examples, and an appended prompt example.
+
+        :param doc: str
+            The document as returned from training_docs, validation_docs, or test_docs.
+        :param num_fewshot: int
+            The number of fewshot examples to provide in the returned context string.
+        :param provide_description: bool
+            Not implemented, and this option is deprecated and will be removed in a future version in favor of a different description providing method
+        :param rnd: random.Random
+            The pseudo-random number generator used to randomly sample examples.
+            WARNING: This is currently a required arg although it's optionalized with a default `None`.
+        :param description: str
+            The task's description that will be prepended to the fewshot examples.
+        :returns: str
+            The fewshot context.
+        """
+        assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`"
+        assert not provide_description, (
+            "The `provide_description` arg will be removed in future versions. To prepend "
+            "a custom description to the context, supply the corresponding string via the "
+            "`description` arg."
+        )
+        if provide_description is not None:
+            # nudge people to not specify it at all
+            print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")
+
+        description = description + "\n\n" if description else ""
 
         if num_fewshot == 0:
             labeled_examples = ""
@@ -523,16 +556,22 @@ class PerplexityTask(Task, abc.ABC):
     def has_training_docs(self):
         return False
 
-    def fewshot_description(self):
-        return ""
-
     def fewshot_examples(self, k, rnd):
         assert k == 0
         return []
 
-    def fewshot_context(self, doc, num_fewshot, provide_description, rnd):
+    def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None):
         assert num_fewshot == 0
-        assert not provide_description
+        assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`"
+        assert not provide_description, (
+            "The `provide_description` arg will be removed in future versions. To prepend "
+            "a custom description to the context, supply the corresponding string via the  "
+            "`description` arg."
+        )
+        if provide_description is not None:
+            # nudge people to not specify it at all
+            print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")
+
         return ""
 
     def higher_is_better(self):
@@ -560,14 +599,14 @@ def process_results(self, doc, results):
         return {
             "word_perplexity": (loglikelihood, words),
             "byte_perplexity": (loglikelihood, bytes_),
-            "bits_per_byte": (-loglikelihood, self.count_bytes(doc))
+            "bits_per_byte": (loglikelihood, bytes_),
         }
 
     def aggregation(self):
         return {
             "word_perplexity": weighted_perplexity,
             "byte_perplexity": weighted_perplexity,
-            "bits_per_byte": weighted_mean
+            "bits_per_byte": bits_per_byte,
         }
 
     @classmethod
diff --git a/lm_eval/evaluator.py b/lm_eval/evaluator.py
index 4de59ad5c4a..087f2a13135 100644
--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -6,19 +6,23 @@
 import lm_eval.tasks
 import lm_eval.base
 import numpy as np
+from lm_eval.utils import positional_deprecated
 
 
-def simple_evaluate(model, model_args, task_names,
+@positional_deprecated
+def simple_evaluate(model, model_args=None, tasks=[],
                     num_fewshot=0, batch_size=None, device=None,
-                    no_cache=False, limit=None, bootstrap_iters=100000):
+                    no_cache=False, limit=None, bootstrap_iters=100000,
+                    description_dict=None):
     """Instantiate and evaluate a model on a list of tasks.
 
-    :param model: str
-        Name of model, see lm_eval.models.get_model
-    :param model_args: str
-        String arguments for each model class, see LM.create_from_arg_string
-    :param task_names: list[str]
-        List of task names
+    :param model: Union[str, LM]
+        Name of model or LM object, see lm_eval.models.get_model
+    :param model_args: Optional[str]
+        String arguments for each model class, see LM.create_from_arg_string. 
+        Ignored if `model` argument is a LM object.
+    :param tasks: list[Union[str, Task]]
+        List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
     :param num_fewshot: int
         Number of examples in few-shot context
     :param batch_size: int, optional
@@ -31,23 +35,39 @@ def simple_evaluate(model, model_args, task_names,
         Limit the number of examples per task (only use this for testing)
     :param bootstrap_iters:
         Number of iterations for bootstrap statistics
+    :param description_dict: dict[str, str]
+        Dictionary of custom task descriptions of the form: `task_name: description` 
     :return
         Dictionary of results
     """
     random.seed(1234)
     np.random.seed(1234)
 
-    lm = lm_eval.models.get_model(model).create_from_arg_string(model_args, {
-        'batch_size': batch_size, 'device': device
-    })
+    assert tasks != [], "No tasks specified"
+
+    if isinstance(model, str):
+        if model_args is None: model_args = ""
+        lm = lm_eval.models.get_model(model).create_from_arg_string(model_args, {
+            'batch_size': batch_size, 'device': device
+        })
+    else:
+        assert isinstance(model, lm_eval.base.LM)
+        lm = model
 
     if not no_cache:
         lm = lm_eval.base.CachingLM(
             lm, 'lm_cache/' + model + '_' + model_args.replace('=', '-').replace(',', '_').replace('/', '-') + '.db'
         )
     
-    task_dict = lm_eval.tasks.get_task_dict(task_names)
-    results = evaluate(lm, task_dict, False, num_fewshot, limit)
+    task_dict = lm_eval.tasks.get_task_dict(tasks)
+
+    results = evaluate(
+        lm=lm,
+        task_dict=task_dict,
+        num_fewshot=num_fewshot,
+        limit=limit,
+        description_dict=description_dict
+    )
 
     # add info about the model and few shot config
     results["config"] = {
@@ -58,19 +78,21 @@ def simple_evaluate(model, model_args, task_names,
         "device": device,
         "no_cache": no_cache,
         "limit": limit,
-        "bootstrap_iters": bootstrap_iters
+        "bootstrap_iters": bootstrap_iters,
+        "description_dict": description_dict
     }
 
     return results
 
 
-def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_iters=100000):
+@positional_deprecated
+def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None, bootstrap_iters=100000, description_dict=None):
     """Instantiate and evaluate a model on a list of tasks.
 
     :param lm: obj
         Language Model
     :param task_dict: dict[str, Task]
-        Dictionary of tasks
+        Dictionary of tasks. Tasks will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
     :param provide_description: bool
         Not implemented, and this option is deprecated and will be removed in a future version in favor of a different description providing method
     :param num_fewshot: int
@@ -79,6 +101,8 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i
         Limit the number of examples per task (only use this for testing)
     :param bootstrap_iters:
         Number of iterations for bootstrap statistics
+    :param description_dict: dict[str, str]
+        Dictionary of custom task descriptions of the form: `task_name: description` 
     :return
         Dictionary of results
     """
@@ -86,6 +110,9 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i
 
     # TODO: todo: implement proper description-providing system
     assert not provide_description  # not implemented.
+    if provide_description is not None:
+        # nudge people to not specify it at all
+        print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")
 
     task_dict_items = [
         (name, task)
@@ -125,16 +152,16 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i
         rnd.seed(42)
         rnd.shuffle(task_docs)
 
+        description = description_dict[task_name] if description_dict and task_name in description_dict else ""
+
         for doc_id, doc in enumerate(itertools.islice(task_docs, 0, limit)):
             docs[(task_name, doc_id)] = doc
-
             ctx = task.fewshot_context(
                 doc=doc,
-                provide_description=provide_description,
                 num_fewshot=num_fewshot,
-                rnd=rnd
+                rnd=rnd,
+                description=description
             )
-
             reqs = task.construct_requests(doc, ctx)
             if not isinstance(reqs, (list, tuple)):
                 reqs = [reqs]
diff --git a/lm_eval/metrics.py b/lm_eval/metrics.py
index c95d4cd61c3..9029ac08ce6 100644
--- a/lm_eval/metrics.py
+++ b/lm_eval/metrics.py
@@ -52,13 +52,14 @@ def acc_all(items):
     docs = list(zip(*items))[1]
 
     for doc, pred in zip(docs, preds):
+        paragraph_id = doc["idx"]["paragraph"]
         question_id = doc["idx"]["question"]
-        if question_id not in question_scoring_dict:
-            question_scoring_dict[question_id] = []
+        if (paragraph_id, question_id) not in question_scoring_dict:
+            question_scoring_dict[(paragraph_id, question_id)] = []
 
         gold_label = doc["label"] == 1
-        question_scoring_dict[question_id].append(gold_label == pred)
 
+        question_scoring_dict[(paragraph_id, question_id)].append(gold_label == pred)
     acc = np.mean([int(all(x)) for x in question_scoring_dict.values()])
     return acc
 
@@ -102,6 +103,9 @@ def weighted_mean(items):
 def weighted_perplexity(items):
     return math.exp(-weighted_mean(items))
 
+def bits_per_byte(items):
+    return -weighted_mean(items) / math.log(2)
+
 
 def bleu(items):
     """The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric
diff --git a/lm_eval/models/__init__.py b/lm_eval/models/__init__.py
index a12f68a513a..9ffd1ceffb8 100644
--- a/lm_eval/models/__init__.py
+++ b/lm_eval/models/__init__.py
@@ -1,12 +1,16 @@
 from . import gpt2
 from . import gpt3
 from . import dummy
+from . import xglm
+from . import bigscience
 
 MODEL_REGISTRY = {
     "hf": gpt2.HFLM,
     "gpt2": gpt2.GPT2LM,
     "gpt3": gpt3.GPT3LM,
     "dummy": dummy.DummyLM,
+    "XGLM": xglm.XGLM,
+    "bigscience":bigscience.BigScience,
 }
 
 
diff --git a/lm_eval/models/bigscience.py b/lm_eval/models/bigscience.py
new file mode 100644
index 00000000000..e54b0627164
--- /dev/null
+++ b/lm_eval/models/bigscience.py
@@ -0,0 +1,84 @@
+import transformers
+import torch
+from lm_eval.base import BaseLM
+#
+# 
+# 
+# ​
+class BigScience(BaseLM):
+
+    def __init__(self, device='cuda', pretrained='bigscience/tr5b-1B3-multilingual-alpha-checkpoints', revision='global_step118500', subfolder=None, tokenizer=None, batch_size=1):
+        super().__init__()
+
+        assert isinstance(device, str)
+        assert isinstance(pretrained, str)
+        assert isinstance(batch_size, int)
+        
+        if device:
+            self._device = torch.device(device)
+        else:
+            self._device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
+        # TODO: update this to be less of a hack once subfolder is fixed in HF
+        self.bigscience = transformers.AutoModelForCausalLM.from_pretrained(
+            pretrained, revision=revision
+        ).to(self.device)
+        self.bigscience.eval()
+        # pretrained tokenizer for neo is broken for now so just hard-coding this to gpt2
+        self.tokenizer = transformers.AutoTokenizer.from_pretrained(
+            pretrained if tokenizer is None else tokenizer, revision=revision, subfolder=subfolder)
+        # assert isinstance(self.tokenizer, (
+        #     transformers.GPT2Tokenizer, transformers.GPT2TokenizerFast,
+        #     transformers.T5Tokenizer, transformers.T5TokenizerFast,
+        # )), "this tokenizer has not been checked for compatibility yet!"
+        self.vocab_size = self.tokenizer.vocab_size
+        # if isinstance(self.tokenizer, (transformers.GPT2Tokenizer, transformers.GPT2TokenizerFast)):
+        #     assert self.tokenizer.encode('hello\n\nhello') == [31373, 198, 198, 31373], \
+        #         self.tokenizer.encode('hello\n\nhello')
+        # multithreading and batching
+        self.batch_size_per_gpu = batch_size  # todo: adaptive batch size
+        # TODO: fix multi-gpu
+        # gpus = torch.cuda.device_count()
+        # if gpus > 1:
+        #     self.gpt2 = nn.DataParallel(self.gpt2)
+    @property
+    def eot_token_id(self):
+        # we use EOT because end of *text* is more accurate for what we're doing than end of *sentence*
+        return self.tokenizer.eos_token_id
+    @property
+    def max_length(self):
+        try:
+            return self.bigscience.config.n_ctx
+        except AttributeError:
+            # gptneoconfig doesn't have n_ctx apparently
+            return self.bigscience.config.max_position_embeddings
+    @property
+    def max_gen_toks(self):
+        return 256
+    @property
+    def batch_size(self):
+        # TODO: fix multi-gpu
+        return self.batch_size_per_gpu  # * gpus
+    @property
+    def device(self):
+        # TODO: fix multi-gpu
+        return self._device
+    def tok_encode(self, string: str):
+        return self.tokenizer.encode(string, add_special_tokens=False)
+    def tok_decode(self, tokens):
+        return self.tokenizer.decode(tokens)
+    def _model_call(self, inps):
+        """
+        inps: a torch tensor of shape [batch, sequence]
+        the size of sequence may vary from call to call
+        returns: a torch tensor of shape [batch, sequence, vocab] with the
+        logits returned from the model
+        """
+        with torch.no_grad():
+            return self.bigscience(inps)[0][:, :, :130000]
+    def _model_generate(self, context, max_length, eos_token_id):
+        result = self.bigscience.generate(
+            context,
+            max_length=max_length,
+            eos_token_id=eos_token_id,
+            do_sample=False)
+        return result
diff --git a/lm_eval/models/xglm.py b/lm_eval/models/xglm.py
new file mode 100644
index 00000000000..3aac8f9ead2
--- /dev/null
+++ b/lm_eval/models/xglm.py
@@ -0,0 +1,98 @@
+import transformers
+import torch
+from lm_eval import utils
+from lm_eval.base import BaseLM
+from tqdm import tqdm
+
+
+class XGLM(BaseLM):
+    def __init__(self, device='cuda', pretrained='facebook/xglm-1.7B', revision='main', subfolder=None, tokenizer=None, batch_size=1):
+        super().__init__()
+        assert isinstance(device, str)
+        assert isinstance(pretrained, str)
+        assert isinstance(batch_size, int)
+        if device:
+            self._device = torch.device(device)
+        else:
+            self._device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
+        # TODO: update this to be less of a hack once subfolder is fixed in HF
+        self.xglm = transformers.AutoModelForCausalLM.from_pretrained(
+            pretrained,
+            # cache_dir="/users/zyong2/data/zyong2/huggingface/xglm"
+        ).to(self.device)
+        print(f"🤖 Loading model {pretrained}")
+        self.xglm.eval()
+        # pretrained tokenizer for neo is broken for now so just hard-coding this to gpt2
+        self.tokenizer = transformers.AutoTokenizer.from_pretrained(
+            pretrained if tokenizer is None else tokenizer, revision=revision, subfolder=subfolder)
+
+        # assert isinstance(self.tokenizer, (
+        #     transformers.GPT2Tokenizer, transformers.GPT2TokenizerFast,
+        #     transformers.T5Tokenizer, transformers.T5TokenizerFast,
+        # )), "this tokenizer has not been checked for compatibility yet!"
+        self.vocab_size = self.tokenizer.vocab_size
+        # if isinstance(self.tokenizer, (transformers.GPT2Tokenizer, transformers.GPT2TokenizerFast)):
+        #     assert self.tokenizer.encode('hello\n\nhello') == [31373, 198, 198, 31373], \
+        #         self.tokenizer.encode('hello\n\nhello')
+        # multithreading and batching
+        self.batch_size_per_gpu = batch_size  # todo: adaptive batch size
+        # TODO: fix multi-gpu
+        # gpus = torch.cuda.device_count()
+        # if gpus > 1:
+        #     self.gpt2 = nn.DataParallel(self.gpt2)
+    @property
+    def eot_token_id(self):
+        # we use EOT because end of *text* is more accurate for what we're doing than end of *sentence*
+        return self.tokenizer.eos_token_id
+    @property
+    def max_length(self):
+        try:
+            return self.xglm.config.n_ctx
+        except AttributeError:
+            # gptneoconfig doesn't have n_ctx apparently
+            return self.xglm.config.max_position_embeddings
+    @property
+    def max_gen_toks(self):
+        return 256
+    @property
+    def batch_size(self):
+        # TODO: fix multi-gpu
+        return self.batch_size_per_gpu  # * gpus
+    @property
+    def device(self):
+        # TODO: fix multi-gpu
+        return self._device
+    def tok_encode(self, string: str):
+        # HACK: to overcome problem of  XGLM tokenizer removing new lines
+        # we replace newline with SEP token
+        # WARNING: Since typical SEP token == EOS token
+        # Generation will stop after the first appearance of SEP token prevnting XGLM from 
+        # outputting Multi line generations
+        string = string.replace("\n", self.tokenizer.sep_token)
+        return self.tokenizer.encode(string, add_special_tokens=False)
+
+    def tok_decode(self, tokens):
+        # HACK: to overcome problem of  XGLM tokenizer removing new lines
+        # replace back the generated sep_tokens with newlines
+        output =  self.tokenizer.decode(tokens)
+        output = output.replace(self.tokenizer.sep_token, "\n")
+        print(output)
+        return output
+    def _model_call(self, inps):
+        """
+        inps: a torch tensor of shape [batch, sequence]
+        the size of sequence may vary from call to call
+        returns: a torch tensor of shape [batch, sequence, vocab] with the
+        logits returned from the model
+        """
+        with torch.no_grad():
+            return self.xglm(inps)[0][:, :, :256008]
+    
+    def _model_generate(self, context, max_length, eos_token_id):
+        result = self.xglm.generate(
+            context,
+            max_length=max_length,
+            eos_token_id=eos_token_id,
+            do_sample=False
+        )
+        return result
diff --git a/lm_eval/tasks/__init__.py b/lm_eval/tasks/__init__.py
index 53d7e88f16c..4221b811597 100644
--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
@@ -1,6 +1,8 @@
 from pprint import pprint
+from typing import List, Union
 
 import sacrebleu
+import lm_eval.base
 
 from . import superglue
 from . import glue
@@ -45,6 +47,7 @@
 from . import mutual
 from . import truthfulqa
 from . import blimp
+from . import asdiv
 
 ########################################
 # Translation tasks
@@ -133,7 +136,9 @@
     "squad2": squad.SQuAD2,
     "race": race.RACE,
     # "naturalqs": naturalqs.NaturalQs, # not implemented yet
-    "headqa": headqa.HeadQA,
+    "headqa": headqa.HeadQAEsDeprecated, # for backwards compat - headqa used to default to es
+    "headqa_es": headqa.HeadQAEs,
+    "headqa_en": headqa.HeadQAEn,
     "mathqa": mathqa.MathQA,
     "webqs": webqs.WebQs,
     "wsc273": wsc273.WinogradSchemaChallenge273,
@@ -164,6 +169,7 @@
     "math_num_theory": hendrycks_math.MathNumberTheory,
     "math_prealgebra": hendrycks_math.MathPrealgebra,
     "math_precalc": hendrycks_math.MathPrecalculus,
+    "math_asdiv": asdiv.Asdiv,
 
     # arithmetic
     "arithmetic_2da": arithmetic.Arithmetic2DPlus,
@@ -301,8 +307,23 @@ def get_task(task_name):
         raise KeyError(f"Missing task {task_name}")
 
 
-def get_task_dict(task_name_list):
-    return {
+def get_task_name_from_object(task_object):
+    for name, class_ in TASK_REGISTRY.items():
+        if class_ is task_object:
+            return name
+    
+    # this gives a mechanism for non-registered tasks to have a custom name anyways when reporting
+    return task_object.EVAL_HARNESS_NAME if hasattr(task_object, "EVAL_HARNESS_NAME") else type(task_object).__name__
+
+
+def get_task_dict(task_name_list: List[Union[str, lm_eval.base.Task]]):
+    task_name_dict = {
         task_name: get_task(task_name)()
-        for task_name in task_name_list
+        for task_name in task_name_list if isinstance(task_name, str)
+    }
+    task_name_from_object_dict = {
+        get_task_name_from_object(task_object): task_object
+        for task_object in task_name_list if not isinstance(task_object, str)
     }
+    assert set(task_name_dict.keys()).isdisjoint(set(task_name_from_object_dict.keys()))
+    return {**task_name_dict, **task_name_from_object_dict}
diff --git a/lm_eval/tasks/anli.py b/lm_eval/tasks/anli.py
index 1304c5da2bc..13c4044560e 100644
--- a/lm_eval/tasks/anli.py
+++ b/lm_eval/tasks/anli.py
@@ -33,10 +33,6 @@ def test_docs(self):
         if self.has_test_docs():
             return self.data["test_r" + str(self.SPLIT)]
 
-    def fewshot_description(self):
-        # TODO: figure out description
-        return ""
-
     def doc_to_text(self, doc):
         # OA does this a bit weirdly: they prepend "anli 1:  anli 1:  " to the beginning
         # of the prompt (yes, repeating it!). also, " True, False, or Neither?" is directly 
diff --git a/lm_eval/tasks/arc.py b/lm_eval/tasks/arc.py
index a0d13abc59d..2a8a9998429 100644
--- a/lm_eval/tasks/arc.py
+++ b/lm_eval/tasks/arc.py
@@ -29,10 +29,6 @@ def _convert_standard(self, doc):
         }
         return out_doc
 
-    def fewshot_description(self):
-        # TODO: figure out description
-        return ""
-
     def doc_to_text(self, doc):
         return doc["query"]
 
diff --git a/lm_eval/tasks/arithmetic.py b/lm_eval/tasks/arithmetic.py
index 147b66a1754..b3256b5c874 100644
--- a/lm_eval/tasks/arithmetic.py
+++ b/lm_eval/tasks/arithmetic.py
@@ -21,7 +21,7 @@ def download(self):
         url = 'https://raw.githubusercontent.com/openai/gpt-3/master/data/' + file_name
         if not os.path.exists(self.directory):
             os.makedirs(self.directory)
-        download_file(url, self.directory+file_name, checksum)
+        download_file(url, local_file=self.directory+file_name, expected_checksum=checksum)
         self.set_docs()
 
     @abc.abstractmethod
diff --git a/lm_eval/tasks/asdiv.py b/lm_eval/tasks/asdiv.py
new file mode 100644
index 00000000000..58bcdcd250e
--- /dev/null
+++ b/lm_eval/tasks/asdiv.py
@@ -0,0 +1,121 @@
+"""
+ASDiv: A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers
+https://arxiv.org/abs/2106.15772
+
+@misc{miao2021diverse,
+      title={A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers},
+      author={Shen-Yun Miao and Chao-Chun Liang and Keh-Yih Su},
+      year={2021},
+      eprint={2106.15772},
+      archivePrefix={arXiv},
+      primaryClass={cs.AI}
+}
+"""
+from lm_eval.base import Task
+from pathlib import Path
+from best_download import download_file 
+import xml.etree.ElementTree as ET
+from lm_eval.base import rf
+from lm_eval.metrics import mean,perplexity
+import numpy as np
+from zipfile import ZipFile
+import os 
+
+#currently ignoring formula for answer generation
+
+# given a subset, splits return the docs 
+class Asdiv(Task):
+    VERSION = 0
+    DATASET_PATH = Path("data/asdiv")
+
+    def download(self):
+        if self.DATASET_PATH.exists():
+            return
+        Path.mkdir(self.DATASET_PATH, parents=True)
+        url = "https://github.com/chaochun/nlu-asdiv-dataset/archive/55790e5270bb91ccfa5053194b25732534696b50.zip"
+        checksum = "8f1fe4f6d5f170ec1e24ab78c244153c14c568b1bb2b1dad0324e71f37939a2d"
+        zip_path = self.DATASET_PATH / "55790e5270bb91ccfa5053194b25732534696b50.zip"
+        download_file(url, local_file=str(zip_path), expected_checksum=checksum)
+        with ZipFile(zip_path, "r") as zip:
+            zip.extractall(self.DATASET_PATH)
+        os.remove(zip_path)
+
+    def _convert_standard(self, problem):
+        #TODO: include solution-type and formula
+        out_doc = {
+            "question" : problem.find('Question').text,
+            "body" : problem.find('Body').text,
+            "answer": problem.find('Answer').text
+        }
+        return out_doc
+
+    def load_docs(self, textfilename, tfds=False):
+        tree = ET.parse(textfilename)
+        root = tree.getroot()
+        for pid, problem in enumerate(root.iter('Problem')):
+            out_doc = self._convert_standard(problem)
+            yield out_doc
+
+    def has_training_docs(self):
+        return False
+    
+    def has_validation_docs(self):
+        return True
+
+    def has_test_docs(self):
+        return False
+
+    def training_docs(self):
+        raise NotImplementedError("This dataset has no training docs")
+
+    def test_docs(self):
+        raise NotImplementedError("This dataset has no test docs")
+
+    def validation_docs(self):
+        data_xml_path = self.DATASET_PATH / "nlu-asdiv-dataset-55790e5270bb91ccfa5053194b25732534696b50/dataset/ASDiv.xml"
+        return self.load_docs(data_xml_path)
+
+    def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None):
+        assert num_fewshot == 0, "ASDiv is intended only for the zero-shot setting."
+        return super().fewshot_context(
+            doc=doc,
+            num_fewshot=num_fewshot,
+            rnd=rnd,
+            description=description
+        )
+    
+    def fewshot_description(self):
+        # TODO: add solution-type and formula
+        desc = "information containing the context of the question\nQuestion: Text of a question.\nAnswer: Answer to the question, based on the passage.\n"
+        return desc
+
+    def doc_to_text(self, doc):
+        # TODO: add solution-type
+        return doc['body'] + '\n' + 'Question:' + doc['question'] + '\n' + 'Answer:'
+
+    def doc_to_target(self, doc):
+        # TODO: add formula
+
+        answer = doc['answer'].split(' (')[0]
+        return " " + answer
+
+    def construct_requests(self, doc, ctx):
+        ll, is_greedy = rf.loglikelihood(ctx, self.doc_to_target(doc))
+        return ll, is_greedy
+    
+    def process_results(self, doc, results):
+        ll, is_greedy = results
+
+        return {
+            'acc': int(is_greedy)
+        }
+        
+    def aggregation(self):
+        return {
+            'acc': mean
+        }
+
+    def higher_is_better(self):
+        return {
+            'acc': True
+        }
diff --git a/lm_eval/tasks/blimp.py b/lm_eval/tasks/blimp.py
index e8e7bd9f2be..8a52d888caa 100644
--- a/lm_eval/tasks/blimp.py
+++ b/lm_eval/tasks/blimp.py
@@ -29,9 +29,18 @@ def download(self):
         self.data["validation"] = self.data["train"]
         del self.data["train"]
 
-    def fewshot_context(self, doc, num_fewshot, provide_description, rnd):
+    def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None):
         assert num_fewshot == 0
-        assert not provide_description
+        assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`"
+        assert not provide_description, (
+            "The `provide_description` arg will be removed in future versions. To prepend "
+            "a custom description to the context, supply the corresponding string via the  "
+            "`description` arg."
+        )
+        if provide_description is not None:
+            # nudge people to not specify it at all
+            print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict")
+
         return ""
 
     def doc_to_text(self, doc):
diff --git a/lm_eval/tasks/cbt.py b/lm_eval/tasks/cbt.py
index 8837caff6dc..e239a630b40 100644
--- a/lm_eval/tasks/cbt.py
+++ b/lm_eval/tasks/cbt.py
@@ -17,10 +17,6 @@ class CBTBase(HFTask):
 
     VERSION = 0
 
-    def fewshot_description(self):
-        # TODO: Figure out description.
-        return ""
-
     def detokenize(self, text):
         text = text.replace(" '", "'")
         text = text.replace(" \n", "\n")
diff --git a/lm_eval/tasks/coqa.py b/lm_eval/tasks/coqa.py
index beba53a6630..128ac8f8d5d 100644
--- a/lm_eval/tasks/coqa.py
+++ b/lm_eval/tasks/coqa.py
@@ -16,8 +16,8 @@ def download(self):
 
         sh ("""mkdir -p data/coqa""")
 
-        download_file("http://downloads.cs.stanford.edu/nlp/data/coqa/coqa-train-v1.0.json", coqa_train_filepath, "b0fdb2bc1bd38dd3ca2ce5fa2ac3e02c6288ac914f241ac409a655ffb6619fa6")
-        download_file("http://downloads.cs.stanford.edu/nlp/data/coqa/coqa-dev-v1.0.json", coqa_dev_filepath, "dfa367a9733ce53222918d0231d9b3bedc2b8ee831a2845f62dfc70701f2540a")
+        download_file("http://downloads.cs.stanford.edu/nlp/data/coqa/coqa-train-v1.0.json", local_file=coqa_train_filepath, expected_checksum="b0fdb2bc1bd38dd3ca2ce5fa2ac3e02c6288ac914f241ac409a655ffb6619fa6")
+        download_file("http://downloads.cs.stanford.edu/nlp/data/coqa/coqa-dev-v1.0.json", local_file=coqa_dev_filepath, expected_checksum="dfa367a9733ce53222918d0231d9b3bedc2b8ee831a2845f62dfc70701f2540a")
 
     def has_training_docs(self):
         return True
@@ -36,10 +36,7 @@ def validation_docs(self):
 
     def test_docs(self):
         pass
-    
-    def fewshot_description(self):
-        return "Given a passage and a conversation so far, answer the next question in the conversation."
-    
+
     def doc_to_text(self, doc):
         # Given a passage p, the conversation history {q1, a1, . . . qi−1, ai−1} 
         # and a question qi, the task is to predict the answer ai
diff --git a/lm_eval/tasks/drop.py b/lm_eval/tasks/drop.py
index 97d10983274..1b896f3cdee 100644
--- a/lm_eval/tasks/drop.py
+++ b/lm_eval/tasks/drop.py
@@ -27,7 +27,7 @@ def download(self):
         url = "https://s3-us-west-2.amazonaws.com/allennlp/datasets/drop/drop_dataset.zip"
         checksum = "39d2278a29fd729de301b111a45f434c24834f40df8f4ff116d864589e3249d6"
         zip_path = self.DATASET_PATH / "drop_dataset.zip"
-        download_file(url, str(zip_path), checksum)
+        download_file(url, local_file=str(zip_path), expected_checksum=checksum)
         with ZipFile(zip_path, "r") as zip:
             zip.extractall(self.DATASET_PATH)
 
@@ -40,10 +40,6 @@ def has_validation_docs(self):
     def has_test_docs(self):
         return False
 
-    def fewshot_description(self):
-        # TODO: figure out description
-        return ""
-
     def _load_docs(self, docs):
         for doc in docs:
             for qa in doc["qa_pairs"]:
diff --git a/lm_eval/tasks/glue.py b/lm_eval/tasks/glue.py
index dad629c02a7..80a77310a16 100644
--- a/lm_eval/tasks/glue.py
+++ b/lm_eval/tasks/glue.py
@@ -21,10 +21,6 @@ def has_validation_docs(self):
     def has_test_docs(self):
         return False
 
-    def fewshot_description(self):
-        # TODO
-        return ""
-
     def doc_to_text(self, doc):
         return "{}\nQuestion: Does this sentence make sense?\nAnswer:".format(doc["sentence"])
 
@@ -69,9 +65,6 @@ def has_validation_docs(self):
     def has_test_docs(self):
         return False
 
-    def fewshot_description(self):
-        return "Indicate if the sentiment of each sentence is positive or negative."
-
     def doc_to_text(self, doc):
         return "{}\nQuestion: Is this sentence positive or negative?\nAnswer:".format(
             general_detokenize(doc["sentence"]),
@@ -341,9 +334,6 @@ def has_validation_docs(self):
     def has_test_docs(self):
         return False
 
-    def fewshot_description(self):
-        return "Indicate if both sentences mean the same thing."
-
     def doc_to_text(self, doc):
         return "Sentence 1: {}\nSentence 2: {}\nQuestion: Do both sentences mean the same thing?\nAnswer:".format(
             general_detokenize(doc["sentence1"]),
@@ -394,9 +384,6 @@ def has_validation_docs(self):
     def has_test_docs(self):
         return False
 
-    def fewshot_description(self):
-        return "Indicate if both questions ask the same thing."
-
     def doc_to_text(self, doc):
         return "Question 1: {}\nQuestion 2: {}\nQuestion: Do both questions ask the same thing?\nAnswer:".format(
             doc["question1"],
@@ -447,10 +434,6 @@ def has_validation_docs(self):
     def has_test_docs(self):
         return True
 
-    def fewshot_description(self):
-        return "Indicate if both sentences mean the same thing from a scale of 0-5, " \
-           "where 5 means identical and 0 means unrelated."
-
     def doc_to_text(self, doc):
         return "sentence 1: {}\nsentence 2: {}\nAnswer:".format(
             doc["sentence1"],
diff --git a/lm_eval/tasks/headqa.py b/lm_eval/tasks/headqa.py
index 3c66dc064b9..d9ac2d87c13 100644
--- a/lm_eval/tasks/headqa.py
+++ b/lm_eval/tasks/headqa.py
@@ -2,10 +2,9 @@
 from lm_eval.base import MultipleChoiceTask
 
 
-class HeadQA(HFTask, MultipleChoiceTask):
+class HeadQABase(HFTask, MultipleChoiceTask):
     VERSION = 0
     DATASET_PATH = "head_qa"
-    DATASET_NAME = None
 
     def has_training_docs(self):
         return True
@@ -25,9 +24,19 @@ def _convert_standard(self, doc):
         }
         return out_doc
 
-    def fewshot_description(self):
-        # TODO: figure out description
-        return ""
-
     def doc_to_text(self, doc):
         return doc["query"]
+
+class HeadQAEn(HeadQABase):
+    DATASET_NAME = "en"
+
+class HeadQAEs(HeadQABase):
+    DATASET_NAME = "es"
+
+# for backwards compatibility
+class HeadQAEsDeprecated(HeadQABase):
+    DATASET_NAME = "es"
+
+    def __init__(self):
+        super().__init__()
+        print("WARNING: headqa is deprecated. Please use headqa_es or headqa_en instead. See https://github.com/EleutherAI/lm-evaluation-harness/pull/240 for more info.")
\ No newline at end of file
diff --git a/lm_eval/tasks/hellaswag.py b/lm_eval/tasks/hellaswag.py
index 762ce473377..56450cf3e6e 100644
--- a/lm_eval/tasks/hellaswag.py
+++ b/lm_eval/tasks/hellaswag.py
@@ -35,10 +35,5 @@ def _convert_standard(self, doc):
         }
         return out_doc
 
-    def fewshot_description(self):
-        return "Label for the relevant action: Sentences describing the " \
-            "context, with an incomplete sentence trailing\nanswer that " \
-            "plausibly completes the situation."
-
     def doc_to_text(self, doc):
         return doc["query"]
diff --git a/lm_eval/tasks/hendrycks_ethics.py b/lm_eval/tasks/hendrycks_ethics.py
index 50e94a508cf..d12c0064cf7 100644
--- a/lm_eval/tasks/hendrycks_ethics.py
+++ b/lm_eval/tasks/hendrycks_ethics.py
@@ -20,7 +20,7 @@ class Ethics(Task):
     def download(self):
         if not os.path.exists('data/ethics/done'):
             sh("mkdir -p data")
-            download_file("https://people.eecs.berkeley.edu/~hendrycks/ethics.tar", "data/ethics.tar", "40acbf1ac0da79a2aabef394d58889136b8d38b05be09482006de2453fb06333")
+            download_file("https://people.eecs.berkeley.edu/~hendrycks/ethics.tar", local_file="data/ethics.tar", expected_checksum="40acbf1ac0da79a2aabef394d58889136b8d38b05be09482006de2453fb06333")
             sh("""
             tar -xf data/ethics.tar -C data/
             rm data/ethics.tar
@@ -237,9 +237,6 @@ def process_doc(self, docs):
         for doc in docs:
             yield {"activity": doc[0], "baseline": doc[1], "rating": ""}
 
-    def fewshot_description(self):
-        return "Rate how pleasant each of the following activities is on a scale from 1 (very unpleasant) to 10 (very pleasant).\n\n"
-
     def fewshot_examples(self, k, rnd):
         # Overwriting fewshot examples as k can be max 5
         assert k <= 5, "There are only 5 possible shots for this task. Refer to the V2 for more."
@@ -350,9 +347,6 @@ class EthicsVirtue(Ethics):
     def get_prefix(self):
         return "virtue/virtue"
 
-    def fewshot_description(self):
-        return "The following is a list of sentences and traits, along with whether the trait is exhibited in that sentence.\n\n"
-
     def process_doc(self, doc):
         # Append identifiers before shuffling to calculate exact matches lateron & skip the first element of headers
         return [x + [i] for i, x in enumerate(doc[1:])]
diff --git a/lm_eval/tasks/hendrycks_math.py b/lm_eval/tasks/hendrycks_math.py
index 379e727d617..657969948c8 100644
--- a/lm_eval/tasks/hendrycks_math.py
+++ b/lm_eval/tasks/hendrycks_math.py
@@ -18,7 +18,7 @@ class Math(Task):
     def download(self):
         if not (self.DATASET_PATH / 'test').exists() or not (self.DATASET_PATH / 'done').exists():
             sh(f"mkdir -p {self.DATASET_PATH}")
-            download_file("https://people.eecs.berkeley.edu/~hendrycks/MATH.tar", f"{self.DATASET_PATH}.tar", "01256fd7cd5430596fdf07e6e6a5827111b5235b7ffed679c662a12f898932da")
+            download_file("https://people.eecs.berkeley.edu/~hendrycks/MATH.tar", local_file=f"{self.DATASET_PATH}.tar", expected_checksum="01256fd7cd5430596fdf07e6e6a5827111b5235b7ffed679c662a12f898932da")
             sh(f"""
             tar -xf {self.DATASET_PATH}.tar -C data/ && touch {self.DATASET_PATH / 'done'}
             rm {self.DATASET_PATH}.tar
@@ -55,9 +55,6 @@ def validation_docs(self):
     def test_docs(self):
         return self._load_docs(self.DATASET_PATH / "test" / self.get_file_info())
 
-    def fewshot_description(self):
-        return "Given a mathematics problem, determine the answer. Simplify your answer as much as possible."
-
     def doc_to_text(self, doc):
         return "Problem: " + doc["problem"] + "\nAnswer:"
 
diff --git a/lm_eval/tasks/hendrycks_test.py b/lm_eval/tasks/hendrycks_test.py
index 46c0306fcd0..aa45d608f5e 100644
--- a/lm_eval/tasks/hendrycks_test.py
+++ b/lm_eval/tasks/hendrycks_test.py
@@ -45,7 +45,7 @@ def __init__(self, subject):
     def download(self):
         if not (self.DATASET_PATH / 'done').exists():
             sh("mkdir -p data")
-            download_file("https://people.eecs.berkeley.edu/~hendrycks/data.tar", "data/data.tar", "78a804365a59028188fb19bd1adcadc5e0c260b220a9d8b2e33a5ea7d5fbe3b4")
+            download_file("https://people.eecs.berkeley.edu/~hendrycks/data.tar", local_file="data/data.tar", expected_checksum="78a804365a59028188fb19bd1adcadc5e0c260b220a9d8b2e33a5ea7d5fbe3b4")
             sh("""
             tar -xf data/data.tar -C data/
             rm data/data.tar
@@ -114,9 +114,5 @@ def fewshot_examples(self, k, rnd):
 
         return rnd.sample(list(self._fewshot_docs), k)
 
-    def fewshot_description(self):
-        subject = self.subject.replace("_", " ")
-        return f"The following are multiple choice questions (with answers) about {subject}."
-
     def doc_to_text(self, doc):
         return doc["query"]
diff --git a/lm_eval/tasks/lambada.py b/lm_eval/tasks/lambada.py
index bcb4ae019c4..300445c6383 100644
--- a/lm_eval/tasks/lambada.py
+++ b/lm_eval/tasks/lambada.py
@@ -14,8 +14,8 @@ def download(self):
             if not os.path.exists("data/lambada/lambada_test.jsonl"):
                 download_file(
                     "http://eaidata.bmk.sh/data/lambada_test.jsonl", 
-                    "data/lambada/lambada_test.jsonl", 
-                    "4aa8d02cd17c719165fc8a7887fddd641f43fcafa4b1c806ca8abc31fabdb226"
+                    local_file="data/lambada/lambada_test.jsonl",
+                    expected_checksum="4aa8d02cd17c719165fc8a7887fddd641f43fcafa4b1c806ca8abc31fabdb226"
                 )
         except:
             # fallback - for some reason best_download doesnt work all the time here
@@ -47,10 +47,6 @@ def doc_to_text(self, doc):
 
     def doc_to_target(self, doc):
         return " " + doc['text'].rsplit(' ', 1)[1]
-    
-    def fewshot_description(self):
-        # TODO: figure out description
-        return ""
 
     def construct_requests(self, doc, ctx):
         ll, is_greedy = rf.loglikelihood(ctx, self.doc_to_target(doc))
diff --git a/lm_eval/tasks/lambada_cloze.py b/lm_eval/tasks/lambada_cloze.py
index 90bd4f10cac..dc1d4b168b8 100644
--- a/lm_eval/tasks/lambada_cloze.py
+++ b/lm_eval/tasks/lambada_cloze.py
@@ -13,6 +13,3 @@ def doc_to_text(self, doc):
 
     def doc_to_target(self, doc):
         return " " + doc['text'].rsplit(' ', 1)[1]
-    
-    def fewshot_description(self):
-        return "Fill in blank:\n"
diff --git a/lm_eval/tasks/lambada_multilingual.py b/lm_eval/tasks/lambada_multilingual.py
index dd6da10befa..7123ecf01ad 100644
--- a/lm_eval/tasks/lambada_multilingual.py
+++ b/lm_eval/tasks/lambada_multilingual.py
@@ -32,8 +32,8 @@ def download(self):
         if not os.path.exists(f):
           download_file(
               url, 
-              f, 
-              CHECKSUMS[self.LANG]
+              local_file=f,
+              expected_checksum=CHECKSUMS[self.LANG]
           )
       except:
         # fallback - for some reason best_download doesnt work all the time here
diff --git a/lm_eval/tasks/logiqa.py b/lm_eval/tasks/logiqa.py
index e403623beba..6341b4a32e2 100644
--- a/lm_eval/tasks/logiqa.py
+++ b/lm_eval/tasks/logiqa.py
@@ -19,7 +19,7 @@ def download(self):
         ]
         for split in splits:
             file = self.DATASET_PATH / f"{split['name']}.txt"
-            download_file(f"{base_url}/{split['name']}.txt", str(file), split["checksum"])
+            download_file(f"{base_url}/{split['name']}.txt", local_file=str(file), expected_checksum=split["checksum"])
 
     def has_training_docs(self):
         return True
@@ -80,9 +80,5 @@ def validation_docs(self):
     def test_docs(self):
         return self._load_docs(self.DATASET_PATH / "Test.txt")
 
-    def fewshot_description(self):
-        # TODO: figure out actual description
-        return ""
-
     def doc_to_text(self, doc):
         return doc["query"]
diff --git a/lm_eval/tasks/mathqa.py b/lm_eval/tasks/mathqa.py
index 84e5ab9eca5..a02a5b59bb6 100644
--- a/lm_eval/tasks/mathqa.py
+++ b/lm_eval/tasks/mathqa.py
@@ -29,9 +29,5 @@ def _convert_standard(self, doc):
         }
         return out_doc
 
-    def fewshot_description(self):
-        # TODO: figure out description
-        return ""
-
     def doc_to_text(self, doc):
         return doc["query"]
diff --git a/lm_eval/tasks/mc_taco.py b/lm_eval/tasks/mc_taco.py
index c9b2dd91fca..64a36a01f77 100644
--- a/lm_eval/tasks/mc_taco.py
+++ b/lm_eval/tasks/mc_taco.py
@@ -39,9 +39,6 @@ def has_validation_docs(self):
     def has_test_docs(self):
         return True
 
-    def fewshot_description(self):
-        return "Determine whether the candidate answer is plausible (\"yes\") or not (\"no\")"
-
     def doc_to_text(self, doc):
         return f"{doc['sentence']}\nQuestion: {doc['question']}\n"\
             f"Answer: {doc['answer']}\nPlausible:"
diff --git a/lm_eval/tasks/mutual.py b/lm_eval/tasks/mutual.py
index 17274a46fd9..99c1508bc5d 100644
--- a/lm_eval/tasks/mutual.py
+++ b/lm_eval/tasks/mutual.py
@@ -36,8 +36,8 @@ def download(self):
         master_zip = Path("data/master.zip")
         download_file(
             "https://github.com/Nealcly/MuTual/archive/master.zip",
-            str(master_zip),
-            "bb325cf6c672f0f02699993a37138b0fa0af6fcfc77ec81dfbe46add4d7b29f9")
+            local_file=str(master_zip),
+            expected_checksum="bb325cf6c672f0f02699993a37138b0fa0af6fcfc77ec81dfbe46add4d7b29f9")
         with zipfile.ZipFile(master_zip, 'r') as zip:
             zip.extractall("data")
         Path("data/MuTual-master/data").rename(str(self.BASE_PATH))
@@ -70,10 +70,6 @@ def validation_docs(self):
     def test_docs(self):
         return NotImplemented
 
-    def fewshot_description(self):
-        # TODO: figure out fewshot description
-        return ""
-
     def doc_to_text(self, doc):
         return self.detokenize(doc["article"])
 
diff --git a/lm_eval/tasks/naturalqs.py b/lm_eval/tasks/naturalqs.py
index f31875240f1..e7a381dcd4b 100644
--- a/lm_eval/tasks/naturalqs.py
+++ b/lm_eval/tasks/naturalqs.py
@@ -21,10 +21,6 @@ def has_validation_docs(self):
     def has_test_docs(self):
         return False
 
-    def fewshot_description(self):
-        # TODO: figure out description
-        return ""
-
     def training_docs(self):
         # Cache training for faster few-shot.
         # Data is too large to fit in memory.
diff --git a/lm_eval/tasks/openbookqa.py b/lm_eval/tasks/openbookqa.py
index 40fc7a026bd..5f87d8a8ec1 100644
--- a/lm_eval/tasks/openbookqa.py
+++ b/lm_eval/tasks/openbookqa.py
@@ -25,9 +25,5 @@ def _convert_standard(self, doc):
         }
         return out_doc
 
-    def fewshot_description(self):
-        # TODO: figure out fewshot description
-        return ""
-
     def doc_to_text(self, doc):
         return doc["query"]
diff --git a/lm_eval/tasks/pile.py b/lm_eval/tasks/pile.py
index 68ff7ed9a8e..a4475832b55 100644
--- a/lm_eval/tasks/pile.py
+++ b/lm_eval/tasks/pile.py
@@ -10,7 +10,7 @@
 
 
 class PilePerplexityTask(PerplexityTask, abc.ABC):
-    VERSION = 0
+    VERSION = 1
 
     PILE_SET_NAME = None
     VAL_PATH = 'data/pile/val.jsonl.zst'
@@ -18,9 +18,11 @@ class PilePerplexityTask(PerplexityTask, abc.ABC):
 
     def download(self):
         # TODO: separate pile val/test out by component so we don't have to scan the entire file once per set
-        os.makedirs("data/pile/", exist_ok=True)
-        download_file("https://the-eye.eu/public/AI/pile/val.jsonl.zst", self.VAL_PATH, "264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92")
-        download_file("https://the-eye.eu/public/AI/pile/test.jsonl.zst", self.TEST_PATH, "0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e")
+        if not os.path.exists("data/pile/test.jsonl.zst"):
+            # todo use new best_download fallback api
+            os.makedirs("data/pile/", exist_ok=True)
+            download_file("http://eaidata.bmk.sh/data/pile/val.jsonl.zst", local_file=self.VAL_PATH, expected_checksum="264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92")
+            download_file("http://eaidata.bmk.sh/data/pile/test.jsonl.zst", local_file=self.TEST_PATH, expected_checksum="0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e")
 
     def validation_docs(self):
         rdr = lm_dataformat.Reader(self.VAL_PATH)
diff --git a/lm_eval/tasks/piqa.py b/lm_eval/tasks/piqa.py
index 8b43d1af03c..bdf3ec35dca 100644
--- a/lm_eval/tasks/piqa.py
+++ b/lm_eval/tasks/piqa.py
@@ -18,10 +18,6 @@ def has_validation_docs(self):
     def has_test_docs(self):
         return False
 
-    def fewshot_description(self):
-        # TODO: figure out fewshot description
-        return ""
-
     def _convert_standard(self, doc):
         out_doc = {
             "goal": doc["goal"],
diff --git a/lm_eval/tasks/prost.py b/lm_eval/tasks/prost.py
index 1a634d17c80..e972d39ac03 100644
--- a/lm_eval/tasks/prost.py
+++ b/lm_eval/tasks/prost.py
@@ -36,13 +36,14 @@ def has_validation_docs(self):
     def has_test_docs(self):
         return True
 
-    def fewshot_description(self):
-        # TODO: figure out fewshot description
-        return ""
-
-    def fewshot_context(self, doc, num_fewshot, provide_description, rnd):
+    def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None):
         assert num_fewshot == 0, 'PROST is designed to probe models in a zero-shot fashion only.'
-        return super().fewshot_context(doc, num_fewshot, provide_description, rnd)
+        return super().fewshot_context(
+            doc=doc,
+            num_fewshot=num_fewshot,
+            rnd=rnd,
+            description=description
+        )
 
     def _convert_standard(self, doc):
         out_doc = {
diff --git a/lm_eval/tasks/pubmedqa.py b/lm_eval/tasks/pubmedqa.py
index 14335a5de02..c597064f0c6 100644
--- a/lm_eval/tasks/pubmedqa.py
+++ b/lm_eval/tasks/pubmedqa.py
@@ -23,11 +23,6 @@ def test_docs(self):
             # HF is labelled as train but its really just for testing
             return self.data["train"]
 
-    def fewshot_description(self):
-        # Average ctx length in labelled dataset is 238.9
-        # 2 few-shot exmamples pushes it beyond context window
-        return ""
-
     def doc_to_text(self, doc):
         ctxs = "\n".join(doc["context"]["contexts"])
         return "Abstract: {}\nQuestion: {}\nAnswer:".format(
diff --git a/lm_eval/tasks/qa4mre.py b/lm_eval/tasks/qa4mre.py
index 67810ad747a..c0966c24574 100644
--- a/lm_eval/tasks/qa4mre.py
+++ b/lm_eval/tasks/qa4mre.py
@@ -32,8 +32,8 @@ def download(self):
         if not os.path.isfile(f"data/qa4mre/QA4MRE-{year}-{lang}"):
             download_file(
                 url_path,
-                f"data/qa4mre/QA4MRE-{year}-{lang}_GS.xml",
-                sha256sums[year],
+                local_file=f"data/qa4mre/QA4MRE-{year}-{lang}_GS.xml",
+                expected_checksum=sha256sums[year],
                 )
 
     def has_training_docs(self):
@@ -67,9 +67,6 @@ def load_docs(self, textfilename, tfds=False):
                 out_doc['source'] = src
                 yield out_doc
 
-    def fewshot_description(self):
-        return ""
-
     def test_docs(self):
         return self.load_docs(f"data/qa4mre/QA4MRE-{self.YEAR}-EN_GS.xml")
 
diff --git a/lm_eval/tasks/quac.py b/lm_eval/tasks/quac.py
index bb02b1c4e37..c7ce752233e 100644
--- a/lm_eval/tasks/quac.py
+++ b/lm_eval/tasks/quac.py
@@ -51,11 +51,6 @@ def validation_docs(self):
     def test_docs(self):
         raise NotImplementedError("QuAC has no test docs.")
     
-    def fewshot_description(self):
-        # TODO: figure out fewshot description
-        desc = "TITLE: Title of the context passage - subtitle of the passage\nPARAGRAPH: Passage describing the relevant information for answering questions.\n\nQ: Text of a question.\n\nA: Answer to the question, based on the passage. If it cannot be answered based on the passage, write CANNOTANSWER"
-        return desc
-
     def load_doc(self, myjson):
         docs = []
         for item in myjson:
diff --git a/lm_eval/tasks/race.py b/lm_eval/tasks/race.py
index 4525cb4ccdf..64ee000bd19 100644
--- a/lm_eval/tasks/race.py
+++ b/lm_eval/tasks/race.py
@@ -65,10 +65,6 @@ def validation_docs(self):
     def test_docs(self):
         return self._collate_data("test")
 
-    def fewshot_description(self):
-        # TODO: figure out description
-        return ""
-
     @classmethod
     def get_answer_option(cls, problem):
         answer = cls.letter_to_num[problem['answer']]
diff --git a/lm_eval/tasks/sat.py b/lm_eval/tasks/sat.py
index e4411edfd8b..d75d7923b5a 100644
--- a/lm_eval/tasks/sat.py
+++ b/lm_eval/tasks/sat.py
@@ -61,10 +61,5 @@ def validation_docs(self):
             }
             yield doc
 
-    
-    def fewshot_description(self):
-        # TODO: figure out actual description
-        return ""
-
     def doc_to_text(self, doc):
         return "{} is to {} as".format(*doc['query'])
diff --git a/lm_eval/tasks/sciq.py b/lm_eval/tasks/sciq.py
index b750354a7b0..e385811937a 100644
--- a/lm_eval/tasks/sciq.py
+++ b/lm_eval/tasks/sciq.py
@@ -13,8 +13,8 @@ def download(self):
             os.makedirs('data/sciq', exist_ok=True)
             download_file(
                 'https://ai2-public-datasets.s3.amazonaws.com/sciq/SciQ.zip',
-                'data/sciq/SciQ.zip', 
-                '7f3312f6ac6b09970b32942d106a8c44ec0dad46a0369f17d635aff8e348a87c',
+                local_file='data/sciq/SciQ.zip',
+                expected_checksum='7f3312f6ac6b09970b32942d106a8c44ec0dad46a0369f17d635aff8e348a87c',
             )
             with zipfile.ZipFile("data/sciq/SciQ.zip", "r") as zf:
                 zf.extractall("data/sciq/")
@@ -50,9 +50,6 @@ def load_docs(self, textfilename):
         for record in docs:
             yield self._convert_standard(record)
 
-    def fewshot_description(self):
-        return ""
-
     def training_docs(self):
         return self.load_docs("data/sciq/SciQ dataset-2 3/train.json")
 
@@ -63,4 +60,4 @@ def test_docs(self):
         return self.load_docs("data/sciq/SciQ dataset-2 3/test.json")
 
     def doc_to_text(self, doc):
-        return "{}\nQuestion: {}\nAnswer:".format(doc["source"], doc["query"]).strip()
\ No newline at end of file
+        return "{}\nQuestion: {}\nAnswer:".format(doc["source"], doc["query"]).strip()
diff --git a/lm_eval/tasks/squad.py b/lm_eval/tasks/squad.py
index 72e1a19b0ed..2a69a67c7bf 100644
--- a/lm_eval/tasks/squad.py
+++ b/lm_eval/tasks/squad.py
@@ -41,10 +41,6 @@ def training_docs(self):
     def validation_docs(self):
         return self.data["validation"]
 
-    def fewshot_description(self):
-        # TODO: figure out description
-        return ""
-
     def doc_to_text(self, doc):
         return 'Title: ' + doc['title'] + '\n\n' + 'Background: ' + doc['context'] + '\n\n' + 'Question: ' + doc['question'] + '\n\n' + 'Answer:'
 
diff --git a/lm_eval/tasks/storycloze.py b/lm_eval/tasks/storycloze.py
index 3e178facb44..2cc16cf66d7 100644
--- a/lm_eval/tasks/storycloze.py
+++ b/lm_eval/tasks/storycloze.py
@@ -27,18 +27,12 @@ def load_doc(self, filename):
             filereader = csv.reader(file)
             return list(filereader)
                 
-
     def validation_docs(self):
         return self.load_doc("data/storycloze/cloze_test_val__winter2018-cloze_test_ALL_val - 1 - 1.csv")
 
     def test_docs(self):
         return self.load_doc("data/storycloze/cloze_test_test__winter2018-cloze_test_ALL_test - 1.csv")
 
-    
-    def fewshot_description(self):
-        # TODO: figure out fewshot description
-        return ""
-    
     def doc_to_text(self, doc):
         return ' '.join([*doc[1:5]])
 
diff --git a/lm_eval/tasks/superglue.py b/lm_eval/tasks/superglue.py
index 33598f23015..f12b866d01f 100644
--- a/lm_eval/tasks/superglue.py
+++ b/lm_eval/tasks/superglue.py
@@ -13,7 +13,7 @@
 
 
 class BoolQ(HFTask):
-    VERSION = 0
+    VERSION = 1
     DATASET_PATH = "super_glue"
     DATASET_NAME = "boolq"
 
@@ -26,12 +26,8 @@ def has_validation_docs(self):
     def has_test_docs(self):
         return False
 
-    def fewshot_description(self):
-        # TODO: figure out actual description
-        return "Read the following passages and answer each question with a yes or a no."
-
     def doc_to_text(self, doc):
-        return f"{doc['passage']}\nQuestion: {doc['question']}\nAnswer:"
+        return f"{doc['passage']}\nQuestion: {doc['question']}?\nAnswer:"
     
     def doc_to_target(self, doc):
         return " " + yesno(doc['label']) 
@@ -65,7 +61,7 @@ def aggregation(self):
 
 
 class CommitmentBank(HFTask):
-    VERSION = 0
+    VERSION = 1
     DATASET_PATH = "super_glue"
     DATASET_NAME = "cb"
 
@@ -78,11 +74,6 @@ def has_validation_docs(self):
     def has_test_docs(self):
         return False
 
-    def fewshot_description(self):
-        # TODO: figure out actual description
-        return "Given a premise and a hypothesis, classify whether the author of the premise is committed" \
-            "to the truth of the hypothesis. The three possible labels are true, false or neither."
-
     def doc_to_text(self, doc):
         return "{}\nQuestion: {}. True, False or Neither?\nAnswer:".format(
             doc["premise"],
@@ -93,14 +84,14 @@ def doc_to_target(self, doc):
         # True = entailment
         # False = contradiction
         # Neither = neutral
-        return " {}".format({0: "True", 1: "Neither", 2: "False"}[doc["label"]])
+        return " {}".format({0: "True", 1: "False", 2: "Neither"}[doc["label"]])
 
     def construct_requests(self, doc, ctx):
         ll_true, _ = rf.loglikelihood(ctx, ' True')
-        ll_neither, _ = rf.loglikelihood(ctx, ' Neither')
         ll_false, _ = rf.loglikelihood(ctx, ' False')
+        ll_neither, _ = rf.loglikelihood(ctx, ' Neither')
 
-        return ll_true, ll_neither, ll_false
+        return ll_true, ll_false, ll_neither
 
     def process_results(self, doc, results):
         gold = doc["label"]
@@ -150,11 +141,6 @@ def has_validation_docs(self):
     def has_test_docs(self):
         return False
 
-    def fewshot_description(self):
-        # TODO: figure out actual description
-        return "Given a premise and one alternative with a causal relation to the premise and another without," \
-            "choose the more plausible alternative"
-
     def doc_to_text(self, doc):
         # Drop the period
         connector = {
@@ -202,7 +188,7 @@ def convert_choice(choice):
 
 
 class MultiRC(HFTask):
-    VERSION = 0
+    VERSION = 1
     DATASET_PATH = "super_glue"
     DATASET_NAME = "multirc"
 
@@ -215,10 +201,6 @@ def has_validation_docs(self):
     def has_test_docs(self):
         return False
 
-    def fewshot_description(self):
-        # TODO: figure out actual description
-        return "READING COMPREHENSION ANSWER KEY"
-
     def doc_to_text(self, doc):
         return f"{doc['paragraph']}\nQuestion: {doc['question']}\nAnswer:"
 
@@ -228,7 +210,7 @@ def doc_to_target(self, doc):
     @staticmethod
     def format_answer(answer, label):
         label_str = "yes" if label else "no"
-        return f"{label_str}, {answer}"
+        return f"{answer}\nIs the answer correct? {label_str}"
 
     def construct_requests(self, doc, ctx):
         true_choice = self.format_answer(answer=doc["answer"], label=True)
@@ -240,7 +222,8 @@ def construct_requests(self, doc, ctx):
         return ll_true_choice, ll_false_choice
 
     def process_results(self, doc, results):
-        pred = np.argmax(results)
+        ll_true_choice, ll_false_choice = results
+        pred = ll_true_choice > ll_false_choice
         return {
             "acc": (pred, doc)
         }
@@ -270,10 +253,6 @@ def has_validation_docs(self):
     def has_test_docs(self):
         return False
 
-    def fewshot_description(self):
-        # TODO: figure out actual description
-        return ""
-
     def training_docs(self):
         # In ReCoRD, each doc manifests multiple "examples" in the context of few shot example packing.
         # Each doc consists of multiple answer candidates, each of which is scored yes/no.
@@ -363,10 +342,6 @@ def has_validation_docs(self):
     def has_test_docs(self):
         return False
 
-    def fewshot_description(self):
-        # TODO: figure out actual description
-        return ""
-
     def doc_to_text(self, doc):
         return "Sentence 1: {}\nSentence 2: {}\nQuestion: Is the word '{}' used in the same way in the" \
                " two sentences above?\nAnswer:".format(
@@ -432,12 +407,6 @@ def training_docs(self):
                 ]
             return self._training_docs
 
-    def fewshot_description(self):
-        return "Final Exam with Answer Key\n" \
-           "Instructions: Please carefully read the following passages. " \
-           "For each passage, you must identify which noun the pronoun marked in *bold*" \
-           " refers to.\n====="
-
     def doc_to_text(self, doc):
         raw_passage = doc["text"]
         # NOTE: HuggingFace span indices are word-based not character-based.
diff --git a/lm_eval/tasks/translation.py b/lm_eval/tasks/translation.py
index de02946334f..4d65de43fba 100644
--- a/lm_eval/tasks/translation.py
+++ b/lm_eval/tasks/translation.py
@@ -107,7 +107,7 @@ def doc_to_text(self, doc):
         language_codes = self.sacrebleu_language_pair.split("-")
         src_lang = code_to_language(language_codes[0])
         tar_lang = code_to_language(language_codes[1])
-        return f"{src_lang} phrase: " + doc["src"] + f"\n{tar_lang} phrase:"
+        return f"\nTranslate {src_lang} to {tar_lang}:\n [{src_lang}] " + doc["src"] + f"\n[{tar_lang}]"
 
     def doc_to_target(self, doc):
         # This shows a single target, though there may be multiple targets in a lang test
@@ -132,7 +132,6 @@ def process_results(self, doc, results):
         if tar_lang_code in NO_SPACE_LANG:
             doc["ref"] = NO_SPACE_LANG[tar_lang_code]([doc["ref"]])[0]
             results = NO_SPACE_LANG[tar_lang_code](results)
-
         # These metrics are corpus-level not sentence level, so we'll hide the
         # results in this dict and compute the corpus score in the aggregate method
         ref_pred = (doc["ref"], results)
@@ -166,12 +165,6 @@ def higher_is_better(self):
             "ter": False,
         }
 
-    def fewshot_description(self):
-        language_codes = self.sacrebleu_language_pair.split("-")
-        src_lang = code_to_language(language_codes[0])
-        tar_lang = code_to_language(language_codes[1])
-        return f"Translate these {src_lang} phrases to {tar_lang}."
-
     def __str__(self):
         language_codes = self.sacrebleu_language_pair.split("-")
         src_lang = code_to_language(language_codes[0])
diff --git a/lm_eval/tasks/triviaqa.py b/lm_eval/tasks/triviaqa.py
index e61a40bdde2..86ba406ab97 100644
--- a/lm_eval/tasks/triviaqa.py
+++ b/lm_eval/tasks/triviaqa.py
@@ -12,7 +12,7 @@ class TriviaQA(Task):
     def download(self):
         if not os.path.exists('data/triviaqa/unfiltered-web-train.jsonl'):
             os.makedirs("data/triviaqa/", exist_ok=True)
-            download_file("http://eaidata.bmk.sh/data/triviaqa-unfiltered.tar.gz", "data/triviaqa/triviaqa-unfiltered.tar.gz", "adc19b42769062d241a8fbe834c56e58598d9322eb6c614e9f33a68a2cf5523e")
+            download_file("http://eaidata.bmk.sh/data/triviaqa-unfiltered.tar.gz", local_file="data/triviaqa/triviaqa-unfiltered.tar.gz", expected_checksum="adc19b42769062d241a8fbe834c56e58598d9322eb6c614e9f33a68a2cf5523e")
             sh("""
             cd data/triviaqa/
             tar -xf triviaqa-unfiltered.tar.gz
@@ -36,10 +36,6 @@ def validation_docs(self):
     def test_docs(self):
         raise NotImplementedError()
     
-    def fewshot_description(self):
-        # TODO: figure out fewshot description
-        return ""
-    
     def doc_to_text(self, doc):
         return f"Question: {doc['Question']}\nAnswer:"
 
@@ -56,7 +52,6 @@ def _remove_prefixes(self, aliases):
                 ret.append(alias)
 
         return ret
-        
 
     def construct_requests(self, doc, ctx):
         ret = []
diff --git a/lm_eval/tasks/truthfulqa.py b/lm_eval/tasks/truthfulqa.py
index f0b46196bc2..ad66c5ca6ad 100644
--- a/lm_eval/tasks/truthfulqa.py
+++ b/lm_eval/tasks/truthfulqa.py
@@ -58,7 +58,7 @@ def download(self):
         Path.mkdir(self.DATASET_PATH, parents=True)
         mc_url = "https://raw.githubusercontent.com/sylinrl/TruthfulQA/013686a06be7a7bde5bf8223943e106c7250123c/data/mc_task.json"
         checksum = "6eb4125d25750c0145c4be2dce00440736684ab6f74ce6bff2139571cc758954"
-        download_file(mc_url, str(self.DATASET_PATH / "mc_task.json"), checksum)
+        download_file(mc_url, local_file=str(self.DATASET_PATH / "mc_task.json"), expected_checksum=checksum)
 
     def has_training_docs(self):
         return False
@@ -85,9 +85,14 @@ def doc_to_text(self, doc):
     def doc_to_target(self, doc):
         return " "
 
-    def fewshot_context(self, doc, num_fewshot, provide_description, rnd):
+    def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None):
         assert num_fewshot == 0, "TruthfulQA is intended only for the zero-shot setting."
-        return super().fewshot_context(doc, num_fewshot, provide_description, rnd)
+        return super().fewshot_context(
+            doc=doc,
+            num_fewshot=num_fewshot,
+            rnd=rnd,
+            description=description
+        )
 
     def construct_requests(self, doc, ctx):
         """ Uses RequestFactory to construct Requests and returns an iterable of
@@ -163,7 +168,7 @@ def download(self):
         Path.mkdir(self.DATASET_PATH, parents=True)
         url = "https://raw.githubusercontent.com/sylinrl/TruthfulQA/013686a06be7a7bde5bf8223943e106c7250123c/TruthfulQA.csv"
         checksum = "8d7dd15f033196140f032d97d30f037da7a7b1192c3f36f9937c1850925335a2"
-        download_file(url, str(self.DATASET_PATH / "TruthfulQA.csv"), checksum)
+        download_file(url, local_file=str(self.DATASET_PATH / "TruthfulQA.csv"), expected_checksum=checksum)
 
     def has_training_docs(self):
         return False
@@ -217,9 +222,14 @@ def doc_to_text(self, doc):
     def doc_to_target(self, doc):
         return " "
 
-    def fewshot_context(self, doc, num_fewshot, provide_description, rnd):
+    def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None):
         assert num_fewshot == 0, "TruthfulQA is intended only for the zero-shot setting."
-        return super().fewshot_context(doc, num_fewshot, provide_description, rnd)
+        return super().fewshot_context(
+            doc=doc,
+            num_fewshot=num_fewshot,
+            rnd=rnd,
+            description=description
+        )
 
     def construct_requests(self, doc, ctx):
         """ Uses RequestFactory to construct Requests and returns an iterable of
diff --git a/lm_eval/tasks/unscramble.py b/lm_eval/tasks/unscramble.py
index dc742a2ceef..41bccb0a79f 100644
--- a/lm_eval/tasks/unscramble.py
+++ b/lm_eval/tasks/unscramble.py
@@ -29,7 +29,7 @@ def download(self):
         if not file.exists():
             rawfile = file.parent / (file.name + ".gz")
             base_url = "https://raw.githubusercontent.com/openai/gpt-3/master/data"
-            download_file(f"{base_url}/{self.FILENAME}.gz", str(rawfile), self.CHECKSUM)
+            download_file(f"{base_url}/{self.FILENAME}.gz", local_file=str(rawfile), expected_checksum=self.CHECKSUM)
             extract_gzip(gz=rawfile, to=file)
 
     def has_training_docs(self):
@@ -45,9 +45,6 @@ def validation_docs(self):
         file = self.BASE_PATH / self.FILENAME
         return (json.loads(line) for line in open(file).read().splitlines())
 
-    def fewshot_description(self):
-        return "Please unscramble the letters into a word, and write that word:"
-
     def doc_to_text(self, doc):
         return doc["context"]
 
diff --git a/lm_eval/tasks/webqs.py b/lm_eval/tasks/webqs.py
index 51ed0167580..ebab7c8968f 100644
--- a/lm_eval/tasks/webqs.py
+++ b/lm_eval/tasks/webqs.py
@@ -17,10 +17,6 @@ def has_validation_docs(self):
     def has_test_docs(self):
         return True
 
-    def fewshot_description(self):
-        # TODO: figure out description
-        return ""
-
     def doc_to_text(self, doc):
         return "Question: " + doc['question'] + '\nAnswer:'
 
@@ -40,7 +36,6 @@ def _remove_prefixes(self, aliases):
                 ret.append(alias)
 
         return ret
-        
 
     def construct_requests(self, doc, ctx):
         ret = []
@@ -62,4 +57,4 @@ def aggregation(self):
     def higher_is_better(self):
         return {
             "acc": True
-        }
\ No newline at end of file
+        }
diff --git a/lm_eval/tasks/wikitext.py b/lm_eval/tasks/wikitext.py
index 24f9ec35074..40699863bb1 100644
--- a/lm_eval/tasks/wikitext.py
+++ b/lm_eval/tasks/wikitext.py
@@ -41,18 +41,14 @@ def wikitext_detokenizer(string):
 
 
 class WikiText(PerplexityTask):
-    VERSION = 0
+    VERSION = 1
 
     def download(self):
         if not os.path.exists('data/wikitext/wikitext-2-raw/wiki.valid.raw'):
             os.makedirs("data/wikitext/", exist_ok=True)
-            download_file("https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip", "data/wikitext/wikitext-2-raw-v1.zip", "ef7edb566e3e2b2d31b29c1fdb0c89a4cc683597484c3dc2517919c615435a11")
+            download_file("https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip", local_file="data/wikitext/wikitext-2-raw-v1.zip", expected_checksum="ef7edb566e3e2b2d31b29c1fdb0c89a4cc683597484c3dc2517919c615435a11")
             sh("cd data/wikitext/ && unzip wikitext-2-raw-v1.zip")
 
-    def fewshot_description(self):
-        # TODO: figure out fewshot description
-        return ""
-
     def has_validation_docs(self):
         return True
 
@@ -87,4 +83,4 @@ def doc_to_target(self, doc):
     
     def count_words(self, doc):
         # count number of words in *original doc before detokenization*
-        return len(re.split(r"\s+", doc))
\ No newline at end of file
+        return len(re.split(r"\s+", doc))
diff --git a/lm_eval/tasks/winogrande.py b/lm_eval/tasks/winogrande.py
index 106c826a692..2e188d7b30c 100644
--- a/lm_eval/tasks/winogrande.py
+++ b/lm_eval/tasks/winogrande.py
@@ -29,10 +29,6 @@ def has_test_docs(self):
     def doc_to_text(self, doc):
         return self.partial_context(doc, doc["option" + doc["answer"]])
 
-    def fewshot_description(self):
-        # TODO: redo description
-        return "Winograd schema sentence including a either a ___ blank with a missing word, making the pronoun ambiguous, or the same with the word filled in."
-
     @classmethod
     def partial_context(cls, doc, option):
         # Substitute the pronoun in the sentence with the specified option
diff --git a/lm_eval/tasks/wsc273.py b/lm_eval/tasks/wsc273.py
index 20dd5175b60..505557b15c2 100644
--- a/lm_eval/tasks/wsc273.py
+++ b/lm_eval/tasks/wsc273.py
@@ -53,10 +53,6 @@ def has_validation_docs(self):
     def has_test_docs(self):
         return True
 
-    def fewshot_description(self):
-        # TODO: redo description
-        return "Winograd schema sentence with correct continuation. True. Winograd schema sentence with incorrect continuation. False."
-
     def fewshot_examples(self, k, rnd):
         # NOTE: `super().fewshot_examples` samples from training docs which are
         # not available for this test-set-only dataset.
diff --git a/lm_eval/utils.py b/lm_eval/utils.py
index c3d718a5007..2a8c6d17fe8 100644
--- a/lm_eval/utils.py
+++ b/lm_eval/utils.py
@@ -1,6 +1,8 @@
 import os
 import re
 import collections
+import functools
+import inspect
 
 
 class ExitCodeError(Exception):
@@ -138,4 +140,18 @@ def get_original(self, newarr):
         
         assert all(cov)
         
-        return res
\ No newline at end of file
+        return res
+
+def positional_deprecated(fn):
+    """
+    A decorator to nudge users into passing only keyword args (`kwargs`) to the 
+    wrapped function, `fn`.
+    """
+    @functools.wraps(fn)
+    def _wrapper(*args, **kwargs):
+        if len(args) != 1 if inspect.ismethod(fn) else 0: 
+            print(f"WARNING: using {fn.__name__} with positional arguments is "
+                "deprecated and will be disallowed in a future version of "
+                "lm-evaluation-harness!")
+        return fn(*args, **kwargs)
+    return _wrapper
diff --git a/main.py b/main.py
index c63446fcca4..cf13fe1315e 100644
--- a/main.py
+++ b/main.py
@@ -19,6 +19,7 @@ def parse_args():
     parser.add_argument('--output_path', default=None)
     parser.add_argument('--limit', type=int, default=None)
     parser.add_argument('--no_cache', action="store_true")
+    parser.add_argument('--description_dict_path', default=None)
     return parser.parse_args()
 
 
@@ -34,15 +35,21 @@ def main():
     else:
         task_names = args.tasks.split(",")
 
+    description_dict = {}
+    if args.description_dict_path:
+        with open(args.description_dict_path, 'r') as f:
+            description_dict = json.load(f)
+
     results = evaluator.simple_evaluate(
         model=args.model,
         model_args=args.model_args,
-        task_names=task_names,
+        tasks=task_names,
         num_fewshot=args.num_fewshot,
         batch_size=args.batch_size,
         device=args.device,
         no_cache=args.no_cache,
         limit=args.limit,
+        description_dict=description_dict
     )
 
     dumped = json.dumps(results, indent=2)
diff --git a/scripts/cost_estimate.py b/scripts/cost_estimate.py
index 4339b8dbd21..d2e60bfa0d9 100644
--- a/scripts/cost_estimate.py
+++ b/scripts/cost_estimate.py
@@ -51,7 +51,14 @@ def main():
     values = []
     for taskname in task_list.split(","):
         lm.tokencost = 0
-        evaluator.evaluate(lm, {taskname: tasks.get_task(taskname)()}, False, 0, None, bootstrap_iters=10)
+        evaluator.evaluate(
+            lm=lm,
+            task_dict={taskname: tasks.get_task(taskname)()},
+            num_fewshot=0,
+            limit=None,
+            bootstrap_iters=10,
+            description_dict=None
+        )
 
         print(taskname, lm.tokencost)
         values.append([taskname, lm.tokencost, lm.tokencost / 1000 * 0.0008, lm.tokencost / 1000 * 0.0012, lm.tokencost / 1000 * 0.006, lm.tokencost / 1000 * 0.06])
diff --git a/scripts/fewshot_description_experiment.py b/scripts/fewshot_description_experiment.py
deleted file mode 100644
index e6ad97b3404..00000000000
--- a/scripts/fewshot_description_experiment.py
+++ /dev/null
@@ -1,79 +0,0 @@
-import json
-import numpy as np
-import random
-import logging
-from lm_eval import models, tasks, evaluator, base
-
-logging.getLogger("openai").setLevel(logging.WARNING)
-
-
-fewshot_descriptions = [
-    "foo",
-    "bar"
-]
-
-task = "lambada"
-num_fewshot = 0
-model = "gpt2"
-model_args = ""
-limit = None
-no_cache = False
-
-
-class CustomDescTask:
-    def __init__(self, task, desc):
-        self.task = task
-        self.desc = desc
-
-        def fewshot_description():
-            return self.desc
-        
-        self.task.fewshot_description = fewshot_description
-
-    def __getattr__(self, attr):
-        return getattr(self.task, attr)
-
-
-def main():
-    random.seed(42)
-    np.random.seed(42)
-
-    lm = models.get_model(model).create_from_arg_string(model_args)
-    
-    if limit:
-        print("WARNING: --limit SHOULD ONLY BE USED FOR TESTING. REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.")
-
-    if not no_cache:
-        lm = base.CachingLM(lm, 'lm_cache/' + model + '_' + model_args.replace('=', '-').replace(',', '_') + '.db')
-
-    task_dict = tasks.get_task_dict([task])
-
-    for desc in fewshot_descriptions:
-        custom_task_dict = {k: CustomDescTask(v, desc) for k, v in task_dict.items()}
-
-        results = evaluator.evaluate(lm, custom_task_dict, True, num_fewshot, limit)
-
-        dumped = json.dumps(results, indent=2)
-
-        print('Description:', desc)
-        print(dumped)
-
-        # MAKE TABLE
-        from pytablewriter import MarkdownTableWriter
-
-        writer = MarkdownTableWriter()
-        writer.headers = ["Task", "Metric", "Value"]
-
-        values = []
-
-        for k, dic in results.items():
-            for m, v in dic.items():
-                values.append([k, m, '%.4f' % v])
-                k = ""
-        writer.value_matrix = values
-
-        print(writer.dumps())
-
-
-if __name__ == "__main__":
-    main()
diff --git a/scripts/get_prompts.py b/scripts/get_prompts.py
index cdda6dc43e4..56a9ff79f44 100644
--- a/scripts/get_prompts.py
+++ b/scripts/get_prompts.py
@@ -9,7 +9,6 @@
     print('#', tname)
     docs = islice(task.validation_docs() if task.has_validation_docs() else task.test_docs(), ct)
     print()
-    print('**Zero-Shot Prompt**:', "\n```\n" + task.fewshot_description() + "\n```\n")
     for i in range(ct):
         print()
         doc = next(docs)
diff --git a/scripts/write_out.py b/scripts/write_out.py
index b7eb30c15ad..2039d3934f3 100644
--- a/scripts/write_out.py
+++ b/scripts/write_out.py
@@ -1,5 +1,6 @@
 import argparse
 import numpy as np
+import json
 import os
 import random
 from lm_eval import tasks
@@ -17,6 +18,7 @@ def parse_args():
     parser.add_argument('--num_fewshot', type=int, default=1)
     parser.add_argument('--seed', type=int, default=42)
     parser.add_argument('--num_examples', type=int, default=1)
+    parser.add_argument('--description_dict_path', default=None)
     return parser.parse_args()
 
 
@@ -29,6 +31,12 @@ def main():
     else:
         task_names = args.tasks.split(",")
     task_dict = tasks.get_task_dict(task_names)
+
+    description_dict = {}
+    if args.description_dict_path:
+        with open(args.description_dict_path, 'r') as f:
+            description_dict = json.load(f)
+
     os.makedirs(args.output_base_path, exist_ok=True)
     for task_name, task in task_dict.items():
         rnd = random.Random()
@@ -47,14 +55,16 @@ def main():
 
         docs = join_iters(iters)
 
+        description = description_dict[task_name] if description_dict and task_name in description_dict else ""
+
         with open(os.path.join(args.output_base_path, task_name), "w") as f:
             for i, doc in zip(range(args.num_examples), docs) if args.num_examples > 0 else enumerate(docs):
                 f.write(EXAMPLE_DIVIDER.format(i=i))
                 ctx = task.fewshot_context(
                     doc=doc,
-                    provide_description=args.provide_description,
                     num_fewshot=args.num_fewshot,
-                    rnd=rnd
+                    rnd=rnd,
+                    description=description
                 )
                 f.write(ctx + "\n")
 
diff --git a/tests/test_cache.db b/tests/test_cache.db
deleted file mode 100644
index 7477f429bf6..00000000000
Binary files a/tests/test_cache.db and /dev/null differ
diff --git a/tests/test_description_dict.py b/tests/test_description_dict.py
new file mode 100644
index 00000000000..f80f5290638
--- /dev/null
+++ b/tests/test_description_dict.py
@@ -0,0 +1,42 @@
+import random
+import lm_eval.tasks
+import lm_eval.models
+
+
+def test_description_dict():
+    seed = 42
+    num_examples = 1
+    task_names = ["hellaswag", "winogrande"]
+    description_dict = {
+        "hellaswag": "Label for the relevant action:\nSentences describing context, with an incomplete sentence trailing answer that plausibly completes the situation.",
+        "winogrande": "Winograd schema sentence including a either a ___ blank with a missing word, making the pronoun ambiguous, or the same with the word filled in.",
+    }
+
+    task_dict = lm_eval.tasks.get_task_dict(task_names)
+    for task_name, task in task_dict.items():
+        rnd = random.Random()
+        rnd.seed(seed)
+
+        if task.has_training_docs():
+            docs = task.training_docs()
+        elif set == "val" and task.has_validation_docs():
+            docs = task.validation_docs()
+        elif set == "test" and task.has_test_docs():
+            docs = task.test_docs()
+
+        description = (
+            description_dict[task_name]
+            if description_dict and task_name in description_dict
+            else ""
+        )
+
+        for _, doc in (
+            zip(range(num_examples), docs) if num_examples > 0 else enumerate(docs)
+        ):
+            ctx = task.fewshot_context(
+                doc=doc,
+                num_fewshot=1,
+                rnd=rnd,
+                description=description,
+            )
+            assert description in ctx
diff --git a/tests/test_evaluator.py b/tests/test_evaluator.py
index 491e9de899d..363384a05c9 100644
--- a/tests/test_evaluator.py
+++ b/tests/test_evaluator.py
@@ -48,8 +48,22 @@ def ll_perp_fn(reqs):
     lm.loglikelihood_rolling = ll_perp_fn
 
     limit = 10
-    e1 = evaluator.evaluate(lm, task_dict, False, 0, limit, bootstrap_iters=10)
-    e2 = evaluator.evaluate(lm, task_dict, False, 0, limit, bootstrap_iters=10)
+    e1 = evaluator.evaluate(
+            lm=lm,
+            task_dict=task_dict,
+            num_fewshot=0,
+            limit=limit,
+            bootstrap_iters=10,
+            description_dict=None
+    )
+    e2 = evaluator.evaluate(
+            lm=lm,
+            task_dict=task_dict,
+            num_fewshot=0,
+            limit=limit,
+            bootstrap_iters=10,
+            description_dict=None
+    )
 
     # check that caching is working
     assert e1 == e2
diff --git a/tests/test_tasks.py b/tests/test_tasks.py
index 97baeacf8a9..46812798a91 100644
--- a/tests/test_tasks.py
+++ b/tests/test_tasks.py
@@ -32,7 +32,7 @@ def test_basic_interface(taskname, task_class):
 
     limit = None
 
-    if taskname in ["triviaqa"]:
+    if taskname in ["triviaqa"] or taskname.startswith("pile_"):
         limit = 10000
     if task.has_validation_docs():
         arr = list(islice(task.validation_docs(), limit))
diff --git a/tests/test_version_stable.py b/tests/test_version_stable.py
index d230112de16..7dd36a94b6b 100644
--- a/tests/test_version_stable.py
+++ b/tests/test_version_stable.py
@@ -99,5 +99,13 @@ def greedy_until(reqs):
     lm.greedy_until = greedy_until
 
     limit = None
-    result = evaluator.evaluate(lm, task_dict, False, 0, limit, bootstrap_iters=10)
+    result = evaluator.evaluate(
+            lm=lm,
+            task_dict=task_dict,
+            num_fewshot=0,
+            limit=limit,
+            bootstrap_iters=10,
+            description_dict=None
+    )
+
     assert_target(f"{taskname}-v{task_class.VERSION}-res", result)
diff --git a/tests/testdata/boolq-v1-loglikelihood b/tests/testdata/boolq-v1-loglikelihood
new file mode 100644
index 00000000000..7811121c9fd
--- /dev/null
+++ b/tests/testdata/boolq-v1-loglikelihood
@@ -0,0 +1 @@
+6577e0d88572772ef08e64f624c0e3df0953286ae1f118ccef15623b59ffeabf
\ No newline at end of file
diff --git a/tests/testdata/boolq-v1-res.json b/tests/testdata/boolq-v1-res.json
new file mode 100644
index 00000000000..291b9f122d0
--- /dev/null
+++ b/tests/testdata/boolq-v1-res.json
@@ -0,0 +1 @@
+{"results": {"boolq": {"acc": 0.5048929663608562, "acc_stderr": 0.00874463623355505}}, "versions": {"boolq": 1}}
\ No newline at end of file
diff --git a/tests/testdata/cb-v1-loglikelihood b/tests/testdata/cb-v1-loglikelihood
new file mode 100644
index 00000000000..ad7e928fe6a
--- /dev/null
+++ b/tests/testdata/cb-v1-loglikelihood
@@ -0,0 +1 @@
+77b11f4348eb8a7f57faf95c531fda01ab4bf0e729f91a82451ed8e71ec8e66d
\ No newline at end of file
diff --git a/tests/testdata/cb-v1-res.json b/tests/testdata/cb-v1-res.json
new file mode 100644
index 00000000000..1cff410b2c3
--- /dev/null
+++ b/tests/testdata/cb-v1-res.json
@@ -0,0 +1 @@
+{"results": {"cb": {"acc": 0.3392857142857143, "acc_stderr": 0.06384226561930825, "f1": 0.2819143819143819}}, "versions": {"cb": 1}}
\ No newline at end of file
diff --git a/tests/testdata/headqa_en-v0-loglikelihood b/tests/testdata/headqa_en-v0-loglikelihood
new file mode 100644
index 00000000000..11f07878fb5
--- /dev/null
+++ b/tests/testdata/headqa_en-v0-loglikelihood
@@ -0,0 +1 @@
+09da45119b12a0144e3081f8fb790c2a22af7b9c3aac42f54423d348a711fbf5
\ No newline at end of file
diff --git a/tests/testdata/headqa_en-v0-res.json b/tests/testdata/headqa_en-v0-res.json
new file mode 100644
index 00000000000..6ac5a9c0b8e
--- /dev/null
+++ b/tests/testdata/headqa_en-v0-res.json
@@ -0,0 +1 @@
+{"results": {"headqa_en": {"acc": 0.23559445660102116, "acc_norm": 0.2447118891320204, "acc_norm_stderr": 0.008211629406841468, "acc_stderr": 0.008105688874297972}}, "versions": {"headqa_en": 0}}
\ No newline at end of file
diff --git a/tests/testdata/headqa_es-v0-loglikelihood b/tests/testdata/headqa_es-v0-loglikelihood
new file mode 100644
index 00000000000..9129d834b60
--- /dev/null
+++ b/tests/testdata/headqa_es-v0-loglikelihood
@@ -0,0 +1 @@
+767ca34d9714edd9fb030ddbcc35a64e5180d1e247b0cb557fbb22fdf971ad1f
\ No newline at end of file
diff --git a/tests/testdata/headqa_es-v0-res.json b/tests/testdata/headqa_es-v0-res.json
new file mode 100644
index 00000000000..0964db9bbb8
--- /dev/null
+++ b/tests/testdata/headqa_es-v0-res.json
@@ -0,0 +1 @@
+{"results": {"headqa_es": {"acc": 0.23559445660102116, "acc_norm": 0.25018234865062, "acc_norm_stderr": 0.008272783230806014, "acc_stderr": 0.008105688874297972}}, "versions": {"headqa_es": 0}}
\ No newline at end of file
diff --git a/tests/testdata/multirc-v1-loglikelihood b/tests/testdata/multirc-v1-loglikelihood
new file mode 100644
index 00000000000..52a89c6f9ea
--- /dev/null
+++ b/tests/testdata/multirc-v1-loglikelihood
@@ -0,0 +1 @@
+0e793bd6f637a70a04c6f2cda080188fc037961b2f909095fe63f7bdbc4a90c6
\ No newline at end of file
diff --git a/tests/testdata/multirc-v1-res.json b/tests/testdata/multirc-v1-res.json
new file mode 100644
index 00000000000..938141bbb88
--- /dev/null
+++ b/tests/testdata/multirc-v1-res.json
@@ -0,0 +1 @@
+{"results": {"multirc": {"acc": 0.046169989506820566, "acc_stderr": 0.006801377886208738}}, "versions": {"multirc": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_arxiv-v1-loglikelihood_rolling b/tests/testdata/pile_arxiv-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..3aa1d8c7349
--- /dev/null
+++ b/tests/testdata/pile_arxiv-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+814f9954e44368559602c00f7e85fa3971acdfd0315f508ec7df6318a79c55ec
\ No newline at end of file
diff --git a/tests/testdata/pile_arxiv-v1-res.json b/tests/testdata/pile_arxiv-v1-res.json
new file mode 100644
index 00000000000..05cbab38732
--- /dev/null
+++ b/tests/testdata/pile_arxiv-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_arxiv": {"bits_per_byte": 1.55095665856779e-05, "byte_perplexity": 1.0000107504701365, "word_perplexity": 1.0000819333090385}}, "versions": {"pile_arxiv": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_bookcorpus2-v1-loglikelihood_rolling b/tests/testdata/pile_bookcorpus2-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..b37a91cc2de
--- /dev/null
+++ b/tests/testdata/pile_bookcorpus2-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+5c17ddfebeab8c41dabadb6fc216ceda91e3fe5dc95aaf1b2c843d7f11828b03
\ No newline at end of file
diff --git a/tests/testdata/pile_bookcorpus2-v1-res.json b/tests/testdata/pile_bookcorpus2-v1-res.json
new file mode 100644
index 00000000000..967c14934b8
--- /dev/null
+++ b/tests/testdata/pile_bookcorpus2-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_bookcorpus2": {"bits_per_byte": 1.6780040419457868e-06, "byte_perplexity": 1.000001163104447, "word_perplexity": 1.0000066499426599}}, "versions": {"pile_bookcorpus2": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_books3-v1-loglikelihood_rolling b/tests/testdata/pile_books3-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..b483d3b45b4
--- /dev/null
+++ b/tests/testdata/pile_books3-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+0f8f36f705b999b6d55fa72ff89a82793dd1cb568ab1f8727a6a2086a12b9410
\ No newline at end of file
diff --git a/tests/testdata/pile_books3-v1-res.json b/tests/testdata/pile_books3-v1-res.json
new file mode 100644
index 00000000000..6ff7a517112
--- /dev/null
+++ b/tests/testdata/pile_books3-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_books3": {"bits_per_byte": 1.2901280503011222e-06, "byte_perplexity": 1.0000008942490204, "word_perplexity": 1.0000052870063607}}, "versions": {"pile_books3": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_dm-mathematics-v1-loglikelihood_rolling b/tests/testdata/pile_dm-mathematics-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..2fb27786c54
--- /dev/null
+++ b/tests/testdata/pile_dm-mathematics-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+d5b7967c0ece8b816f3921a8bd0fad23365349e935b491595e2ad1135af42da6
\ No newline at end of file
diff --git a/tests/testdata/pile_dm-mathematics-v1-res.json b/tests/testdata/pile_dm-mathematics-v1-res.json
new file mode 100644
index 00000000000..192e9066a42
--- /dev/null
+++ b/tests/testdata/pile_dm-mathematics-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_dm-mathematics": {"bits_per_byte": 8.910951449933553e-05, "byte_perplexity": 1.0000617679162955, "word_perplexity": 1.0002875035042451}}, "versions": {"pile_dm-mathematics": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_enron-v1-loglikelihood_rolling b/tests/testdata/pile_enron-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..57dbe764605
--- /dev/null
+++ b/tests/testdata/pile_enron-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+4baa6ccdc9e3aa9921675ab4400d5e89d7b546b844a8ea28f6461d649066418a
\ No newline at end of file
diff --git a/tests/testdata/pile_enron-v1-res.json b/tests/testdata/pile_enron-v1-res.json
new file mode 100644
index 00000000000..abe7b45f9af
--- /dev/null
+++ b/tests/testdata/pile_enron-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_enron": {"bits_per_byte": 0.0004564546920781453, "byte_perplexity": 1.000316440339552, "word_perplexity": 1.00224668051869}}, "versions": {"pile_enron": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_europarl-v1-loglikelihood_rolling b/tests/testdata/pile_europarl-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..80272607557
--- /dev/null
+++ b/tests/testdata/pile_europarl-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+e67d3dbccd47d308bfc5b0e66b76d0dfc5e386ebfa94e056562c2281c395543f
\ No newline at end of file
diff --git a/tests/testdata/pile_europarl-v1-res.json b/tests/testdata/pile_europarl-v1-res.json
new file mode 100644
index 00000000000..b948f0d3691
--- /dev/null
+++ b/tests/testdata/pile_europarl-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_europarl": {"bits_per_byte": 1.2477664839621123e-05, "byte_perplexity": 1.000008648895605, "word_perplexity": 1.000063506523818}}, "versions": {"pile_europarl": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_freelaw-v1-loglikelihood_rolling b/tests/testdata/pile_freelaw-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..7b5771f4911
--- /dev/null
+++ b/tests/testdata/pile_freelaw-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+d77f3f68aadd6cbf1290c2f6737b2ed5d5c2a60e4c81a65c280f207783caabe1
\ No newline at end of file
diff --git a/tests/testdata/pile_freelaw-v1-res.json b/tests/testdata/pile_freelaw-v1-res.json
new file mode 100644
index 00000000000..dd0e0bac36b
--- /dev/null
+++ b/tests/testdata/pile_freelaw-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_freelaw": {"bits_per_byte": 4.5623635481434923e-05, "byte_perplexity": 1.0000316243943415, "word_perplexity": 1.000203169094218}}, "versions": {"pile_freelaw": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_github-v1-loglikelihood_rolling b/tests/testdata/pile_github-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..cf8251e4f68
--- /dev/null
+++ b/tests/testdata/pile_github-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+df384c3df3d8f53273e97127c5bb84c17e638acad7d6bc9c91f6dee96d43b639
\ No newline at end of file
diff --git a/tests/testdata/pile_github-v1-res.json b/tests/testdata/pile_github-v1-res.json
new file mode 100644
index 00000000000..cc06a45501f
--- /dev/null
+++ b/tests/testdata/pile_github-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_github": {"bits_per_byte": 0.00013764216145332133, "byte_perplexity": 1.0000954108274611, "word_perplexity": 1.0009643183931227}}, "versions": {"pile_github": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_gutenberg-v1-loglikelihood_rolling b/tests/testdata/pile_gutenberg-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..bd7b15927f7
--- /dev/null
+++ b/tests/testdata/pile_gutenberg-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+02a559f74a9105145e7d4d9c5ddea372b5b4938f5368dc8ffafc39cbe3b4c7ef
\ No newline at end of file
diff --git a/tests/testdata/pile_gutenberg-v1-res.json b/tests/testdata/pile_gutenberg-v1-res.json
new file mode 100644
index 00000000000..6d22ed3ff50
--- /dev/null
+++ b/tests/testdata/pile_gutenberg-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_gutenberg": {"bits_per_byte": 1.7952329146458065e-06, "byte_perplexity": 1.0000012443614075, "word_perplexity": 1.0000072174665404}}, "versions": {"pile_gutenberg": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_hackernews-v1-loglikelihood_rolling b/tests/testdata/pile_hackernews-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..48b767bfe70
--- /dev/null
+++ b/tests/testdata/pile_hackernews-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+ec1082ee5a5326e0d57aa4e73b634937140c1de9af95f154e8ab57b05d9b422b
\ No newline at end of file
diff --git a/tests/testdata/pile_hackernews-v1-res.json b/tests/testdata/pile_hackernews-v1-res.json
new file mode 100644
index 00000000000..ea135278b72
--- /dev/null
+++ b/tests/testdata/pile_hackernews-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_hackernews": {"bits_per_byte": 0.00014672607267878518, "byte_perplexity": 1.0001017079354932, "word_perplexity": 1.0006273924348839}}, "versions": {"pile_hackernews": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_nih-exporter-v1-loglikelihood_rolling b/tests/testdata/pile_nih-exporter-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..5f76588a813
--- /dev/null
+++ b/tests/testdata/pile_nih-exporter-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+520ea6e04e8a39dc0b5f63a837429a78a40e63d39d109096101feb8c5b2cf8d8
\ No newline at end of file
diff --git a/tests/testdata/pile_nih-exporter-v1-res.json b/tests/testdata/pile_nih-exporter-v1-res.json
new file mode 100644
index 00000000000..0e40fc8268a
--- /dev/null
+++ b/tests/testdata/pile_nih-exporter-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_nih-exporter": {"bits_per_byte": 0.00035193728014978225, "byte_perplexity": 1.0002439740903082, "word_perplexity": 1.0016712202288802}}, "versions": {"pile_nih-exporter": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_opensubtitles-v1-loglikelihood_rolling b/tests/testdata/pile_opensubtitles-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..47805d3b5fe
--- /dev/null
+++ b/tests/testdata/pile_opensubtitles-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+0f1c23a1f4ddec0c2b1ff34de8d1505b0eb9e2868d8edbcc1b6de13d02f32036
\ No newline at end of file
diff --git a/tests/testdata/pile_opensubtitles-v1-res.json b/tests/testdata/pile_opensubtitles-v1-res.json
new file mode 100644
index 00000000000..1468294732b
--- /dev/null
+++ b/tests/testdata/pile_opensubtitles-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_opensubtitles": {"bits_per_byte": 2.1948356082685497e-05, "byte_perplexity": 1.0000152135568616, "word_perplexity": 1.0000856162053249}}, "versions": {"pile_opensubtitles": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_openwebtext2-v1-loglikelihood_rolling b/tests/testdata/pile_openwebtext2-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..22046e44058
--- /dev/null
+++ b/tests/testdata/pile_openwebtext2-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+5d6c19665f429ab1ccbe027da67f42bdaf219f819ab093673976eee55e015ff4
\ No newline at end of file
diff --git a/tests/testdata/pile_openwebtext2-v1-res.json b/tests/testdata/pile_openwebtext2-v1-res.json
new file mode 100644
index 00000000000..ca433e3c854
--- /dev/null
+++ b/tests/testdata/pile_openwebtext2-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_openwebtext2": {"bits_per_byte": 0.000184802319359215, "byte_perplexity": 1.000128103411166, "word_perplexity": 1.0007951516532847}}, "versions": {"pile_openwebtext2": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_philpapers-v1-loglikelihood_rolling b/tests/testdata/pile_philpapers-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..4fbbc241ba9
--- /dev/null
+++ b/tests/testdata/pile_philpapers-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+339ba5d8c044c4a3ff9b9a8eaa24da1d6c01b72972074eb671a7da049eeb7047
\ No newline at end of file
diff --git a/tests/testdata/pile_philpapers-v1-res.json b/tests/testdata/pile_philpapers-v1-res.json
new file mode 100644
index 00000000000..5a2f77678ab
--- /dev/null
+++ b/tests/testdata/pile_philpapers-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_philpapers": {"bits_per_byte": 9.004690592465457e-06, "byte_perplexity": 1.0000062415953748, "word_perplexity": 1.0000409888564146}}, "versions": {"pile_philpapers": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_pile-cc-v1-loglikelihood_rolling b/tests/testdata/pile_pile-cc-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..d5369ed3c97
--- /dev/null
+++ b/tests/testdata/pile_pile-cc-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+731fdef4a43949b179ba0c540148ebc2fa41583dd583ef580dd812076c66a451
\ No newline at end of file
diff --git a/tests/testdata/pile_pile-cc-v1-res.json b/tests/testdata/pile_pile-cc-v1-res.json
new file mode 100644
index 00000000000..bd2772e32a9
--- /dev/null
+++ b/tests/testdata/pile_pile-cc-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_pile-cc": {"bits_per_byte": 0.0001620742639125056, "byte_perplexity": 1.0001123476295946, "word_perplexity": 1.0006738958554477}}, "versions": {"pile_pile-cc": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_pubmed-abstracts-v1-loglikelihood_rolling b/tests/testdata/pile_pubmed-abstracts-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..de5660d60a8
--- /dev/null
+++ b/tests/testdata/pile_pubmed-abstracts-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+66436569a43163afb2caf422d32c5f329899e74c49865d4d13881fd465fd9976
\ No newline at end of file
diff --git a/tests/testdata/pile_pubmed-abstracts-v1-res.json b/tests/testdata/pile_pubmed-abstracts-v1-res.json
new file mode 100644
index 00000000000..21b6bb451fe
--- /dev/null
+++ b/tests/testdata/pile_pubmed-abstracts-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_pubmed-abstracts": {"bits_per_byte": 0.0005417858444030858, "byte_perplexity": 1.0003756078534862, "word_perplexity": 1.0025884332779}}, "versions": {"pile_pubmed-abstracts": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_pubmed-central-v1-loglikelihood_rolling b/tests/testdata/pile_pubmed-central-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..283109f32e0
--- /dev/null
+++ b/tests/testdata/pile_pubmed-central-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+40b39d120d99a145690444e86acc3e3e24d41e6e0538a75e26929ad84926e5e0
\ No newline at end of file
diff --git a/tests/testdata/pile_pubmed-central-v1-res.json b/tests/testdata/pile_pubmed-central-v1-res.json
new file mode 100644
index 00000000000..4d4a241ace0
--- /dev/null
+++ b/tests/testdata/pile_pubmed-central-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_pubmed-central": {"bits_per_byte": 2.2812488135667854e-05, "byte_perplexity": 1.0000158125368497, "word_perplexity": 1.000123107107861}}, "versions": {"pile_pubmed-central": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_stackexchange-v1-loglikelihood_rolling b/tests/testdata/pile_stackexchange-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..dcf0e64cf0d
--- /dev/null
+++ b/tests/testdata/pile_stackexchange-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+e524bfb3e21cbdaddc117403a50df598520c7bf5b2c60ad8f2372cfa564e79be
\ No newline at end of file
diff --git a/tests/testdata/pile_stackexchange-v1-res.json b/tests/testdata/pile_stackexchange-v1-res.json
new file mode 100644
index 00000000000..2773302990f
--- /dev/null
+++ b/tests/testdata/pile_stackexchange-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_stackexchange": {"bits_per_byte": 0.0003302063346758449, "byte_perplexity": 1.0002289077852733, "word_perplexity": 1.0016993562258851}}, "versions": {"pile_stackexchange": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_ubuntu-irc-v1-loglikelihood_rolling b/tests/testdata/pile_ubuntu-irc-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..ce041998635
--- /dev/null
+++ b/tests/testdata/pile_ubuntu-irc-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+4eb69e314f0864ec8890e2323d7e76f8a8309692c4f090e2b41bf4be681a811d
\ No newline at end of file
diff --git a/tests/testdata/pile_ubuntu-irc-v1-res.json b/tests/testdata/pile_ubuntu-irc-v1-res.json
new file mode 100644
index 00000000000..0e3b1b25977
--- /dev/null
+++ b/tests/testdata/pile_ubuntu-irc-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_ubuntu-irc": {"bits_per_byte": 2.3513498942121155e-06, "byte_perplexity": 1.0000016298328778, "word_perplexity": 1.0000108866656874}}, "versions": {"pile_ubuntu-irc": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_uspto-v1-loglikelihood_rolling b/tests/testdata/pile_uspto-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..4649d3b9b7f
--- /dev/null
+++ b/tests/testdata/pile_uspto-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+789b2bdb31564d512b70f801316f49320a26c83ba361226bac0afb255341d477
\ No newline at end of file
diff --git a/tests/testdata/pile_uspto-v1-res.json b/tests/testdata/pile_uspto-v1-res.json
new file mode 100644
index 00000000000..599ae44ef43
--- /dev/null
+++ b/tests/testdata/pile_uspto-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_uspto": {"bits_per_byte": 0.000174024142670342, "byte_perplexity": 1.00012063161925, "word_perplexity": 1.0007716198916954}}, "versions": {"pile_uspto": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_wikipedia-v1-loglikelihood_rolling b/tests/testdata/pile_wikipedia-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..e44bd276280
--- /dev/null
+++ b/tests/testdata/pile_wikipedia-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+ef9ec0dd408316ca6537228a6812e839f14b30608973081d41efc47c138338da
\ No newline at end of file
diff --git a/tests/testdata/pile_wikipedia-v1-res.json b/tests/testdata/pile_wikipedia-v1-res.json
new file mode 100644
index 00000000000..4f2314e66b3
--- /dev/null
+++ b/tests/testdata/pile_wikipedia-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_wikipedia": {"bits_per_byte": 0.00024287370359008176, "byte_perplexity": 1.0001683613940646, "word_perplexity": 1.001084677949439}}, "versions": {"pile_wikipedia": 1}}
\ No newline at end of file
diff --git a/tests/testdata/pile_youtubesubtitles-v1-loglikelihood_rolling b/tests/testdata/pile_youtubesubtitles-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..81c2e5ed063
--- /dev/null
+++ b/tests/testdata/pile_youtubesubtitles-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+68263c52adc0086011e2220b619983935cabb1cc1f5f9f8ee1a74ab2a7457967
\ No newline at end of file
diff --git a/tests/testdata/pile_youtubesubtitles-v1-res.json b/tests/testdata/pile_youtubesubtitles-v1-res.json
new file mode 100644
index 00000000000..fcf2faa8bc7
--- /dev/null
+++ b/tests/testdata/pile_youtubesubtitles-v1-res.json
@@ -0,0 +1 @@
+{"results": {"pile_youtubesubtitles": {"bits_per_byte": 3.3827117222045906e-05, "byte_perplexity": 1.000023447445816, "word_perplexity": 1.0001529192262875}}, "versions": {"pile_youtubesubtitles": 1}}
\ No newline at end of file
diff --git a/tests/testdata/wikitext-v1-loglikelihood_rolling b/tests/testdata/wikitext-v1-loglikelihood_rolling
new file mode 100644
index 00000000000..f09af45a38c
--- /dev/null
+++ b/tests/testdata/wikitext-v1-loglikelihood_rolling
@@ -0,0 +1 @@
+b6f83e6cf7535ee41b0057c3e2ec2cf7f2fa5a9119b305c479a83091d1142b2c
\ No newline at end of file
diff --git a/tests/testdata/wikitext-v1-res.json b/tests/testdata/wikitext-v1-res.json
new file mode 100644
index 00000000000..122098aec22
--- /dev/null
+++ b/tests/testdata/wikitext-v1-res.json
@@ -0,0 +1 @@
+{"results": {"wikitext": {"bits_per_byte": 3.202519859941674e-05, "byte_perplexity": 1.0000221984224973, "word_perplexity": 1.000118710696617}}, "versions": {"wikitext": 1}}
\ No newline at end of file