diff --git a/README.md b/README.md index 2dde05a586f..2aaa45137c5 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ pip install lm-eval ## Basic Usage -To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command. +To evaluate a model, (e.g. GPT-2) on NLU tasks (e.g. LAMBADA, HellaSwag), you can run the following command. **When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility.** This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](https://github.com/EleutherAI/lm-evaluation-harness#task-versioning) section for more info. ```bash python main.py \ @@ -55,7 +55,7 @@ To evaluate mesh-transformer-jax models that are not available on HF, please inv ## Implementing new tasks -To implement a new task in eval harness, see [this guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/task-guide.md). +To implement a new task in eval harness, see [this guide](./docs/task_guide.md). ## Cite as @@ -128,8 +128,9 @@ To implement a new task in eval harness, see [this guide](https://github.com/Ele |openbookqa |✓ |✓ |✓ | 500|acc, acc_norm | |squad2 |✓ |✓ | | 11873|exact, f1, HasAns_exact, HasAns_f1, NoAns_exact, NoAns_f1, best_exact, best_f1| |race |✓ |✓ |✓ | 1045|acc | -|headqa |✓ |✓ |✓ | 2742|acc, acc_norm | |mathqa |✓ |✓ |✓ | 2985|acc, acc_norm | +|headqa_es |✓ |✓ |✓ | 2742|acc, acc_norm | +|headqa_en |✓ |✓ |✓ | 2742|acc, acc_norm | |webqs |✓ | |✓ | 2032|acc | |wsc273 | | |✓ | 273|acc | |winogrande |✓ |✓ | | 1267|acc | @@ -363,7 +364,6 @@ To inspect what the LM inputs look like, you can run the following command: ```bash python write_out.py \ --tasks all_tasks \ - --provide_description \ --num_fewshot 5 \ --num_examples 10 \ --output_base_path /path/to/output/folder diff --git a/docs/description_guide.md b/docs/description_guide.md new file mode 100644 index 00000000000..b3fea0834f2 --- /dev/null +++ b/docs/description_guide.md @@ -0,0 +1,49 @@ +# Description Guide + +![fewshot-example](./img/fewshot_example_gpt3.png) +(Figure from [Brown et al., 2020](https://arxiv.org/pdf/2005.14165.pdf)) + +Task descriptions provide in-context task instruction for your language model. If you'd like to prepend a natural language description to your few-shot examples and prompt, you can do so on a per-task basis via the `description_dict` arg of [`evaluator.evaluate`](../lm_eval/evaluator.py). This `description_dict` must adhere to the following key-value structure: + +- **key**: the task name (`str`) as specified in the lm-eval-harness [task registry](../lm_eval/tasks/__init__.py). +- **value**: the corresponding (`str`) description/prompt for the task identified by **key**. + +```python +description_dict = { + "task_name_1": "description", + "task_name_2": "description", + ... +} +``` + +Note that a task's description will be separated from its following few-shot examples and prompt by a new line as such: + +```python +""" + + + + + +""" +``` + +## Descriptions in File + +One can also interface with the aforementioned [`evaluator.evaluate`](../lm_eval/evaluator.py) (or `evaluator.simple_evaluate`) method from a higher level by simply passing a JSON file path to the `description_dict_path` arg of the command-line interface (CLI) program, `main.py`. The JSON file pointed to should be structured the same as the `description_dict`. E.g. for some file at `/your/path/descriptions.json` you may have: + +```json +{ + "cycle_letters": "Please unscramble the letters into a word, and write that word:", + "copa": "Given a premise and one alternative with a causal relation to the premise and another without, choose the more plausible alternative" +} +``` + +which can then be supplied to the CLI as: + +```bash +python main.py \ +--tasks cycle_letters,copa \ +--description_dict_path /your/path/descriptions.json \ +... +``` diff --git a/docs/img/fewshot_example_gpt3.png b/docs/img/fewshot_example_gpt3.png new file mode 100644 index 00000000000..b199736867a Binary files /dev/null and b/docs/img/fewshot_example_gpt3.png differ diff --git a/task-guide.md b/docs/task_guide.md similarity index 94% rename from task-guide.md rename to docs/task_guide.md index 5ea43fc2f41..f3b2c986ba6 100644 --- a/task-guide.md +++ b/docs/task_guide.md @@ -87,8 +87,7 @@ There are 2 standard approaches we follow for downloading data: ``` These methods return `True`/`False` whether or not your task dataset provides documents for each split type. __Note__: if the test set doesn't have publicly available labels, please do not put it down as having a test set. - Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.: - `{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`: + Lastly, we need to load the documents. In our terminology, a document (`doc`) is a single natural language data example stored in a Python `dict`. E.g.: `{“question”: “What is the capital of France?”, “answer”: “Paris”}`. Override the following methods to load your data splits from their storage location in `DATASET_PATH`: ```python def training_docs(self): return #... @@ -125,17 +124,9 @@ You can now skip ahead to registering your task
+In the case your task is _not_ multiple-choice, override the following methods for your task class: -In the case your task is not multiple-choice, override the following methods for your task class: - -Put the natural language task description as a single line (no `\n`s) string here. E.g. `"Translate English to French:"` - -```python -def fewshot_description(self): - return "" -``` - -Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example (in dictionary form) . You should concatenate its members into a nicely formatted prompt. +Format your document into a single query prompt __without the answer__ here. This method takes a single `doc` example of type `dict` with `str` key-value members. You should concatenate these `doc` item values together into a neatly formatted prompt. ```python def doc_to_text(self, doc): @@ -161,11 +152,12 @@ After registering your task, you can now check on your data downloading and veri ```bash python -m scripts.write_out \ - --task \ --output_base_path \ + --tasks \ --sets \ --num_fewshot K \ - --num_examples N + --num_examples N \ + --description_dict_path ``` Open the file specified at the `--output_base_path ` and ensure it passes diff --git a/lm_eval/base.py b/lm_eval/base.py index 927ecb49f0e..0950315f231 100644 --- a/lm_eval/base.py +++ b/lm_eval/base.py @@ -1,6 +1,7 @@ import abc from typing import Iterable import numpy as np +import random import re import os import json @@ -10,7 +11,7 @@ import torch import torch.nn.functional as F -from lm_eval.metrics import mean, weighted_perplexity, weighted_mean +from lm_eval.metrics import mean, weighted_perplexity, weighted_mean, bits_per_byte from lm_eval import utils from abc import abstractmethod @@ -450,11 +451,43 @@ def higher_is_better(self): pass def fewshot_description(self): + import warnings + warnings.warn( + "`fewshot_description` will be removed in futures versions. Pass " + "any custom descriptions to the `evaluate` function instead.", + DeprecationWarning) return "" - def fewshot_context(self, doc, num_fewshot, provide_description, rnd): - raw_description = self.fewshot_description() - description = (raw_description + "\n===\n\n") if provide_description and raw_description else "" + @utils.positional_deprecated + def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None): + """ Returns a fewshot context string that is made up of a prepended description + (if provided), the `num_fewshot` number of examples, and an appended prompt example. + + :param doc: str + The document as returned from training_docs, validation_docs, or test_docs. + :param num_fewshot: int + The number of fewshot examples to provide in the returned context string. + :param provide_description: bool + Not implemented, and this option is deprecated and will be removed in a future version in favor of a different description providing method + :param rnd: random.Random + The pseudo-random number generator used to randomly sample examples. + WARNING: This is currently a required arg although it's optionalized with a default `None`. + :param description: str + The task's description that will be prepended to the fewshot examples. + :returns: str + The fewshot context. + """ + assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`" + assert not provide_description, ( + "The `provide_description` arg will be removed in future versions. To prepend " + "a custom description to the context, supply the corresponding string via the " + "`description` arg." + ) + if provide_description is not None: + # nudge people to not specify it at all + print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict") + + description = description + "\n\n" if description else "" if num_fewshot == 0: labeled_examples = "" @@ -523,16 +556,22 @@ class PerplexityTask(Task, abc.ABC): def has_training_docs(self): return False - def fewshot_description(self): - return "" - def fewshot_examples(self, k, rnd): assert k == 0 return [] - def fewshot_context(self, doc, num_fewshot, provide_description, rnd): + def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None): assert num_fewshot == 0 - assert not provide_description + assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`" + assert not provide_description, ( + "The `provide_description` arg will be removed in future versions. To prepend " + "a custom description to the context, supply the corresponding string via the " + "`description` arg." + ) + if provide_description is not None: + # nudge people to not specify it at all + print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict") + return "" def higher_is_better(self): @@ -560,14 +599,14 @@ def process_results(self, doc, results): return { "word_perplexity": (loglikelihood, words), "byte_perplexity": (loglikelihood, bytes_), - "bits_per_byte": (-loglikelihood, self.count_bytes(doc)) + "bits_per_byte": (loglikelihood, bytes_), } def aggregation(self): return { "word_perplexity": weighted_perplexity, "byte_perplexity": weighted_perplexity, - "bits_per_byte": weighted_mean + "bits_per_byte": bits_per_byte, } @classmethod diff --git a/lm_eval/evaluator.py b/lm_eval/evaluator.py index 4de59ad5c4a..087f2a13135 100644 --- a/lm_eval/evaluator.py +++ b/lm_eval/evaluator.py @@ -6,19 +6,23 @@ import lm_eval.tasks import lm_eval.base import numpy as np +from lm_eval.utils import positional_deprecated -def simple_evaluate(model, model_args, task_names, +@positional_deprecated +def simple_evaluate(model, model_args=None, tasks=[], num_fewshot=0, batch_size=None, device=None, - no_cache=False, limit=None, bootstrap_iters=100000): + no_cache=False, limit=None, bootstrap_iters=100000, + description_dict=None): """Instantiate and evaluate a model on a list of tasks. - :param model: str - Name of model, see lm_eval.models.get_model - :param model_args: str - String arguments for each model class, see LM.create_from_arg_string - :param task_names: list[str] - List of task names + :param model: Union[str, LM] + Name of model or LM object, see lm_eval.models.get_model + :param model_args: Optional[str] + String arguments for each model class, see LM.create_from_arg_string. + Ignored if `model` argument is a LM object. + :param tasks: list[Union[str, Task]] + List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise. :param num_fewshot: int Number of examples in few-shot context :param batch_size: int, optional @@ -31,23 +35,39 @@ def simple_evaluate(model, model_args, task_names, Limit the number of examples per task (only use this for testing) :param bootstrap_iters: Number of iterations for bootstrap statistics + :param description_dict: dict[str, str] + Dictionary of custom task descriptions of the form: `task_name: description` :return Dictionary of results """ random.seed(1234) np.random.seed(1234) - lm = lm_eval.models.get_model(model).create_from_arg_string(model_args, { - 'batch_size': batch_size, 'device': device - }) + assert tasks != [], "No tasks specified" + + if isinstance(model, str): + if model_args is None: model_args = "" + lm = lm_eval.models.get_model(model).create_from_arg_string(model_args, { + 'batch_size': batch_size, 'device': device + }) + else: + assert isinstance(model, lm_eval.base.LM) + lm = model if not no_cache: lm = lm_eval.base.CachingLM( lm, 'lm_cache/' + model + '_' + model_args.replace('=', '-').replace(',', '_').replace('/', '-') + '.db' ) - task_dict = lm_eval.tasks.get_task_dict(task_names) - results = evaluate(lm, task_dict, False, num_fewshot, limit) + task_dict = lm_eval.tasks.get_task_dict(tasks) + + results = evaluate( + lm=lm, + task_dict=task_dict, + num_fewshot=num_fewshot, + limit=limit, + description_dict=description_dict + ) # add info about the model and few shot config results["config"] = { @@ -58,19 +78,21 @@ def simple_evaluate(model, model_args, task_names, "device": device, "no_cache": no_cache, "limit": limit, - "bootstrap_iters": bootstrap_iters + "bootstrap_iters": bootstrap_iters, + "description_dict": description_dict } return results -def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_iters=100000): +@positional_deprecated +def evaluate(lm, task_dict, provide_description=None, num_fewshot=0, limit=None, bootstrap_iters=100000, description_dict=None): """Instantiate and evaluate a model on a list of tasks. :param lm: obj Language Model :param task_dict: dict[str, Task] - Dictionary of tasks + Dictionary of tasks. Tasks will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise. :param provide_description: bool Not implemented, and this option is deprecated and will be removed in a future version in favor of a different description providing method :param num_fewshot: int @@ -79,6 +101,8 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i Limit the number of examples per task (only use this for testing) :param bootstrap_iters: Number of iterations for bootstrap statistics + :param description_dict: dict[str, str] + Dictionary of custom task descriptions of the form: `task_name: description` :return Dictionary of results """ @@ -86,6 +110,9 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i # TODO: todo: implement proper description-providing system assert not provide_description # not implemented. + if provide_description is not None: + # nudge people to not specify it at all + print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict") task_dict_items = [ (name, task) @@ -125,16 +152,16 @@ def evaluate(lm, task_dict, provide_description, num_fewshot, limit, bootstrap_i rnd.seed(42) rnd.shuffle(task_docs) + description = description_dict[task_name] if description_dict and task_name in description_dict else "" + for doc_id, doc in enumerate(itertools.islice(task_docs, 0, limit)): docs[(task_name, doc_id)] = doc - ctx = task.fewshot_context( doc=doc, - provide_description=provide_description, num_fewshot=num_fewshot, - rnd=rnd + rnd=rnd, + description=description ) - reqs = task.construct_requests(doc, ctx) if not isinstance(reqs, (list, tuple)): reqs = [reqs] diff --git a/lm_eval/metrics.py b/lm_eval/metrics.py index c95d4cd61c3..9029ac08ce6 100644 --- a/lm_eval/metrics.py +++ b/lm_eval/metrics.py @@ -52,13 +52,14 @@ def acc_all(items): docs = list(zip(*items))[1] for doc, pred in zip(docs, preds): + paragraph_id = doc["idx"]["paragraph"] question_id = doc["idx"]["question"] - if question_id not in question_scoring_dict: - question_scoring_dict[question_id] = [] + if (paragraph_id, question_id) not in question_scoring_dict: + question_scoring_dict[(paragraph_id, question_id)] = [] gold_label = doc["label"] == 1 - question_scoring_dict[question_id].append(gold_label == pred) + question_scoring_dict[(paragraph_id, question_id)].append(gold_label == pred) acc = np.mean([int(all(x)) for x in question_scoring_dict.values()]) return acc @@ -102,6 +103,9 @@ def weighted_mean(items): def weighted_perplexity(items): return math.exp(-weighted_mean(items)) +def bits_per_byte(items): + return -weighted_mean(items) / math.log(2) + def bleu(items): """The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric diff --git a/lm_eval/models/__init__.py b/lm_eval/models/__init__.py index a12f68a513a..9ffd1ceffb8 100644 --- a/lm_eval/models/__init__.py +++ b/lm_eval/models/__init__.py @@ -1,12 +1,16 @@ from . import gpt2 from . import gpt3 from . import dummy +from . import xglm +from . import bigscience MODEL_REGISTRY = { "hf": gpt2.HFLM, "gpt2": gpt2.GPT2LM, "gpt3": gpt3.GPT3LM, "dummy": dummy.DummyLM, + "XGLM": xglm.XGLM, + "bigscience":bigscience.BigScience, } diff --git a/lm_eval/models/bigscience.py b/lm_eval/models/bigscience.py new file mode 100644 index 00000000000..e54b0627164 --- /dev/null +++ b/lm_eval/models/bigscience.py @@ -0,0 +1,84 @@ +import transformers +import torch +from lm_eval.base import BaseLM +# +# +# +# ​ +class BigScience(BaseLM): + + def __init__(self, device='cuda', pretrained='bigscience/tr5b-1B3-multilingual-alpha-checkpoints', revision='global_step118500', subfolder=None, tokenizer=None, batch_size=1): + super().__init__() + + assert isinstance(device, str) + assert isinstance(pretrained, str) + assert isinstance(batch_size, int) + + if device: + self._device = torch.device(device) + else: + self._device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') + # TODO: update this to be less of a hack once subfolder is fixed in HF + self.bigscience = transformers.AutoModelForCausalLM.from_pretrained( + pretrained, revision=revision + ).to(self.device) + self.bigscience.eval() + # pretrained tokenizer for neo is broken for now so just hard-coding this to gpt2 + self.tokenizer = transformers.AutoTokenizer.from_pretrained( + pretrained if tokenizer is None else tokenizer, revision=revision, subfolder=subfolder) + # assert isinstance(self.tokenizer, ( + # transformers.GPT2Tokenizer, transformers.GPT2TokenizerFast, + # transformers.T5Tokenizer, transformers.T5TokenizerFast, + # )), "this tokenizer has not been checked for compatibility yet!" + self.vocab_size = self.tokenizer.vocab_size + # if isinstance(self.tokenizer, (transformers.GPT2Tokenizer, transformers.GPT2TokenizerFast)): + # assert self.tokenizer.encode('hello\n\nhello') == [31373, 198, 198, 31373], \ + # self.tokenizer.encode('hello\n\nhello') + # multithreading and batching + self.batch_size_per_gpu = batch_size # todo: adaptive batch size + # TODO: fix multi-gpu + # gpus = torch.cuda.device_count() + # if gpus > 1: + # self.gpt2 = nn.DataParallel(self.gpt2) + @property + def eot_token_id(self): + # we use EOT because end of *text* is more accurate for what we're doing than end of *sentence* + return self.tokenizer.eos_token_id + @property + def max_length(self): + try: + return self.bigscience.config.n_ctx + except AttributeError: + # gptneoconfig doesn't have n_ctx apparently + return self.bigscience.config.max_position_embeddings + @property + def max_gen_toks(self): + return 256 + @property + def batch_size(self): + # TODO: fix multi-gpu + return self.batch_size_per_gpu # * gpus + @property + def device(self): + # TODO: fix multi-gpu + return self._device + def tok_encode(self, string: str): + return self.tokenizer.encode(string, add_special_tokens=False) + def tok_decode(self, tokens): + return self.tokenizer.decode(tokens) + def _model_call(self, inps): + """ + inps: a torch tensor of shape [batch, sequence] + the size of sequence may vary from call to call + returns: a torch tensor of shape [batch, sequence, vocab] with the + logits returned from the model + """ + with torch.no_grad(): + return self.bigscience(inps)[0][:, :, :130000] + def _model_generate(self, context, max_length, eos_token_id): + result = self.bigscience.generate( + context, + max_length=max_length, + eos_token_id=eos_token_id, + do_sample=False) + return result diff --git a/lm_eval/models/xglm.py b/lm_eval/models/xglm.py new file mode 100644 index 00000000000..3aac8f9ead2 --- /dev/null +++ b/lm_eval/models/xglm.py @@ -0,0 +1,98 @@ +import transformers +import torch +from lm_eval import utils +from lm_eval.base import BaseLM +from tqdm import tqdm + + +class XGLM(BaseLM): + def __init__(self, device='cuda', pretrained='facebook/xglm-1.7B', revision='main', subfolder=None, tokenizer=None, batch_size=1): + super().__init__() + assert isinstance(device, str) + assert isinstance(pretrained, str) + assert isinstance(batch_size, int) + if device: + self._device = torch.device(device) + else: + self._device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') + # TODO: update this to be less of a hack once subfolder is fixed in HF + self.xglm = transformers.AutoModelForCausalLM.from_pretrained( + pretrained, + # cache_dir="/users/zyong2/data/zyong2/huggingface/xglm" + ).to(self.device) + print(f"🤖 Loading model {pretrained}") + self.xglm.eval() + # pretrained tokenizer for neo is broken for now so just hard-coding this to gpt2 + self.tokenizer = transformers.AutoTokenizer.from_pretrained( + pretrained if tokenizer is None else tokenizer, revision=revision, subfolder=subfolder) + + # assert isinstance(self.tokenizer, ( + # transformers.GPT2Tokenizer, transformers.GPT2TokenizerFast, + # transformers.T5Tokenizer, transformers.T5TokenizerFast, + # )), "this tokenizer has not been checked for compatibility yet!" + self.vocab_size = self.tokenizer.vocab_size + # if isinstance(self.tokenizer, (transformers.GPT2Tokenizer, transformers.GPT2TokenizerFast)): + # assert self.tokenizer.encode('hello\n\nhello') == [31373, 198, 198, 31373], \ + # self.tokenizer.encode('hello\n\nhello') + # multithreading and batching + self.batch_size_per_gpu = batch_size # todo: adaptive batch size + # TODO: fix multi-gpu + # gpus = torch.cuda.device_count() + # if gpus > 1: + # self.gpt2 = nn.DataParallel(self.gpt2) + @property + def eot_token_id(self): + # we use EOT because end of *text* is more accurate for what we're doing than end of *sentence* + return self.tokenizer.eos_token_id + @property + def max_length(self): + try: + return self.xglm.config.n_ctx + except AttributeError: + # gptneoconfig doesn't have n_ctx apparently + return self.xglm.config.max_position_embeddings + @property + def max_gen_toks(self): + return 256 + @property + def batch_size(self): + # TODO: fix multi-gpu + return self.batch_size_per_gpu # * gpus + @property + def device(self): + # TODO: fix multi-gpu + return self._device + def tok_encode(self, string: str): + # HACK: to overcome problem of XGLM tokenizer removing new lines + # we replace newline with SEP token + # WARNING: Since typical SEP token == EOS token + # Generation will stop after the first appearance of SEP token prevnting XGLM from + # outputting Multi line generations + string = string.replace("\n", self.tokenizer.sep_token) + return self.tokenizer.encode(string, add_special_tokens=False) + + def tok_decode(self, tokens): + # HACK: to overcome problem of XGLM tokenizer removing new lines + # replace back the generated sep_tokens with newlines + output = self.tokenizer.decode(tokens) + output = output.replace(self.tokenizer.sep_token, "\n") + print(output) + return output + def _model_call(self, inps): + """ + inps: a torch tensor of shape [batch, sequence] + the size of sequence may vary from call to call + returns: a torch tensor of shape [batch, sequence, vocab] with the + logits returned from the model + """ + with torch.no_grad(): + return self.xglm(inps)[0][:, :, :256008] + + def _model_generate(self, context, max_length, eos_token_id): + result = self.xglm.generate( + context, + max_length=max_length, + eos_token_id=eos_token_id, + do_sample=False + ) + return result diff --git a/lm_eval/tasks/__init__.py b/lm_eval/tasks/__init__.py index 53d7e88f16c..4221b811597 100644 --- a/lm_eval/tasks/__init__.py +++ b/lm_eval/tasks/__init__.py @@ -1,6 +1,8 @@ from pprint import pprint +from typing import List, Union import sacrebleu +import lm_eval.base from . import superglue from . import glue @@ -45,6 +47,7 @@ from . import mutual from . import truthfulqa from . import blimp +from . import asdiv ######################################## # Translation tasks @@ -133,7 +136,9 @@ "squad2": squad.SQuAD2, "race": race.RACE, # "naturalqs": naturalqs.NaturalQs, # not implemented yet - "headqa": headqa.HeadQA, + "headqa": headqa.HeadQAEsDeprecated, # for backwards compat - headqa used to default to es + "headqa_es": headqa.HeadQAEs, + "headqa_en": headqa.HeadQAEn, "mathqa": mathqa.MathQA, "webqs": webqs.WebQs, "wsc273": wsc273.WinogradSchemaChallenge273, @@ -164,6 +169,7 @@ "math_num_theory": hendrycks_math.MathNumberTheory, "math_prealgebra": hendrycks_math.MathPrealgebra, "math_precalc": hendrycks_math.MathPrecalculus, + "math_asdiv": asdiv.Asdiv, # arithmetic "arithmetic_2da": arithmetic.Arithmetic2DPlus, @@ -301,8 +307,23 @@ def get_task(task_name): raise KeyError(f"Missing task {task_name}") -def get_task_dict(task_name_list): - return { +def get_task_name_from_object(task_object): + for name, class_ in TASK_REGISTRY.items(): + if class_ is task_object: + return name + + # this gives a mechanism for non-registered tasks to have a custom name anyways when reporting + return task_object.EVAL_HARNESS_NAME if hasattr(task_object, "EVAL_HARNESS_NAME") else type(task_object).__name__ + + +def get_task_dict(task_name_list: List[Union[str, lm_eval.base.Task]]): + task_name_dict = { task_name: get_task(task_name)() - for task_name in task_name_list + for task_name in task_name_list if isinstance(task_name, str) + } + task_name_from_object_dict = { + get_task_name_from_object(task_object): task_object + for task_object in task_name_list if not isinstance(task_object, str) } + assert set(task_name_dict.keys()).isdisjoint(set(task_name_from_object_dict.keys())) + return {**task_name_dict, **task_name_from_object_dict} diff --git a/lm_eval/tasks/anli.py b/lm_eval/tasks/anli.py index 1304c5da2bc..13c4044560e 100644 --- a/lm_eval/tasks/anli.py +++ b/lm_eval/tasks/anli.py @@ -33,10 +33,6 @@ def test_docs(self): if self.has_test_docs(): return self.data["test_r" + str(self.SPLIT)] - def fewshot_description(self): - # TODO: figure out description - return "" - def doc_to_text(self, doc): # OA does this a bit weirdly: they prepend "anli 1: anli 1: " to the beginning # of the prompt (yes, repeating it!). also, " True, False, or Neither?" is directly diff --git a/lm_eval/tasks/arc.py b/lm_eval/tasks/arc.py index a0d13abc59d..2a8a9998429 100644 --- a/lm_eval/tasks/arc.py +++ b/lm_eval/tasks/arc.py @@ -29,10 +29,6 @@ def _convert_standard(self, doc): } return out_doc - def fewshot_description(self): - # TODO: figure out description - return "" - def doc_to_text(self, doc): return doc["query"] diff --git a/lm_eval/tasks/arithmetic.py b/lm_eval/tasks/arithmetic.py index 147b66a1754..b3256b5c874 100644 --- a/lm_eval/tasks/arithmetic.py +++ b/lm_eval/tasks/arithmetic.py @@ -21,7 +21,7 @@ def download(self): url = 'https://raw.githubusercontent.com/openai/gpt-3/master/data/' + file_name if not os.path.exists(self.directory): os.makedirs(self.directory) - download_file(url, self.directory+file_name, checksum) + download_file(url, local_file=self.directory+file_name, expected_checksum=checksum) self.set_docs() @abc.abstractmethod diff --git a/lm_eval/tasks/asdiv.py b/lm_eval/tasks/asdiv.py new file mode 100644 index 00000000000..58bcdcd250e --- /dev/null +++ b/lm_eval/tasks/asdiv.py @@ -0,0 +1,121 @@ +""" +ASDiv: A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers +https://arxiv.org/abs/2106.15772 + +@misc{miao2021diverse, + title={A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers}, + author={Shen-Yun Miao and Chao-Chun Liang and Keh-Yih Su}, + year={2021}, + eprint={2106.15772}, + archivePrefix={arXiv}, + primaryClass={cs.AI} +} +""" +from lm_eval.base import Task +from pathlib import Path +from best_download import download_file +import xml.etree.ElementTree as ET +from lm_eval.base import rf +from lm_eval.metrics import mean,perplexity +import numpy as np +from zipfile import ZipFile +import os + +#currently ignoring formula for answer generation + +# given a subset, splits return the docs +class Asdiv(Task): + VERSION = 0 + DATASET_PATH = Path("data/asdiv") + + def download(self): + if self.DATASET_PATH.exists(): + return + Path.mkdir(self.DATASET_PATH, parents=True) + url = "https://github.com/chaochun/nlu-asdiv-dataset/archive/55790e5270bb91ccfa5053194b25732534696b50.zip" + checksum = "8f1fe4f6d5f170ec1e24ab78c244153c14c568b1bb2b1dad0324e71f37939a2d" + zip_path = self.DATASET_PATH / "55790e5270bb91ccfa5053194b25732534696b50.zip" + download_file(url, local_file=str(zip_path), expected_checksum=checksum) + with ZipFile(zip_path, "r") as zip: + zip.extractall(self.DATASET_PATH) + os.remove(zip_path) + + def _convert_standard(self, problem): + #TODO: include solution-type and formula + out_doc = { + "question" : problem.find('Question').text, + "body" : problem.find('Body').text, + "answer": problem.find('Answer').text + } + return out_doc + + def load_docs(self, textfilename, tfds=False): + tree = ET.parse(textfilename) + root = tree.getroot() + for pid, problem in enumerate(root.iter('Problem')): + out_doc = self._convert_standard(problem) + yield out_doc + + def has_training_docs(self): + return False + + def has_validation_docs(self): + return True + + def has_test_docs(self): + return False + + def training_docs(self): + raise NotImplementedError("This dataset has no training docs") + + def test_docs(self): + raise NotImplementedError("This dataset has no test docs") + + def validation_docs(self): + data_xml_path = self.DATASET_PATH / "nlu-asdiv-dataset-55790e5270bb91ccfa5053194b25732534696b50/dataset/ASDiv.xml" + return self.load_docs(data_xml_path) + + def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None): + assert num_fewshot == 0, "ASDiv is intended only for the zero-shot setting." + return super().fewshot_context( + doc=doc, + num_fewshot=num_fewshot, + rnd=rnd, + description=description + ) + + def fewshot_description(self): + # TODO: add solution-type and formula + desc = "information containing the context of the question\nQuestion: Text of a question.\nAnswer: Answer to the question, based on the passage.\n" + return desc + + def doc_to_text(self, doc): + # TODO: add solution-type + return doc['body'] + '\n' + 'Question:' + doc['question'] + '\n' + 'Answer:' + + def doc_to_target(self, doc): + # TODO: add formula + + answer = doc['answer'].split(' (')[0] + return " " + answer + + def construct_requests(self, doc, ctx): + ll, is_greedy = rf.loglikelihood(ctx, self.doc_to_target(doc)) + return ll, is_greedy + + def process_results(self, doc, results): + ll, is_greedy = results + + return { + 'acc': int(is_greedy) + } + + def aggregation(self): + return { + 'acc': mean + } + + def higher_is_better(self): + return { + 'acc': True + } diff --git a/lm_eval/tasks/blimp.py b/lm_eval/tasks/blimp.py index e8e7bd9f2be..8a52d888caa 100644 --- a/lm_eval/tasks/blimp.py +++ b/lm_eval/tasks/blimp.py @@ -29,9 +29,18 @@ def download(self): self.data["validation"] = self.data["train"] del self.data["train"] - def fewshot_context(self, doc, num_fewshot, provide_description, rnd): + def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None): assert num_fewshot == 0 - assert not provide_description + assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`" + assert not provide_description, ( + "The `provide_description` arg will be removed in future versions. To prepend " + "a custom description to the context, supply the corresponding string via the " + "`description` arg." + ) + if provide_description is not None: + # nudge people to not specify it at all + print("WARNING: provide_description is deprecated and will be removed in a future version in favor of description_dict") + return "" def doc_to_text(self, doc): diff --git a/lm_eval/tasks/cbt.py b/lm_eval/tasks/cbt.py index 8837caff6dc..e239a630b40 100644 --- a/lm_eval/tasks/cbt.py +++ b/lm_eval/tasks/cbt.py @@ -17,10 +17,6 @@ class CBTBase(HFTask): VERSION = 0 - def fewshot_description(self): - # TODO: Figure out description. - return "" - def detokenize(self, text): text = text.replace(" '", "'") text = text.replace(" \n", "\n") diff --git a/lm_eval/tasks/coqa.py b/lm_eval/tasks/coqa.py index beba53a6630..128ac8f8d5d 100644 --- a/lm_eval/tasks/coqa.py +++ b/lm_eval/tasks/coqa.py @@ -16,8 +16,8 @@ def download(self): sh ("""mkdir -p data/coqa""") - download_file("http://downloads.cs.stanford.edu/nlp/data/coqa/coqa-train-v1.0.json", coqa_train_filepath, "b0fdb2bc1bd38dd3ca2ce5fa2ac3e02c6288ac914f241ac409a655ffb6619fa6") - download_file("http://downloads.cs.stanford.edu/nlp/data/coqa/coqa-dev-v1.0.json", coqa_dev_filepath, "dfa367a9733ce53222918d0231d9b3bedc2b8ee831a2845f62dfc70701f2540a") + download_file("http://downloads.cs.stanford.edu/nlp/data/coqa/coqa-train-v1.0.json", local_file=coqa_train_filepath, expected_checksum="b0fdb2bc1bd38dd3ca2ce5fa2ac3e02c6288ac914f241ac409a655ffb6619fa6") + download_file("http://downloads.cs.stanford.edu/nlp/data/coqa/coqa-dev-v1.0.json", local_file=coqa_dev_filepath, expected_checksum="dfa367a9733ce53222918d0231d9b3bedc2b8ee831a2845f62dfc70701f2540a") def has_training_docs(self): return True @@ -36,10 +36,7 @@ def validation_docs(self): def test_docs(self): pass - - def fewshot_description(self): - return "Given a passage and a conversation so far, answer the next question in the conversation." - + def doc_to_text(self, doc): # Given a passage p, the conversation history {q1, a1, . . . qi−1, ai−1} # and a question qi, the task is to predict the answer ai diff --git a/lm_eval/tasks/drop.py b/lm_eval/tasks/drop.py index 97d10983274..1b896f3cdee 100644 --- a/lm_eval/tasks/drop.py +++ b/lm_eval/tasks/drop.py @@ -27,7 +27,7 @@ def download(self): url = "https://s3-us-west-2.amazonaws.com/allennlp/datasets/drop/drop_dataset.zip" checksum = "39d2278a29fd729de301b111a45f434c24834f40df8f4ff116d864589e3249d6" zip_path = self.DATASET_PATH / "drop_dataset.zip" - download_file(url, str(zip_path), checksum) + download_file(url, local_file=str(zip_path), expected_checksum=checksum) with ZipFile(zip_path, "r") as zip: zip.extractall(self.DATASET_PATH) @@ -40,10 +40,6 @@ def has_validation_docs(self): def has_test_docs(self): return False - def fewshot_description(self): - # TODO: figure out description - return "" - def _load_docs(self, docs): for doc in docs: for qa in doc["qa_pairs"]: diff --git a/lm_eval/tasks/glue.py b/lm_eval/tasks/glue.py index dad629c02a7..80a77310a16 100644 --- a/lm_eval/tasks/glue.py +++ b/lm_eval/tasks/glue.py @@ -21,10 +21,6 @@ def has_validation_docs(self): def has_test_docs(self): return False - def fewshot_description(self): - # TODO - return "" - def doc_to_text(self, doc): return "{}\nQuestion: Does this sentence make sense?\nAnswer:".format(doc["sentence"]) @@ -69,9 +65,6 @@ def has_validation_docs(self): def has_test_docs(self): return False - def fewshot_description(self): - return "Indicate if the sentiment of each sentence is positive or negative." - def doc_to_text(self, doc): return "{}\nQuestion: Is this sentence positive or negative?\nAnswer:".format( general_detokenize(doc["sentence"]), @@ -341,9 +334,6 @@ def has_validation_docs(self): def has_test_docs(self): return False - def fewshot_description(self): - return "Indicate if both sentences mean the same thing." - def doc_to_text(self, doc): return "Sentence 1: {}\nSentence 2: {}\nQuestion: Do both sentences mean the same thing?\nAnswer:".format( general_detokenize(doc["sentence1"]), @@ -394,9 +384,6 @@ def has_validation_docs(self): def has_test_docs(self): return False - def fewshot_description(self): - return "Indicate if both questions ask the same thing." - def doc_to_text(self, doc): return "Question 1: {}\nQuestion 2: {}\nQuestion: Do both questions ask the same thing?\nAnswer:".format( doc["question1"], @@ -447,10 +434,6 @@ def has_validation_docs(self): def has_test_docs(self): return True - def fewshot_description(self): - return "Indicate if both sentences mean the same thing from a scale of 0-5, " \ - "where 5 means identical and 0 means unrelated." - def doc_to_text(self, doc): return "sentence 1: {}\nsentence 2: {}\nAnswer:".format( doc["sentence1"], diff --git a/lm_eval/tasks/headqa.py b/lm_eval/tasks/headqa.py index 3c66dc064b9..d9ac2d87c13 100644 --- a/lm_eval/tasks/headqa.py +++ b/lm_eval/tasks/headqa.py @@ -2,10 +2,9 @@ from lm_eval.base import MultipleChoiceTask -class HeadQA(HFTask, MultipleChoiceTask): +class HeadQABase(HFTask, MultipleChoiceTask): VERSION = 0 DATASET_PATH = "head_qa" - DATASET_NAME = None def has_training_docs(self): return True @@ -25,9 +24,19 @@ def _convert_standard(self, doc): } return out_doc - def fewshot_description(self): - # TODO: figure out description - return "" - def doc_to_text(self, doc): return doc["query"] + +class HeadQAEn(HeadQABase): + DATASET_NAME = "en" + +class HeadQAEs(HeadQABase): + DATASET_NAME = "es" + +# for backwards compatibility +class HeadQAEsDeprecated(HeadQABase): + DATASET_NAME = "es" + + def __init__(self): + super().__init__() + print("WARNING: headqa is deprecated. Please use headqa_es or headqa_en instead. See https://github.com/EleutherAI/lm-evaluation-harness/pull/240 for more info.") \ No newline at end of file diff --git a/lm_eval/tasks/hellaswag.py b/lm_eval/tasks/hellaswag.py index 762ce473377..56450cf3e6e 100644 --- a/lm_eval/tasks/hellaswag.py +++ b/lm_eval/tasks/hellaswag.py @@ -35,10 +35,5 @@ def _convert_standard(self, doc): } return out_doc - def fewshot_description(self): - return "Label for the relevant action: Sentences describing the " \ - "context, with an incomplete sentence trailing\nanswer that " \ - "plausibly completes the situation." - def doc_to_text(self, doc): return doc["query"] diff --git a/lm_eval/tasks/hendrycks_ethics.py b/lm_eval/tasks/hendrycks_ethics.py index 50e94a508cf..d12c0064cf7 100644 --- a/lm_eval/tasks/hendrycks_ethics.py +++ b/lm_eval/tasks/hendrycks_ethics.py @@ -20,7 +20,7 @@ class Ethics(Task): def download(self): if not os.path.exists('data/ethics/done'): sh("mkdir -p data") - download_file("https://people.eecs.berkeley.edu/~hendrycks/ethics.tar", "data/ethics.tar", "40acbf1ac0da79a2aabef394d58889136b8d38b05be09482006de2453fb06333") + download_file("https://people.eecs.berkeley.edu/~hendrycks/ethics.tar", local_file="data/ethics.tar", expected_checksum="40acbf1ac0da79a2aabef394d58889136b8d38b05be09482006de2453fb06333") sh(""" tar -xf data/ethics.tar -C data/ rm data/ethics.tar @@ -237,9 +237,6 @@ def process_doc(self, docs): for doc in docs: yield {"activity": doc[0], "baseline": doc[1], "rating": ""} - def fewshot_description(self): - return "Rate how pleasant each of the following activities is on a scale from 1 (very unpleasant) to 10 (very pleasant).\n\n" - def fewshot_examples(self, k, rnd): # Overwriting fewshot examples as k can be max 5 assert k <= 5, "There are only 5 possible shots for this task. Refer to the V2 for more." @@ -350,9 +347,6 @@ class EthicsVirtue(Ethics): def get_prefix(self): return "virtue/virtue" - def fewshot_description(self): - return "The following is a list of sentences and traits, along with whether the trait is exhibited in that sentence.\n\n" - def process_doc(self, doc): # Append identifiers before shuffling to calculate exact matches lateron & skip the first element of headers return [x + [i] for i, x in enumerate(doc[1:])] diff --git a/lm_eval/tasks/hendrycks_math.py b/lm_eval/tasks/hendrycks_math.py index 379e727d617..657969948c8 100644 --- a/lm_eval/tasks/hendrycks_math.py +++ b/lm_eval/tasks/hendrycks_math.py @@ -18,7 +18,7 @@ class Math(Task): def download(self): if not (self.DATASET_PATH / 'test').exists() or not (self.DATASET_PATH / 'done').exists(): sh(f"mkdir -p {self.DATASET_PATH}") - download_file("https://people.eecs.berkeley.edu/~hendrycks/MATH.tar", f"{self.DATASET_PATH}.tar", "01256fd7cd5430596fdf07e6e6a5827111b5235b7ffed679c662a12f898932da") + download_file("https://people.eecs.berkeley.edu/~hendrycks/MATH.tar", local_file=f"{self.DATASET_PATH}.tar", expected_checksum="01256fd7cd5430596fdf07e6e6a5827111b5235b7ffed679c662a12f898932da") sh(f""" tar -xf {self.DATASET_PATH}.tar -C data/ && touch {self.DATASET_PATH / 'done'} rm {self.DATASET_PATH}.tar @@ -55,9 +55,6 @@ def validation_docs(self): def test_docs(self): return self._load_docs(self.DATASET_PATH / "test" / self.get_file_info()) - def fewshot_description(self): - return "Given a mathematics problem, determine the answer. Simplify your answer as much as possible." - def doc_to_text(self, doc): return "Problem: " + doc["problem"] + "\nAnswer:" diff --git a/lm_eval/tasks/hendrycks_test.py b/lm_eval/tasks/hendrycks_test.py index 46c0306fcd0..aa45d608f5e 100644 --- a/lm_eval/tasks/hendrycks_test.py +++ b/lm_eval/tasks/hendrycks_test.py @@ -45,7 +45,7 @@ def __init__(self, subject): def download(self): if not (self.DATASET_PATH / 'done').exists(): sh("mkdir -p data") - download_file("https://people.eecs.berkeley.edu/~hendrycks/data.tar", "data/data.tar", "78a804365a59028188fb19bd1adcadc5e0c260b220a9d8b2e33a5ea7d5fbe3b4") + download_file("https://people.eecs.berkeley.edu/~hendrycks/data.tar", local_file="data/data.tar", expected_checksum="78a804365a59028188fb19bd1adcadc5e0c260b220a9d8b2e33a5ea7d5fbe3b4") sh(""" tar -xf data/data.tar -C data/ rm data/data.tar @@ -114,9 +114,5 @@ def fewshot_examples(self, k, rnd): return rnd.sample(list(self._fewshot_docs), k) - def fewshot_description(self): - subject = self.subject.replace("_", " ") - return f"The following are multiple choice questions (with answers) about {subject}." - def doc_to_text(self, doc): return doc["query"] diff --git a/lm_eval/tasks/lambada.py b/lm_eval/tasks/lambada.py index bcb4ae019c4..300445c6383 100644 --- a/lm_eval/tasks/lambada.py +++ b/lm_eval/tasks/lambada.py @@ -14,8 +14,8 @@ def download(self): if not os.path.exists("data/lambada/lambada_test.jsonl"): download_file( "http://eaidata.bmk.sh/data/lambada_test.jsonl", - "data/lambada/lambada_test.jsonl", - "4aa8d02cd17c719165fc8a7887fddd641f43fcafa4b1c806ca8abc31fabdb226" + local_file="data/lambada/lambada_test.jsonl", + expected_checksum="4aa8d02cd17c719165fc8a7887fddd641f43fcafa4b1c806ca8abc31fabdb226" ) except: # fallback - for some reason best_download doesnt work all the time here @@ -47,10 +47,6 @@ def doc_to_text(self, doc): def doc_to_target(self, doc): return " " + doc['text'].rsplit(' ', 1)[1] - - def fewshot_description(self): - # TODO: figure out description - return "" def construct_requests(self, doc, ctx): ll, is_greedy = rf.loglikelihood(ctx, self.doc_to_target(doc)) diff --git a/lm_eval/tasks/lambada_cloze.py b/lm_eval/tasks/lambada_cloze.py index 90bd4f10cac..dc1d4b168b8 100644 --- a/lm_eval/tasks/lambada_cloze.py +++ b/lm_eval/tasks/lambada_cloze.py @@ -13,6 +13,3 @@ def doc_to_text(self, doc): def doc_to_target(self, doc): return " " + doc['text'].rsplit(' ', 1)[1] - - def fewshot_description(self): - return "Fill in blank:\n" diff --git a/lm_eval/tasks/lambada_multilingual.py b/lm_eval/tasks/lambada_multilingual.py index dd6da10befa..7123ecf01ad 100644 --- a/lm_eval/tasks/lambada_multilingual.py +++ b/lm_eval/tasks/lambada_multilingual.py @@ -32,8 +32,8 @@ def download(self): if not os.path.exists(f): download_file( url, - f, - CHECKSUMS[self.LANG] + local_file=f, + expected_checksum=CHECKSUMS[self.LANG] ) except: # fallback - for some reason best_download doesnt work all the time here diff --git a/lm_eval/tasks/logiqa.py b/lm_eval/tasks/logiqa.py index e403623beba..6341b4a32e2 100644 --- a/lm_eval/tasks/logiqa.py +++ b/lm_eval/tasks/logiqa.py @@ -19,7 +19,7 @@ def download(self): ] for split in splits: file = self.DATASET_PATH / f"{split['name']}.txt" - download_file(f"{base_url}/{split['name']}.txt", str(file), split["checksum"]) + download_file(f"{base_url}/{split['name']}.txt", local_file=str(file), expected_checksum=split["checksum"]) def has_training_docs(self): return True @@ -80,9 +80,5 @@ def validation_docs(self): def test_docs(self): return self._load_docs(self.DATASET_PATH / "Test.txt") - def fewshot_description(self): - # TODO: figure out actual description - return "" - def doc_to_text(self, doc): return doc["query"] diff --git a/lm_eval/tasks/mathqa.py b/lm_eval/tasks/mathqa.py index 84e5ab9eca5..a02a5b59bb6 100644 --- a/lm_eval/tasks/mathqa.py +++ b/lm_eval/tasks/mathqa.py @@ -29,9 +29,5 @@ def _convert_standard(self, doc): } return out_doc - def fewshot_description(self): - # TODO: figure out description - return "" - def doc_to_text(self, doc): return doc["query"] diff --git a/lm_eval/tasks/mc_taco.py b/lm_eval/tasks/mc_taco.py index c9b2dd91fca..64a36a01f77 100644 --- a/lm_eval/tasks/mc_taco.py +++ b/lm_eval/tasks/mc_taco.py @@ -39,9 +39,6 @@ def has_validation_docs(self): def has_test_docs(self): return True - def fewshot_description(self): - return "Determine whether the candidate answer is plausible (\"yes\") or not (\"no\")" - def doc_to_text(self, doc): return f"{doc['sentence']}\nQuestion: {doc['question']}\n"\ f"Answer: {doc['answer']}\nPlausible:" diff --git a/lm_eval/tasks/mutual.py b/lm_eval/tasks/mutual.py index 17274a46fd9..99c1508bc5d 100644 --- a/lm_eval/tasks/mutual.py +++ b/lm_eval/tasks/mutual.py @@ -36,8 +36,8 @@ def download(self): master_zip = Path("data/master.zip") download_file( "https://github.com/Nealcly/MuTual/archive/master.zip", - str(master_zip), - "bb325cf6c672f0f02699993a37138b0fa0af6fcfc77ec81dfbe46add4d7b29f9") + local_file=str(master_zip), + expected_checksum="bb325cf6c672f0f02699993a37138b0fa0af6fcfc77ec81dfbe46add4d7b29f9") with zipfile.ZipFile(master_zip, 'r') as zip: zip.extractall("data") Path("data/MuTual-master/data").rename(str(self.BASE_PATH)) @@ -70,10 +70,6 @@ def validation_docs(self): def test_docs(self): return NotImplemented - def fewshot_description(self): - # TODO: figure out fewshot description - return "" - def doc_to_text(self, doc): return self.detokenize(doc["article"]) diff --git a/lm_eval/tasks/naturalqs.py b/lm_eval/tasks/naturalqs.py index f31875240f1..e7a381dcd4b 100644 --- a/lm_eval/tasks/naturalqs.py +++ b/lm_eval/tasks/naturalqs.py @@ -21,10 +21,6 @@ def has_validation_docs(self): def has_test_docs(self): return False - def fewshot_description(self): - # TODO: figure out description - return "" - def training_docs(self): # Cache training for faster few-shot. # Data is too large to fit in memory. diff --git a/lm_eval/tasks/openbookqa.py b/lm_eval/tasks/openbookqa.py index 40fc7a026bd..5f87d8a8ec1 100644 --- a/lm_eval/tasks/openbookqa.py +++ b/lm_eval/tasks/openbookqa.py @@ -25,9 +25,5 @@ def _convert_standard(self, doc): } return out_doc - def fewshot_description(self): - # TODO: figure out fewshot description - return "" - def doc_to_text(self, doc): return doc["query"] diff --git a/lm_eval/tasks/pile.py b/lm_eval/tasks/pile.py index 68ff7ed9a8e..a4475832b55 100644 --- a/lm_eval/tasks/pile.py +++ b/lm_eval/tasks/pile.py @@ -10,7 +10,7 @@ class PilePerplexityTask(PerplexityTask, abc.ABC): - VERSION = 0 + VERSION = 1 PILE_SET_NAME = None VAL_PATH = 'data/pile/val.jsonl.zst' @@ -18,9 +18,11 @@ class PilePerplexityTask(PerplexityTask, abc.ABC): def download(self): # TODO: separate pile val/test out by component so we don't have to scan the entire file once per set - os.makedirs("data/pile/", exist_ok=True) - download_file("https://the-eye.eu/public/AI/pile/val.jsonl.zst", self.VAL_PATH, "264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92") - download_file("https://the-eye.eu/public/AI/pile/test.jsonl.zst", self.TEST_PATH, "0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e") + if not os.path.exists("data/pile/test.jsonl.zst"): + # todo use new best_download fallback api + os.makedirs("data/pile/", exist_ok=True) + download_file("http://eaidata.bmk.sh/data/pile/val.jsonl.zst", local_file=self.VAL_PATH, expected_checksum="264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92") + download_file("http://eaidata.bmk.sh/data/pile/test.jsonl.zst", local_file=self.TEST_PATH, expected_checksum="0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e") def validation_docs(self): rdr = lm_dataformat.Reader(self.VAL_PATH) diff --git a/lm_eval/tasks/piqa.py b/lm_eval/tasks/piqa.py index 8b43d1af03c..bdf3ec35dca 100644 --- a/lm_eval/tasks/piqa.py +++ b/lm_eval/tasks/piqa.py @@ -18,10 +18,6 @@ def has_validation_docs(self): def has_test_docs(self): return False - def fewshot_description(self): - # TODO: figure out fewshot description - return "" - def _convert_standard(self, doc): out_doc = { "goal": doc["goal"], diff --git a/lm_eval/tasks/prost.py b/lm_eval/tasks/prost.py index 1a634d17c80..e972d39ac03 100644 --- a/lm_eval/tasks/prost.py +++ b/lm_eval/tasks/prost.py @@ -36,13 +36,14 @@ def has_validation_docs(self): def has_test_docs(self): return True - def fewshot_description(self): - # TODO: figure out fewshot description - return "" - - def fewshot_context(self, doc, num_fewshot, provide_description, rnd): + def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None): assert num_fewshot == 0, 'PROST is designed to probe models in a zero-shot fashion only.' - return super().fewshot_context(doc, num_fewshot, provide_description, rnd) + return super().fewshot_context( + doc=doc, + num_fewshot=num_fewshot, + rnd=rnd, + description=description + ) def _convert_standard(self, doc): out_doc = { diff --git a/lm_eval/tasks/pubmedqa.py b/lm_eval/tasks/pubmedqa.py index 14335a5de02..c597064f0c6 100644 --- a/lm_eval/tasks/pubmedqa.py +++ b/lm_eval/tasks/pubmedqa.py @@ -23,11 +23,6 @@ def test_docs(self): # HF is labelled as train but its really just for testing return self.data["train"] - def fewshot_description(self): - # Average ctx length in labelled dataset is 238.9 - # 2 few-shot exmamples pushes it beyond context window - return "" - def doc_to_text(self, doc): ctxs = "\n".join(doc["context"]["contexts"]) return "Abstract: {}\nQuestion: {}\nAnswer:".format( diff --git a/lm_eval/tasks/qa4mre.py b/lm_eval/tasks/qa4mre.py index 67810ad747a..c0966c24574 100644 --- a/lm_eval/tasks/qa4mre.py +++ b/lm_eval/tasks/qa4mre.py @@ -32,8 +32,8 @@ def download(self): if not os.path.isfile(f"data/qa4mre/QA4MRE-{year}-{lang}"): download_file( url_path, - f"data/qa4mre/QA4MRE-{year}-{lang}_GS.xml", - sha256sums[year], + local_file=f"data/qa4mre/QA4MRE-{year}-{lang}_GS.xml", + expected_checksum=sha256sums[year], ) def has_training_docs(self): @@ -67,9 +67,6 @@ def load_docs(self, textfilename, tfds=False): out_doc['source'] = src yield out_doc - def fewshot_description(self): - return "" - def test_docs(self): return self.load_docs(f"data/qa4mre/QA4MRE-{self.YEAR}-EN_GS.xml") diff --git a/lm_eval/tasks/quac.py b/lm_eval/tasks/quac.py index bb02b1c4e37..c7ce752233e 100644 --- a/lm_eval/tasks/quac.py +++ b/lm_eval/tasks/quac.py @@ -51,11 +51,6 @@ def validation_docs(self): def test_docs(self): raise NotImplementedError("QuAC has no test docs.") - def fewshot_description(self): - # TODO: figure out fewshot description - desc = "TITLE: Title of the context passage - subtitle of the passage\nPARAGRAPH: Passage describing the relevant information for answering questions.\n\nQ: Text of a question.\n\nA: Answer to the question, based on the passage. If it cannot be answered based on the passage, write CANNOTANSWER" - return desc - def load_doc(self, myjson): docs = [] for item in myjson: diff --git a/lm_eval/tasks/race.py b/lm_eval/tasks/race.py index 4525cb4ccdf..64ee000bd19 100644 --- a/lm_eval/tasks/race.py +++ b/lm_eval/tasks/race.py @@ -65,10 +65,6 @@ def validation_docs(self): def test_docs(self): return self._collate_data("test") - def fewshot_description(self): - # TODO: figure out description - return "" - @classmethod def get_answer_option(cls, problem): answer = cls.letter_to_num[problem['answer']] diff --git a/lm_eval/tasks/sat.py b/lm_eval/tasks/sat.py index e4411edfd8b..d75d7923b5a 100644 --- a/lm_eval/tasks/sat.py +++ b/lm_eval/tasks/sat.py @@ -61,10 +61,5 @@ def validation_docs(self): } yield doc - - def fewshot_description(self): - # TODO: figure out actual description - return "" - def doc_to_text(self, doc): return "{} is to {} as".format(*doc['query']) diff --git a/lm_eval/tasks/sciq.py b/lm_eval/tasks/sciq.py index b750354a7b0..e385811937a 100644 --- a/lm_eval/tasks/sciq.py +++ b/lm_eval/tasks/sciq.py @@ -13,8 +13,8 @@ def download(self): os.makedirs('data/sciq', exist_ok=True) download_file( 'https://ai2-public-datasets.s3.amazonaws.com/sciq/SciQ.zip', - 'data/sciq/SciQ.zip', - '7f3312f6ac6b09970b32942d106a8c44ec0dad46a0369f17d635aff8e348a87c', + local_file='data/sciq/SciQ.zip', + expected_checksum='7f3312f6ac6b09970b32942d106a8c44ec0dad46a0369f17d635aff8e348a87c', ) with zipfile.ZipFile("data/sciq/SciQ.zip", "r") as zf: zf.extractall("data/sciq/") @@ -50,9 +50,6 @@ def load_docs(self, textfilename): for record in docs: yield self._convert_standard(record) - def fewshot_description(self): - return "" - def training_docs(self): return self.load_docs("data/sciq/SciQ dataset-2 3/train.json") @@ -63,4 +60,4 @@ def test_docs(self): return self.load_docs("data/sciq/SciQ dataset-2 3/test.json") def doc_to_text(self, doc): - return "{}\nQuestion: {}\nAnswer:".format(doc["source"], doc["query"]).strip() \ No newline at end of file + return "{}\nQuestion: {}\nAnswer:".format(doc["source"], doc["query"]).strip() diff --git a/lm_eval/tasks/squad.py b/lm_eval/tasks/squad.py index 72e1a19b0ed..2a69a67c7bf 100644 --- a/lm_eval/tasks/squad.py +++ b/lm_eval/tasks/squad.py @@ -41,10 +41,6 @@ def training_docs(self): def validation_docs(self): return self.data["validation"] - def fewshot_description(self): - # TODO: figure out description - return "" - def doc_to_text(self, doc): return 'Title: ' + doc['title'] + '\n\n' + 'Background: ' + doc['context'] + '\n\n' + 'Question: ' + doc['question'] + '\n\n' + 'Answer:' diff --git a/lm_eval/tasks/storycloze.py b/lm_eval/tasks/storycloze.py index 3e178facb44..2cc16cf66d7 100644 --- a/lm_eval/tasks/storycloze.py +++ b/lm_eval/tasks/storycloze.py @@ -27,18 +27,12 @@ def load_doc(self, filename): filereader = csv.reader(file) return list(filereader) - def validation_docs(self): return self.load_doc("data/storycloze/cloze_test_val__winter2018-cloze_test_ALL_val - 1 - 1.csv") def test_docs(self): return self.load_doc("data/storycloze/cloze_test_test__winter2018-cloze_test_ALL_test - 1.csv") - - def fewshot_description(self): - # TODO: figure out fewshot description - return "" - def doc_to_text(self, doc): return ' '.join([*doc[1:5]]) diff --git a/lm_eval/tasks/superglue.py b/lm_eval/tasks/superglue.py index 33598f23015..f12b866d01f 100644 --- a/lm_eval/tasks/superglue.py +++ b/lm_eval/tasks/superglue.py @@ -13,7 +13,7 @@ class BoolQ(HFTask): - VERSION = 0 + VERSION = 1 DATASET_PATH = "super_glue" DATASET_NAME = "boolq" @@ -26,12 +26,8 @@ def has_validation_docs(self): def has_test_docs(self): return False - def fewshot_description(self): - # TODO: figure out actual description - return "Read the following passages and answer each question with a yes or a no." - def doc_to_text(self, doc): - return f"{doc['passage']}\nQuestion: {doc['question']}\nAnswer:" + return f"{doc['passage']}\nQuestion: {doc['question']}?\nAnswer:" def doc_to_target(self, doc): return " " + yesno(doc['label']) @@ -65,7 +61,7 @@ def aggregation(self): class CommitmentBank(HFTask): - VERSION = 0 + VERSION = 1 DATASET_PATH = "super_glue" DATASET_NAME = "cb" @@ -78,11 +74,6 @@ def has_validation_docs(self): def has_test_docs(self): return False - def fewshot_description(self): - # TODO: figure out actual description - return "Given a premise and a hypothesis, classify whether the author of the premise is committed" \ - "to the truth of the hypothesis. The three possible labels are true, false or neither." - def doc_to_text(self, doc): return "{}\nQuestion: {}. True, False or Neither?\nAnswer:".format( doc["premise"], @@ -93,14 +84,14 @@ def doc_to_target(self, doc): # True = entailment # False = contradiction # Neither = neutral - return " {}".format({0: "True", 1: "Neither", 2: "False"}[doc["label"]]) + return " {}".format({0: "True", 1: "False", 2: "Neither"}[doc["label"]]) def construct_requests(self, doc, ctx): ll_true, _ = rf.loglikelihood(ctx, ' True') - ll_neither, _ = rf.loglikelihood(ctx, ' Neither') ll_false, _ = rf.loglikelihood(ctx, ' False') + ll_neither, _ = rf.loglikelihood(ctx, ' Neither') - return ll_true, ll_neither, ll_false + return ll_true, ll_false, ll_neither def process_results(self, doc, results): gold = doc["label"] @@ -150,11 +141,6 @@ def has_validation_docs(self): def has_test_docs(self): return False - def fewshot_description(self): - # TODO: figure out actual description - return "Given a premise and one alternative with a causal relation to the premise and another without," \ - "choose the more plausible alternative" - def doc_to_text(self, doc): # Drop the period connector = { @@ -202,7 +188,7 @@ def convert_choice(choice): class MultiRC(HFTask): - VERSION = 0 + VERSION = 1 DATASET_PATH = "super_glue" DATASET_NAME = "multirc" @@ -215,10 +201,6 @@ def has_validation_docs(self): def has_test_docs(self): return False - def fewshot_description(self): - # TODO: figure out actual description - return "READING COMPREHENSION ANSWER KEY" - def doc_to_text(self, doc): return f"{doc['paragraph']}\nQuestion: {doc['question']}\nAnswer:" @@ -228,7 +210,7 @@ def doc_to_target(self, doc): @staticmethod def format_answer(answer, label): label_str = "yes" if label else "no" - return f"{label_str}, {answer}" + return f"{answer}\nIs the answer correct? {label_str}" def construct_requests(self, doc, ctx): true_choice = self.format_answer(answer=doc["answer"], label=True) @@ -240,7 +222,8 @@ def construct_requests(self, doc, ctx): return ll_true_choice, ll_false_choice def process_results(self, doc, results): - pred = np.argmax(results) + ll_true_choice, ll_false_choice = results + pred = ll_true_choice > ll_false_choice return { "acc": (pred, doc) } @@ -270,10 +253,6 @@ def has_validation_docs(self): def has_test_docs(self): return False - def fewshot_description(self): - # TODO: figure out actual description - return "" - def training_docs(self): # In ReCoRD, each doc manifests multiple "examples" in the context of few shot example packing. # Each doc consists of multiple answer candidates, each of which is scored yes/no. @@ -363,10 +342,6 @@ def has_validation_docs(self): def has_test_docs(self): return False - def fewshot_description(self): - # TODO: figure out actual description - return "" - def doc_to_text(self, doc): return "Sentence 1: {}\nSentence 2: {}\nQuestion: Is the word '{}' used in the same way in the" \ " two sentences above?\nAnswer:".format( @@ -432,12 +407,6 @@ def training_docs(self): ] return self._training_docs - def fewshot_description(self): - return "Final Exam with Answer Key\n" \ - "Instructions: Please carefully read the following passages. " \ - "For each passage, you must identify which noun the pronoun marked in *bold*" \ - " refers to.\n=====" - def doc_to_text(self, doc): raw_passage = doc["text"] # NOTE: HuggingFace span indices are word-based not character-based. diff --git a/lm_eval/tasks/translation.py b/lm_eval/tasks/translation.py index de02946334f..4d65de43fba 100644 --- a/lm_eval/tasks/translation.py +++ b/lm_eval/tasks/translation.py @@ -107,7 +107,7 @@ def doc_to_text(self, doc): language_codes = self.sacrebleu_language_pair.split("-") src_lang = code_to_language(language_codes[0]) tar_lang = code_to_language(language_codes[1]) - return f"{src_lang} phrase: " + doc["src"] + f"\n{tar_lang} phrase:" + return f"\nTranslate {src_lang} to {tar_lang}:\n [{src_lang}] " + doc["src"] + f"\n[{tar_lang}]" def doc_to_target(self, doc): # This shows a single target, though there may be multiple targets in a lang test @@ -132,7 +132,6 @@ def process_results(self, doc, results): if tar_lang_code in NO_SPACE_LANG: doc["ref"] = NO_SPACE_LANG[tar_lang_code]([doc["ref"]])[0] results = NO_SPACE_LANG[tar_lang_code](results) - # These metrics are corpus-level not sentence level, so we'll hide the # results in this dict and compute the corpus score in the aggregate method ref_pred = (doc["ref"], results) @@ -166,12 +165,6 @@ def higher_is_better(self): "ter": False, } - def fewshot_description(self): - language_codes = self.sacrebleu_language_pair.split("-") - src_lang = code_to_language(language_codes[0]) - tar_lang = code_to_language(language_codes[1]) - return f"Translate these {src_lang} phrases to {tar_lang}." - def __str__(self): language_codes = self.sacrebleu_language_pair.split("-") src_lang = code_to_language(language_codes[0]) diff --git a/lm_eval/tasks/triviaqa.py b/lm_eval/tasks/triviaqa.py index e61a40bdde2..86ba406ab97 100644 --- a/lm_eval/tasks/triviaqa.py +++ b/lm_eval/tasks/triviaqa.py @@ -12,7 +12,7 @@ class TriviaQA(Task): def download(self): if not os.path.exists('data/triviaqa/unfiltered-web-train.jsonl'): os.makedirs("data/triviaqa/", exist_ok=True) - download_file("http://eaidata.bmk.sh/data/triviaqa-unfiltered.tar.gz", "data/triviaqa/triviaqa-unfiltered.tar.gz", "adc19b42769062d241a8fbe834c56e58598d9322eb6c614e9f33a68a2cf5523e") + download_file("http://eaidata.bmk.sh/data/triviaqa-unfiltered.tar.gz", local_file="data/triviaqa/triviaqa-unfiltered.tar.gz", expected_checksum="adc19b42769062d241a8fbe834c56e58598d9322eb6c614e9f33a68a2cf5523e") sh(""" cd data/triviaqa/ tar -xf triviaqa-unfiltered.tar.gz @@ -36,10 +36,6 @@ def validation_docs(self): def test_docs(self): raise NotImplementedError() - def fewshot_description(self): - # TODO: figure out fewshot description - return "" - def doc_to_text(self, doc): return f"Question: {doc['Question']}\nAnswer:" @@ -56,7 +52,6 @@ def _remove_prefixes(self, aliases): ret.append(alias) return ret - def construct_requests(self, doc, ctx): ret = [] diff --git a/lm_eval/tasks/truthfulqa.py b/lm_eval/tasks/truthfulqa.py index f0b46196bc2..ad66c5ca6ad 100644 --- a/lm_eval/tasks/truthfulqa.py +++ b/lm_eval/tasks/truthfulqa.py @@ -58,7 +58,7 @@ def download(self): Path.mkdir(self.DATASET_PATH, parents=True) mc_url = "https://raw.githubusercontent.com/sylinrl/TruthfulQA/013686a06be7a7bde5bf8223943e106c7250123c/data/mc_task.json" checksum = "6eb4125d25750c0145c4be2dce00440736684ab6f74ce6bff2139571cc758954" - download_file(mc_url, str(self.DATASET_PATH / "mc_task.json"), checksum) + download_file(mc_url, local_file=str(self.DATASET_PATH / "mc_task.json"), expected_checksum=checksum) def has_training_docs(self): return False @@ -85,9 +85,14 @@ def doc_to_text(self, doc): def doc_to_target(self, doc): return " " - def fewshot_context(self, doc, num_fewshot, provide_description, rnd): + def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None): assert num_fewshot == 0, "TruthfulQA is intended only for the zero-shot setting." - return super().fewshot_context(doc, num_fewshot, provide_description, rnd) + return super().fewshot_context( + doc=doc, + num_fewshot=num_fewshot, + rnd=rnd, + description=description + ) def construct_requests(self, doc, ctx): """ Uses RequestFactory to construct Requests and returns an iterable of @@ -163,7 +168,7 @@ def download(self): Path.mkdir(self.DATASET_PATH, parents=True) url = "https://raw.githubusercontent.com/sylinrl/TruthfulQA/013686a06be7a7bde5bf8223943e106c7250123c/TruthfulQA.csv" checksum = "8d7dd15f033196140f032d97d30f037da7a7b1192c3f36f9937c1850925335a2" - download_file(url, str(self.DATASET_PATH / "TruthfulQA.csv"), checksum) + download_file(url, local_file=str(self.DATASET_PATH / "TruthfulQA.csv"), expected_checksum=checksum) def has_training_docs(self): return False @@ -217,9 +222,14 @@ def doc_to_text(self, doc): def doc_to_target(self, doc): return " " - def fewshot_context(self, doc, num_fewshot, provide_description, rnd): + def fewshot_context(self, doc, num_fewshot, provide_description=None, rnd=None, description=None): assert num_fewshot == 0, "TruthfulQA is intended only for the zero-shot setting." - return super().fewshot_context(doc, num_fewshot, provide_description, rnd) + return super().fewshot_context( + doc=doc, + num_fewshot=num_fewshot, + rnd=rnd, + description=description + ) def construct_requests(self, doc, ctx): """ Uses RequestFactory to construct Requests and returns an iterable of diff --git a/lm_eval/tasks/unscramble.py b/lm_eval/tasks/unscramble.py index dc742a2ceef..41bccb0a79f 100644 --- a/lm_eval/tasks/unscramble.py +++ b/lm_eval/tasks/unscramble.py @@ -29,7 +29,7 @@ def download(self): if not file.exists(): rawfile = file.parent / (file.name + ".gz") base_url = "https://raw.githubusercontent.com/openai/gpt-3/master/data" - download_file(f"{base_url}/{self.FILENAME}.gz", str(rawfile), self.CHECKSUM) + download_file(f"{base_url}/{self.FILENAME}.gz", local_file=str(rawfile), expected_checksum=self.CHECKSUM) extract_gzip(gz=rawfile, to=file) def has_training_docs(self): @@ -45,9 +45,6 @@ def validation_docs(self): file = self.BASE_PATH / self.FILENAME return (json.loads(line) for line in open(file).read().splitlines()) - def fewshot_description(self): - return "Please unscramble the letters into a word, and write that word:" - def doc_to_text(self, doc): return doc["context"] diff --git a/lm_eval/tasks/webqs.py b/lm_eval/tasks/webqs.py index 51ed0167580..ebab7c8968f 100644 --- a/lm_eval/tasks/webqs.py +++ b/lm_eval/tasks/webqs.py @@ -17,10 +17,6 @@ def has_validation_docs(self): def has_test_docs(self): return True - def fewshot_description(self): - # TODO: figure out description - return "" - def doc_to_text(self, doc): return "Question: " + doc['question'] + '\nAnswer:' @@ -40,7 +36,6 @@ def _remove_prefixes(self, aliases): ret.append(alias) return ret - def construct_requests(self, doc, ctx): ret = [] @@ -62,4 +57,4 @@ def aggregation(self): def higher_is_better(self): return { "acc": True - } \ No newline at end of file + } diff --git a/lm_eval/tasks/wikitext.py b/lm_eval/tasks/wikitext.py index 24f9ec35074..40699863bb1 100644 --- a/lm_eval/tasks/wikitext.py +++ b/lm_eval/tasks/wikitext.py @@ -41,18 +41,14 @@ def wikitext_detokenizer(string): class WikiText(PerplexityTask): - VERSION = 0 + VERSION = 1 def download(self): if not os.path.exists('data/wikitext/wikitext-2-raw/wiki.valid.raw'): os.makedirs("data/wikitext/", exist_ok=True) - download_file("https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip", "data/wikitext/wikitext-2-raw-v1.zip", "ef7edb566e3e2b2d31b29c1fdb0c89a4cc683597484c3dc2517919c615435a11") + download_file("https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip", local_file="data/wikitext/wikitext-2-raw-v1.zip", expected_checksum="ef7edb566e3e2b2d31b29c1fdb0c89a4cc683597484c3dc2517919c615435a11") sh("cd data/wikitext/ && unzip wikitext-2-raw-v1.zip") - def fewshot_description(self): - # TODO: figure out fewshot description - return "" - def has_validation_docs(self): return True @@ -87,4 +83,4 @@ def doc_to_target(self, doc): def count_words(self, doc): # count number of words in *original doc before detokenization* - return len(re.split(r"\s+", doc)) \ No newline at end of file + return len(re.split(r"\s+", doc)) diff --git a/lm_eval/tasks/winogrande.py b/lm_eval/tasks/winogrande.py index 106c826a692..2e188d7b30c 100644 --- a/lm_eval/tasks/winogrande.py +++ b/lm_eval/tasks/winogrande.py @@ -29,10 +29,6 @@ def has_test_docs(self): def doc_to_text(self, doc): return self.partial_context(doc, doc["option" + doc["answer"]]) - def fewshot_description(self): - # TODO: redo description - return "Winograd schema sentence including a either a ___ blank with a missing word, making the pronoun ambiguous, or the same with the word filled in." - @classmethod def partial_context(cls, doc, option): # Substitute the pronoun in the sentence with the specified option diff --git a/lm_eval/tasks/wsc273.py b/lm_eval/tasks/wsc273.py index 20dd5175b60..505557b15c2 100644 --- a/lm_eval/tasks/wsc273.py +++ b/lm_eval/tasks/wsc273.py @@ -53,10 +53,6 @@ def has_validation_docs(self): def has_test_docs(self): return True - def fewshot_description(self): - # TODO: redo description - return "Winograd schema sentence with correct continuation. True. Winograd schema sentence with incorrect continuation. False." - def fewshot_examples(self, k, rnd): # NOTE: `super().fewshot_examples` samples from training docs which are # not available for this test-set-only dataset. diff --git a/lm_eval/utils.py b/lm_eval/utils.py index c3d718a5007..2a8c6d17fe8 100644 --- a/lm_eval/utils.py +++ b/lm_eval/utils.py @@ -1,6 +1,8 @@ import os import re import collections +import functools +import inspect class ExitCodeError(Exception): @@ -138,4 +140,18 @@ def get_original(self, newarr): assert all(cov) - return res \ No newline at end of file + return res + +def positional_deprecated(fn): + """ + A decorator to nudge users into passing only keyword args (`kwargs`) to the + wrapped function, `fn`. + """ + @functools.wraps(fn) + def _wrapper(*args, **kwargs): + if len(args) != 1 if inspect.ismethod(fn) else 0: + print(f"WARNING: using {fn.__name__} with positional arguments is " + "deprecated and will be disallowed in a future version of " + "lm-evaluation-harness!") + return fn(*args, **kwargs) + return _wrapper diff --git a/main.py b/main.py index c63446fcca4..cf13fe1315e 100644 --- a/main.py +++ b/main.py @@ -19,6 +19,7 @@ def parse_args(): parser.add_argument('--output_path', default=None) parser.add_argument('--limit', type=int, default=None) parser.add_argument('--no_cache', action="store_true") + parser.add_argument('--description_dict_path', default=None) return parser.parse_args() @@ -34,15 +35,21 @@ def main(): else: task_names = args.tasks.split(",") + description_dict = {} + if args.description_dict_path: + with open(args.description_dict_path, 'r') as f: + description_dict = json.load(f) + results = evaluator.simple_evaluate( model=args.model, model_args=args.model_args, - task_names=task_names, + tasks=task_names, num_fewshot=args.num_fewshot, batch_size=args.batch_size, device=args.device, no_cache=args.no_cache, limit=args.limit, + description_dict=description_dict ) dumped = json.dumps(results, indent=2) diff --git a/scripts/cost_estimate.py b/scripts/cost_estimate.py index 4339b8dbd21..d2e60bfa0d9 100644 --- a/scripts/cost_estimate.py +++ b/scripts/cost_estimate.py @@ -51,7 +51,14 @@ def main(): values = [] for taskname in task_list.split(","): lm.tokencost = 0 - evaluator.evaluate(lm, {taskname: tasks.get_task(taskname)()}, False, 0, None, bootstrap_iters=10) + evaluator.evaluate( + lm=lm, + task_dict={taskname: tasks.get_task(taskname)()}, + num_fewshot=0, + limit=None, + bootstrap_iters=10, + description_dict=None + ) print(taskname, lm.tokencost) values.append([taskname, lm.tokencost, lm.tokencost / 1000 * 0.0008, lm.tokencost / 1000 * 0.0012, lm.tokencost / 1000 * 0.006, lm.tokencost / 1000 * 0.06]) diff --git a/scripts/fewshot_description_experiment.py b/scripts/fewshot_description_experiment.py deleted file mode 100644 index e6ad97b3404..00000000000 --- a/scripts/fewshot_description_experiment.py +++ /dev/null @@ -1,79 +0,0 @@ -import json -import numpy as np -import random -import logging -from lm_eval import models, tasks, evaluator, base - -logging.getLogger("openai").setLevel(logging.WARNING) - - -fewshot_descriptions = [ - "foo", - "bar" -] - -task = "lambada" -num_fewshot = 0 -model = "gpt2" -model_args = "" -limit = None -no_cache = False - - -class CustomDescTask: - def __init__(self, task, desc): - self.task = task - self.desc = desc - - def fewshot_description(): - return self.desc - - self.task.fewshot_description = fewshot_description - - def __getattr__(self, attr): - return getattr(self.task, attr) - - -def main(): - random.seed(42) - np.random.seed(42) - - lm = models.get_model(model).create_from_arg_string(model_args) - - if limit: - print("WARNING: --limit SHOULD ONLY BE USED FOR TESTING. REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.") - - if not no_cache: - lm = base.CachingLM(lm, 'lm_cache/' + model + '_' + model_args.replace('=', '-').replace(',', '_') + '.db') - - task_dict = tasks.get_task_dict([task]) - - for desc in fewshot_descriptions: - custom_task_dict = {k: CustomDescTask(v, desc) for k, v in task_dict.items()} - - results = evaluator.evaluate(lm, custom_task_dict, True, num_fewshot, limit) - - dumped = json.dumps(results, indent=2) - - print('Description:', desc) - print(dumped) - - # MAKE TABLE - from pytablewriter import MarkdownTableWriter - - writer = MarkdownTableWriter() - writer.headers = ["Task", "Metric", "Value"] - - values = [] - - for k, dic in results.items(): - for m, v in dic.items(): - values.append([k, m, '%.4f' % v]) - k = "" - writer.value_matrix = values - - print(writer.dumps()) - - -if __name__ == "__main__": - main() diff --git a/scripts/get_prompts.py b/scripts/get_prompts.py index cdda6dc43e4..56a9ff79f44 100644 --- a/scripts/get_prompts.py +++ b/scripts/get_prompts.py @@ -9,7 +9,6 @@ print('#', tname) docs = islice(task.validation_docs() if task.has_validation_docs() else task.test_docs(), ct) print() - print('**Zero-Shot Prompt**:', "\n```\n" + task.fewshot_description() + "\n```\n") for i in range(ct): print() doc = next(docs) diff --git a/scripts/write_out.py b/scripts/write_out.py index b7eb30c15ad..2039d3934f3 100644 --- a/scripts/write_out.py +++ b/scripts/write_out.py @@ -1,5 +1,6 @@ import argparse import numpy as np +import json import os import random from lm_eval import tasks @@ -17,6 +18,7 @@ def parse_args(): parser.add_argument('--num_fewshot', type=int, default=1) parser.add_argument('--seed', type=int, default=42) parser.add_argument('--num_examples', type=int, default=1) + parser.add_argument('--description_dict_path', default=None) return parser.parse_args() @@ -29,6 +31,12 @@ def main(): else: task_names = args.tasks.split(",") task_dict = tasks.get_task_dict(task_names) + + description_dict = {} + if args.description_dict_path: + with open(args.description_dict_path, 'r') as f: + description_dict = json.load(f) + os.makedirs(args.output_base_path, exist_ok=True) for task_name, task in task_dict.items(): rnd = random.Random() @@ -47,14 +55,16 @@ def main(): docs = join_iters(iters) + description = description_dict[task_name] if description_dict and task_name in description_dict else "" + with open(os.path.join(args.output_base_path, task_name), "w") as f: for i, doc in zip(range(args.num_examples), docs) if args.num_examples > 0 else enumerate(docs): f.write(EXAMPLE_DIVIDER.format(i=i)) ctx = task.fewshot_context( doc=doc, - provide_description=args.provide_description, num_fewshot=args.num_fewshot, - rnd=rnd + rnd=rnd, + description=description ) f.write(ctx + "\n") diff --git a/tests/test_cache.db b/tests/test_cache.db deleted file mode 100644 index 7477f429bf6..00000000000 Binary files a/tests/test_cache.db and /dev/null differ diff --git a/tests/test_description_dict.py b/tests/test_description_dict.py new file mode 100644 index 00000000000..f80f5290638 --- /dev/null +++ b/tests/test_description_dict.py @@ -0,0 +1,42 @@ +import random +import lm_eval.tasks +import lm_eval.models + + +def test_description_dict(): + seed = 42 + num_examples = 1 + task_names = ["hellaswag", "winogrande"] + description_dict = { + "hellaswag": "Label for the relevant action:\nSentences describing context, with an incomplete sentence trailing answer that plausibly completes the situation.", + "winogrande": "Winograd schema sentence including a either a ___ blank with a missing word, making the pronoun ambiguous, or the same with the word filled in.", + } + + task_dict = lm_eval.tasks.get_task_dict(task_names) + for task_name, task in task_dict.items(): + rnd = random.Random() + rnd.seed(seed) + + if task.has_training_docs(): + docs = task.training_docs() + elif set == "val" and task.has_validation_docs(): + docs = task.validation_docs() + elif set == "test" and task.has_test_docs(): + docs = task.test_docs() + + description = ( + description_dict[task_name] + if description_dict and task_name in description_dict + else "" + ) + + for _, doc in ( + zip(range(num_examples), docs) if num_examples > 0 else enumerate(docs) + ): + ctx = task.fewshot_context( + doc=doc, + num_fewshot=1, + rnd=rnd, + description=description, + ) + assert description in ctx diff --git a/tests/test_evaluator.py b/tests/test_evaluator.py index 491e9de899d..363384a05c9 100644 --- a/tests/test_evaluator.py +++ b/tests/test_evaluator.py @@ -48,8 +48,22 @@ def ll_perp_fn(reqs): lm.loglikelihood_rolling = ll_perp_fn limit = 10 - e1 = evaluator.evaluate(lm, task_dict, False, 0, limit, bootstrap_iters=10) - e2 = evaluator.evaluate(lm, task_dict, False, 0, limit, bootstrap_iters=10) + e1 = evaluator.evaluate( + lm=lm, + task_dict=task_dict, + num_fewshot=0, + limit=limit, + bootstrap_iters=10, + description_dict=None + ) + e2 = evaluator.evaluate( + lm=lm, + task_dict=task_dict, + num_fewshot=0, + limit=limit, + bootstrap_iters=10, + description_dict=None + ) # check that caching is working assert e1 == e2 diff --git a/tests/test_tasks.py b/tests/test_tasks.py index 97baeacf8a9..46812798a91 100644 --- a/tests/test_tasks.py +++ b/tests/test_tasks.py @@ -32,7 +32,7 @@ def test_basic_interface(taskname, task_class): limit = None - if taskname in ["triviaqa"]: + if taskname in ["triviaqa"] or taskname.startswith("pile_"): limit = 10000 if task.has_validation_docs(): arr = list(islice(task.validation_docs(), limit)) diff --git a/tests/test_version_stable.py b/tests/test_version_stable.py index d230112de16..7dd36a94b6b 100644 --- a/tests/test_version_stable.py +++ b/tests/test_version_stable.py @@ -99,5 +99,13 @@ def greedy_until(reqs): lm.greedy_until = greedy_until limit = None - result = evaluator.evaluate(lm, task_dict, False, 0, limit, bootstrap_iters=10) + result = evaluator.evaluate( + lm=lm, + task_dict=task_dict, + num_fewshot=0, + limit=limit, + bootstrap_iters=10, + description_dict=None + ) + assert_target(f"{taskname}-v{task_class.VERSION}-res", result) diff --git a/tests/testdata/boolq-v1-loglikelihood b/tests/testdata/boolq-v1-loglikelihood new file mode 100644 index 00000000000..7811121c9fd --- /dev/null +++ b/tests/testdata/boolq-v1-loglikelihood @@ -0,0 +1 @@ +6577e0d88572772ef08e64f624c0e3df0953286ae1f118ccef15623b59ffeabf \ No newline at end of file diff --git a/tests/testdata/boolq-v1-res.json b/tests/testdata/boolq-v1-res.json new file mode 100644 index 00000000000..291b9f122d0 --- /dev/null +++ b/tests/testdata/boolq-v1-res.json @@ -0,0 +1 @@ +{"results": {"boolq": {"acc": 0.5048929663608562, "acc_stderr": 0.00874463623355505}}, "versions": {"boolq": 1}} \ No newline at end of file diff --git a/tests/testdata/cb-v1-loglikelihood b/tests/testdata/cb-v1-loglikelihood new file mode 100644 index 00000000000..ad7e928fe6a --- /dev/null +++ b/tests/testdata/cb-v1-loglikelihood @@ -0,0 +1 @@ +77b11f4348eb8a7f57faf95c531fda01ab4bf0e729f91a82451ed8e71ec8e66d \ No newline at end of file diff --git a/tests/testdata/cb-v1-res.json b/tests/testdata/cb-v1-res.json new file mode 100644 index 00000000000..1cff410b2c3 --- /dev/null +++ b/tests/testdata/cb-v1-res.json @@ -0,0 +1 @@ +{"results": {"cb": {"acc": 0.3392857142857143, "acc_stderr": 0.06384226561930825, "f1": 0.2819143819143819}}, "versions": {"cb": 1}} \ No newline at end of file diff --git a/tests/testdata/headqa_en-v0-loglikelihood b/tests/testdata/headqa_en-v0-loglikelihood new file mode 100644 index 00000000000..11f07878fb5 --- /dev/null +++ b/tests/testdata/headqa_en-v0-loglikelihood @@ -0,0 +1 @@ +09da45119b12a0144e3081f8fb790c2a22af7b9c3aac42f54423d348a711fbf5 \ No newline at end of file diff --git a/tests/testdata/headqa_en-v0-res.json b/tests/testdata/headqa_en-v0-res.json new file mode 100644 index 00000000000..6ac5a9c0b8e --- /dev/null +++ b/tests/testdata/headqa_en-v0-res.json @@ -0,0 +1 @@ +{"results": {"headqa_en": {"acc": 0.23559445660102116, "acc_norm": 0.2447118891320204, "acc_norm_stderr": 0.008211629406841468, "acc_stderr": 0.008105688874297972}}, "versions": {"headqa_en": 0}} \ No newline at end of file diff --git a/tests/testdata/headqa_es-v0-loglikelihood b/tests/testdata/headqa_es-v0-loglikelihood new file mode 100644 index 00000000000..9129d834b60 --- /dev/null +++ b/tests/testdata/headqa_es-v0-loglikelihood @@ -0,0 +1 @@ +767ca34d9714edd9fb030ddbcc35a64e5180d1e247b0cb557fbb22fdf971ad1f \ No newline at end of file diff --git a/tests/testdata/headqa_es-v0-res.json b/tests/testdata/headqa_es-v0-res.json new file mode 100644 index 00000000000..0964db9bbb8 --- /dev/null +++ b/tests/testdata/headqa_es-v0-res.json @@ -0,0 +1 @@ +{"results": {"headqa_es": {"acc": 0.23559445660102116, "acc_norm": 0.25018234865062, "acc_norm_stderr": 0.008272783230806014, "acc_stderr": 0.008105688874297972}}, "versions": {"headqa_es": 0}} \ No newline at end of file diff --git a/tests/testdata/multirc-v1-loglikelihood b/tests/testdata/multirc-v1-loglikelihood new file mode 100644 index 00000000000..52a89c6f9ea --- /dev/null +++ b/tests/testdata/multirc-v1-loglikelihood @@ -0,0 +1 @@ +0e793bd6f637a70a04c6f2cda080188fc037961b2f909095fe63f7bdbc4a90c6 \ No newline at end of file diff --git a/tests/testdata/multirc-v1-res.json b/tests/testdata/multirc-v1-res.json new file mode 100644 index 00000000000..938141bbb88 --- /dev/null +++ b/tests/testdata/multirc-v1-res.json @@ -0,0 +1 @@ +{"results": {"multirc": {"acc": 0.046169989506820566, "acc_stderr": 0.006801377886208738}}, "versions": {"multirc": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_arxiv-v1-loglikelihood_rolling b/tests/testdata/pile_arxiv-v1-loglikelihood_rolling new file mode 100644 index 00000000000..3aa1d8c7349 --- /dev/null +++ b/tests/testdata/pile_arxiv-v1-loglikelihood_rolling @@ -0,0 +1 @@ +814f9954e44368559602c00f7e85fa3971acdfd0315f508ec7df6318a79c55ec \ No newline at end of file diff --git a/tests/testdata/pile_arxiv-v1-res.json b/tests/testdata/pile_arxiv-v1-res.json new file mode 100644 index 00000000000..05cbab38732 --- /dev/null +++ b/tests/testdata/pile_arxiv-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_arxiv": {"bits_per_byte": 1.55095665856779e-05, "byte_perplexity": 1.0000107504701365, "word_perplexity": 1.0000819333090385}}, "versions": {"pile_arxiv": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_bookcorpus2-v1-loglikelihood_rolling b/tests/testdata/pile_bookcorpus2-v1-loglikelihood_rolling new file mode 100644 index 00000000000..b37a91cc2de --- /dev/null +++ b/tests/testdata/pile_bookcorpus2-v1-loglikelihood_rolling @@ -0,0 +1 @@ +5c17ddfebeab8c41dabadb6fc216ceda91e3fe5dc95aaf1b2c843d7f11828b03 \ No newline at end of file diff --git a/tests/testdata/pile_bookcorpus2-v1-res.json b/tests/testdata/pile_bookcorpus2-v1-res.json new file mode 100644 index 00000000000..967c14934b8 --- /dev/null +++ b/tests/testdata/pile_bookcorpus2-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_bookcorpus2": {"bits_per_byte": 1.6780040419457868e-06, "byte_perplexity": 1.000001163104447, "word_perplexity": 1.0000066499426599}}, "versions": {"pile_bookcorpus2": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_books3-v1-loglikelihood_rolling b/tests/testdata/pile_books3-v1-loglikelihood_rolling new file mode 100644 index 00000000000..b483d3b45b4 --- /dev/null +++ b/tests/testdata/pile_books3-v1-loglikelihood_rolling @@ -0,0 +1 @@ +0f8f36f705b999b6d55fa72ff89a82793dd1cb568ab1f8727a6a2086a12b9410 \ No newline at end of file diff --git a/tests/testdata/pile_books3-v1-res.json b/tests/testdata/pile_books3-v1-res.json new file mode 100644 index 00000000000..6ff7a517112 --- /dev/null +++ b/tests/testdata/pile_books3-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_books3": {"bits_per_byte": 1.2901280503011222e-06, "byte_perplexity": 1.0000008942490204, "word_perplexity": 1.0000052870063607}}, "versions": {"pile_books3": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_dm-mathematics-v1-loglikelihood_rolling b/tests/testdata/pile_dm-mathematics-v1-loglikelihood_rolling new file mode 100644 index 00000000000..2fb27786c54 --- /dev/null +++ b/tests/testdata/pile_dm-mathematics-v1-loglikelihood_rolling @@ -0,0 +1 @@ +d5b7967c0ece8b816f3921a8bd0fad23365349e935b491595e2ad1135af42da6 \ No newline at end of file diff --git a/tests/testdata/pile_dm-mathematics-v1-res.json b/tests/testdata/pile_dm-mathematics-v1-res.json new file mode 100644 index 00000000000..192e9066a42 --- /dev/null +++ b/tests/testdata/pile_dm-mathematics-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_dm-mathematics": {"bits_per_byte": 8.910951449933553e-05, "byte_perplexity": 1.0000617679162955, "word_perplexity": 1.0002875035042451}}, "versions": {"pile_dm-mathematics": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_enron-v1-loglikelihood_rolling b/tests/testdata/pile_enron-v1-loglikelihood_rolling new file mode 100644 index 00000000000..57dbe764605 --- /dev/null +++ b/tests/testdata/pile_enron-v1-loglikelihood_rolling @@ -0,0 +1 @@ +4baa6ccdc9e3aa9921675ab4400d5e89d7b546b844a8ea28f6461d649066418a \ No newline at end of file diff --git a/tests/testdata/pile_enron-v1-res.json b/tests/testdata/pile_enron-v1-res.json new file mode 100644 index 00000000000..abe7b45f9af --- /dev/null +++ b/tests/testdata/pile_enron-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_enron": {"bits_per_byte": 0.0004564546920781453, "byte_perplexity": 1.000316440339552, "word_perplexity": 1.00224668051869}}, "versions": {"pile_enron": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_europarl-v1-loglikelihood_rolling b/tests/testdata/pile_europarl-v1-loglikelihood_rolling new file mode 100644 index 00000000000..80272607557 --- /dev/null +++ b/tests/testdata/pile_europarl-v1-loglikelihood_rolling @@ -0,0 +1 @@ +e67d3dbccd47d308bfc5b0e66b76d0dfc5e386ebfa94e056562c2281c395543f \ No newline at end of file diff --git a/tests/testdata/pile_europarl-v1-res.json b/tests/testdata/pile_europarl-v1-res.json new file mode 100644 index 00000000000..b948f0d3691 --- /dev/null +++ b/tests/testdata/pile_europarl-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_europarl": {"bits_per_byte": 1.2477664839621123e-05, "byte_perplexity": 1.000008648895605, "word_perplexity": 1.000063506523818}}, "versions": {"pile_europarl": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_freelaw-v1-loglikelihood_rolling b/tests/testdata/pile_freelaw-v1-loglikelihood_rolling new file mode 100644 index 00000000000..7b5771f4911 --- /dev/null +++ b/tests/testdata/pile_freelaw-v1-loglikelihood_rolling @@ -0,0 +1 @@ +d77f3f68aadd6cbf1290c2f6737b2ed5d5c2a60e4c81a65c280f207783caabe1 \ No newline at end of file diff --git a/tests/testdata/pile_freelaw-v1-res.json b/tests/testdata/pile_freelaw-v1-res.json new file mode 100644 index 00000000000..dd0e0bac36b --- /dev/null +++ b/tests/testdata/pile_freelaw-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_freelaw": {"bits_per_byte": 4.5623635481434923e-05, "byte_perplexity": 1.0000316243943415, "word_perplexity": 1.000203169094218}}, "versions": {"pile_freelaw": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_github-v1-loglikelihood_rolling b/tests/testdata/pile_github-v1-loglikelihood_rolling new file mode 100644 index 00000000000..cf8251e4f68 --- /dev/null +++ b/tests/testdata/pile_github-v1-loglikelihood_rolling @@ -0,0 +1 @@ +df384c3df3d8f53273e97127c5bb84c17e638acad7d6bc9c91f6dee96d43b639 \ No newline at end of file diff --git a/tests/testdata/pile_github-v1-res.json b/tests/testdata/pile_github-v1-res.json new file mode 100644 index 00000000000..cc06a45501f --- /dev/null +++ b/tests/testdata/pile_github-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_github": {"bits_per_byte": 0.00013764216145332133, "byte_perplexity": 1.0000954108274611, "word_perplexity": 1.0009643183931227}}, "versions": {"pile_github": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_gutenberg-v1-loglikelihood_rolling b/tests/testdata/pile_gutenberg-v1-loglikelihood_rolling new file mode 100644 index 00000000000..bd7b15927f7 --- /dev/null +++ b/tests/testdata/pile_gutenberg-v1-loglikelihood_rolling @@ -0,0 +1 @@ +02a559f74a9105145e7d4d9c5ddea372b5b4938f5368dc8ffafc39cbe3b4c7ef \ No newline at end of file diff --git a/tests/testdata/pile_gutenberg-v1-res.json b/tests/testdata/pile_gutenberg-v1-res.json new file mode 100644 index 00000000000..6d22ed3ff50 --- /dev/null +++ b/tests/testdata/pile_gutenberg-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_gutenberg": {"bits_per_byte": 1.7952329146458065e-06, "byte_perplexity": 1.0000012443614075, "word_perplexity": 1.0000072174665404}}, "versions": {"pile_gutenberg": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_hackernews-v1-loglikelihood_rolling b/tests/testdata/pile_hackernews-v1-loglikelihood_rolling new file mode 100644 index 00000000000..48b767bfe70 --- /dev/null +++ b/tests/testdata/pile_hackernews-v1-loglikelihood_rolling @@ -0,0 +1 @@ +ec1082ee5a5326e0d57aa4e73b634937140c1de9af95f154e8ab57b05d9b422b \ No newline at end of file diff --git a/tests/testdata/pile_hackernews-v1-res.json b/tests/testdata/pile_hackernews-v1-res.json new file mode 100644 index 00000000000..ea135278b72 --- /dev/null +++ b/tests/testdata/pile_hackernews-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_hackernews": {"bits_per_byte": 0.00014672607267878518, "byte_perplexity": 1.0001017079354932, "word_perplexity": 1.0006273924348839}}, "versions": {"pile_hackernews": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_nih-exporter-v1-loglikelihood_rolling b/tests/testdata/pile_nih-exporter-v1-loglikelihood_rolling new file mode 100644 index 00000000000..5f76588a813 --- /dev/null +++ b/tests/testdata/pile_nih-exporter-v1-loglikelihood_rolling @@ -0,0 +1 @@ +520ea6e04e8a39dc0b5f63a837429a78a40e63d39d109096101feb8c5b2cf8d8 \ No newline at end of file diff --git a/tests/testdata/pile_nih-exporter-v1-res.json b/tests/testdata/pile_nih-exporter-v1-res.json new file mode 100644 index 00000000000..0e40fc8268a --- /dev/null +++ b/tests/testdata/pile_nih-exporter-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_nih-exporter": {"bits_per_byte": 0.00035193728014978225, "byte_perplexity": 1.0002439740903082, "word_perplexity": 1.0016712202288802}}, "versions": {"pile_nih-exporter": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_opensubtitles-v1-loglikelihood_rolling b/tests/testdata/pile_opensubtitles-v1-loglikelihood_rolling new file mode 100644 index 00000000000..47805d3b5fe --- /dev/null +++ b/tests/testdata/pile_opensubtitles-v1-loglikelihood_rolling @@ -0,0 +1 @@ +0f1c23a1f4ddec0c2b1ff34de8d1505b0eb9e2868d8edbcc1b6de13d02f32036 \ No newline at end of file diff --git a/tests/testdata/pile_opensubtitles-v1-res.json b/tests/testdata/pile_opensubtitles-v1-res.json new file mode 100644 index 00000000000..1468294732b --- /dev/null +++ b/tests/testdata/pile_opensubtitles-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_opensubtitles": {"bits_per_byte": 2.1948356082685497e-05, "byte_perplexity": 1.0000152135568616, "word_perplexity": 1.0000856162053249}}, "versions": {"pile_opensubtitles": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_openwebtext2-v1-loglikelihood_rolling b/tests/testdata/pile_openwebtext2-v1-loglikelihood_rolling new file mode 100644 index 00000000000..22046e44058 --- /dev/null +++ b/tests/testdata/pile_openwebtext2-v1-loglikelihood_rolling @@ -0,0 +1 @@ +5d6c19665f429ab1ccbe027da67f42bdaf219f819ab093673976eee55e015ff4 \ No newline at end of file diff --git a/tests/testdata/pile_openwebtext2-v1-res.json b/tests/testdata/pile_openwebtext2-v1-res.json new file mode 100644 index 00000000000..ca433e3c854 --- /dev/null +++ b/tests/testdata/pile_openwebtext2-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_openwebtext2": {"bits_per_byte": 0.000184802319359215, "byte_perplexity": 1.000128103411166, "word_perplexity": 1.0007951516532847}}, "versions": {"pile_openwebtext2": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_philpapers-v1-loglikelihood_rolling b/tests/testdata/pile_philpapers-v1-loglikelihood_rolling new file mode 100644 index 00000000000..4fbbc241ba9 --- /dev/null +++ b/tests/testdata/pile_philpapers-v1-loglikelihood_rolling @@ -0,0 +1 @@ +339ba5d8c044c4a3ff9b9a8eaa24da1d6c01b72972074eb671a7da049eeb7047 \ No newline at end of file diff --git a/tests/testdata/pile_philpapers-v1-res.json b/tests/testdata/pile_philpapers-v1-res.json new file mode 100644 index 00000000000..5a2f77678ab --- /dev/null +++ b/tests/testdata/pile_philpapers-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_philpapers": {"bits_per_byte": 9.004690592465457e-06, "byte_perplexity": 1.0000062415953748, "word_perplexity": 1.0000409888564146}}, "versions": {"pile_philpapers": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_pile-cc-v1-loglikelihood_rolling b/tests/testdata/pile_pile-cc-v1-loglikelihood_rolling new file mode 100644 index 00000000000..d5369ed3c97 --- /dev/null +++ b/tests/testdata/pile_pile-cc-v1-loglikelihood_rolling @@ -0,0 +1 @@ +731fdef4a43949b179ba0c540148ebc2fa41583dd583ef580dd812076c66a451 \ No newline at end of file diff --git a/tests/testdata/pile_pile-cc-v1-res.json b/tests/testdata/pile_pile-cc-v1-res.json new file mode 100644 index 00000000000..bd2772e32a9 --- /dev/null +++ b/tests/testdata/pile_pile-cc-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_pile-cc": {"bits_per_byte": 0.0001620742639125056, "byte_perplexity": 1.0001123476295946, "word_perplexity": 1.0006738958554477}}, "versions": {"pile_pile-cc": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_pubmed-abstracts-v1-loglikelihood_rolling b/tests/testdata/pile_pubmed-abstracts-v1-loglikelihood_rolling new file mode 100644 index 00000000000..de5660d60a8 --- /dev/null +++ b/tests/testdata/pile_pubmed-abstracts-v1-loglikelihood_rolling @@ -0,0 +1 @@ +66436569a43163afb2caf422d32c5f329899e74c49865d4d13881fd465fd9976 \ No newline at end of file diff --git a/tests/testdata/pile_pubmed-abstracts-v1-res.json b/tests/testdata/pile_pubmed-abstracts-v1-res.json new file mode 100644 index 00000000000..21b6bb451fe --- /dev/null +++ b/tests/testdata/pile_pubmed-abstracts-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_pubmed-abstracts": {"bits_per_byte": 0.0005417858444030858, "byte_perplexity": 1.0003756078534862, "word_perplexity": 1.0025884332779}}, "versions": {"pile_pubmed-abstracts": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_pubmed-central-v1-loglikelihood_rolling b/tests/testdata/pile_pubmed-central-v1-loglikelihood_rolling new file mode 100644 index 00000000000..283109f32e0 --- /dev/null +++ b/tests/testdata/pile_pubmed-central-v1-loglikelihood_rolling @@ -0,0 +1 @@ +40b39d120d99a145690444e86acc3e3e24d41e6e0538a75e26929ad84926e5e0 \ No newline at end of file diff --git a/tests/testdata/pile_pubmed-central-v1-res.json b/tests/testdata/pile_pubmed-central-v1-res.json new file mode 100644 index 00000000000..4d4a241ace0 --- /dev/null +++ b/tests/testdata/pile_pubmed-central-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_pubmed-central": {"bits_per_byte": 2.2812488135667854e-05, "byte_perplexity": 1.0000158125368497, "word_perplexity": 1.000123107107861}}, "versions": {"pile_pubmed-central": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_stackexchange-v1-loglikelihood_rolling b/tests/testdata/pile_stackexchange-v1-loglikelihood_rolling new file mode 100644 index 00000000000..dcf0e64cf0d --- /dev/null +++ b/tests/testdata/pile_stackexchange-v1-loglikelihood_rolling @@ -0,0 +1 @@ +e524bfb3e21cbdaddc117403a50df598520c7bf5b2c60ad8f2372cfa564e79be \ No newline at end of file diff --git a/tests/testdata/pile_stackexchange-v1-res.json b/tests/testdata/pile_stackexchange-v1-res.json new file mode 100644 index 00000000000..2773302990f --- /dev/null +++ b/tests/testdata/pile_stackexchange-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_stackexchange": {"bits_per_byte": 0.0003302063346758449, "byte_perplexity": 1.0002289077852733, "word_perplexity": 1.0016993562258851}}, "versions": {"pile_stackexchange": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_ubuntu-irc-v1-loglikelihood_rolling b/tests/testdata/pile_ubuntu-irc-v1-loglikelihood_rolling new file mode 100644 index 00000000000..ce041998635 --- /dev/null +++ b/tests/testdata/pile_ubuntu-irc-v1-loglikelihood_rolling @@ -0,0 +1 @@ +4eb69e314f0864ec8890e2323d7e76f8a8309692c4f090e2b41bf4be681a811d \ No newline at end of file diff --git a/tests/testdata/pile_ubuntu-irc-v1-res.json b/tests/testdata/pile_ubuntu-irc-v1-res.json new file mode 100644 index 00000000000..0e3b1b25977 --- /dev/null +++ b/tests/testdata/pile_ubuntu-irc-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_ubuntu-irc": {"bits_per_byte": 2.3513498942121155e-06, "byte_perplexity": 1.0000016298328778, "word_perplexity": 1.0000108866656874}}, "versions": {"pile_ubuntu-irc": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_uspto-v1-loglikelihood_rolling b/tests/testdata/pile_uspto-v1-loglikelihood_rolling new file mode 100644 index 00000000000..4649d3b9b7f --- /dev/null +++ b/tests/testdata/pile_uspto-v1-loglikelihood_rolling @@ -0,0 +1 @@ +789b2bdb31564d512b70f801316f49320a26c83ba361226bac0afb255341d477 \ No newline at end of file diff --git a/tests/testdata/pile_uspto-v1-res.json b/tests/testdata/pile_uspto-v1-res.json new file mode 100644 index 00000000000..599ae44ef43 --- /dev/null +++ b/tests/testdata/pile_uspto-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_uspto": {"bits_per_byte": 0.000174024142670342, "byte_perplexity": 1.00012063161925, "word_perplexity": 1.0007716198916954}}, "versions": {"pile_uspto": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_wikipedia-v1-loglikelihood_rolling b/tests/testdata/pile_wikipedia-v1-loglikelihood_rolling new file mode 100644 index 00000000000..e44bd276280 --- /dev/null +++ b/tests/testdata/pile_wikipedia-v1-loglikelihood_rolling @@ -0,0 +1 @@ +ef9ec0dd408316ca6537228a6812e839f14b30608973081d41efc47c138338da \ No newline at end of file diff --git a/tests/testdata/pile_wikipedia-v1-res.json b/tests/testdata/pile_wikipedia-v1-res.json new file mode 100644 index 00000000000..4f2314e66b3 --- /dev/null +++ b/tests/testdata/pile_wikipedia-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_wikipedia": {"bits_per_byte": 0.00024287370359008176, "byte_perplexity": 1.0001683613940646, "word_perplexity": 1.001084677949439}}, "versions": {"pile_wikipedia": 1}} \ No newline at end of file diff --git a/tests/testdata/pile_youtubesubtitles-v1-loglikelihood_rolling b/tests/testdata/pile_youtubesubtitles-v1-loglikelihood_rolling new file mode 100644 index 00000000000..81c2e5ed063 --- /dev/null +++ b/tests/testdata/pile_youtubesubtitles-v1-loglikelihood_rolling @@ -0,0 +1 @@ +68263c52adc0086011e2220b619983935cabb1cc1f5f9f8ee1a74ab2a7457967 \ No newline at end of file diff --git a/tests/testdata/pile_youtubesubtitles-v1-res.json b/tests/testdata/pile_youtubesubtitles-v1-res.json new file mode 100644 index 00000000000..fcf2faa8bc7 --- /dev/null +++ b/tests/testdata/pile_youtubesubtitles-v1-res.json @@ -0,0 +1 @@ +{"results": {"pile_youtubesubtitles": {"bits_per_byte": 3.3827117222045906e-05, "byte_perplexity": 1.000023447445816, "word_perplexity": 1.0001529192262875}}, "versions": {"pile_youtubesubtitles": 1}} \ No newline at end of file diff --git a/tests/testdata/wikitext-v1-loglikelihood_rolling b/tests/testdata/wikitext-v1-loglikelihood_rolling new file mode 100644 index 00000000000..f09af45a38c --- /dev/null +++ b/tests/testdata/wikitext-v1-loglikelihood_rolling @@ -0,0 +1 @@ +b6f83e6cf7535ee41b0057c3e2ec2cf7f2fa5a9119b305c479a83091d1142b2c \ No newline at end of file diff --git a/tests/testdata/wikitext-v1-res.json b/tests/testdata/wikitext-v1-res.json new file mode 100644 index 00000000000..122098aec22 --- /dev/null +++ b/tests/testdata/wikitext-v1-res.json @@ -0,0 +1 @@ +{"results": {"wikitext": {"bits_per_byte": 3.202519859941674e-05, "byte_perplexity": 1.0000221984224973, "word_perplexity": 1.000118710696617}}, "versions": {"wikitext": 1}} \ No newline at end of file