|
| 1 | +# WikiText-103 |
| 2 | + |
| 3 | + |
| 4 | + |
| 5 | +You can view the WikiText-103 leaderboard [here](https://sotabench.com/benchmarks/language-modelling-on-wikitext-103). |
| 6 | + |
| 7 | +## Getting Started |
| 8 | + |
| 9 | +You'll need the following in the root of your repository: |
| 10 | + |
| 11 | +- `sotabench.py` file - contains benchmarking logic; the server will run this on each commit |
| 12 | +- `requirements.txt` file - Python dependencies to be installed before running `sotabench.py` |
| 13 | +- `sotabench_setup.sh` *(optional)* - any advanced dependencies or setup, e.g. compilation |
| 14 | + |
| 15 | +You can write whatever you want in your `sotabench.py` file to get language model predictions on the WikiText-103 dataset. |
| 16 | + |
| 17 | +But you will need to record your results for the server, and you'll want to avoid doing things like |
| 18 | +downloading the dataset on the server. So you should: |
| 19 | + |
| 20 | +- **Point to the server WikiText-103 data path** - popular datasets are pre-downloaded on the server. |
| 21 | +- **Include an Evaluation object** in `sotabench.py` file to record the results. |
| 22 | +- **Use Caching** *(optional)* - to speed up evaluation by hashing the first batch of predictions. |
| 23 | + |
| 24 | +We explain how to do these various steps below. |
| 25 | + |
| 26 | +## Server Data Location |
| 27 | + |
| 28 | +The WikiText-103 development data is located in the root of your repository on the server at `.data/nlp/wikitext-103/wikitext-103-v1.zip`. |
| 29 | +The archive contains a folder `wikitext-103` with the following files: |
| 30 | + |
| 31 | +- `wiki.train.tokens` |
| 32 | +- `wiki.valid.tokens` |
| 33 | +- `wiki.test.tokens` |
| 34 | + |
| 35 | +It is the original zip file released [here](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/). |
| 36 | +We are running the benchmark on the `wiki.test.tokens` dataset. |
| 37 | +We have two helper methods that will unpack the dataset for you and give you the `pathlib.Path` to the test file. |
| 38 | + |
| 39 | +The first option `test_set_path` is available once you instantiate the `WikiText103Evaluator`: |
| 40 | + |
| 41 | +```python |
| 42 | +... |
| 43 | + |
| 44 | +evaluator = WikiText103Evaluator( |
| 45 | + model_name="Transformer-XL Large", |
| 46 | + paper_arxiv_id="1901.02860", |
| 47 | + paper_pwc_id="transformer-xl-attentive-language-models", |
| 48 | + local_root='/content/wikitext-103' |
| 49 | +) |
| 50 | +# dataset_path is pathlib.Path and points to wikitext.test.tokens |
| 51 | +with evaluator.test_set_path.open() as f: |
| 52 | + test_data = torch.tensor(tokenizer.encode(f.read())).to("cuda") |
| 53 | +``` |
| 54 | + |
| 55 | +There is a second option available if you are evaluating multiple models and need to use the same |
| 56 | +dataset multiple times - `WikiText103Evaluator.get_test_set_path(local_root)`. This will get the path before |
| 57 | +you initialize a WikiText evaluator: |
| 58 | + |
| 59 | +```python |
| 60 | +from sotabencheval.language_modelling import WikiText103Evaluator |
| 61 | + |
| 62 | +test_file_path = WikiText103Evaluator.get_test_set_path('/home/ubuntu/my_data/wiki103') |
| 63 | +with test_file_path.open() as f: |
| 64 | + content = f.read() |
| 65 | +``` |
| 66 | + |
| 67 | +## How Do I Initialize an Evaluator? |
| 68 | + |
| 69 | +Add this to your code - before you start batching over the dataset and making predictions: |
| 70 | + |
| 71 | +``` python |
| 72 | +from sotabencheval.language_modelling import WikiText103Evaluator |
| 73 | + |
| 74 | +evaluator = WikiText103Evaluator(model_name='Model name as found in paperswithcode website') |
| 75 | +``` |
| 76 | + |
| 77 | +If you are reproducing a model from a paper, then you can enter the arXiv ID. If you |
| 78 | +put in the same model name string as on the |
| 79 | +[Wikitext-103](https://sotabench.com/benchmarks/language-modelling-on-wikitext-103) leaderboard |
| 80 | +then you will enable direct comparison with the paper's model. If the `arxiv_id` is not available you |
| 81 | +can use `paperswithcode.com` id. Below is an example of an evaluator that matches `Transformer XL`: |
| 82 | + |
| 83 | +``` python |
| 84 | +from sotabencheval.language_modelling import WikiText103Evaluator |
| 85 | + |
| 86 | +evaluator = WikiText103Evaluator( |
| 87 | + model_name="Transformer-XL Large", |
| 88 | + paper_arxiv_id="1901.02860", |
| 89 | + paper_pwc_id="transformer-xl-attentive-language-models", |
| 90 | + local_root="path_to_your_data", |
| 91 | +) |
| 92 | +``` |
| 93 | + |
| 94 | +The above will directly compare with the result of the paper when run on the server. |
| 95 | + |
| 96 | +## How Do I Evaluate Predictions? |
| 97 | + |
| 98 | +The evaluator object has an `.add(log_probs, targets)` method to submit predictions by batch or in full. |
| 99 | +We expect you to give us the log probability of a batch of target tokens and the `target` tokens themselves. |
| 100 | +The `log_probs` can be either: |
| 101 | + |
| 102 | +- a 0d "tensor" (`np.ndarray`/`torch.tensor`) - summed log probability of all `targets` tokens |
| 103 | +- a 2d "tensor" (`np.ndarray`/`torch.tensor`) - log probabilities of each target token, the `log_probs.shape` should match `targets.shape` |
| 104 | +- a 3d "tensor" (`np.ndarray`/`torch.tensor`) - distribution of log probabilities for each position in the sequence, we will gather the probabilities of target tokens for you. |
| 105 | + |
| 106 | +It is recommended to use third or second option as it allows us to check your perplexity calculations. |
| 107 | + |
| 108 | +If your model uses subword tokenization you don't need convert subwords to full words. You are free to report probability of each subword: we will adjust the perplexity normalization accordingly. Just make sure to set `subword_tokenization=True` in your evaluator. |
| 109 | + |
| 110 | +Here is an example of how to report results (for a PyTorch example): |
| 111 | + |
| 112 | +``` python |
| 113 | + |
| 114 | +evaluator = WikiText103Evaluator( |
| 115 | + model_name='GPT-2 Small', |
| 116 | + paper_pwc_id="language-models-are-unsupervised-multitask", |
| 117 | + local_root="path_to_your_data", |
| 118 | + subword_tokenization = True |
| 119 | +) |
| 120 | + |
| 121 | +# run you data preprocessing, in case of GPT-2 the preprocessing removes moses artifacts |
| 122 | +with torch.no_grad(): |
| 123 | + model.eval() |
| 124 | + for input, target in data_loader: |
| 125 | + output = model(input) |
| 126 | + log_probs = torch.LogSoftmax(output, dim=-1) |
| 127 | + target_log_probs = output.gather(-1, targets.unsqueeze(-1)) |
| 128 | + evaluator.add(target_log_probs, target) |
| 129 | +``` |
| 130 | + |
| 131 | +When you are done, you can get the results locally by running: |
| 132 | + |
| 133 | +``` python |
| 134 | +evaluator.get_results() |
| 135 | +``` |
| 136 | + |
| 137 | +But for the server you want to save the results by running: |
| 138 | + |
| 139 | +``` python |
| 140 | +evaluator.save() |
| 141 | +``` |
| 142 | + |
| 143 | +This method serialises the results and model metadata and stores to the server database. |
| 144 | + |
| 145 | +## How Do I Cache Evaluation? |
| 146 | + |
| 147 | +Sotabench reruns your script on every commit. This is good because it acts like |
| 148 | +continuous integration in checking for bugs and changes, but can be annoying |
| 149 | +if the model hasn't changed and evaluation is lengthy. |
| 150 | + |
| 151 | +Fortunately sotabencheval has caching logic that you can use. |
| 152 | + |
| 153 | +The idea is that after the first batch, we hash the model outputs and the |
| 154 | +current metrics and this tells us if the model is the same given the dataset. |
| 155 | +You can include hashing within an evaluation loop like follows (in the following |
| 156 | +example for a PyTorch repository): |
| 157 | + |
| 158 | +``` python |
| 159 | +with torch.no_grad(): |
| 160 | + for input, target in data_loader: |
| 161 | + # ... |
| 162 | + output = model(input) |
| 163 | + log_probs = #... |
| 164 | + evaluator.add(log_probs, target) |
| 165 | + |
| 166 | + if evaluator.cache_exists: |
| 167 | + break |
| 168 | + |
| 169 | +evaluator.save() |
| 170 | +``` |
| 171 | + |
| 172 | +If the hash is the same as in the server, we infer that the model hasn't changed, so |
| 173 | +we simply return hashed results rather than running the whole evaluation again. |
| 174 | + |
| 175 | +Caching is very useful if you have large models, or a repository that is evaluating |
| 176 | +multiple models, as it speeds up evaluation significantly. |
| 177 | + |
| 178 | + |
| 179 | +## A full sotabench.py example |
| 180 | + |
| 181 | +Below we show an implementation for a model from the `huggingface/transformers`. This |
| 182 | +incorporates all the features explained above: (a) using the server data, |
| 183 | +(b) using the WikiText-103 Evaluator, and (c) caching the evaluation logic: |
| 184 | + |
| 185 | +``` python |
| 186 | +import torch |
| 187 | +from tqdm import tqdm |
| 188 | +from sotabencheval.language_modelling import WikiText103Evaluator |
| 189 | + |
| 190 | +model = torch.hub.load('huggingface/transformers', 'modelWithLMHead', 'transfo-xl-wt103').to("cuda") |
| 191 | +tokenizer = torch.hub.load('huggingface/transformers', 'tokenizer', 'transfo-xl-wt103') |
| 192 | + |
| 193 | +evaluator = WikiText103Evaluator( |
| 194 | + model_name="Transformer-XL Large", |
| 195 | + paper_arxiv_id="1901.02860", |
| 196 | + paper_pwc_id="transformer-xl-attentive-language-models", |
| 197 | + local_root='/content/wikitext-103' |
| 198 | +) |
| 199 | + |
| 200 | +with evaluator.test_set_path.open() as f: |
| 201 | + test_data = torch.tensor(tokenizer.encode(f.read())) |
| 202 | + |
| 203 | +seq_len = 128 |
| 204 | +with torch.no_grad(): |
| 205 | + evaluator.reset_timer() |
| 206 | + model.eval() |
| 207 | + X, Y, mems = test_data[None, :-1], test_data[None, 1:], None |
| 208 | + for s in tqdm(range(0, X.shape[-1], seq_len)): |
| 209 | + x,y = X[..., s:s+seq_len].to("cuda"), Y[..., s:s+seq_len].to("cuda") |
| 210 | + log_probs, mems, *_ = model(input_ids=x, mems=mems) |
| 211 | + evaluator.add(log_probs, y) |
| 212 | + if evaluator.cache_exists: |
| 213 | + break |
| 214 | +evaluator.save() |
| 215 | +evaluator.print_results() |
| 216 | +``` |
| 217 | + |
| 218 | +You can run this example on [Google Colab](https://colab.research.google.com/drive/1Qcp1_Fgo_aMtSgf_PV1gFw1DT6hEv7fW). |
| 219 | + |
| 220 | +## Need More Help? |
| 221 | + |
| 222 | +Head on over to the [Natural Language Processing](https://forum.sotabench.com/c/natural-language-processing) section of the sotabench forums if you have any questions or difficulties. |
0 commit comments