Skip to content

Commit bc38912

Browse files
authored
Merge pull request #9 from /pull/7
WikiText-103 branch
2 parents 033be86 + f3568c3 commit bc38912

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+721
-3605
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
__pycache__
2+
*.egg-info

docs/docs/img/language_model.png

252 KB
Loading

docs/docs/wikitext103.md

Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
# WikiText-103
2+
3+
![An example text of Wikitext-103](img/language_model.png)
4+
5+
You can view the WikiText-103 leaderboard [here](https://sotabench.com/benchmarks/language-modelling-on-wikitext-103).
6+
7+
## Getting Started
8+
9+
You'll need the following in the root of your repository:
10+
11+
- `sotabench.py` file - contains benchmarking logic; the server will run this on each commit
12+
- `requirements.txt` file - Python dependencies to be installed before running `sotabench.py`
13+
- `sotabench_setup.sh` *(optional)* - any advanced dependencies or setup, e.g. compilation
14+
15+
You can write whatever you want in your `sotabench.py` file to get language model predictions on the WikiText-103 dataset.
16+
17+
But you will need to record your results for the server, and you'll want to avoid doing things like
18+
downloading the dataset on the server. So you should:
19+
20+
- **Point to the server WikiText-103 data path** - popular datasets are pre-downloaded on the server.
21+
- **Include an Evaluation object** in `sotabench.py` file to record the results.
22+
- **Use Caching** *(optional)* - to speed up evaluation by hashing the first batch of predictions.
23+
24+
We explain how to do these various steps below.
25+
26+
## Server Data Location
27+
28+
The WikiText-103 development data is located in the root of your repository on the server at `.data/nlp/wikitext-103/wikitext-103-v1.zip`.
29+
The archive contains a folder `wikitext-103` with the following files:
30+
31+
- `wiki.train.tokens`
32+
- `wiki.valid.tokens`
33+
- `wiki.test.tokens`
34+
35+
It is the original zip file released [here](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).
36+
We are running the benchmark on the `wiki.test.tokens` dataset.
37+
We have two helper methods that will unpack the dataset for you and give you the `pathlib.Path` to the test file.
38+
39+
The first option `test_set_path` is available once you instantiate the `WikiText103Evaluator`:
40+
41+
```python
42+
...
43+
44+
evaluator = WikiText103Evaluator(
45+
model_name="Transformer-XL Large",
46+
paper_arxiv_id="1901.02860",
47+
paper_pwc_id="transformer-xl-attentive-language-models",
48+
local_root='/content/wikitext-103'
49+
)
50+
# dataset_path is pathlib.Path and points to wikitext.test.tokens
51+
with evaluator.test_set_path.open() as f:
52+
test_data = torch.tensor(tokenizer.encode(f.read())).to("cuda")
53+
```
54+
55+
There is a second option available if you are evaluating multiple models and need to use the same
56+
dataset multiple times - `WikiText103Evaluator.get_test_set_path(local_root)`. This will get the path before
57+
you initialize a WikiText evaluator:
58+
59+
```python
60+
from sotabencheval.language_modelling import WikiText103Evaluator
61+
62+
test_file_path = WikiText103Evaluator.get_test_set_path('/home/ubuntu/my_data/wiki103')
63+
with test_file_path.open() as f:
64+
content = f.read()
65+
```
66+
67+
## How Do I Initialize an Evaluator?
68+
69+
Add this to your code - before you start batching over the dataset and making predictions:
70+
71+
``` python
72+
from sotabencheval.language_modelling import WikiText103Evaluator
73+
74+
evaluator = WikiText103Evaluator(model_name='Model name as found in paperswithcode website')
75+
```
76+
77+
If you are reproducing a model from a paper, then you can enter the arXiv ID. If you
78+
put in the same model name string as on the
79+
[Wikitext-103](https://sotabench.com/benchmarks/language-modelling-on-wikitext-103) leaderboard
80+
then you will enable direct comparison with the paper's model. If the `arxiv_id` is not available you
81+
can use `paperswithcode.com` id. Below is an example of an evaluator that matches `Transformer XL`:
82+
83+
``` python
84+
from sotabencheval.language_modelling import WikiText103Evaluator
85+
86+
evaluator = WikiText103Evaluator(
87+
model_name="Transformer-XL Large",
88+
paper_arxiv_id="1901.02860",
89+
paper_pwc_id="transformer-xl-attentive-language-models",
90+
local_root="path_to_your_data",
91+
)
92+
```
93+
94+
The above will directly compare with the result of the paper when run on the server.
95+
96+
## How Do I Evaluate Predictions?
97+
98+
The evaluator object has an `.add(log_probs, targets)` method to submit predictions by batch or in full.
99+
We expect you to give us the log probability of a batch of target tokens and the `target` tokens themselves.
100+
The `log_probs` can be either:
101+
102+
- a 0d "tensor" (`np.ndarray`/`torch.tensor`) - summed log probability of all `targets` tokens
103+
- a 2d "tensor" (`np.ndarray`/`torch.tensor`) - log probabilities of each target token, the `log_probs.shape` should match `targets.shape`
104+
- a 3d "tensor" (`np.ndarray`/`torch.tensor`) - distribution of log probabilities for each position in the sequence, we will gather the probabilities of target tokens for you.
105+
106+
It is recommended to use third or second option as it allows us to check your perplexity calculations.
107+
108+
If your model uses subword tokenization you don't need convert subwords to full words. You are free to report probability of each subword: we will adjust the perplexity normalization accordingly. Just make sure to set `subword_tokenization=True` in your evaluator.
109+
110+
Here is an example of how to report results (for a PyTorch example):
111+
112+
``` python
113+
114+
evaluator = WikiText103Evaluator(
115+
model_name='GPT-2 Small',
116+
paper_pwc_id="language-models-are-unsupervised-multitask",
117+
local_root="path_to_your_data",
118+
subword_tokenization = True
119+
)
120+
121+
# run you data preprocessing, in case of GPT-2 the preprocessing removes moses artifacts
122+
with torch.no_grad():
123+
model.eval()
124+
for input, target in data_loader:
125+
output = model(input)
126+
log_probs = torch.LogSoftmax(output, dim=-1)
127+
target_log_probs = output.gather(-1, targets.unsqueeze(-1))
128+
evaluator.add(target_log_probs, target)
129+
```
130+
131+
When you are done, you can get the results locally by running:
132+
133+
``` python
134+
evaluator.get_results()
135+
```
136+
137+
But for the server you want to save the results by running:
138+
139+
``` python
140+
evaluator.save()
141+
```
142+
143+
This method serialises the results and model metadata and stores to the server database.
144+
145+
## How Do I Cache Evaluation?
146+
147+
Sotabench reruns your script on every commit. This is good because it acts like
148+
continuous integration in checking for bugs and changes, but can be annoying
149+
if the model hasn't changed and evaluation is lengthy.
150+
151+
Fortunately sotabencheval has caching logic that you can use.
152+
153+
The idea is that after the first batch, we hash the model outputs and the
154+
current metrics and this tells us if the model is the same given the dataset.
155+
You can include hashing within an evaluation loop like follows (in the following
156+
example for a PyTorch repository):
157+
158+
``` python
159+
with torch.no_grad():
160+
for input, target in data_loader:
161+
# ...
162+
output = model(input)
163+
log_probs = #...
164+
evaluator.add(log_probs, target)
165+
166+
if evaluator.cache_exists:
167+
break
168+
169+
evaluator.save()
170+
```
171+
172+
If the hash is the same as in the server, we infer that the model hasn't changed, so
173+
we simply return hashed results rather than running the whole evaluation again.
174+
175+
Caching is very useful if you have large models, or a repository that is evaluating
176+
multiple models, as it speeds up evaluation significantly.
177+
178+
179+
## A full sotabench.py example
180+
181+
Below we show an implementation for a model from the `huggingface/transformers`. This
182+
incorporates all the features explained above: (a) using the server data,
183+
(b) using the WikiText-103 Evaluator, and (c) caching the evaluation logic:
184+
185+
``` python
186+
import torch
187+
from tqdm import tqdm
188+
from sotabencheval.language_modelling import WikiText103Evaluator
189+
190+
model = torch.hub.load('huggingface/transformers', 'modelWithLMHead', 'transfo-xl-wt103').to("cuda")
191+
tokenizer = torch.hub.load('huggingface/transformers', 'tokenizer', 'transfo-xl-wt103')
192+
193+
evaluator = WikiText103Evaluator(
194+
model_name="Transformer-XL Large",
195+
paper_arxiv_id="1901.02860",
196+
paper_pwc_id="transformer-xl-attentive-language-models",
197+
local_root='/content/wikitext-103'
198+
)
199+
200+
with evaluator.test_set_path.open() as f:
201+
test_data = torch.tensor(tokenizer.encode(f.read()))
202+
203+
seq_len = 128
204+
with torch.no_grad():
205+
evaluator.reset_timer()
206+
model.eval()
207+
X, Y, mems = test_data[None, :-1], test_data[None, 1:], None
208+
for s in tqdm(range(0, X.shape[-1], seq_len)):
209+
x,y = X[..., s:s+seq_len].to("cuda"), Y[..., s:s+seq_len].to("cuda")
210+
log_probs, mems, *_ = model(input_ids=x, mems=mems)
211+
evaluator.add(log_probs, y)
212+
if evaluator.cache_exists:
213+
break
214+
evaluator.save()
215+
evaluator.print_results()
216+
```
217+
218+
You can run this example on [Google Colab](https://colab.research.google.com/drive/1Qcp1_Fgo_aMtSgf_PV1gFw1DT6hEv7fW).
219+
220+
## Need More Help?
221+
222+
Head on over to the [Natural Language Processing](https://forum.sotabench.com/c/natural-language-processing) section of the sotabench forums if you have any questions or difficulties.

0 commit comments

Comments
 (0)