Skip to content

Commit 235d0da

Browse files
committed
update docs
1 parent d502b48 commit 235d0da

File tree

4 files changed

+31
-32
lines changed

4 files changed

+31
-32
lines changed

docs/docs/wikitext103.md

Lines changed: 21 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -36,9 +36,11 @@ It is the original zip file released [here](https://blog.einstein.ai/the-wikitex
3636
We are running the benchmark on the `wiki.test.tokens` dataset.
3737
We have two helper methods that will unpack the dataset for you and give you the `pathlib.Path` to the test file.
3838

39-
First one `test_set_path` is available once you instantiate the WikiText103Evaluator
39+
The first option `test_set_path` is available once you instantiate the `WikiText103Evaluator`:
4040

4141
```python
42+
...
43+
4244
evaluator = WikiText103Evaluator(
4345
model_name="Transformer-XL Large",
4446
paper_arxiv_id="1901.02860",
@@ -50,7 +52,10 @@ with evaluator.test_set_path.open() as f:
5052
test_data = torch.tensor(tokenizer.encode(f.read())).to("cuda")
5153
```
5254

53-
Second option `WikiText103Evaluator.get_test_set_path(local_root)` is there if you need path to the files before you get your first instance of WikiText evaluator, for example if you are going to reuse the data for multiple models.
55+
There is a second option available if you are evaluating multiple models and need to use the same
56+
dataset multiple times - `WikiText103Evaluator.get_test_set_path(local_root)`. This will get the path before
57+
you initialize a WikiText evaluator:
58+
5459
```python
5560
from sotabencheval.language_modelling import WikiText103Evaluator
5661

@@ -72,9 +77,8 @@ evaluator = WikiText103Evaluator(model_name='Model name as found in paperswithco
7277
If you are reproducing a model from a paper, then you can enter the arXiv ID. If you
7378
put in the same model name string as on the
7479
[Wikitext-103](https://sotabench.com/benchmarks/language-modelling-on-wikitext-103) leaderboard
75-
then you will enable direct comparison with the paper's model.
76-
If the `arxiv` is not available you can use `paperswithcode.com` id.
77-
Below is an example of an evaluator that matches `Transformer XL`:
80+
then you will enable direct comparison with the paper's model. If the `arxiv_id` is not available you
81+
can use `paperswithcode.com` id. Below is an example of an evaluator that matches `Transformer XL`:
7882

7983
``` python
8084
from sotabencheval.language_modelling import WikiText103Evaluator
@@ -91,18 +95,19 @@ The above will directly compare with the result of the paper when run on the ser
9195

9296
## How Do I Evaluate Predictions?
9397

94-
The evaluator object has an `.add(log_probs:tensor, targets:tensor)` method to submit predictions by batch or in full.
98+
The evaluator object has an `.add(log_probs, targets)` method to submit predictions by batch or in full.
9599
We expect you to give us the log probability of a batch of target tokens and the `target` tokens themselves.
96100
The `log_probs` can be either:
97-
- a 0d tensor - summed log probability of all `targets` tokens, or
98-
- a 2d tensor - log probabilities of each target token, the `log_probs.shape` have to match `targets.shape`
99-
- a 3d tensor - distribution of log probabilities for each position in the sequence, we will gather the probabilities of target tokens for you.
100-
It is recommended to use third or second option as it give use a way to check your perplexity calculations.
101101

102-
If your model use subword tokenization you don't need convert subwords to full words.
103-
You are free to report probability of each subwords, we will adjust the perplexity normalization for you, but make sure to set `subword_tokenization=True` in your evaluator.
102+
- a 0d "tensor" (`np.ndarray`/`torch.tensor`) - summed log probability of all `targets` tokens
103+
- a 2d "tensor" (`np.ndarray`/`torch.tensor`) - log probabilities of each target token, the `log_probs.shape` should match `targets.shape`
104+
- a 3d "tensor" (`np.ndarray`/`torch.tensor`) - distribution of log probabilities for each position in the sequence, we will gather the probabilities of target tokens for you.
105+
106+
It is recommended to use third or second option as it allows us to check your perplexity calculations.
107+
108+
If your model uses subword tokenization you don't need convert subwords to full words. You are free to report probability of each subword: we will adjust the perplexity normalization accordingly. Just make sure to set `subword_tokenization=True` in your evaluator.
104109

105-
Here is an example how to report results (for a PyTorch example):
110+
Here is an example of how to report results (for a PyTorch example):
106111

107112
``` python
108113

@@ -175,7 +180,7 @@ multiple models, as it speeds up evaluation significantly.
175180

176181
Below we show an implementation for a model from the `huggingface/transformers`. This
177182
incorporates all the features explained above: (a) using the server data,
178-
(b) using the WikiText103 Evaluator, and (c) caching the evaluation logic:
183+
(b) using the WikiText-103 Evaluator, and (c) caching the evaluation logic:
179184

180185
``` python
181186
import torch
@@ -210,8 +215,8 @@ evaluator.save()
210215
evaluator.print_results()
211216
```
212217

213-
You can run this example on google [colab](https://colab.research.google.com/drive/1Qcp1_Fgo_aMtSgf_PV1gFw1DT6hEv7fW).
218+
You can run this example on [Google Colab](https://colab.research.google.com/drive/1Qcp1_Fgo_aMtSgf_PV1gFw1DT6hEv7fW).
214219

215220
## Need More Help?
216221

217-
Head on over to the [Natural Language Processing](https://forum.sotabench.com/c/nlp) section of the sotabench forums if you have any questions or difficulties.
222+
Head on over to the [Natural Language Processing](https://forum.sotabench.com/c/natural-language-processing) section of the sotabench forums if you have any questions or difficulties.

sotabencheval/language_modelling/wikitext.py

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,15 @@
1-
import os
21
import time
3-
from itertools import islice
42
from enum import Enum
53
from pathlib import Path
64

75
import numpy as np
86

9-
from sotabenchapi.check import in_check_mode
10-
from sotabenchapi.client import Client
11-
from sotabenchapi.core import BenchmarkResult, check_inputs
127
from sotabencheval.core import BaseEvaluator
138
from sotabencheval.utils import calculate_batch_hash, extract_archive, change_root_if_server, is_server, get_max_memory_allocated
149

10+
1511
class WikiTextDataset(Enum):
16-
"""Enum used to select dataset on which evaluation is executed. """
12+
"""Enum used to select the dataset on which evaluation is executed. """
1713
WikiText103 = ('WikiText-103', 245569, 267735)
1814
WikiText2 = ('WikiText-2', 245569, 33278)
1915

@@ -86,7 +82,7 @@ def _gather_probs(log_probs, targets):
8682

8783
class WikiTextEvaluator(BaseEvaluator):
8884
task = "Language Modelling"
89-
dataset = None # defined in a subclass
85+
dataset = None # defined in a subclass
9086

9187
def __init__(self,
9288
local_root: str = '.',
@@ -290,6 +286,7 @@ class WikiText103Evaluator(WikiTextEvaluator):
290286
"""
291287
dataset = WikiTextDataset.WikiText103
292288

289+
293290
class WikiText2Evaluator(WikiTextEvaluator):
294291
"""`WikiText103 <https://sotabench.com/benchmarks/language-modelling-on-wikitext-2>`_ benchmark.
295292

sotabencheval/natural_language_inference/multinli.py

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,18 @@
1-
import os
21
import csv
32
import time
43

5-
from itertools import islice, zip_longest
6-
from enum import Enum
4+
from itertools import zip_longest
75
from pathlib import Path
86

9-
import numpy as np
10-
11-
from sotabenchapi.check import in_check_mode
12-
from sotabenchapi.client import Client
13-
from sotabenchapi.core import BenchmarkResult, check_inputs
147
from sotabencheval.core import BaseEvaluator
158
from sotabencheval.utils import calculate_batch_hash, extract_archive, change_root_if_server, is_server, get_max_memory_allocated
169

10+
1711
def read_csv(path):
1812
with path.open('r') as f:
1913
yield from csv.DictReader(f, delimiter='\t')
2014

15+
2116
def get_path(local_root, local_unzip=False):
2217
root = Path(change_root_if_server(root=local_root,
2318
server_root=".data/nlp/multinli"))
@@ -27,6 +22,7 @@ def get_path(local_root, local_unzip=False):
2722
extract_archive(str(root / zip_name), to_path=root)
2823
return (dataset_path, dataset_path.parent / "dev_mismatched.tsv")
2924

25+
3026
class ClassificationEvaluator:
3127
def __init__(self, file_path):
3228
self.dataset_path = file_path
@@ -63,6 +59,7 @@ def accuracy(self):
6359
return (accuracy, f"partial on {self.count} out of {len(self.targets)}")
6460
return accuracy
6561

62+
6663
class MultiNLI(BaseEvaluator):
6764
task = "Natural Language Inference"
6865
dataset = 'MultiNLI' # defined in subclass

sotabencheval/version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,6 @@ def __repr__(self):
1515
f"build={self.build})"
1616
)
1717

18-
version = Version(0, 0, 35)
18+
version = Version(0, 0, 36)
1919

2020
__version__ = str(version)

0 commit comments

Comments
 (0)