Skip to content

Commit 033be86

Browse files
committed
Document SQuADEvaluator
1 parent 4a82a3f commit 033be86

File tree

6 files changed

+234
-35
lines changed

6 files changed

+234
-35
lines changed

docs/docs/squad.md

Lines changed: 69 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -18,35 +18,12 @@ You can write whatever you want in your `sotabench.py` file to get model predict
1818
But you will need to record your results for the server, and you'll want to avoid doing things like
1919
downloading the dataset on the server. So you should:
2020

21-
- **Point to the server SQuAD data path** - popular datasets are pre-downloaded on the server.
2221
- **Include an Evaluation object** in `sotabench.py` file to record the results.
22+
- **Point to the server SQuAD data path** - popular datasets are pre-downloaded on the server.
2323
- **Use Caching** *(optional)* - to speed up evaluation by hashing the first batch of predictions.
2424

2525
We explain how to do these various steps below.
2626

27-
## Server Data Location
28-
29-
The SQuAD development data is located in the root of your repository on the server at `.data/nlp/squad`.
30-
In this folder is contained:
31-
32-
- `dev-v1.1.json` - containing SQuAD v1.1 development dataset
33-
- `dev-v2.0.json` - containing SQuAD v2.0 development dataset
34-
35-
Your local files may have a different file directory structure, so you
36-
can use control flow like below to change the data path if the script is being
37-
run on sotabench servers:
38-
39-
``` python
40-
from sotabencheval.utils import is_server
41-
42-
if is_server():
43-
DATA_ROOT = '.data/nlp/squad'
44-
else: # local settings
45-
DATA_ROOT = '/home/ubuntu/my_data/'
46-
```
47-
48-
This will detect if `sotabench.py` is being run on the server and change behaviour accordingly.
49-
5027
## How Do I Initialize an Evaluator?
5128

5229
Add this to your code - before you start batching over the dataset and making predictions:
@@ -76,6 +53,28 @@ evaluator = SQuADEvaluator(model_name='SpanBERT',
7653

7754
The above will directly compare with the result of the paper when run on the server.
7855

56+
## Server Data Location
57+
58+
The SQuAD development data is located in the root of your repository on the server at `.data/nlp/squad`.
59+
In this folder is contained:
60+
61+
- `dev-v1.1.json` - containing SQuAD v1.1 development dataset
62+
- `dev-v2.0.json` - containing SQuAD v2.0 development dataset
63+
64+
You can use `evaluator.dataset_path: Path` to get a path to the dataset json file.
65+
In the example above it resolves to `.data/nlp/squad/dev-v2.0.json` on
66+
sotabench server and `./dev-v2.0.json` when run locally.
67+
If you want to use a non-standard file name or location when running locally
68+
you can override the defaults like this:
69+
70+
``` python
71+
evaluator = SQuADEvaluator(
72+
...,
73+
local_root='mydatasets',
74+
dataset_filename='data.json'
75+
)
76+
```
77+
7978
## How Do I Evaluate Predictions?
8079

8180
The evaluator object has an `.add(answers: Dict[str, str])` method to submit predictions by batch or in full.
@@ -153,6 +152,52 @@ we simply return hashed results rather than running the whole evaluation again.
153152
Caching is very useful if you have large models, or a repository that is evaluating
154153
multiple models, as it speeds up evaluation significantly.
155154

155+
## A Full sotabench.py Example
156+
157+
Below we show an implementation for a model from the AllenNLP repository. This
158+
incorporates all the features explained above: (a) using the SQuAD Evaluator,
159+
(b) using custom dataset location when run locally, and (c) the evaluation caching logic.
160+
161+
``` python
162+
from sotabencheval.question_answering import SQuADEvaluator, SQuADVersion
163+
164+
from allennlp.data import DatasetReader
165+
from allennlp.data.iterators import DataIterator
166+
from allennlp.models.archival import load_archive
167+
from allennlp.nn.util import move_to_device
168+
169+
def load_model(url, batch_size=64):
170+
archive = load_archive(url, cuda_device=0)
171+
model = archive.model
172+
reader = DatasetReader.from_params(archive.config["dataset_reader"])
173+
iterator_params = archive.config["iterator"]
174+
iterator_params["batch_size"] = batch_size
175+
data_iterator = DataIterator.from_params(iterator_params)
176+
data_iterator.index_with(model.vocab)
177+
return model, reader, data_iterator
178+
179+
def evaluate(model, dataset, data_iterator, evaluator):
180+
model.eval()
181+
evaluator.reset_time()
182+
for batch in data_iterator(dataset, num_epochs=1, shuffle=False):
183+
batch = move_to_device(batch, 0)
184+
predictions = model(**batch)
185+
answers = {metadata['id']: prediction
186+
for metadata, prediction in zip(batch['metadata'], predictions['best_span_str'])}
187+
evaluator.add(answers)
188+
if evaluator.cache_exists:
189+
break
190+
191+
evaluator = SQuADEvaluator(local_root="data/nlp/squad", model_name="BiDAF (single)",
192+
paper_arxiv_id="1611.01603", version=SQuADVersion.V11)
193+
194+
model, reader, data_iter =\
195+
load_model("https://allennlp.s3.amazonaws.com/models/bidaf-model-2017.09.15-charpad.tar.gz")
196+
dataset = reader.read(evaluator.dataset_path)
197+
evaluate(model, dataset, data_iter, evaluator)
198+
evaluator.save()
199+
print(evaluator.results)
200+
```
156201

157202
## Need More Help?
158203

docs/docs/wmt.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -209,7 +209,7 @@ evaluator = WMTEvaluator(
209209
model = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-de.single_model',
210210
force_reload=True, tokenizer='moses', bpe='fastbpe').cuda()
211211

212-
for sid, text in tqdm(evaluator.metrics.source_segments.items()):
212+
for sid, text in tqdm(evaluator.source_segments.items()):
213213
translated = model.translate(text)
214214
evaluator.add({sid: translated})
215215
if evaluator.cache_exists:

sotabencheval/core/evaluator.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ def __init__(self,
5858
paper_results: dict = None,
5959
model_description=None,):
6060
"""
61-
Initializes an BaseEvaluator like object
61+
Initializes a BaseEvaluator like object
6262
6363
:param model_name: (str) The name of the model, for example 'ResNet-101', which will be saved to sotabench.com
6464
:param paper_arxiv_id: (str, optional) The paper that the model is linked to, e.g. '1906.06423'

sotabencheval/machine_translation/wmt.py

Lines changed: 46 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ class WMTEvaluator(BaseEvaluator):
3838
model = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.en-de.single_model',
3939
force_reload=True, tokenizer='moses', bpe='fastbpe').cuda()
4040
41-
for sid, text in tqdm(evaluator.metrics.source_segments.items()):
41+
for sid, text in tqdm(evaluator.source_segments.items()):
4242
translated = model.translate(text)
4343
evaluator.add({sid: translated})
4444
if evaluator.cache_exists:
@@ -87,9 +87,9 @@ def __init__(self,
8787
Ignored when run on sotabench server.
8888
:param model_name: The name of the model from the
8989
paper - if you want to link your build to a model from a
90-
machine learning paper. See the WMT benchmarks page for model names,
90+
machine learning paper. See the WMT benchmarks pages for model names,
9191
(f.e., https://sotabench.com/benchmarks/machine-translation-on-wmt2014-english-german)
92-
on the paper leaderboard or models yet to try tab.
92+
on the paper leaderboard or models yet to try tabs.
9393
:param paper_arxiv_id: Optional linking to arXiv if you
9494
want to link to papers on the leaderboard; put in the
9595
corresponding paper's arXiv ID, e.g. '1907.06616'.
@@ -105,7 +105,7 @@ def __init__(self,
105105
106106
Ensure that the metric names match those on the sotabench
107107
leaderboard - for WMT benchmarks it should be `SacreBLEU` for de-tokenized
108-
mix-cased BLEU score and `BLEU score` for tokenized BLEU.
108+
case sensitive BLEU score and `BLEU score` for tokenized BLEU.
109109
:param model_description: Optional model description.
110110
:param tokenization: An optional tokenization function to compute tokenized BLEU score.
111111
It takes a single string - a segment to tokenize, and returns a string with tokens
@@ -178,7 +178,7 @@ def add(self, answers: Dict[str, str]):
178178
'bbc.381790#3': 'Sie ist aufgrund von Plänen entstanden, den Namen...'
179179
})
180180
181-
.. seealso:: `sotabencheval.machine_translation.TranslationMetrics.source_segments`
181+
.. seealso:: `source_segments`
182182
"""
183183

184184
self.metrics.add(answers)
@@ -190,15 +190,53 @@ def add(self, answers: Dict[str, str]):
190190
)
191191
self.first_batch_processed = True
192192

193+
@property
194+
def source_segments(self):
195+
"""
196+
Ordered dictionary of all segments to translate with segments ids as keys. The same segments ids
197+
have to be used when submitting translations with :func:`add`.
198+
199+
Examples:
200+
201+
.. code-block:: python
202+
203+
for segment_id, text in my_evaluator.source_segments.items():
204+
translated = model(text)
205+
my_evaluator.add({segment_id: translated})
206+
207+
.. seealso: `source_documents`
208+
"""
209+
210+
return self.metrics.source_segments
211+
212+
@property
213+
def source_documents(self):
214+
"""
215+
List of all documents to translate
216+
217+
Examples:
218+
219+
.. code-block:: python
220+
221+
for document in my_evaluator.source_documents:
222+
for segment in document.segments:
223+
translated = model(segment.text)
224+
my_evaluator.add({segment.id: translated})
225+
226+
.. seealso: `source_segments`
227+
"""
228+
229+
return self.metrics.source_documents
230+
193231
def reset(self):
194232
"""
195233
Removes already added translations
196234
197235
When checking if the model should be rerun on whole dataset it is first run on a smaller subset
198236
and the results are compared with values cached on sotabench server (the check is not performed
199237
when running locally.) Ideally, the smaller subset is just the first batch, so no additional
200-
computation is needed. However, for more complex multistage pipelines it maybe simpler to
201-
run a model twice - on a small dataset and (if necessary) on the full dataset. In that case
238+
computation is needed. However, for more complex multistage pipelines it may be simpler to
239+
run the model twice - on a small dataset and (if necessary) on the full dataset. In that case
202240
:func:`reset` needs to be called before the second run so values from the first run are not reported.
203241
204242
.. seealso:: :func:`cache_exists`
@@ -212,7 +250,7 @@ def get_results(self):
212250
Gets the results for the evaluator. Empty string is assumed for segments for which in translation
213251
was provided.
214252
215-
:return: dict with `SacreBLEU` and `BLEU score`
253+
:return: dict with `SacreBLEU` and `BLEU score`.
216254
"""
217255

218256
if self.cached_results:

sotabencheval/question_answering/squad.py

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,53 @@ class SQuADVersion(Enum):
1313

1414

1515
class SQuADEvaluator(BaseEvaluator):
16+
"""Evaluator for Stanford Question Answering Dataset v1.1 and v2.0 benchmarks.
17+
18+
Examples:
19+
Evaluate a BiDAF model from the AllenNLP repository on SQuAD 1.1 development set:
20+
21+
.. code-block:: python
22+
23+
from sotabencheval.question_answering import SQuADEvaluator, SQuADVersion
24+
25+
from allennlp.data import DatasetReader
26+
from allennlp.data.iterators import DataIterator
27+
from allennlp.models.archival import load_archive
28+
from allennlp.nn.util import move_to_device
29+
30+
def load_model(url, batch_size=64):
31+
archive = load_archive(url, cuda_device=0)
32+
model = archive.model
33+
reader = DatasetReader.from_params(archive.config["dataset_reader"])
34+
iterator_params = archive.config["iterator"]
35+
iterator_params["batch_size"] = batch_size
36+
data_iterator = DataIterator.from_params(iterator_params)
37+
data_iterator.index_with(model.vocab)
38+
return model, reader, data_iterator
39+
40+
def evaluate(model, dataset, data_iterator, evaluator):
41+
model.eval()
42+
evaluator.reset_time()
43+
for batch in data_iterator(dataset, num_epochs=1, shuffle=False):
44+
batch = move_to_device(batch, 0)
45+
predictions = model(**batch)
46+
answers = {metadata['id']: prediction
47+
for metadata, prediction in zip(batch['metadata'], predictions['best_span_str'])}
48+
evaluator.add(answers)
49+
if evaluator.cache_exists:
50+
break
51+
52+
evaluator = SQuADEvaluator(local_root="data/nlp/squad", model_name="BiDAF (single)",
53+
paper_arxiv_id="1611.01603", version=SQuADVersion.V11)
54+
55+
model, reader, data_iter =\
56+
load_model("https://allennlp.s3.amazonaws.com/models/bidaf-model-2017.09.15-charpad.tar.gz")
57+
dataset = reader.read(evaluator.dataset_path)
58+
evaluate(model, dataset, data_iter, evaluator)
59+
evaluator.save()
60+
print(evaluator.results)
61+
"""
62+
1663
task = "Question Answering"
1764

1865
def __init__(self,
@@ -24,6 +71,38 @@ def __init__(self,
2471
paper_results: dict = None,
2572
model_description=None,
2673
version: SQuADVersion = SQuADVersion.V20):
74+
"""
75+
Creates an evaluator for SQuAD v1.1 or v2.0 Question Answering benchmarks.
76+
77+
:param local_root: Path to the directory where the dataset files are located locally.
78+
Ignored when run on sotabench server.
79+
:param dataset_filename: Local filename of the JSON file with the SQuAD dataset.
80+
If None, the standard filename is used, based on :param:`version`.
81+
Ignored when run on sotabench server.
82+
:param model_name: The name of the model from the
83+
paper - if you want to link your build to a model from a
84+
machine learning paper. See the SQuAD benchmarks pages for model names,
85+
(f.e., https://sotabench.com/benchmarks/question-answering-on-squad11-dev)
86+
on the paper leaderboard or models yet to try tabs.
87+
:param paper_arxiv_id: Optional linking to arXiv if you
88+
want to link to papers on the leaderboard; put in the
89+
corresponding paper's arXiv ID, e.g. '1907.10529'.
90+
:param paper_pwc_id: Optional linking to Papers With Code;
91+
put in the corresponding papers with code URL slug, e.g.
92+
'spanbert-improving-pre-training-by'
93+
:param paper_results: If the paper model you are reproducing
94+
does not have model results on sotabench.com, you can specify
95+
the paper results yourself through this argument, where keys
96+
are metric names, values are metric values. e.g:
97+
98+
{'EM': 0.858, 'F1': 0.873}.
99+
100+
Ensure that the metric names match those on the sotabench
101+
leaderboard - for SQuAD benchmarks it should be `EM` for exact match
102+
and `F1` for F1 score. Make sure to use results of evaluation on a development set.
103+
:param model_description: Optional model description.
104+
:param version: Which dataset to evaluate on, either `SQuADVersion.V11` or `SQuADVersion.V20`.
105+
"""
27106
super().__init__(model_name, paper_arxiv_id, paper_pwc_id, paper_results, model_description)
28107
self.root = change_root_if_server(root=local_root,
29108
server_root=".data/nlp/squad")
@@ -35,6 +114,23 @@ def __init__(self,
35114
self.metrics = SQuADMetrics(self.dataset_path, version)
36115

37116
def add(self, answers: Dict[str, str]):
117+
"""
118+
Updates the evaluator with new results
119+
120+
:param answers: a dictionary, where keys are question ids and values are text answers.
121+
For unanswerable questions (SQuAD v2.0) the answer should be an empty string.
122+
123+
Examples:
124+
Update the evaluator with two results:
125+
126+
.. code-block:: python
127+
128+
my_evaluator.add({
129+
"57296d571d04691400779413": "itself",
130+
"5a89117e19b91f001a626f2d": ""
131+
})
132+
"""
133+
38134
self.metrics.add(answers)
39135

40136
if not self.first_batch_processed and self.metrics.has_data:
@@ -45,10 +141,30 @@ def add(self, answers: Dict[str, str]):
45141
self.first_batch_processed = True
46142

47143
def reset(self):
144+
"""
145+
Removes already added answers
146+
147+
When checking if the model should be rerun on whole dataset it is first run on a smaller subset
148+
and the results are compared with values cached on sotabench server (the check is not performed
149+
when running locally.) Ideally, the smaller subset is just the first batch, so no additional
150+
computation is needed. However, for more complex multistage pipelines it may be simpler to
151+
run the model twice - on a small dataset and (if necessary) on the full dataset. In that case
152+
:func:`reset` needs to be called before the second run so values from the first run are not reported.
153+
154+
.. seealso:: :func:`cache_exists`
155+
.. seealso:: :func:`reset_time`
156+
"""
157+
48158
self.metrics.reset()
49159
self.reset_time()
50160

51161
def get_results(self):
162+
"""
163+
Gets the results for the evaluator.
164+
165+
:return: dict with `EM` (exact match score) and `F1`.
166+
"""
167+
52168
if self.cached_results:
53169
return self.results
54170
self.results = self.metrics.get_results()

sotabencheval/version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,6 @@ def __repr__(self):
1515
f"build={self.build})"
1616
)
1717

18-
version = Version(0, 0, 35)
18+
version = Version(0, 0, 36)
1919

2020
__version__ = str(version)

0 commit comments

Comments
 (0)