You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: introduce results attribute on MMLU evaluator
In order to test the validity of our MMLU results or get information on prior runs,
we need to be able to access the full set of results from the lm_eval.evaluator.simple_evaluate
API. This commit provides that ability by adding a results attribute on the MMLUEvaluator class
and storing the results there.
Signed-off-by: Oleg S <[email protected]>
Copy file name to clipboardExpand all lines: CHANGELOG.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,7 @@
1
1
## 0.4.2
2
2
3
3
* Adds the ability to provide a custom system prompt to the MMLU-based evaluators. When a system prompt is provided, LM-eval applies the chat template under the hood, else it will pass the model a barebones prompt.
4
+
* Adds an `extra_args` parameter to the `.run` method of all MMLU-based evaluators. This way, consumers are able to directly pass any additional arguments they want through to the `lm_eval.evaluators.simple_evaluate` function.
Copy file name to clipboardExpand all lines: scripts/test_mmlu.py
+52-1Lines changed: 52 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,41 @@
1
+
# Standard
2
+
fromtypingimportDict, List, Tuple, TypedDict
3
+
1
4
# First Party
2
5
frominstructlab.eval.mmluimportMMLUEvaluator
3
6
4
7
SYSTEM_PROMPT="""I am, Red Hat® Instruct Model based on Granite 7B, an AI language model developed by Red Hat and IBM Research, based on the Granite-7b-base language model. My primary function is to be a chat assistant."""
5
8
6
9
10
+
classMMLUSample(TypedDict):
11
+
"""
12
+
Example of a single sample returned from lm_eval when running MMLU.
13
+
This is not a comprehensive type, just the subset of fields we care about for this test.
14
+
"""
15
+
16
+
# Arguments is the list of (prompt, answer) pairs passed to MMLU as few-shot samples.
17
+
# They will not be present with few_shot=0
18
+
arguments: List[Tuple[str, str]]
19
+
20
+
21
+
defall_samples_contain_system_prompt(
22
+
samples: Dict[str, List[MMLUSample]], prompt: str
23
+
) ->bool:
24
+
"""
25
+
Given a mapping of evaluation --> list of results, validates that all few-shot examples
26
+
included the system prompt
27
+
"""
28
+
fortopic, samples_setinsamples.items():
29
+
forsampleinsamples_set:
30
+
formmlu_prompt, _insample["arguments"]:
31
+
ifpromptnotinmmlu_prompt:
32
+
# we are looking for the exact system prompt, so no need to convert to normalize to lowercase
33
+
print(f"found a sample in the '{topic}' MMLU topic set")
@@ -103,6 +103,7 @@ class AbstractMMLUEvaluator(Evaluator):
103
103
batch_size batch size for evaluation. Valid values are a positive integer or 'auto' to select the largest batch size that will fit in memory, or 'auto:N' to reselect the largest batch size N times'.
104
104
device PyTorch device (e.g. "cpu" or "cuda:0") for running models
105
105
system_prompt system prompt to be used when applying the chat template
106
+
results full output from the `lm_eval.evaluator.simple_evaluate` function after MMLU has run.
106
107
"""
107
108
108
109
def__init__(
@@ -124,18 +125,33 @@ def __init__(
124
125
self.few_shots=few_shots
125
126
self.batch_size=batch_size
126
127
self.device=device
128
+
self._results=None
127
129
128
-
defrun(self, server_url: str|None=None) ->tuple:
130
+
@property
131
+
defresults(self) ->Dict[str, Any] |None:
132
+
"""
133
+
Returns the results of the last MMLU evaluation, if one has taken place.
134
+
135
+
Returns:
136
+
Dict[str, Any] | None: The output from `lm_eval.evaluator.simple_evaluate`
0 commit comments