Skip to content

Commit dabff0a

Browse files
loosen lmeval assertions to upper or lower bound (#1477)
SUMMARY: `lm_eval` end-to-end tests are occasionally failing when the actual value is higher than the expected value, when we really only care about if performance has regressed (example run [here](https://github.com/neuralmagic/llm-compressor-testing/actions/runs/15232878962/job/42842981702)). This PR loosens the check to only assert - actual value > expected value - error tolerance if higher score is better (generally the case) - actual value < expected value + error tolerance if lower score is better (in case we have PPL checks or add in future) TEST PLAN: - [x] Rerun weekly lm-eval tests before merging this in -- https://github.com/neuralmagic/llm-compressor-testing/actions/runs/15281691638 --------- Signed-off-by: Brian Dellabetta <[email protected]>
1 parent 5c643b0 commit dabff0a

File tree

1 file changed

+26
-9
lines changed

1 file changed

+26
-9
lines changed

tests/lmeval/test_lmeval.py

Lines changed: 26 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -154,29 +154,46 @@ def _run_lm_eval(self):
154154
batch_size=self.lmeval.batch_size,
155155
)
156156

157-
metrics = results["results"][self.lmeval.task]
157+
metrics: dict = results["results"][self.lmeval.task]
158158
for metric_key, expected_val in self.lmeval.metrics.items():
159159
# stderr metrics are only used as absolute tolerance
160160
# checks for actual values
161161
if "stderr" in metric_key:
162162
continue
163163
actual_val = metrics.get(metric_key)
164-
# If stderr is provided, use it as absolute tolerance
165-
# Otherwise, default to a 5% relative tolerance
164+
higher_is_better = results["higher_is_better"][self.lmeval.task].get(
165+
metric_key.split(",")[0], True
166+
)
166167
stderr_key = metric_key.replace(",", "_stderr,")
167168
std_err = self.lmeval.metrics.get(stderr_key)
169+
170+
# If stderr is provided, use it as absolute tolerance
171+
# Otherwise, default to a 5% relative tolerance
168172
if std_err is None:
169173
logger.info(
170-
f"Comparing {metric_key}: Expected {expected_val} "
171-
f"±5%, Got {actual_val}"
174+
f"Comparing {metric_key}: Expecting {expected_val} "
175+
f"relative tolerance ±5%, Got {actual_val}. "
176+
f"Higher is better: {higher_is_better}"
172177
)
173-
assert numpy.isclose(expected_val, actual_val, rtol=0.05)
178+
# If higher is better, assert actual val >= expected val * (1 - stderr)
179+
if higher_is_better:
180+
assert actual_val >= expected_val * (0.95)
181+
# If higher is worse, assert actual val <= expected val * (1 + stderr)
182+
else:
183+
assert actual_val <= expected_val * (1.05)
184+
174185
else:
175186
logger.info(
176-
f"Comparing {metric_key}: Expected {expected_val} "
177-
f"±{std_err*100}%, Got {actual_val}"
187+
f"Comparing {metric_key}: Expecting {expected_val} "
188+
f"absolute tolerance ±{std_err*100}%, Got {actual_val}. "
189+
f"Higher is better: {higher_is_better}"
178190
)
179-
assert numpy.isclose(expected_val, actual_val, atol=std_err)
191+
# If higher is better, assert actual val >= expected val - stderr
192+
if higher_is_better:
193+
assert actual_val >= expected_val - std_err
194+
# If higher is worse, assert actual val <= expected val + stderr
195+
else:
196+
assert actual_val <= expected_val + std_err
180197

181198
def tear_down(self):
182199
timer = get_singleton_manager()

0 commit comments

Comments
 (0)