Skip to content

Commit b06bf56

Browse files
authored
Improve stability of flaky perplexity test (#1884)
SUMMARY: `tests/llmcompressor/transformers/compression/test_quantization.py:test_perplexity` is currently flaky, with the test [occasionally failing](https://github.com/neuralmagic/llm-compressor-testing/actions/runs/17994161150/job/51234264145) due to the recorded `avg_ppl` exceeding the test threshold. Through debugging, it seems like most of the high perplexity samples are samples where most of the target labels are not trained (i.e. set to `-100`). This makes the loss calculation averaging over the remaining tokens more volatile and can result in high perplexity values recorded. To correct this, I added a check that filters out samples where less than `25%` of the tokens have training labels. This should make the perplexity calculation more consistent, while still testing the model's perplexities are reasonable. TEST PLAN: Ran the test locally and all cases passed. Although this is a flaky test, so that doesn't guarantee this has solved the problem. --------- Signed-off-by: Fynn Schmitt-Ulms <[email protected]>
1 parent 8f327a6 commit b06bf56

File tree

1 file changed

+11
-7
lines changed

1 file changed

+11
-7
lines changed

tests/llmcompressor/transformers/compression/test_quantization.py

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -145,15 +145,19 @@ def test_perplexity(setup_model_and_config):
145145
dispatch_for_generation(model)
146146

147147
total_ppl = 0.0
148-
total_non_nan = 0
149-
for idx, sample in enumerate(dataloader):
150-
if idx >= config["num_eval"]:
148+
total_samples = 0
149+
for sample in dataloader:
150+
if total_samples >= config["num_eval"]:
151151
break
152-
output = model(**tensors_to_device(sample, "cuda:0"))
153-
if torch.isnan(output.loss):
152+
# -100 in labels indicates that the token is not part of the loss calculation
153+
pct_labels_in_sample = (sample["labels"] != -100).to(torch.float).mean().item()
154+
if pct_labels_in_sample <= 0.25:
155+
# At least 25% of the tokens in the sample must be part of loss calculation
156+
# otherwise the perplexity is too volatile and can skew the results
154157
continue
158+
output = model(**tensors_to_device(sample, "cuda:0"))
155159
total_ppl += torch.exp(output.loss).item()
156-
total_non_nan += 1
160+
total_samples += 1
157161

158-
avg_ppl = total_ppl / total_non_nan
162+
avg_ppl = total_ppl / total_samples
159163
assert avg_ppl <= config["ppl_threshold"]

0 commit comments

Comments
 (0)