Improve stability of flaky perplexity test (#1884)

fynnsu · web-flow · commit b06bf5644d6a · 2025-09-30T14:09:58.000-04:00
SUMMARY: `tests/llmcompressor/transformers/compression/test_quantization.py:test_perplexity` is currently flaky, with the test [occasionally failing](https://github.com/neuralmagic/llm-compressor-testing/actions/runs/17994161150/job/51234264145) due to the recorded `avg_ppl` exceeding the test threshold. Through debugging, it seems like most of the high perplexity samples are samples where most of the target labels are not trained (i.e. set to `-100`). This makes the loss calculation averaging over the remaining tokens more volatile and can result in high perplexity values recorded. To correct this, I added a check that filters out samples where less than `25%` of the tokens have training labels. This should make the perplexity calculation more consistent, while still testing the model's perplexities are reasonable. TEST PLAN: Ran the test locally and all cases passed. Although this is a flaky test, so that doesn't guarantee this has solved the problem. --------- Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
diff --git a/tests/llmcompressor/transformers/compression/test_quantization.py b/tests/llmcompressor/transformers/compression/test_quantization.py
@@ -145,15 +145,19 @@ def test_perplexity(setup_model_and_config):
     dispatch_for_generation(model)
 
     total_ppl = 0.0
-    total_non_nan = 0
-    for idx, sample in enumerate(dataloader):
-        if idx >= config["num_eval"]:
+    total_samples = 0
+    for sample in dataloader:
+        if total_samples >= config["num_eval"]:
             break
-        output = model(**tensors_to_device(sample, "cuda:0"))
-        if torch.isnan(output.loss):
+        # -100 in labels indicates that the token is not part of the loss calculation
+        pct_labels_in_sample = (sample["labels"] != -100).to(torch.float).mean().item()
+        if pct_labels_in_sample <= 0.25:
+            # At least 25% of the tokens in the sample must be part of loss calculation
+            # otherwise the perplexity is too volatile and can skew the results
             continue
+        output = model(**tensors_to_device(sample, "cuda:0"))
         total_ppl += torch.exp(output.loss).item()
-        total_non_nan += 1
+        total_samples += 1
 
-    avg_ppl = total_ppl / total_non_nan
+    avg_ppl = total_ppl / total_samples
     assert avg_ppl <= config["ppl_threshold"]