You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Improve stability of flaky perplexity test (#1884)
SUMMARY:
`tests/llmcompressor/transformers/compression/test_quantization.py:test_perplexity`
is currently flaky, with the test [occasionally
failing](https://github.com/neuralmagic/llm-compressor-testing/actions/runs/17994161150/job/51234264145)
due to the recorded `avg_ppl` exceeding the test threshold.
Through debugging, it seems like most of the high perplexity samples are
samples where most of the target labels are not trained (i.e. set to
`-100`). This makes the loss calculation averaging over the remaining
tokens more volatile and can result in high perplexity values recorded.
To correct this, I added a check that filters out samples where less
than `25%` of the tokens have training labels. This should make the
perplexity calculation more consistent, while still testing the model's
perplexities are reasonable.
TEST PLAN:
Ran the test locally and all cases passed. Although this is a flaky
test, so that doesn't guarantee this has solved the problem.
---------
Signed-off-by: Fynn Schmitt-Ulms <[email protected]>
0 commit comments