-
Notifications
You must be signed in to change notification settings - Fork 794
Fix flaky testPhi4 and testVoxtral by setting temperature=0 #16517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary:
Both tests were flaky because LLM outputs are non-deterministic with the default temperature of 0.8, which uses RNG-based sampling with a time-based seed. Setting temperature=0 enables greedy argmax decoding, eliminating randomness and making assertions on generated text reliable.
This is consistent with how other LLM tests and production runners in the codebase handle determinism (e.g., test_text_decoder_runner.cpp, test_sampler.cpp, and QNN/QAI Hub runners).
This fixes 5 flaky tests
{F1984490494}
Reviewed By: shoumikhin
Differential Revision: D90361187
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16517
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 8c9e574 with merge base 847d70d ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes flaky LLM tests by setting temperature=0 in the generation configuration, enabling deterministic greedy decoding instead of random sampling. This eliminates non-deterministic behavior that was causing test failures when assertions checked for specific words in generated text.
Key changes:
- Added
temperature=0to both generation calls intestPhi4 - Added
temperature=0to the generation call intestVoxtral
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| extension/llm/apple/ExecuTorchLLM/tests/TextRunnerTest.swift | Set temperature=0 in both generation calls of testPhi4 to ensure deterministic output when asserting "paris" is in the generated text |
| extension/llm/apple/ExecuTorchLLM/tests/MultimodalRunnerTest.swift | Set temperature=0 in testVoxtral's generation call to ensure deterministic output when asserting "tattoo" is in the generated text |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary:
Both tests were flaky because LLM outputs are non-deterministic with the default temperature of 0.8, which uses RNG-based sampling with a time-based seed. Setting temperature=0 enables greedy argmax decoding, eliminating randomness and making assertions on generated text reliable.
This is consistent with how other LLM tests and production runners in the codebase handle determinism (e.g., test_text_decoder_runner.cpp, test_sampler.cpp, and QNN/QAI Hub runners).
This fixes 5 flaky tests
{F1984490494}
Reviewed By: shoumikhin
Differential Revision: D90361187