Skip to content

Commit 10f72fc

Browse files
authored
Fix flaky testPhi4 and testVoxtral by setting temperature=0 (#16517)
Summary: Both tests were flaky because LLM outputs are non-deterministic with the default temperature of 0.8, which uses RNG-based sampling with a time-based seed. Setting temperature=0 enables greedy argmax decoding, eliminating randomness and making assertions on generated text reliable. This is consistent with how other LLM tests and production runners in the codebase handle determinism (e.g., test_text_decoder_runner.cpp, test_sampler.cpp, and QNN/QAI Hub runners). This fixes 5 flaky tests {F1984490494} Reviewed By: shoumikhin Differential Revision: D90361187
1 parent f0edae2 commit 10f72fc

File tree

2 files changed

+3
-0
lines changed

2 files changed

+3
-0
lines changed

extension/llm/apple/ExecuTorchLLM/__tests__/MultimodalRunnerTest.swift

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -238,6 +238,7 @@ class MultimodalRunnerTest: XCTestCase {
238238
MultimodalInput(String(format: chatTemplate, userPrompt)),
239239
], Config {
240240
$0.maximumNewTokens = 256
241+
$0.temperature = 0
241242
}) { token in
242243
text += token
243244
}

extension/llm/apple/ExecuTorchLLM/__tests__/TextRunnerTest.swift

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,7 @@ class TextRunnerTest: XCTestCase {
8787
do {
8888
try runner.generate(userPrompt, Config {
8989
$0.sequenceLength = sequenceLength
90+
$0.temperature = 0
9091
}) { token in
9192
text += token
9293
}
@@ -100,6 +101,7 @@ class TextRunnerTest: XCTestCase {
100101
do {
101102
try runner.generate(userPrompt, Config {
102103
$0.sequenceLength = sequenceLength
104+
$0.temperature = 0
103105
}) { token in
104106
text += token
105107
}

0 commit comments

Comments
 (0)