Skip to content

Commit 4b18555

Browse files
Fix text_llm_runner kv cache pos count and use it for generate() (#15295)
### Summary pos_ should advance by prefill and generated prompt size. ### Test plan CI cc @larryliu0820 @mergennachin @cccclai @helunwencser @jackzhxng Co-authored-by: Hansong Zhang <[email protected]>
1 parent 8c84780 commit 4b18555

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

extension/llm/runner/text_llm_runner.cpp

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -183,11 +183,13 @@ Error TextLLMRunner::generate(
183183
// Generate max_new_tokens - 1 because prefill already generated 1 token.
184184
int64_t num_generated_tokens = ET_UNWRAP(text_token_generator_->generate(
185185
prompt_tokens,
186-
num_prompt_tokens,
186+
pos_,
187187
max_new_tokens - 1,
188188
temperature_ == -1.0f ? config.temperature : temperature_,
189189
wrapped_callback));
190190

191+
pos_ += num_generated_tokens;
192+
191193
stats_->inference_end_ms = time_in_ms();
192194
if (!config.warming) {
193195
printf("\n");

0 commit comments

Comments
 (0)