Skip to content

Commit 52b37e1

Browse files
committed
fix: Fix off-by-one for limiting cached tokens to before alora start
This was the cause of the inconsistent results from the dummy test script with and without the turn that runs the prompt without the adapter before running it with the adapter. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <[email protected]>
1 parent d03d106 commit 52b37e1

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

tools/server/server.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3338,7 +3338,7 @@ struct server_context {
33383338
// if there is an alora invoked, don't cache after the invocation start
33393339
if (slot.alora_invocation_start >= 0) {
33403340
SLT_DBG(slot, "only caching to alora invocation start (n_past=%d, alora_invocation_start=%d)\n", slot.n_past, slot.alora_invocation_start);
3341-
slot.n_past = std::min(slot.n_past, slot.alora_invocation_start);
3341+
slot.n_past = std::min(slot.n_past, slot.alora_invocation_start - 1);
33423342
}
33433343

33443344
// reuse chunks from the cached prompt by shifting their KV cache in the new position

0 commit comments

Comments
 (0)