Skip to content

Commit 41ebbfd

Browse files
committed
server : fixes + clean-up
1 parent 545df93 commit 41ebbfd

File tree

2 files changed

+6
-6
lines changed

2 files changed

+6
-6
lines changed

tools/server/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -587,7 +587,7 @@ These words will not be included in the completion, so make sure to add them to
587587
- `word`: Stopped due to encountering a stopping word from `stop` JSON array provided
588588
- `stopping_word`: The stopping word encountered which stopped the generation (or "" if not stopped due to a stopping word)
589589
- `timings`: Hash of timing information about the completion such as the number of tokens `predicted_per_second`
590-
- `tokens_cached`: Number of tokens from the prompt which could be re-used from previous completion (`n_past`)
590+
- `tokens_cached`: Number of tokens from the prompt which could be re-used from previous completion
591591
- `tokens_evaluated`: Number of tokens evaluated in total from the prompt
592592
- `truncated`: Boolean indicating if the context size was exceeded during generation, i.e. the number of tokens provided in the prompt (`tokens_evaluated`) plus tokens generated (`tokens predicted`) exceeded the context size (`n_ctx`)
593593

@@ -1045,7 +1045,7 @@ Available metrics:
10451045
- `llamacpp:kv_cache_tokens`: KV-cache tokens.
10461046
- `llamacpp:requests_processing`: Number of requests processing.
10471047
- `llamacpp:requests_deferred`: Number of requests deferred.
1048-
- `llamacpp:n_past_max`: High watermark of the context size observed.
1048+
- `llamacpp:n_tokens_max`: High watermark of the context size observed.
10491049

10501050
### POST `/slots/{id_slot}?action=save`: Save the prompt cache of the specified slot to a file.
10511051

tools/server/server.cpp

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3707,9 +3707,9 @@ struct server_context {
37073707
n_past = slot.prompt.tokens.get_common_prefix(input_tokens);
37083708

37093709
// if there is an alora invoked, don't cache after the invocation start
3710-
if (slot.alora_invocation_start >= 0) {
3711-
SLT_DBG(slot, "only caching to alora invocation start (n_past=%d, alora_invocation_start=%d)\n", n_past, slot.alora_invocation_start);
3712-
n_past = std::min(n_past, slot.alora_invocation_start);
3710+
if (slot.alora_invocation_start > 0) {
3711+
SLT_DBG(slot, "only caching to alora invocation start (n_past = %d, alora_invocation_start = %d)\n", n_past, slot.alora_invocation_start);
3712+
n_past = std::min(n_past, slot.alora_invocation_start - 1);
37133713
}
37143714

37153715
// reuse chunks from the cached prompt by shifting their KV cache in the new position
@@ -3769,7 +3769,7 @@ struct server_context {
37693769
const auto n_swa = std::max(1, llama_model_n_swa(model));
37703770

37713771
// the largest pos_min required for a checkpoint to be useful
3772-
const auto pos_min_thold = std::max(0, n_past - n_swa - 1);
3772+
const auto pos_min_thold = std::max(0, n_past - n_swa);
37733773

37743774
if (n_past > 0 && n_past < slot.prompt.n_tokens()) {
37753775
const auto pos_min = llama_memory_seq_pos_min(llama_get_memory(ctx), slot.id);

0 commit comments

Comments
 (0)