Skip to content

Conversation

@ilopezluna
Copy link
Contributor

No description provided.

@ilopezluna ilopezluna requested a review from a team December 11, 2025 08:28
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ilopezluna, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a significant update to the core llama.cpp library, specifically bumping it to version b7356. This update is accompanied by extensive refactoring and new features within the server component, primarily focusing on improving generation capabilities and internal architecture. Key changes include enabling multiple completion choices per request, a more sophisticated speculative decoding implementation, and a revamped task and slot management system to support these advanced features. Additionally, the server now boasts a more robust model loading and argument handling mechanism through a new preset system, alongside refinements to HTTP proxying for better compatibility and performance.

Highlights

  • llama.cpp Core Update: The llama.cpp subproject has been updated to commit b7356, bringing in the latest features and bug fixes from the upstream project.
  • Multiple Completion Choices (n_cmpl): The server now supports generating multiple completion choices (n_cmpl > 1) for a single request, removing previous restrictions and allowing for parallel sampling.
  • Enhanced Speculative Decoding: The speculative decoding mechanism has been significantly refined, introducing new slot states (SLOT_STATE_WAIT_OTHER), dedicated fields for draft tokens (i_batch_dft, drafted), and a more integrated sample-and-accept process.
  • Refactored Task and Slot Management: The server's internal task and slot management has been improved with parent-child task relationships, state copying between slots (copy_state_to), and more granular slot states to handle complex generation scenarios.
  • Model Preset System: A new server_presets system has been introduced to manage model configurations and arguments, allowing for more flexible loading of cached, local, and custom-path models, and better argument merging.
  • Improved HTTP Proxying: HTTP proxying for model instances has been enhanced to prevent duplicate headers (e.g., Server, Transfer-Encoding, Content-Length, CORS headers) and correctly handle Content-Type when forwarding responses.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • In the speculative decoding acceptance loop, SLT_DBG logs slot.drafted.size() after slot.drafted has been cleared, so the logged accepted/total draft counts will always show 0 for the draft size; capture the original n_draft (already computed) or move the debug log before clearing drafted.
  • The new task_result_state / is_updated plumbing assumes server_task_result::update() is always called before to_json(); if any existing or future call site bypasses server_response_reader::next() and calls to_json() directly, it will hit the assertion—consider either documenting this invariant clearly or providing a safer default path for non-streaming/non-HTTP usages.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In the speculative decoding acceptance loop, `SLT_DBG` logs `slot.drafted.size()` after `slot.drafted` has been cleared, so the logged accepted/total draft counts will always show 0 for the draft size; capture the original `n_draft` (already computed) or move the debug log before clearing `drafted`.
- The new `task_result_state` / `is_updated` plumbing assumes `server_task_result::update()` is always called before `to_json()`; if any existing or future call site bypasses `server_response_reader::next()` and calls `to_json()` directly, it will hit the assertion—consider either documenting this invariant clearly or providing a safer default path for non-streaming/non-HTTP usages.

## Individual Comments

### Comment 1
<location> `llamacpp/native/src/server/server-context.cpp:2573-2582` </location>
<code_context>

                 // the accepted tokens from the speculation
-                const auto ids = common_sampler_sample_and_accept_n(slot.smpl, ctx, draft);
+                const auto ids = common_sampler_sample_and_accept_n(slot.smpl, ctx, slot.i_batch_dft, slot.drafted);
+                slot.i_batch_dft.clear();
+                slot.drafted.clear();
</code_context>

<issue_to_address>
**issue (bug_risk):** Speculative decoding debug log uses `slot.drafted.size()` after the vector is cleared.

After `common_sampler_sample_and_accept_n` you clear `slot.drafted`, but the final `SLT_DBG` still uses `slot.drafted.size()` for the total draft count, so it will always log `0`. Capture the draft count before clearing (e.g., `const auto n_draft = slot.drafted.size();`) and log that, or move the log before `slot.drafted.clear()`.
</issue_to_address>

### Comment 2
<location> `llamacpp/native/src/server/server-context.cpp:429-437` </location>
<code_context>
         return res;
     }
+
+    void copy_state_to(server_slot & other) const {
+        llama_memory_seq_rm(llama_get_memory(ctx), other.id, 0, -1);
+        llama_memory_seq_cp(llama_get_memory(ctx), id, other.id, 0, -1);
+        other.n_decoded   = n_decoded;
+        other.n_remaining = n_remaining;
+        other.i_batch     = i_batch;
+        other.n_prompt_tokens_cache     = n_prompt_tokens_cache;
+        other.n_prompt_tokens_processed = n_prompt_tokens_processed;
+        other.prompt = prompt.clone();
+    }
 };
</code_context>

<issue_to_address>
**suggestion (bug_risk):** copy_state_to does not reset child-specific transient state, which can leak speculative/decoder bookkeeping to children.

When copying from a parent to a child in `copy_state_to`, you only copy KV/memory and a few counters, but leave per-generation transient fields on `other` (e.g., `drafted`, `i_batch_dft`, and similar speculative/decoder bookkeeping) unchanged. If the child slot was previously used, these fields may contain stale values and influence later speculative decoding or sampling. Please ensure all speculative/transient generation state on `other` is reset or reinitialized so children always start cleanly from the parent’s state.

Suggested implementation:

```cpp
    void copy_state_to(server_slot & other) const {
        // copy KV / memory state from this slot to the target slot
        llama_memory_seq_rm(llama_get_memory(ctx), other.id, 0, -1);
        llama_memory_seq_cp(llama_get_memory(ctx), id, other.id, 0, -1);

        // copy persistent generation counters / prompt state
        other.n_decoded   = n_decoded;
        other.n_remaining = n_remaining;
        other.i_batch     = i_batch;
        other.n_prompt_tokens_cache     = n_prompt_tokens_cache;
        other.n_prompt_tokens_processed = n_prompt_tokens_processed;
        other.prompt = prompt.clone();

        // reset all transient / speculative decoding state on the child so it starts clean
        // from the parent's state without inheriting stale bookkeeping from a previous use
        other.drafted.clear();
        other.i_batch_dft = 0;

        other.n_draft_accepted = 0;
        other.n_draft_total    = 0;
        other.draft_ratio      = 0.0f;

        other.has_draft        = false;
        other.draft_in_flight  = false;

        other.i_batch_sampling = 0;
        other.sampled_tokens.clear();
    }

```

The above implementation assumes the following transient/speculative members exist on `server_slot` (or its enclosing type):
- `std::vector<...> drafted;`
- `int i_batch_dft;`
- `int n_draft_accepted;`
- `int n_draft_total;`
- `float draft_ratio;`
- `bool has_draft;`
- `bool draft_in_flight;`
- `int i_batch_sampling;`
- `std::vector<...> sampled_tokens;`

If the actual names or types differ in your codebase, adjust the reset section accordingly to:
1. Clear all containers that hold speculative or per-draft data (e.g., `drafted`, `draft_tokens`, etc.).
2. Reset all speculative indices/counters (e.g., `i_batch_dft`, `n_draft_accepted`, `n_draft_total`, acceptance ratios).
3. Reset any flags that indicate draft/speculative activity (e.g., `has_draft`, `draft_in_flight`, `sampling_in_progress`).

Ensure that every field that is mutated only during a single generation/speculative pass is either:
- copied from the parent when it is actually part of the persistent state, or
- explicitly reset here so reused child slots never carry stale speculative state.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +2573 to 2582
const auto ids = common_sampler_sample_and_accept_n(slot.smpl, ctx, slot.i_batch_dft, slot.drafted);
slot.i_batch_dft.clear();
slot.drafted.clear();

slot.n_decoded += ids.size();

slot.t_token_generation = std::max<int64_t>(1, t_current - slot.t_start_generation) / 1e3;

// update how many tokens out of those tested were accepted
slot.n_draft_accepted += ids.size() - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Speculative decoding debug log uses slot.drafted.size() after the vector is cleared.

After common_sampler_sample_and_accept_n you clear slot.drafted, but the final SLT_DBG still uses slot.drafted.size() for the total draft count, so it will always log 0. Capture the draft count before clearing (e.g., const auto n_draft = slot.drafted.size();) and log that, or move the log before slot.drafted.clear().

Comment on lines +429 to +437
void copy_state_to(server_slot & other) const {
llama_memory_seq_rm(llama_get_memory(ctx), other.id, 0, -1);
llama_memory_seq_cp(llama_get_memory(ctx), id, other.id, 0, -1);
other.n_decoded = n_decoded;
other.n_remaining = n_remaining;
other.i_batch = i_batch;
other.n_prompt_tokens_cache = n_prompt_tokens_cache;
other.n_prompt_tokens_processed = n_prompt_tokens_processed;
other.prompt = prompt.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): copy_state_to does not reset child-specific transient state, which can leak speculative/decoder bookkeeping to children.

When copying from a parent to a child in copy_state_to, you only copy KV/memory and a few counters, but leave per-generation transient fields on other (e.g., drafted, i_batch_dft, and similar speculative/decoder bookkeeping) unchanged. If the child slot was previously used, these fields may contain stale values and influence later speculative decoding or sampling. Please ensure all speculative/transient generation state on other is reset or reinitialized so children always start cleanly from the parent’s state.

Suggested implementation:

    void copy_state_to(server_slot & other) const {
        // copy KV / memory state from this slot to the target slot
        llama_memory_seq_rm(llama_get_memory(ctx), other.id, 0, -1);
        llama_memory_seq_cp(llama_get_memory(ctx), id, other.id, 0, -1);

        // copy persistent generation counters / prompt state
        other.n_decoded   = n_decoded;
        other.n_remaining = n_remaining;
        other.i_batch     = i_batch;
        other.n_prompt_tokens_cache     = n_prompt_tokens_cache;
        other.n_prompt_tokens_processed = n_prompt_tokens_processed;
        other.prompt = prompt.clone();

        // reset all transient / speculative decoding state on the child so it starts clean
        // from the parent's state without inheriting stale bookkeeping from a previous use
        other.drafted.clear();
        other.i_batch_dft = 0;

        other.n_draft_accepted = 0;
        other.n_draft_total    = 0;
        other.draft_ratio      = 0.0f;

        other.has_draft        = false;
        other.draft_in_flight  = false;

        other.i_batch_sampling = 0;
        other.sampled_tokens.clear();
    }

The above implementation assumes the following transient/speculative members exist on server_slot (or its enclosing type):

  • std::vector<...> drafted;
  • int i_batch_dft;
  • int n_draft_accepted;
  • int n_draft_total;
  • float draft_ratio;
  • bool has_draft;
  • bool draft_in_flight;
  • int i_batch_sampling;
  • std::vector<...> sampled_tokens;

If the actual names or types differ in your codebase, adjust the reset section accordingly to:

  1. Clear all containers that hold speculative or per-draft data (e.g., drafted, draft_tokens, etc.).
  2. Reset all speculative indices/counters (e.g., i_batch_dft, n_draft_accepted, n_draft_total, acceptance ratios).
  3. Reset any flags that indicate draft/speculative activity (e.g., has_draft, draft_in_flight, sampling_in_progress).

Ensure that every field that is mutated only during a single generation/speculative pass is either:

  • copied from the parent when it is actually part of the persistent state, or
  • explicitly reset here so reused child slots never carry stale speculative state.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request incorporates a significant update from the upstream llama.cpp repository, introducing several major features and refactorings. Key enhancements include support for generating multiple completions (n > 1) through a new parent-child task model, a substantial rework of the speculative decoding logic for improved performance and organization, and a cleaner state management design for chat messages. The changes also bring more flexible configuration for multi-model serving via presets and improve overall server robustness with better error handling. My review identified a minor logging bug and an opportunity for code simplification, but overall, the changes are of high quality and represent a solid step forward.

Comment on lines +501 to +505
for (auto it = map_idx_to_media.begin(); it != map_idx_to_media.end(); ++it) {
size_t idx = it->first;
const mtmd::input_chunk_ptr & chunk = it->second;
res.map_idx_to_media[idx] = mtmd::input_chunk_ptr(mtmd_input_chunk_copy(chunk.get()));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better readability and to use a more modern C++ approach, this iterator-based for loop can be replaced with a range-based for loop. This makes the code more concise and easier to understand.

    for (const auto & pair : map_idx_to_media) {
        size_t idx = pair.first;
        const mtmd::input_chunk_ptr & chunk = pair.second;
        res.map_idx_to_media[idx] = mtmd::input_chunk_ptr(mtmd_input_chunk_copy(chunk.get()));
    }

}

SLT_DBG(slot, "accepted %d/%d draft tokens, new n_tokens = %d\n", (int) ids.size() - 1, (int) draft.size(), slot.prompt.n_tokens());
SLT_DBG(slot, "accepted %d/%d draft tokens, new n_tokens = %d\n", (int) ids.size() - 1, (int) slot.drafted.size(), slot.prompt.n_tokens());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There appears to be a bug in this debug log. slot.drafted.size() is used to report the number of generated draft tokens, but slot.drafted is cleared on line 2575, causing this to always log 0. The n_draft variable, which captured the size before clearing, should be used instead for accurate logging.

                SLT_DBG(slot, "accepted %d/%d draft tokens, new n_tokens = %d\n", (int) ids.size() - 1, (int) n_draft, slot.prompt.n_tokens());

@ilopezluna ilopezluna merged commit ff48957 into main Dec 11, 2025
9 checks passed
@ilopezluna ilopezluna deleted the bump-llamacpp branch December 11, 2025 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants