webui: Add a "Continue" Action for Assistant Message #16971

allozaur · 2025-11-03T15:14:07Z

Close #16097

Add Continue and Save features for chat messages

What's new

Continue button for assistant messages

Click the arrow button on any assistant response to keep generating from where it left off
Useful for getting longer outputs or continuing after you've edited a response
New content gets appended to the existing message

Save button when editing user messages

Now you get three options when editing: Cancel, Save, and Send
Save keeps your edit without regenerating the response (preserves the conversation below)
Send saves and regenerates like before
Useful when you just want to fix a typo without losing the assistant's response

Technical notes

Added continueAssistantMessage() and editUserMessagePreserveResponses() methods to ChatStore
Continue feature sends a synthetic "continue" prompt to the API (not saved to the database)
Assistant message edits now preserve trailing whitespace for proper continuation
Follows existing component architecture patterns

Demos

`ggml-org/gpt-oss-20b-GGUF`

demo1.mp4

`unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF`

demo2.mp4

allozaur

@ngxson @ggerganov lemme know if you think that this logic works for handling the "Continue" action for assistant messages.

@artyfacialintelagent fell free to test this out and give feedback!

ggerganov · 2025-11-03T15:33:25Z

Is this supposed to work correctly when pressing Continue after stopping a response while it is generating? I am testing with gpt-oss and after Continue the text does not seem to resume as expected.

allozaur · 2025-11-03T15:51:44Z

Is this supposed to work correctly when pressing Continue after stopping a response while it is generating? I am testing with gpt-oss and after Continue the text does not seem to resume as expected.

I've tested it for the edited assistant responses so far. I will take a close look at the stopped generation -> continue flow as well

Iq1pl · 2025-11-05T05:29:39Z

Is this supposed to work correctly when pressing Continue after stopping a response while it is generating? I am testing with gpt-oss and after Continue the text does not seem to resume as expected.

When using gpt-oss in Lm Studio the model generates a new response instead of continuing the previous text, this is because of the Harmony parser, uninstalling it resolves this and the model continues the generation successfully.

allozaur · 2025-11-12T17:02:04Z

@ggerganov please check the demos i've attached to the PR description and also test this feature on your end. looking forward to your feedback!

tools/server/webui/src/lib/stores/chat.svelte.ts

ggerganov · 2025-11-13T09:45:56Z

Continue feature sends a synthetic "continue" prompt to the API (not saved to the database)

Hm, I wonder why do it like this. We already have support on the server to continue the assistant message if it is the last one in the request #13174:

llama.cpp/tools/server/utils.hpp

Lines 729 to 751 in c7e23c7

    
           // if the assistant message appears at the end of list, we do not add end-of-turn token 
        
           // for ex. this can be useful to modify the reasoning process in reasoning models 
        
           bool prefill_assistant_message = !inputs.messages.empty() && inputs.messages.back().role == "assistant" && opt.prefill_assistant; 
        
           common_chat_msg last_message; 
        
           if (prefill_assistant_message) { 
        
               last_message = inputs.messages.back(); 
        
               inputs.messages.pop_back(); 
        
               /* sanity check, max one assistant message at the end of the list */ 
        
               if (!inputs.messages.empty() && inputs.messages.back().role == "assistant"){ 
        
                   throw std::runtime_error("Cannot have 2 or more assistant messages at the end of the list."); 
        
               } 
        
               /* TODO: test this properly */ 
        
               inputs.reasoning_format = COMMON_REASONING_FORMAT_NONE; 
        
               if ( inputs.enable_thinking ) { 
        
                   throw std::runtime_error("Assistant response prefill is incompatible with enable_thinking."); 
        
               } 
        
               inputs.add_generation_prompt = true; 
        
           }

The current approach often does not continue properly, as can be seen in the sample videos:

Using the assistant prefill functionality above would make this work correctly in all cases.

ngxson · 2025-11-13T10:20:39Z

Agree with @ggerganov , it's better to use the prefill assistant message from #13174

Just one thing to note though, I think most templates does not support formatting the reasoning content back to original, so probably that's the only case where it will break

allozaur · 2025-11-13T10:21:26Z

Thanks guys, I missed that! Will patch it and come back to you.

allozaur · 2025-11-13T19:28:44Z

@ggerganov @ngxson

I've updated the logic with 859e496 and i have tested with few models and only 1 (Qwen3-VL-32B-Instruct-GGUF) managed to properly continue the assistant message in response to the prefill request. See videos below.

`Qwen3-VL-32B-Instruct-GGUF`

Qwen3-VL-32B-Instruct-GGUF.mov

`ggml-org/gpt-oss-20b-gguf`

gpt-oss-20b-gguf.mov

`ggml-org/gpt-oss-120b-gguf`

gpt-oss-120b-gguf.mov

`unsloth/gemma3-12b-it-gguf`

demo.mp4

ggerganov · 2025-11-13T19:59:10Z

For me, both Qwen3 and Gemma3 are able to complete successfully. For example, here is Gemma3 12B IT:

webui-continue-0.mp4

It's strange that it didn't work for you.

Regarding gpt-oss - I think that "Continue" has to also send the reasoning in this case. Currently, it is discarded and I think it confuses the model.

allozaur · 2025-11-13T20:00:28Z

Regarding gpt-oss - I think that "Continue" has to also send the reasoning in this case. Currently, it is discarded and I think it confuses the model.

Should we then address the thinking models differently for now, at least from the WebUI perspective?

It's strange that it didn't work for you.

I will do some more testing with other instruct models and make sure all is working right.

ngxson · 2025-11-13T20:03:28Z

It's likely due to chat template, I suspect some chat templates (especially jinja) adds the generation prompt. Can you verify how the chat template looks like with POST /apply-template endpoint? (the request body is the same as /chat/completions)

ggerganov · 2025-11-13T20:04:55Z

Regarding gpt-oss - I think that "Continue" has to also send the reasoning in this case. Currently, it is discarded and I think it confuses the model.

Should we then address the thinking models differently for now, at least from the WebUI perspective?

If it's not too complicated, I'd say change the logic so that "Continue" includes the reasoning of the last assistant message for all reasoning models.

ngxson · 2025-11-13T20:33:39Z

If it's not too complicated, I'd say change the logic so that "Continue" includes the reasoning of the last assistant message for all reasoning models.

The main issue is that some chat templates actively suppress the reasoning content from assistant messages, so I'm doubt if it will work cross all model.

Actually I'm thinking about a more generic approach, we can implement a feature in the backend such that both the "raw" generated text (i.e. with <think>, <reasoning>, etc) can be sent along with the already-parsed version.

I would say for now, we can put a warning in the webui to tell user that this feature is experimental and doesn't work cross all models. We can improve it later if it gets more usage.

allozaur · 2025-11-13T22:13:14Z

I would say for now, we can put a warning in the webui to tell user that this feature is experimental and doesn't work cross all models. We can improve it later if it gets more usage

Gotcha, @ngxson, let's do that

ggerganov · 2025-11-14T08:41:51Z

I would say for now, we can put a warning in the webui to tell user that this feature is experimental and doesn't work cross all models. We can improve it later if it gets more usage.

For reasoning models we can also disable continue all together - I don't think it is useful for reasoning models as it is because it looses it's reasoning trace. Also looking at the logs, gpt-oss tends to produce some gibberish tokens that are not displayed in the UI when you use continue:

Iq1pl · 2025-11-14T11:23:44Z

I don't want to interfere with you experts, I just want to share my insight, as I also struggled with gpt-oss on this issue, see the video. Of course, the implementation of Harmony might be different in llamacpp, but this is the only way to get the continue feature working for gpt-oss in LM Studio.

If it is possible, rather than disabling the function for all thinking models, adding an option to disable the Harmony parser might be better.
Thank you for your hard work.

output.mp4

…g the conversation payload ending with assistant message

…type

allozaur · 2025-11-14T12:01:53Z

@ggerganov @ngxson

I've added this setting and as for now we have "Continue" icon button rendered only for non-reasoning models.

If it is possible, rather than disabling the function for all thinking models, adding an option to disable the Harmony parser might be better.
Thank you for your hard work.

Maybe we can tackle this with the next iteration of this feature..? Idk, @ngxson do you think it's worth still doing this as a part of this PR or we want to revisit this in the future?

ngxson · 2025-11-14T12:03:13Z

If it is possible, rather than disabling the function for all thinking models, adding an option to disable the Harmony parser might be better.

I'm a bit surprise that it doesn't work in LM Studio. IIRC LM Studio doesn't actually modify, but they parse it for displaying, while still keeping the original generated content under the hood. CC @mattjcly from LM Studio team (as this probably a bug)

As I mentioned earlier in #16971 (comment) , we can preserve the raw content by introducing a new flag. But this is currently a low-prio task and we can do it later if more users need it

allozaur · 2025-11-14T14:49:24Z

If it is possible, rather than disabling the function for all thinking models, adding an option to disable the Harmony parser might be better.

I'm a bit surprise that it doesn't work in LM Studio. IIRC LM Studio doesn't actually modify, but they parse it for displaying, while still keeping the original generated content under the hood. CC @mattjcly from LM Studio team (as this probably a bug)

As I mentioned earlier in #16971 (comment) , we can preserve the raw content by introducing a new flag. But this is currently a low-prio task and we can do it later if more users need it

@ngxson i guess this doesn't stop us from having that PR reviewed and eventually merged?

ngxson · 2025-11-14T15:46:41Z

@ngxson i guess this doesn't stop us from having that PR reviewed and eventually merged?

Yes the current approach in this PR should be enough. Will give it a try a bit later.

allozaur · 2025-11-15T20:11:19Z

@ngxson i guess this doesn't stop us from having that PR reviewed and eventually merged?

Yes the current approach in this PR should be enough. Will give it a try a bit later.

Sure, lemme know!

allozaur requested review from ggerganov and ngxson November 3, 2025 15:14

allozaur commented Nov 3, 2025

View reviewed changes

github-actions bot added examples server labels Nov 3, 2025

allozaur force-pushed the 16097-continue-response branch 2 times, most recently from f4c3aeb to b8e4bb4 Compare November 12, 2025 17:00

allozaur added enhancement New feature or request server/webui and removed examples server labels Nov 12, 2025

github-actions bot added examples server labels Nov 12, 2025

allozaur force-pushed the 16097-continue-response branch from b8e4bb4 to e0d03e2 Compare November 12, 2025 17:07

allozaur commented Nov 12, 2025

View reviewed changes

DajanaV mentioned this pull request Nov 12, 2025

UPSTREAM PR #16971: webui: Add a "Continue" Action for Assistant Message auroralabs-loci/llama.cpp#185

Closed

allozaur force-pushed the 16097-continue-response branch from 4741f81 to c7e23c7 Compare November 12, 2025 18:02

allozaur added 11 commits November 14, 2025 12:55

feat: Add "Continue" action for assistant messages

8bd817a

feat: Continuation logic & prompt improvements

965f2ec

chore: update webui build output

4b8790b

feat: Improve logic for continuing the assistant message

44aece1

chore: update webui build output

4079d70

chore: Linting

3878117

chore: update webui build output

26701f6

fix: Remove synthetic prompt logic, use the prefill feature by sendin…

355ced4

…g the conversation payload ending with assistant message

chore: update webui build output

f1792e1

feat: Enable "Continue" button based on config & non-reasoning model …

daee4cc

…type

chore: update webui build output

d8f952d

allozaur force-pushed the 16097-continue-response branch from 941ec7c to d8f952d Compare November 14, 2025 11:56

webui: Add a "Continue" Action for Assistant Message #16971

Are you sure you want to change the base?

webui: Add a "Continue" Action for Assistant Message #16971

Conversation

allozaur commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Continue and Save features for chat messages

What's new

Technical notes

Demos

ggml-org/gpt-oss-20b-GGUF

unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

Uh oh!

allozaur left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Nov 3, 2025

Uh oh!

allozaur commented Nov 3, 2025

Uh oh!

Iq1pl commented Nov 5, 2025

Uh oh!

allozaur commented Nov 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov commented Nov 13, 2025

Uh oh!

ngxson commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

allozaur commented Nov 13, 2025

Uh oh!

allozaur commented Nov 13, 2025

Qwen3-VL-32B-Instruct-GGUF

ggml-org/gpt-oss-20b-gguf

ggml-org/gpt-oss-120b-gguf

unsloth/gemma3-12b-it-gguf

Uh oh!

ggerganov commented Nov 13, 2025

Uh oh!

allozaur commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Nov 13, 2025

Uh oh!

ggerganov commented Nov 13, 2025

Uh oh!

ngxson commented Nov 13, 2025

Uh oh!

allozaur commented Nov 13, 2025

Uh oh!

ggerganov commented Nov 14, 2025

Uh oh!

Iq1pl commented Nov 14, 2025

Uh oh!

allozaur commented Nov 14, 2025

Uh oh!

ngxson commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

allozaur commented Nov 14, 2025

Uh oh!

ngxson commented Nov 14, 2025

Uh oh!

allozaur commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

allozaur commented Nov 3, 2025 •

edited

Loading

`ggml-org/gpt-oss-20b-GGUF`

`unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF`

ngxson commented Nov 13, 2025 •

edited

Loading

`Qwen3-VL-32B-Instruct-GGUF`

`ggml-org/gpt-oss-20b-gguf`

`ggml-org/gpt-oss-120b-gguf`

`unsloth/gemma3-12b-it-gguf`

allozaur commented Nov 13, 2025 •

edited

Loading

ngxson commented Nov 14, 2025 •

edited

Loading