Glm4 mtp optimizations #4

SamuelOliveirads · 2025-10-24T02:26:56Z

I've created this draft to share my findings on what to fix or improve to make MTP usable. Currently, MTP's output quality is good, but its performance is worse than not using it at all. Therefore, it's not enough to be on par with the baseline; we need to be faster.

My initial plan is to find areas for improvement. It's not necessary to implement everything at once, but some of these should be on our radar for the future. They are:

Graph reuse
llama_context::decode calls
Multi-token drafts

There are likely more things to improve, but for now, I find these to be the most impactful. Below are my thoughts on each:

1) Graph Reuse: The baseline implementation always reuses the graph. The process is simple: it stores the graph, and in the next call to llama_context::process_ubatch, it checks if the stored graph can be reused. If not, it's deleted and the new one is stored. This works well after the first token is generated, as subsequent graphs are identical. The main bottleneck isn't calling llama_model::build_graph constantly, but rather ggml_backend_sched_alloc_graph, which has to allocate and compute resources for the backend.

The first fix was simple: just store one graph. In this case, the main model's token generation graph, which is one of the most expensive, will always be reused. On my machine, this gave an uplift of 13.8% for small prompts.

Current state: Halted.

After that, I tried to store the graph for every operation, or at least the ones that didn't involve the KV cache. By applying llm_graph_context::cb to certain layers, I could store and reuse the graph, and I was able to compile and test this using only the CPU backend. However, I was unable to get it working with the offload policy. In theory, the cb function should handle that, but something else seems to be preventing specifically the allocation and computation. Is it mixing the offload policies of the main model and the MTP? This needs a deeper investigation, and I lack the proper knowledge in this area, so I'm setting it aside for now.

2) decode calls: MTP was successfully implemented inside decode, but it uses the old logic where each operation requires an expensive function call. Here is a comparison of how many calls we make in different scenarios:

LLM - Normal:
- Loop 1: Prompt + Generation = 1 call
- Loop 2: Token generation = 1 call
Draft Model:
- Loop 1: Prompt + Generation -> Draft generation -> Main model validation = 3 calls
- Loop 2: Token generation -> Draft generation -> Main model validation = 3 calls
MTP (Current Slow Implementation):
- Loop 1: Prompt + Generation -> MTP warmup -> MTP draft -> Main model validation -> MTP KV update = 5 calls
- Loop 2: Token generation -> MTP draft -> Main model validation -> MTP KV update = 4 calls

One way to make MTP more usable is to match the number of calls of a typical draft model. To do that, it's necessary to combine the KV cache update and the draft generation into a single call.

Current state: In progress.

I successfully merged the KV cache update with the draft generation. This required creating a custom batch and sinfo, and changing some logic regarding the embeddings and hidden states necessary for the MTP to work. The version in this branch works in terms of output, meaning it's not breaking quality. However, the draft acceptance rate has dropped to around 25%. I believe this happens because while the first step (KV update) works using the correct hidden state from the main model, the subsequent operation (draft) is using a new hidden state generated by the MTP itself during the update. I still need to confirm this theory and apply a fix to hopefully see the acceptance rate rise back to its previous level.

One last thing: this change will still require a separate warmup call on the first interaction, but this is less impactful than merging the update and draft steps. To merge the warmup step, it would be necessary to track the sinfo to know when the prompt processing has finished its last batch, and then insert a new slot for the draft token.

3) Multi-token drafts: We discussed this in another PR. The problem was that for each new draft token, the MTP's KV cache needed to be updated, which was painful to do before. Now that we are using the decode function, it's more feasible. If the unified update/draft implementation works, we could simply increase the batch and sinfo size to make the model draft more tokens.

These are some of my ideas. I'd appreciate any insights you might have on how to better handle some of these things, or even new ideas for improvements that I haven't spotted here.

…llama.cpp into glm4-mtp-graph-cache

SamuelOliveirads added 5 commits October 12, 2025 16:33

mtp-graph(feat): Reactivate graph reuse only for main model path

171346c

Merge branch 'glm4-mtp-batch' of https://github.com/SamuelOliveirads/…

15dff20

…llama.cpp into glm4-mtp-graph-cache

mtp-graph (wip): testing different ways to allow graph reuse

5859cb9

mtp-op (feat): merge update kv and draft into one operation

4812d0a

mtp-op (refactor): fix language for log

b229c6a

SamuelOliveirads mentioned this pull request Nov 2, 2025

server: implement GLM-style MTP ggml-org/llama.cpp#15225

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Glm4 mtp optimizations #4

Glm4 mtp optimizations #4

SamuelOliveirads commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Glm4 mtp optimizations #4

Are you sure you want to change the base?

Glm4 mtp optimizations #4

Conversation

SamuelOliveirads commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant