Conversation
Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`. Changes include: - Updating `llama_batch` to support `mtp_params`. - Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft). - Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`). - Adapting the embedding extraction logic to skip MTP update passes.
…or Draft Model).
|
I think it would be better to first achieve performance improvement via MTP before adding MTP for more models. |
|
Have you tried I found that I never investigated too deeply why as now moved on to using Maybe worth a try though. |
|
Have you tried to use the Qwen3.5-35B as a draft for Qwen3.5-397B ? I remember doing with older models that and having decent speedup. |
|
I'm really only using |
|
Are there any standardized tests to check the scores of the LLM regarding the tool-call performance etc. that can be ran locally? |
Not really, but you will very quickly find out if it starts hallucinating the tool calls in the chat ( |
|
Okay what about using smol-IQ1_KT fully GPU-offloaded as a draft for a larger quant with only offloaded head and the KV-cache? Having about 31 tps decode at zero ctx and 21 tps at 32k ctx. [EDIT]: naaah. I don't think its worth it. 21 tps at 32k ctx is already slow enough. Hmm... I should probably finally try with double EPYC. |
@ikawrakow To be honest, I already had the GLM5 and use it fairly often, so I wanted to add it to have a point of comparison. As for other MTPs, I don’t plan on adding them for now, especially since we don’t retain the layer and it’s unlikely anyone would want to re-quantize just to test a slow feature.
@jukofyork With MLA 1 or 3 I saw slightly lower performance, for me the best performance was: no MLA > MLA3 > MLA1. To be honest, I haven’t been fine-tuning the arguments for a while, but since you mentioned -draft-min, I have an idea in mind that might help better define that parameter, I’ll see how it works in practice later.
@magikRUKKOLA Could you give me some details about the arguments used? I tested it with Kimi K2.5, thinking there was an incompatibility with MTP, then I tested it with GLM5 without MTP and didn't get any errors. |
/opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-server \
--model /opt/ubergarm/GLM-5-GGUF/smol-IQ2_KS/GLM-5-smol-IQ2_KS-00001-of-00006.gguf \
--alias ubergarm/GLM-5-smol-IQ2_KS \
--ctx-size $((128 * 1024)) \
-b $((1024)) -ub $((1024)) \
--mlock \
--temp 0.0 --top-p 1.0 --top-k 0 \
-ctk q6_0 \
-ctv q6_0 \
-mtp \
-khad \
-ger \
-smgs \
-sas \
-muge \
-mea 16 \
-amb 16 \
--merge-qkv \
--graph-reduce-type bf16 \
--split-mode layer \
--main-gpu 0 \
--max-gpu 0 \
--n-gpu-layers 99 \
--threads $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
--host 0.0.0.0 \
--port 8080 \
--log-enable \
--logdir /var/log/ \
--jinja \
--special \
--verbosity 1 \
--verbose-prompt \
--reasoning-format auto \
--prompt-cache "$HOME/.cache/ik_llama.cpp/prompt-cache.bin" --prompt-cache-all \
--slot-save-path "$HOME/.cache/ik_llama.cpp/slot.bin" \
--lookup-cache-dynamic "$HOME/.cache/ik_llama.cpp/slot.bin" \
--keep -1 \
--slot-prompt-similarity 0.35 \
--metrics \
-cuda fusion=1[EDIT]: woops. I had to use |
On mainline |
|
[EDITED]:
|
What arguments should I use once again? How to set the draft size ? [EDIT]: Oh. I see. So via the |
Its with |
|
@magikRUKKOLA I wasn't able to reproduce the same error with your arguments, the only difference was that I couldn't fully offload to the GPU with such a large model. That said, there were some errors that occurred, and they were fixed after the most recent rebase of the branch. Since your first test was done before that, please try making a new pull. To provide more context, the models that have MTP and support it are GLM 4.5/4.6/4.7 and 5.0. You can try running the -mtp command with any other model, and it will be disabled (I used Kimi K2.5 as a test to see if this logic was causing your crash before). Currently, MTP only supports --draft-max and --draft-p-min
@jukofyork I believe that certain parameters, such as draft-max, draft-min, and p-min, could be optimized, perhaps using a controller that can adjust the parameters based on the hit rate of the speculative models. Since you’re running some tests, are there any parameters you’d like me to test? |
|
Aha! Yes, it does not crash indeed. Its like without Overall, with |
Don't worry, one day it will be optimized enough to be worth it (I hope). |
|
Should I re-try with hybrid inference? |
See the posts in this thread, starting here: ggml-org/llama.cpp#10466 (comment) I tried to simplify it to the bare minimum here: but nobody seemed interested and mainline The key thing from all my experiments is that you can't really just use a fixed
Some kind of adaptive controller would be the next step, but there was pretty much zero interest in that discussion and PR... I'm also not convinced the current logic is correct: ggml-org/llama.cpp#10466 (comment) The code has got so many tricky optimisations in it now though, but I think you can show that if If you look at the costs for my |
@magikRUKKOLA If you want to test whether the GLM5 MTP code works, go ahead I appreciate it, but in terms of performance, it shouldn't make much of a difference.
@jukofyork This is a great material, I need more time to read through the details, but I’ll definitely use it when I start working on this feature. I believe parameter inferences can be made in real time, which allows for adapting the settings to the user’s needs and use cases. At the end of the session, a snapshot of the current metrics could be provided so that the user can use it as a default in the future if they wish. |
|
GLM5 IQ2_KL without with |
The performance loss is consistent with my tests, which leads me to believe that the initial gains will be in hybrid/CPU-only inference, but that in the future the main gains will come from the GPU. |

Add mtp support for GLM-5, to try use the args -mtp to activate and --draft-max, --draft-p-min to control how much tokens you want to generate.
Test's applied
I copied the "Top" YouTube section from Wikipedia: https://en.wikipedia.org/wiki/YouTube#GLM 5 smol-IQ2_KS - Draft size = 10, p-min = 0.85, -ot "blk.78..*=CUDA1", --seed 42
Without MTP vs With MTP