Skip to content

server: implement GLM-style MTP #15225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft

Conversation

F1LM1
Copy link

@F1LM1 F1LM1 commented Aug 11, 2025

This is very much a draft/proof of concept I'm playing with, just one idea for an MTP implementation. Planning to test on GLM-4.5 because it's the only model out there that we've preserved NextN tensors for.

From what I can tell

  • the three models with MTP implemented in vLLM right now are all "DeepseekV3-style,"
  • they only have one MTP head, which predicts token at position n+2,
  • the MTP layers take as input the output embedding from the last conventional layer and their own input embedding.

So implementation-wise it seems like

  • we should try to reuse the existing speculative decode functionality (including nice stuff like main model KV cache management, various samplers, etc.),
  • but a lot of the full draft model functionality is redundant/harmful, like context/cache management for the draft model, vocab matching,
  • it probably makes sense to write a new function like mtp_speculative_gen_draft in speculative.cpp that is vastly simplified and branch into it in server.cpp when a slot has MTP (versus common_speculative_gen_draft).
  • AFAICT it looks like the server.cpp loop currently alternates between conventional forward pass and draft, which in the MTP case will probably sabotage performance gains (since our max throughput is only 1.5 tok/pass assuming zero rejections, instead of 2 tok/pass). Let me know if this isn't the case!—but if it is, should probably avoid doing non-speculative decodes after the first response token.
  • It doesn't make sense to have to manage a distinct ctx_dft in this case as well. It's a bit hacky but I was thinking we could just have ctx_dft = ctx and then have both normal and MTP passes write over the shared ctx logits. I think this minimizes required code changes elsewhere

This is my first time (1) working with ML stuff outside of python (2) attempting to contribute, so patience is appreciated :)

@ggerganov ggerganov added the hot Something that is hot label Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples hot Something that is hot server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants