GLM-4.7-Flash Turbomind support by lapy · Pull Request #4362 · InternLM/lmdeploy

lapy · 2026-02-13T13:23:20Z

Motivation

#4320 added GLM-4.7-Flash support for the PyTorch backend only. This PR adds full TurboMind (C++) backend support, enabling significantly faster inference with tensor-core-accelerated attention, native AWQ int4 quantization, and optimized MoE routing — all critical for production deployment of this 4.7B MoE model.

GLM-4.7-Flash uses an MLA + MoE architecture with two properties that required dedicated engine work:

HeadDim=576 (from MLA: kv_lora_rank=512 + qk_rope_dim=64) — not previously supported by any attention kernel in TurboMind.
noaux_tc MoE routing with sigmoid scoring and per-expert correction bias — a new routing strategy not present in the existing DeepSeek2 codebase.

Additionally, this enables mixed AWQ quantization (e.g. QuantTrio/GLM-4.7-Flash-AWQ) where attention weights remain fp16 while FFN/expert weights are int4, requiring a new three-tier weight type system in the C++ engine.

Modification

43 files changed, +922 / −79 lines across 5 commits.

1. Model Registration & Weight Conversion

New source model Glm4MoeLiteModel / Glm4MoeLiteReader extending DeepSeek2Model, handling the model's specific config (topk_method='noaux_tc', scoring_func='sigmoid', e_score_correction_bias).
MLA weight folding at conversion time: When kv_lora_rank != qk_nope_head_dim (512 ≠ 576 for GLM-4.7-Flash), the kc/vc compression matrices from kv_b_proj are folded into q_b_proj and o_proj respectively, yielding an effective head_dim=576 that the existing attention kernels can process without runtime decomposition.
Refactored DeepSeek2Reader with layer-specific regex patterns for FFN weight keys to avoid collisions between dense layer 0 and MoE shared expert layers.

2. Mixed Quantization Support (Three-Tier Weight Types)

Introduced ffn_weight_type alongside existing weight_type and expert_weight_type in both Python config and C++ ModelParam:

	`weight_type`	`ffn_weight_type`	`expert_weight_type`
Pure fp16	float16	float16	float16
Full AWQ	int4	int4	int4
Mixed AWQ	float16	int4	int4
GptOss mxfp4	bfloat16	bfloat16	e2m1

Dense layers (no MoE, e.g. layer 0 excluded via modules_to_not_convert) use weight_type for FFN allocation; MoE layers use ffn_weight_type for shared experts. This prevents buffer type mismatches that would cause garbage output.

3. Attention Kernels for HeadDim=576

16 new codegen files for attention/decoding at HeadDim=576 across SM70/SM75/SM80, fp16/bf16, and linear/u4/u8 KV cache types.
SM70 (V100) specialization: MMA_884 tensor core kernel with CTA_Q=64, CTA_S=32, WARP_Q=16, WARP_S=32 — 4 warps, 76.8 KB shared memory (within V100's 96 KB limit). CTA_S reduced from 64 to fit the large head dimension.
Bug fix in impl_884.h: V accumulation loop bound changed from hardcoded 8 (correct only for HeadDim=128) to compile-time V_N — critical for HeadDim=576 where V_N=36.
Added HeadDim=576 dispatch for invokeProcessKV_v2, invokeFlattenKV_v2, and invokeReduceV3.

4. MoE `noaux_tc` Routing Kernel

New CUDA kernel MoeGateNoAuxTCKernel (~200 lines) implementing the TC-free auxiliary loss routing:
1. scores = sigmoid(logits) (or softmax, configurable via scoring_func).
2. scores_for_choice = scores + correction_bias for top-k selection.
3. Weights taken from raw scores (not biased), then optionally renormalized.
Integrated into moe_ffn_layer.cc with a new topk_method == "noaux_tc" dispatch path.
MoeFfnWeight extended with score_correction_bias tensor.

5. MLA Copy Kernel Fix

Outer loop in mla_copy_qkv_kernel now iterates over heads (was single-head assignment) — needed when head_num > blockDim.y.
Added explicit zero-padding for V when di >= v_head_dim.

BC-breaking (Optional)

No backward-compatibility breaking changes. The new ffn_weight_type config field defaults to weight_type in the C++ YAML reader, so existing model configs without this field continue to work identically.

Use cases (Optional)

# fp16 inference (single GPU)
lmdeploy chat THUDM/glm-4-9b-chat-hf \
    --backend turbomind \
    --session-len 4096

# AWQ int4 quantized inference (fits in ~27 GB VRAM on a single V100-32GB)
lmdeploy chat QuantTrio/GLM-4.7-Flash-AWQ \
    --backend turbomind \
    --model-format awq \
    --session-len 2048 \
    --cache-max-entry-count 0.5

# Multi-GPU inference
lmdeploy chat QuantTrio/GLM-4.7-Flash-AWQ \
    --backend turbomind \
    --model-format awq \
    --tp 2

Tested on 2× V100-32GB with QuantTrio/GLM-4.7-Flash-AWQ — coherent output, ~27 GB VRAM on single GPU with AWQ quantization active.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
- Manual end-to-end testing performed on V100 GPUs. Kernel-level unit tests for HeadDim=576 attention and noaux_tc routing are not yet added.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.
- GLM-4.7-Flash should be added to the supported models documentation.

Copilot

Pull request overview

Adds TurboMind (C++ backend) support for GLM-4.7-Flash / GLM4 MoE Lite, including new routing + attention-kernel capabilities needed for this model family and mixed-quant deployment.

Changes:

Introduces mixed quantization in TurboMind via a new ffn_weight_type (separate from weight_type and expert_weight_type) and threads it through Python config → YAML → C++ param usage.
Adds HeadDim=576 support across TurboMind attention / decoding / KV-cache utilities via new dispatches and many codegen instantiations.
Adds noaux_tc MoE routing (sigmoid/softmax scoring + correction bias) with new CUDA kernel + wiring into MoE layer and weight export.

Reviewed changes

Copilot reviewed 43 out of 43 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
src/turbomind/turbomind.cc	Parse new MoE + mixed-quant YAML fields (`scoring_func`, `router_n_groups`, `ffn_weight_type`).
src/turbomind/models/llama/moe_ffn_layer.cc	Adds `noaux_tc` routing dispatch and passes correction bias + scoring mode into routing kernel.
src/turbomind/models/llama/mla_utils.cu	Fix MLA QKV copy to iterate over heads and add explicit V padding.
src/turbomind/models/llama/llama_params.h	Adds `ffn_weight_type` and new MoE params (`scoring_func`, `router_n_groups`).
src/turbomind/models/llama/LlamaDenseWeight.h	Adds `score_correction_bias` tensor to MoE weight struct.
src/turbomind/models/llama/LlamaDenseWeight.cc	Allocates/registers `gate.score_correction_bias` when `topk_method == noaux_tc`.
src/turbomind/models/llama/LlamaDecoderLayerWeight.cc	Uses `ffn_weight_type` for shared-expert/dense-FFN allocation in MoE layers; keeps attention on `weight_type`.
src/turbomind/kernels/gemm/moe_utils_v2.h	Declares `invokeMoeGate_NoAuxTC`.
src/turbomind/kernels/gemm/moe_utils_v2.cu	Implements `noaux_tc` routing CUDA kernel + scan integration.
src/turbomind/kernels/attention/reduce.cu	Instantiates reduce kernel for HeadDim=576 (fp16/bf16).
src/turbomind/kernels/attention/kv_cache_utils_v2.cu	Adds HeadDim=576 dispatch for KV process/flatten v2.
src/turbomind/kernels/attention/impl_884.h	Fixes V accumulation loop bound from hardcoded `8` to compile-time `V_N`.
src/turbomind/kernels/attention/decoding_config.h	Adds Sm70 decoding config specialization for HeadDim=576.
src/turbomind/kernels/attention/decoding.cu	Adds HeadDim=576 decoding dispatch.
src/turbomind/kernels/attention/codegen/decoding_sm80_576_f16_u8.cu	Adds decoding instantiations for Sm80 f16 + u8 KV cache @ 576.
src/turbomind/kernels/attention/codegen/decoding_sm80_576_f16_u4.cu	Adds decoding instantiations for Sm80 f16 + u4 KV cache @ 576.
src/turbomind/kernels/attention/codegen/decoding_sm80_576_f16_f16.cu	Adds decoding instantiations for Sm80 f16 + f16 KV cache @ 576.
src/turbomind/kernels/attention/codegen/decoding_sm80_576_bf16_u8.cu	Adds decoding instantiations for Sm80 bf16 + u8 KV cache @ 576.
src/turbomind/kernels/attention/codegen/decoding_sm80_576_bf16_u4.cu	Adds decoding instantiations for Sm80 bf16 + u4 KV cache @ 576.
src/turbomind/kernels/attention/codegen/decoding_sm80_576_bf16_bf16.cu	Adds decoding instantiations for Sm80 bf16 + bf16 KV cache @ 576.
src/turbomind/kernels/attention/codegen/decoding_sm75_576_f16_u8.cu	Adds decoding instantiations for Sm75 f16 + u8 KV cache @ 576.
src/turbomind/kernels/attention/codegen/decoding_sm75_576_f16_u4.cu	Adds decoding instantiations for Sm75 f16 + u4 KV cache @ 576.
src/turbomind/kernels/attention/codegen/decoding_sm75_576_f16_f16.cu	Adds decoding instantiations for Sm75 f16 + f16 KV cache @ 576.
src/turbomind/kernels/attention/codegen/decoding_sm70_576_f16_u8.cu	Adds decoding instantiations for Sm70 f16 + u8 KV cache @ 576.
src/turbomind/kernels/attention/codegen/decoding_sm70_576_f16_u4.cu	Adds decoding instantiations for Sm70 f16 + u4 KV cache @ 576.
src/turbomind/kernels/attention/codegen/decoding_sm70_576_f16_f16.cu	Adds decoding instantiations for Sm70 f16 + f16 KV cache @ 576.
src/turbomind/kernels/attention/codegen/attention_sm80_576_f16.cu	Adds context-attention instantiation for Sm80 f16 @ 576.
src/turbomind/kernels/attention/codegen/attention_sm80_576_bf16.cu	Adds context-attention instantiation for Sm80 bf16 @ 576.
src/turbomind/kernels/attention/codegen/attention_sm75_576_f16.cu	Adds context-attention instantiation for Sm75 f16 @ 576.
src/turbomind/kernels/attention/codegen/attention_sm70_576_f16.cu	Adds context-attention instantiation for Sm70 f16 @ 576.
src/turbomind/kernels/attention/attention_universal.h	Changes KV processing rule to only inline-process in decoding; affects kernel behavior selection.
src/turbomind/kernels/attention/attention_config.h	Adds Sm70 HeadDim=576 attention config (MMA_884) and includes SIMT impl.
src/turbomind/kernels/attention/attention.cu	Adds HeadDim=576 dispatch for context attention.
src/turbomind/kernels/attention/CMakeLists.txt	Adds new codegen compilation units for HeadDim=576.
lmdeploy/turbomind/supported_models.py	Registers `Glm4MoeLiteForCausalLM` mapping to `glm4-moe-lite`.
lmdeploy/turbomind/deploy/target_model/base.py	Adds dtype matching for tm-parameter copies to avoid byte-size mismatches.
lmdeploy/turbomind/deploy/source_model/llama.py	Adds `rope_parameters` support for newer Transformers config layout.
lmdeploy/turbomind/deploy/source_model/glm4_moe_lite.py	Adds GLM4 MoE Lite reader/model and exports correction bias key.
lmdeploy/turbomind/deploy/source_model/deepseek2.py	Refines per-layer FFN key filtering and adds MLA folding metadata + scoring config fields.
lmdeploy/turbomind/deploy/source_model/init.py	Exposes `Glm4MoeLiteModel` in source-model package init.
lmdeploy/turbomind/deploy/module.py	Exports correction bias and adds MLA folding logic during conversion; adjusts export ordering and bounds.
lmdeploy/turbomind/deploy/converter.py	Adds `ffn_weight_type` + mixed-AWQ handling to config generation.
lmdeploy/turbomind/deploy/config.py	Adds config fields: `ffn_weight_type`, `scoring_func`, `router_n_groups`.

Comments suppressed due to low confidence (2)

lmdeploy/turbomind/deploy/module.py:7

identity is imported but not used in this module. Please remove the unused import to avoid lint failures and keep the module clean.

from .parameter import get_params

lmdeploy/turbomind/deploy/module.py:7

Import of 'identity' is not used.

from .parameter import get_params

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lmdeploy/turbomind/deploy/target_model/base.py

src/turbomind/kernels/gemm/moe_utils_v2.cu

lmdeploy/turbomind/deploy/source_model/llama.py

src/turbomind/kernels/gemm/moe_utils_v2.cu

src/turbomind/models/llama/moe_ffn_layer.cc

src/turbomind/kernels/attention/attention_universal.h

Dense layers (no MoE experts, e.g. layer 0 in GLM-4.7-Flash) are typically excluded from quantization via modules_to_not_convert. Their FFN weights are fp16, so LlamaFfnWeight must use weight_type (fp16) instead of the global ffn_weight_type (int4). Only MoE layers need ffn_weight_type for their shared experts. Without this fix, the C++ engine allocates int4 buffers for layer 0's FFN but Python writes fp16 weights, causing tensor name mismatches (w1.weight vs w1.qweight) and leaving layer 0 uninitialized -> garbage output.

The HeadDim=576 specialization was using CTA_Q=16 with only 1 warp, while the generic Sm70 config uses CTA_Q=64 with 4 warps. The shared memory budget permits CTA_Q=64: max(Q=72.5KB, K+V+P=76.8KB) = 76.8KB < 96KB. This gives 4x more tensor core parallelism per CTA and 4x faster cooperative global memory loads, matching the generic Sm70 config's warp count.

…ndant memsets, missing forward decl - base.py: Cache _turbomind import outside loop, narrow except to ImportError - llama.py: Use local rope_scaling var instead of model_arg['rope_scaling'] to avoid KeyError when only rope_parameters is present - moe_utils_v2.cu: Remove redundant per-token masks clear in noaux_tc kernel (already initialized via cudaMemsetAsync before kernel launch) - moe_ffn_layer.cc: Move accum/masks clears into respective branches — noaux_tc handles its own clearing internally, V2 needs caller-side accum clear only - attention_universal.h: Forward-declare DecodingCtaMap to fix include-order dependency on cta_map.h

- converter.py: Fix torch_dtype deprecation — prefer config.dtype over config.torch_dtype (transformers v5+ compat) - internlm2.py: Add missing 'if not kind' guard in _ffn() to prevent KeyError when Ffn.apply() probes for parameter keys with kind=None - test_moe_gate_noaux_tc.py: Reference tests for noaux_tc MoE routing algorithm (sigmoid/softmax scoring, correction bias, top-k selection, renormalization, routed_scale, NaN/Inf handling, GLM-4.7-Flash config) - test_converter.py: Add test_torch_dtype_fallback and test_ffn_reader_kind_none

lvhan028 · 2026-02-24T08:40:59Z

Hi, @lapy
We are truly grateful for your exceptional contribution to the turbomind engine.

Could you help resolve the conflicts?
Also, I noticed the compilation takes quite some time. It could be improved by disabling the test_quantization target compilation in src/turbomind/kernels/CMakeLists.txt as shown below:

if (BUILD_TEST)
    add_executable(test_quantization test_quantization.cc gemm/test/test_utils.cu)
    target_link_libraries(test_quantization PRIVATE quantization_kernels core)
endif ()

Would you be able to include this change in the current PR?

lmdeploy/turbomind/deploy/target_model/base.py

- Merge latest main into glm4-moe-lite-minimal to pick up DeepSeek v3.2 / GLM-5 updates - Resolve DeepSeek2/Llama source_model conflicts and keep MLA + MoE routing changes - Gate turbomind test_quantization target behind BUILD_TEST to reduce build time Co-authored-by: Cursor <cursoragent@cursor.com>

RNGMARTIN · 2026-02-25T03:05:30Z

only QuantTrio/GLM-4.7-Flash-AWQ available?? cann't I use the original GLM-4.7-flash through my 4*32GB V100 by lmdeploy?

lvhan028 · 2026-02-25T03:51:02Z

lmdeploy/turbomind/deploy/source_model/internlm2.py

    def _ffn(self, i: int, kind: str):
        """Get ffn kind for layer i."""
+        if not kind:
+            return self.filter(self.ffn_pattern)


miss the other parameters "i", i.e., the layer id

lvhan028 · 2026-02-25T03:57:26Z

src/turbomind/turbomind.cc


    model_param_.weight_type        = data_type_from_string(model["weight_type"].as<std::string>());
    model_param_.expert_weight_type = data_type_from_string(model["expert_weight_type"].as<std::string>());
+    model_param_.ffn_weight_type    = data_type_from_string(


How about model_param_.ffn_weight_type = data_type_from_string(model["ffn_weight_type"].as<std::string>(model_param_.weight_type));

lvhan028 · 2026-02-25T04:03:00Z

@lapy I've left a few more comments. Looks good overall! I'm running the evaluation tests and will share the results later.

lvhan028 · 2026-02-25T07:21:16Z

only QuantTrio/GLM-4.7-Flash-AWQ available?? cann't I use the original GLM-4.7-flash through my 4*32GB V100 by lmdeploy?

I can deploy GLM-4.7-flash on H800 by adding this patch:

for tm_tensor in tm_params[name]:
    # Match TurboMind tensor dtype to avoid byte_size mismatch (e.g. f32 256b vs f16 128b)
    if _tm is not None:
        if tm_tensor.type == _tm.DataType.TYPE_FP32 and torch_tensor.dtype in [torch.float16, torch.bfloat16]:
            torch_tensor = torch_tensor.float()
        elif tm_tensor.type == _tm.DataType.TYPE_FP16 and torch_tensor.dtype == torch.float32:
            torch_tensor = torch_tensor.half()
    tm_tensor.copy_from(torch_tensor)

lvhan028 · 2026-02-26T03:10:39Z

@lapy I've left a few more comments. Looks good overall! I'm running the evaluation tests and will share the results later.

dataset                       version    metric                        mode    glm4.7
----------------------------  ---------  ----------------------------  ------  --------
core_average                  -          -                             -       -
                              -          -                             -       -
Instruction Following         -          -                             -       -
IFEval                        353ae7     Prompt-level-strict-accuracy  gen     85.03

Copilot AI review requested due to automatic review settings February 13, 2026 13:23

Copilot started reviewing on behalf of lapy February 13, 2026 13:23 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

lapy added 7 commits February 13, 2026 14:26

Implement GLM 4.7 Flash Turbomind support

1073fe9

Fix mma_884 implementation

d7e4bbe

Add turbomind support for mixed quantization in GLM 4.7 Flash

9a5c677

Fix pre-commit: unused import, isort order, yapf formatting

5cd41eb

lapy force-pushed the glm4-moe-lite-minimal branch from b8fb39f to b8699b0 Compare February 13, 2026 14:27

lapy force-pushed the glm4-moe-lite-minimal branch from b8699b0 to de580e3 Compare February 13, 2026 16:16

lvhan028 requested a review from lzhangzz February 14, 2026 00:30

windreamer mentioned this pull request Feb 24, 2026

[Feature] support glm-4.7-flash ? #4283

Closed

lvhan028 added the enhancement New feature or request label Feb 24, 2026

lvhan028 self-requested a review February 24, 2026 04:27

lvhan028 reviewed Feb 24, 2026

View reviewed changes

lmdeploy/turbomind/deploy/target_model/base.py Outdated Show resolved Hide resolved

windreamer mentioned this pull request Feb 24, 2026

KV Cache Memory Estimation Error for GLM-4.7-Flash-AWQ on V100 #4366

Open

lvhan028 reviewed Feb 25, 2026

View reviewed changes

minor fix

ba5d14b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GLM-4.7-Flash Turbomind support#4362

GLM-4.7-Flash Turbomind support#4362
lapy wants to merge 10 commits intoInternLM:mainfrom
lapy:glm4-moe-lite-minimal

lapy commented Feb 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lvhan028 commented Feb 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

RNGMARTIN commented Feb 25, 2026

Uh oh!

lvhan028 Feb 25, 2026

Uh oh!

lvhan028 Feb 25, 2026

Uh oh!

lvhan028 commented Feb 25, 2026

Uh oh!

lvhan028 commented Feb 25, 2026 •

edited

Loading

Uh oh!

lvhan028 commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lapy commented Feb 13, 2026

Motivation

Modification

1. Model Registration & Weight Conversion

2. Mixed Quantization Support (Three-Tier Weight Types)

3. Attention Kernels for HeadDim=576

4. MoE noaux_tc Routing Kernel

5. MLA Copy Kernel Fix

BC-breaking (Optional)

Use cases (Optional)

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lvhan028 commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

RNGMARTIN commented Feb 25, 2026

Uh oh!

lvhan028 Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 commented Feb 25, 2026

Uh oh!

lvhan028 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lvhan028 commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

4. MoE `noaux_tc` Routing Kernel

lvhan028 commented Feb 24, 2026 •

edited

Loading

lvhan028 commented Feb 25, 2026 •

edited

Loading