Conversation
There was a problem hiding this comment.
Pull request overview
Adds TurboMind (C++ backend) support for GLM-4.7-Flash / GLM4 MoE Lite, including new routing + attention-kernel capabilities needed for this model family and mixed-quant deployment.
Changes:
- Introduces mixed quantization in TurboMind via a new
ffn_weight_type(separate fromweight_typeandexpert_weight_type) and threads it through Python config → YAML → C++ param usage. - Adds HeadDim=576 support across TurboMind attention / decoding / KV-cache utilities via new dispatches and many codegen instantiations.
- Adds
noaux_tcMoE routing (sigmoid/softmax scoring + correction bias) with new CUDA kernel + wiring into MoE layer and weight export.
Reviewed changes
Copilot reviewed 43 out of 43 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/turbomind/turbomind.cc | Parse new MoE + mixed-quant YAML fields (scoring_func, router_n_groups, ffn_weight_type). |
| src/turbomind/models/llama/moe_ffn_layer.cc | Adds noaux_tc routing dispatch and passes correction bias + scoring mode into routing kernel. |
| src/turbomind/models/llama/mla_utils.cu | Fix MLA QKV copy to iterate over heads and add explicit V padding. |
| src/turbomind/models/llama/llama_params.h | Adds ffn_weight_type and new MoE params (scoring_func, router_n_groups). |
| src/turbomind/models/llama/LlamaDenseWeight.h | Adds score_correction_bias tensor to MoE weight struct. |
| src/turbomind/models/llama/LlamaDenseWeight.cc | Allocates/registers gate.score_correction_bias when topk_method == noaux_tc. |
| src/turbomind/models/llama/LlamaDecoderLayerWeight.cc | Uses ffn_weight_type for shared-expert/dense-FFN allocation in MoE layers; keeps attention on weight_type. |
| src/turbomind/kernels/gemm/moe_utils_v2.h | Declares invokeMoeGate_NoAuxTC. |
| src/turbomind/kernels/gemm/moe_utils_v2.cu | Implements noaux_tc routing CUDA kernel + scan integration. |
| src/turbomind/kernels/attention/reduce.cu | Instantiates reduce kernel for HeadDim=576 (fp16/bf16). |
| src/turbomind/kernels/attention/kv_cache_utils_v2.cu | Adds HeadDim=576 dispatch for KV process/flatten v2. |
| src/turbomind/kernels/attention/impl_884.h | Fixes V accumulation loop bound from hardcoded 8 to compile-time V_N. |
| src/turbomind/kernels/attention/decoding_config.h | Adds Sm70 decoding config specialization for HeadDim=576. |
| src/turbomind/kernels/attention/decoding.cu | Adds HeadDim=576 decoding dispatch. |
| src/turbomind/kernels/attention/codegen/decoding_sm80_576_f16_u8.cu | Adds decoding instantiations for Sm80 f16 + u8 KV cache @ 576. |
| src/turbomind/kernels/attention/codegen/decoding_sm80_576_f16_u4.cu | Adds decoding instantiations for Sm80 f16 + u4 KV cache @ 576. |
| src/turbomind/kernels/attention/codegen/decoding_sm80_576_f16_f16.cu | Adds decoding instantiations for Sm80 f16 + f16 KV cache @ 576. |
| src/turbomind/kernels/attention/codegen/decoding_sm80_576_bf16_u8.cu | Adds decoding instantiations for Sm80 bf16 + u8 KV cache @ 576. |
| src/turbomind/kernels/attention/codegen/decoding_sm80_576_bf16_u4.cu | Adds decoding instantiations for Sm80 bf16 + u4 KV cache @ 576. |
| src/turbomind/kernels/attention/codegen/decoding_sm80_576_bf16_bf16.cu | Adds decoding instantiations for Sm80 bf16 + bf16 KV cache @ 576. |
| src/turbomind/kernels/attention/codegen/decoding_sm75_576_f16_u8.cu | Adds decoding instantiations for Sm75 f16 + u8 KV cache @ 576. |
| src/turbomind/kernels/attention/codegen/decoding_sm75_576_f16_u4.cu | Adds decoding instantiations for Sm75 f16 + u4 KV cache @ 576. |
| src/turbomind/kernels/attention/codegen/decoding_sm75_576_f16_f16.cu | Adds decoding instantiations for Sm75 f16 + f16 KV cache @ 576. |
| src/turbomind/kernels/attention/codegen/decoding_sm70_576_f16_u8.cu | Adds decoding instantiations for Sm70 f16 + u8 KV cache @ 576. |
| src/turbomind/kernels/attention/codegen/decoding_sm70_576_f16_u4.cu | Adds decoding instantiations for Sm70 f16 + u4 KV cache @ 576. |
| src/turbomind/kernels/attention/codegen/decoding_sm70_576_f16_f16.cu | Adds decoding instantiations for Sm70 f16 + f16 KV cache @ 576. |
| src/turbomind/kernels/attention/codegen/attention_sm80_576_f16.cu | Adds context-attention instantiation for Sm80 f16 @ 576. |
| src/turbomind/kernels/attention/codegen/attention_sm80_576_bf16.cu | Adds context-attention instantiation for Sm80 bf16 @ 576. |
| src/turbomind/kernels/attention/codegen/attention_sm75_576_f16.cu | Adds context-attention instantiation for Sm75 f16 @ 576. |
| src/turbomind/kernels/attention/codegen/attention_sm70_576_f16.cu | Adds context-attention instantiation for Sm70 f16 @ 576. |
| src/turbomind/kernels/attention/attention_universal.h | Changes KV processing rule to only inline-process in decoding; affects kernel behavior selection. |
| src/turbomind/kernels/attention/attention_config.h | Adds Sm70 HeadDim=576 attention config (MMA_884) and includes SIMT impl. |
| src/turbomind/kernels/attention/attention.cu | Adds HeadDim=576 dispatch for context attention. |
| src/turbomind/kernels/attention/CMakeLists.txt | Adds new codegen compilation units for HeadDim=576. |
| lmdeploy/turbomind/supported_models.py | Registers Glm4MoeLiteForCausalLM mapping to glm4-moe-lite. |
| lmdeploy/turbomind/deploy/target_model/base.py | Adds dtype matching for tm-parameter copies to avoid byte-size mismatches. |
| lmdeploy/turbomind/deploy/source_model/llama.py | Adds rope_parameters support for newer Transformers config layout. |
| lmdeploy/turbomind/deploy/source_model/glm4_moe_lite.py | Adds GLM4 MoE Lite reader/model and exports correction bias key. |
| lmdeploy/turbomind/deploy/source_model/deepseek2.py | Refines per-layer FFN key filtering and adds MLA folding metadata + scoring config fields. |
| lmdeploy/turbomind/deploy/source_model/init.py | Exposes Glm4MoeLiteModel in source-model package init. |
| lmdeploy/turbomind/deploy/module.py | Exports correction bias and adds MLA folding logic during conversion; adjusts export ordering and bounds. |
| lmdeploy/turbomind/deploy/converter.py | Adds ffn_weight_type + mixed-AWQ handling to config generation. |
| lmdeploy/turbomind/deploy/config.py | Adds config fields: ffn_weight_type, scoring_func, router_n_groups. |
Comments suppressed due to low confidence (2)
lmdeploy/turbomind/deploy/module.py:7
identityis imported but not used in this module. Please remove the unused import to avoid lint failures and keep the module clean.
from .parameter import get_params
lmdeploy/turbomind/deploy/module.py:7
- Import of 'identity' is not used.
from .parameter import get_params
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Dense layers (no MoE experts, e.g. layer 0 in GLM-4.7-Flash) are typically excluded from quantization via modules_to_not_convert. Their FFN weights are fp16, so LlamaFfnWeight must use weight_type (fp16) instead of the global ffn_weight_type (int4). Only MoE layers need ffn_weight_type for their shared experts. Without this fix, the C++ engine allocates int4 buffers for layer 0's FFN but Python writes fp16 weights, causing tensor name mismatches (w1.weight vs w1.qweight) and leaving layer 0 uninitialized -> garbage output.
The HeadDim=576 specialization was using CTA_Q=16 with only 1 warp, while the generic Sm70 config uses CTA_Q=64 with 4 warps. The shared memory budget permits CTA_Q=64: max(Q=72.5KB, K+V+P=76.8KB) = 76.8KB < 96KB. This gives 4x more tensor core parallelism per CTA and 4x faster cooperative global memory loads, matching the generic Sm70 config's warp count.
…ndant memsets, missing forward decl - base.py: Cache _turbomind import outside loop, narrow except to ImportError - llama.py: Use local rope_scaling var instead of model_arg['rope_scaling'] to avoid KeyError when only rope_parameters is present - moe_utils_v2.cu: Remove redundant per-token masks clear in noaux_tc kernel (already initialized via cudaMemsetAsync before kernel launch) - moe_ffn_layer.cc: Move accum/masks clears into respective branches — noaux_tc handles its own clearing internally, V2 needs caller-side accum clear only - attention_universal.h: Forward-declare DecodingCtaMap to fix include-order dependency on cta_map.h
b8fb39f to
b8699b0
Compare
- converter.py: Fix torch_dtype deprecation — prefer config.dtype over config.torch_dtype (transformers v5+ compat) - internlm2.py: Add missing 'if not kind' guard in _ffn() to prevent KeyError when Ffn.apply() probes for parameter keys with kind=None - test_moe_gate_noaux_tc.py: Reference tests for noaux_tc MoE routing algorithm (sigmoid/softmax scoring, correction bias, top-k selection, renormalization, routed_scale, NaN/Inf handling, GLM-4.7-Flash config) - test_converter.py: Add test_torch_dtype_fallback and test_ffn_reader_kind_none
b8699b0 to
de580e3
Compare
|
Hi, @lapy Could you help resolve the conflicts? Would you be able to include this change in the current PR? |
- Merge latest main into glm4-moe-lite-minimal to pick up DeepSeek v3.2 / GLM-5 updates - Resolve DeepSeek2/Llama source_model conflicts and keep MLA + MoE routing changes - Gate turbomind test_quantization target behind BUILD_TEST to reduce build time Co-authored-by: Cursor <cursoragent@cursor.com>
|
only QuantTrio/GLM-4.7-Flash-AWQ available?? cann't I use the original GLM-4.7-flash through my 4*32GB V100 by lmdeploy? |
| def _ffn(self, i: int, kind: str): | ||
| """Get ffn kind for layer i.""" | ||
| if not kind: | ||
| return self.filter(self.ffn_pattern) |
There was a problem hiding this comment.
miss the other parameters "i", i.e., the layer id
|
|
||
| model_param_.weight_type = data_type_from_string(model["weight_type"].as<std::string>()); | ||
| model_param_.expert_weight_type = data_type_from_string(model["expert_weight_type"].as<std::string>()); | ||
| model_param_.ffn_weight_type = data_type_from_string( |
There was a problem hiding this comment.
How about model_param_.ffn_weight_type = data_type_from_string(model["ffn_weight_type"].as<std::string>(model_param_.weight_type));
|
@lapy I've left a few more comments. Looks good overall! I'm running the evaluation tests and will share the results later. |
I can deploy GLM-4.7-flash on H800 by adding this patch: for tm_tensor in tm_params[name]:
# Match TurboMind tensor dtype to avoid byte_size mismatch (e.g. f32 256b vs f16 128b)
if _tm is not None:
if tm_tensor.type == _tm.DataType.TYPE_FP32 and torch_tensor.dtype in [torch.float16, torch.bfloat16]:
torch_tensor = torch_tensor.float()
elif tm_tensor.type == _tm.DataType.TYPE_FP16 and torch_tensor.dtype == torch.float32:
torch_tensor = torch_tensor.half()
tm_tensor.copy_from(torch_tensor) |
|
Motivation
#4320 added GLM-4.7-Flash support for the PyTorch backend only. This PR adds full TurboMind (C++) backend support, enabling significantly faster inference with tensor-core-accelerated attention, native AWQ int4 quantization, and optimized MoE routing — all critical for production deployment of this 4.7B MoE model.
GLM-4.7-Flash uses an MLA + MoE architecture with two properties that required dedicated engine work:
kv_lora_rank=512 + qk_rope_dim=64) — not previously supported by any attention kernel in TurboMind.noaux_tcMoE routing with sigmoid scoring and per-expert correction bias — a new routing strategy not present in the existing DeepSeek2 codebase.Additionally, this enables mixed AWQ quantization (e.g. QuantTrio/GLM-4.7-Flash-AWQ) where attention weights remain fp16 while FFN/expert weights are int4, requiring a new three-tier weight type system in the C++ engine.
Modification
43 files changed, +922 / −79 lines across 5 commits.
1. Model Registration & Weight Conversion
Glm4MoeLiteModel/Glm4MoeLiteReaderextendingDeepSeek2Model, handling the model's specific config (topk_method='noaux_tc',scoring_func='sigmoid',e_score_correction_bias).kv_lora_rank != qk_nope_head_dim(512 ≠ 576 for GLM-4.7-Flash), the kc/vc compression matrices fromkv_b_projare folded intoq_b_projando_projrespectively, yielding an effectivehead_dim=576that the existing attention kernels can process without runtime decomposition.DeepSeek2Readerwith layer-specific regex patterns for FFN weight keys to avoid collisions between dense layer 0 and MoE shared expert layers.2. Mixed Quantization Support (Three-Tier Weight Types)
Introduced
ffn_weight_typealongside existingweight_typeandexpert_weight_typein both Python config and C++ModelParam:weight_typeffn_weight_typeexpert_weight_typemodules_to_not_convert) useweight_typefor FFN allocation; MoE layers useffn_weight_typefor shared experts. This prevents buffer type mismatches that would cause garbage output.3. Attention Kernels for HeadDim=576
MMA_884tensor core kernel withCTA_Q=64, CTA_S=32, WARP_Q=16, WARP_S=32— 4 warps, 76.8 KB shared memory (within V100's 96 KB limit).CTA_Sreduced from 64 to fit the large head dimension.impl_884.h: V accumulation loop bound changed from hardcoded8(correct only for HeadDim=128) to compile-timeV_N— critical for HeadDim=576 whereV_N=36.invokeProcessKV_v2,invokeFlattenKV_v2, andinvokeReduceV3.4. MoE
noaux_tcRouting KernelMoeGateNoAuxTCKernel(~200 lines) implementing the TC-free auxiliary loss routing:scores = sigmoid(logits)(or softmax, configurable viascoring_func).scores_for_choice = scores + correction_biasfor top-k selection.moe_ffn_layer.ccwith a newtopk_method == "noaux_tc"dispatch path.MoeFfnWeightextended withscore_correction_biastensor.5. MLA Copy Kernel Fix
mla_copy_qkv_kernelnow iterates over heads (was single-head assignment) — needed whenhead_num > blockDim.y.di >= v_head_dim.BC-breaking (Optional)
No backward-compatibility breaking changes. The new
ffn_weight_typeconfig field defaults toweight_typein the C++ YAML reader, so existing model configs without this field continue to work identically.Use cases (Optional)
Tested on 2× V100-32GB with
QuantTrio/GLM-4.7-Flash-AWQ— coherent output, ~27 GB VRAM on single GPU with AWQ quantization active.Checklist
noaux_tcrouting are not yet added.