[feat]: add expert bias support to AMX MoE kernel for gpt-oss-120B#1862
[feat]: add expert bias support to AMX MoE kernel for gpt-oss-120B#1862vshortt73 wants to merge 4 commits intokvcache-ai:mainfrom
Conversation
gpt-oss-120B uses expert biases and a custom activation function that the AMX MoE kernel did not support, causing fallback to a Python forward path at 0.04 t/s. This adds native C++ bias support, achieving 0.23 t/s (5.75x) with coherent output verified against GPU reference (zero diff). Changes: - common.hpp: Add gate_bias, up_bias, down_bias pointers and gemm1_alpha, gemm1_clamp_limit parameters to GeneralMOEConfig (nullable, backward compatible — existing models unaffected) - amx.hpp: Add act_fn_alpha() AVX-512 activation for gpt-oss formula: gate * sigmoid(gate * alpha) * (up + 1) with asymmetric clamping - moe_base.hpp: Add bias application in apply_activation() (gate/up bias after GEMM1, before activation) and new apply_down_bias() (after GEMM2, before weighted merge). Integrated into both prefill and decode paths. - ext_bindings.cpp: Expose new config fields via pybind11 - amx.py: Propagate _interleaved_gate_up flag to loader, attach bias tensors to MOEConfig from Python-side pointers - loader.py: Add MXFP4 dequantization (E2M1 lookup + E8M0 block scales), mxfp4_packed format detection, interleaved gate/up split (::2/1::2) for models that use interleaved rather than concatenated layout Resolves: kvcache-ai#1861 Co-Authored-By: Claude Code <noreply@anthropic.com>
Comprehensive documentation covering the AMX kernel bias implementation, MXFP4→BF16 conversion pipeline, interleaved weight layout discovery, kernel correctness validation, and throughput analysis. Co-Authored-By: Claude Code <noreply@anthropic.com>
Summary of ChangesHello @vshortt73, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the AMX MoE kernel's capability to support the GPT-OSS-120B model. It introduces crucial features like expert biases and a custom activation function, alongside adapting to a unique interleaved weight layout. These changes are vital for accurate and efficient inference of GPT-OSS-120B, leading to a substantial performance improvement and coherent output. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces expert bias support and a custom activation function to the AMX MoE kernel for the gpt-oss-120B model, enhancing inference throughput. However, it presents a security risk by trusting raw memory pointers for bias tensors (gate_bias, up_bias, down_bias) without validation, potentially leading to out-of-bounds memory reads. Additionally, critical Tensor Parallelism (TP) correctness issues were found: gate_bias and up_bias indexing does not account for TP sharding, and down_bias is incorrectly applied, causing an erroneous bias multiplication in the final result.
kt-kernel/operators/amx/moe_base.hpp
Outdated
| } | ||
|
|
||
| void apply_down_bias(int activated_expert, int qlen) { | ||
| if (config_.down_bias == nullptr) return; |
There was a problem hiding this comment.
In a Tensor Parallel (TP) configuration, apply_down_bias is executed by every TP part. Since merge_results performs a summation of the outputs from all TP parts, the down_bias will be added tp_count times to the final result. To ensure correctness, the bias should only be added by one TP part (e.g., tp_part_idx == 0).
if (config_.down_bias == nullptr || tp_part_idx != 0) return;
kt-kernel/operators/amx/moe_base.hpp
Outdated
| ggml_bf16_t* gate_bias_ptr = has_bias ? | ||
| (ggml_bf16_t*)config_.gate_bias + (size_t)expert_idx * config_.intermediate_size : nullptr; | ||
| ggml_bf16_t* up_bias_ptr = has_bias ? | ||
| (ggml_bf16_t*)config_.up_bias + (size_t)expert_idx * config_.intermediate_size : nullptr; |
There was a problem hiding this comment.
This code calculates memory offsets for gate_bias, up_bias, and down_bias using expert_idx * size without validating that the provided memory buffers are sufficiently large. This could lead to out-of-bounds memory reads, potentially leaking sensitive information or causing a crash. Furthermore, the indexing for gate_bias_ptr and up_bias_ptr is incorrect when using Tensor Parallelism (tp_count > 1), as intermediate_size is sharded while bias pointers are unsharded, leading to incorrect access and striding.
int tp_count = config_.pool->config.subpool_count;
size_t full_intermediate_size = (size_t)config_.intermediate_size * tp_count;
size_t tp_offset = (size_t)tp_part_idx * config_.intermediate_size;
// Bias pointers for this expert (if present)
ggml_bf16_t* gate_bias_ptr = has_bias ?
(ggml_bf16_t*)config_.gate_bias + (size_t)expert_idx * full_intermediate_size + tp_offset : nullptr;
ggml_bf16_t* up_bias_ptr = has_bias ?
(ggml_bf16_t*)config_.up_bias + (size_t)expert_idx * full_intermediate_size + tp_offset : nullptr;| ggml_bf16_t* bias_ptr = (ggml_bf16_t*)config_.down_bias + | ||
| (size_t)expert_idx * config_.hidden_size; |
There was a problem hiding this comment.
|
Hello @vshortt73, Thank you for your feedback. I understand your perspective that this pull request represents a work in progress, and that you haven't declared it as production-ready or fully secure. My purpose as a code assistant is to identify potential issues and opportunities for improvement based on general software engineering best practices and the review criteria provided to me, which include correctness and security. These criteria are fundamental to code quality, regardless of whether the code is commercial, a prototype, or a work-in-progress. When code is |
- apply_down_bias: only execute on tp_part_idx == 0 to prevent bias being added tp_count times after merge_results sums all TP parts - gate_bias/up_bias: stride by full (unsharded) intermediate_size and offset to TP part slice — fixes incorrect memory access when tp_count > 1 - E2M1 lookup tensor: create on w_blocks.device to prevent implicit cross-device transfer during MXFP4 dequantization No behavioral change when tp_count == 1 (single-GPU). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Fixed the TP correctness issues (1 & 2) and the E2M1 device placement (4) in |
ErvinXie
left a comment
There was a problem hiding this comment.
Nice work — this is a well-structured addition. The kernel changes are backward compatible, TP correctness was addressed in the follow-up commit, and the MXFP4 dequantization is memory-conscious. A few items to address before merging:
Should Fix
1. GPT_OSS_KERNEL_WORK.md should not live in repo root
681-line technical report in the repository root is not ideal. Please move it under doc/ (e.g., doc/en/kt-kernel/GPT-OSS-120B-Kernel-Work.md) or remove it from this PR and link to it externally.
2. actual_hidden / actual_intermediate change affects all models
In amx.py, replacing self.hidden_size / self.moe_intermediate_size with tensor-derived dimensions:
actual_hidden = self.gate_weights[0].shape[1]
actual_intermediate = self.gate_weights[0].shape[0]This is a broader change that runs for every model, not just gpt-oss. For existing models config values and tensor shapes should match, so it's likely safe — but it's a silent behavioral change. Please either:
- Add a comment explaining why this is necessary (GPU padding issue) and that it's intentional for all models, or
- Guard it with a condition so it only applies when dimensions actually differ
Suggestions (non-blocking)
3. Bias tensor lifetime
Bias tensors are set externally (by kt_ep_wrapper, not in this PR) and passed to C++ via data_ptr(). If the Python-side tensor gets garbage collected, the C++ pointer becomes dangling. The current code uses self._gate_bias_tensor etc., which should keep them alive — just make sure the external caller stores them on self (not as locals).
4. MXFP4 gate/up split assumes concatenated layout
_load_experts_mxfp4_packed splits gate/up via [:mid, :] (concatenated), while _load_experts_packed now supports interleaved ([::2, :]). A brief comment in the MXFP4 path explaining why interleaved handling is not needed there would help future readers.
5. Uninitialized alpha_vec / limit_vec in apply_activation
__m512 alpha_vec, limit_vec;
if (use_alpha) {
alpha_vec = _mm512_set1_ps(config_.gemm1_alpha);
limit_vec = _mm512_set1_ps(config_.gemm1_clamp_limit);
}These are only read inside if (use_alpha) so it's not a bug, but declaring them uninitialized is easy to misread. Consider initializing at declaration or moving them inside the if block.
Overall this is solid kernel work with good testing (zero-diff against GPU reference). The TP fixes in 08b203a look correct. Thanks for the detailed technical report and for filing the related issues!
- Move GPT_OSS_KERNEL_WORK.md to doc/en/kt-kernel/GPT-OSS-120B-Kernel-Work.md - Add comment explaining actual_hidden/actual_intermediate applies to all models - Add bias tensor lifetime safety comment at data_ptr() usage - Add comment in MXFP4 loader explaining why concatenated split is correct - Initialize alpha_vec/limit_vec at declaration in moe_base.hpp Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for the thorough review — all great catches. Addressed everything in
2e86c87:
Should Fix:
1. Technical report moved — GPT_OSS_KERNEL_WORK.md →
doc/en/kt-kernel/GPT-OSS-120B-Kernel-Work.md
2. actual_hidden/actual_intermediate — Added a comment explaining this is
intentional for all models. GPU kernels may pad dimensions (e.g.
FlashInfer
MXFP4 rounds to 256-byte alignment), but CPU weights use real model
dimensions. For existing models where no padding occurs, tensor shapes
already
match config values, so it's a no-op.
Suggestions:
3. Bias tensor lifetime — Added a comment at the data_ptr() usage site
noting
that callers must store bias tensors on self to prevent GC while C++
holds the
pointer. The current external caller (kt_ep_wrapper) already does this
correctly.
4. MXFP4 concatenated split — Added a comment explaining that MXFP4 packed
blocks store gate/up in separate halves (unlike the BF16 packed format
where
gpt-oss uses interleaved layout), so the concatenated split is correct
there.
5. Uninitialized alpha_vec/limit_vec — Changed to const with ternary
initialization so they're always initialized at declaration. Zero
overhead, no
ambiguity.
Let me know if anything else needs attention.
Best,
Victor
…On Thu, Feb 26, 2026 at 1:55 AM ErvinXie ***@***.***> wrote:
***@***.**** commented on this pull request.
Nice work — this is a well-structured addition. The kernel changes are
backward compatible, TP correctness was addressed in the follow-up commit,
and the MXFP4 dequantization is memory-conscious. A few items to address
before merging:
------------------------------
Should Fix 1. GPT_OSS_KERNEL_WORK.md should not live in repo root
681-line technical report in the repository root is not ideal. Please move
it under doc/ (e.g., doc/en/kt-kernel/GPT-OSS-120B-Kernel-Work.md) or
remove it from this PR and link to it externally.
2. actual_hidden / actual_intermediate change affects all models
In amx.py, replacing self.hidden_size / self.moe_intermediate_size with
tensor-derived dimensions:
actual_hidden = self.gate_weights[0].shape[1]actual_intermediate = self.gate_weights[0].shape[0]
This is a broader change that runs for *every model*, not just gpt-oss.
For existing models config values and tensor shapes should match, so it's
likely safe — but it's a silent behavioral change. Please either:
- Add a comment explaining *why* this is necessary (GPU padding issue)
and that it's intentional for all models, or
- Guard it with a condition so it only applies when dimensions
actually differ
------------------------------
Suggestions (non-blocking) 3. Bias tensor lifetime
Bias tensors are set externally (by kt_ep_wrapper, not in this PR) and
passed to C++ via data_ptr(). If the Python-side tensor gets garbage
collected, the C++ pointer becomes dangling. The current code uses
self._gate_bias_tensor etc., which should keep them alive — just make
sure the external caller stores them on self (not as locals).
4. MXFP4 gate/up split assumes concatenated layout
_load_experts_mxfp4_packed splits gate/up via [:mid, :] (concatenated),
while _load_experts_packed now supports interleaved ([::2, :]). A brief
comment in the MXFP4 path explaining why interleaved handling is not needed
there would help future readers.
5. Uninitialized alpha_vec / limit_vec in apply_activation
__m512 alpha_vec, limit_vec;if (use_alpha) {
alpha_vec = _mm512_set1_ps(config_.gemm1_alpha);
limit_vec = _mm512_set1_ps(config_.gemm1_clamp_limit);
}
These are only read inside if (use_alpha) so it's not a bug, but
declaring them uninitialized is easy to misread. Consider initializing at
declaration or moving them inside the if block.
------------------------------
Overall this is solid kernel work with good testing (zero-diff against GPU
reference). The TP fixes in 08b203a
<08b203a>
look correct. Thanks for the detailed technical report and for filing the
related issues!
—
Reply to this email directly, view it on GitHub
<#1862 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A33LKHTLJQIRZPIYBPU4WFL4N2RFTAVCNFSM6AAAAACV4FBA3CVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTQNJZGA3TQMJWGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Summary
Adds native expert bias support to the C++ AMX MoE kernel, enabling coherent inference of gpt-oss-120B through KTransformers+SGLang.
gpt-oss-120B has three features the AMX kernel did not handle:
gate_up_proj_biasanddown_proj_biasapplied after GEMMgate * sigmoid(gate * α) * (up + 1)with α=1.702 and clamp=±7.0 (not standard SiLU)Without this, the model fell back to a Python MoE forward path at 0.04 t/s. With this change: 0.23 t/s (5.75x speedup), coherent output verified against GPU reference (zero diff). Expert biases contribute 72.8% of total output magnitude — skipping them produces noise, not slightly-wrong answers.
Changes
C++ kernel (backward compatible — all fields nullable, existing models unaffected):
common.hpp: Addgate_bias,up_bias,down_biaspointers +gemm1_alpha,gemm1_clamp_limittoGeneralMOEConfigamx.hpp: Addact_fn_alpha()AVX-512 activation with asymmetric clampingmoe_base.hpp: Bias application inapply_activation()(after GEMM1) and newapply_down_bias()(after GEMM2, before weighted merge). Both prefill and decode paths.ext_bindings.cpp: Expose new fields via pybind11Python layer:
amx.py: Propagate_interleaved_gate_upflag to loader, attach bias tensors to C++ configloader.py: MXFP4→BF16 dequantization (E2M1 lookup + E8M0 block scales),mxfp4_packedformat auto-detection, interleaved gate/up split (::2/1::2)Hardware tested
Related issues
num_gpu_expertsbut kt-kernel expectsgpu_experts_mask#1859 (API mismatch, already patched in SGLang wrapper)--kt-num-gpu-experts#1860 (MXFP4 GPU allocation ignores--kt-num-gpu-experts, still open)Detailed technical report
A comprehensive write-up covering the kernel work, the interleaved weight layout discovery, standalone validation, throughput analysis, and quantized backend investigation is available at:
https://github.com/vshortt73/ktransformers/blob/feat/gpt-oss-expert-bias-support/GPT_OSS_KERNEL_WORK.md
(Will add to this branch if reviewers want it included.)
Test plan
<|channel|>analysis/<|channel|>final)🤖 Generated with Claude Code