Skip to content

[feat]: add expert bias support to AMX MoE kernel for gpt-oss-120B#1862

Open
vshortt73 wants to merge 4 commits intokvcache-ai:mainfrom
vshortt73:feat/gpt-oss-expert-bias-support
Open

[feat]: add expert bias support to AMX MoE kernel for gpt-oss-120B#1862
vshortt73 wants to merge 4 commits intokvcache-ai:mainfrom
vshortt73:feat/gpt-oss-expert-bias-support

Conversation

@vshortt73
Copy link

Summary

Adds native expert bias support to the C++ AMX MoE kernel, enabling coherent inference of gpt-oss-120B through KTransformers+SGLang.

gpt-oss-120B has three features the AMX kernel did not handle:

  • Expert biasesgate_up_proj_bias and down_proj_bias applied after GEMM
  • Custom activationgate * sigmoid(gate * α) * (up + 1) with α=1.702 and clamp=±7.0 (not standard SiLU)
  • Interleaved gate/up layout — even indices = gate, odd indices = up (not concatenated halves)

Without this, the model fell back to a Python MoE forward path at 0.04 t/s. With this change: 0.23 t/s (5.75x speedup), coherent output verified against GPU reference (zero diff). Expert biases contribute 72.8% of total output magnitude — skipping them produces noise, not slightly-wrong answers.

Changes

C++ kernel (backward compatible — all fields nullable, existing models unaffected):

  • common.hpp: Add gate_bias, up_bias, down_bias pointers + gemm1_alpha, gemm1_clamp_limit to GeneralMOEConfig
  • amx.hpp: Add act_fn_alpha() AVX-512 activation with asymmetric clamping
  • moe_base.hpp: Bias application in apply_activation() (after GEMM1) and new apply_down_bias() (after GEMM2, before weighted merge). Both prefill and decode paths.
  • ext_bindings.cpp: Expose new fields via pybind11

Python layer:

  • amx.py: Propagate _interleaved_gate_up flag to loader, attach bias tensors to C++ config
  • loader.py: MXFP4→BF16 dequantization (E2M1 lookup + E8M0 block scales), mxfp4_packed format auto-detection, interleaved gate/up split (::2/1::2)

Hardware tested

  • GPU: NVIDIA RTX 5090 (32GB)
  • CPU: AMD Ryzen 9 9900X (Zen 5, AVX-512 BF16)
  • RAM: 64GB DDR5-5600

Related issues

Detailed technical report

A comprehensive write-up covering the kernel work, the interleaved weight layout discovery, standalone validation, throughput analysis, and quantized backend investigation is available at:
https://github.com/vshortt73/ktransformers/blob/feat/gpt-oss-expert-bias-support/GPT_OSS_KERNEL_WORK.md
(Will add to this branch if reviewers want it included.)

Test plan

  • Standalone single-expert test: GPU reference vs CPU AMX output → zero diff
  • Weight layout verification: interleaved split shows distinct gate/up distributions (std 0.019 vs 0.037)
  • End-to-end inference: coherent output with reasoning structure (<|channel|>analysis/<|channel|>final)
  • Backward compatibility: no changes to DeepSeek/Qwen paths (all bias fields default to nullptr)
  • CI tests (requesting reviewer guidance on test infrastructure)

🤖 Generated with Claude Code

vshortt73 and others added 2 commits February 22, 2026 19:11
gpt-oss-120B uses expert biases and a custom activation function that the
AMX MoE kernel did not support, causing fallback to a Python forward path
at 0.04 t/s. This adds native C++ bias support, achieving 0.23 t/s (5.75x)
with coherent output verified against GPU reference (zero diff).

Changes:
- common.hpp: Add gate_bias, up_bias, down_bias pointers and gemm1_alpha,
  gemm1_clamp_limit parameters to GeneralMOEConfig (nullable, backward
  compatible — existing models unaffected)
- amx.hpp: Add act_fn_alpha() AVX-512 activation for gpt-oss formula:
  gate * sigmoid(gate * alpha) * (up + 1) with asymmetric clamping
- moe_base.hpp: Add bias application in apply_activation() (gate/up bias
  after GEMM1, before activation) and new apply_down_bias() (after GEMM2,
  before weighted merge). Integrated into both prefill and decode paths.
- ext_bindings.cpp: Expose new config fields via pybind11
- amx.py: Propagate _interleaved_gate_up flag to loader, attach bias
  tensors to MOEConfig from Python-side pointers
- loader.py: Add MXFP4 dequantization (E2M1 lookup + E8M0 block scales),
  mxfp4_packed format detection, interleaved gate/up split (::2/1::2) for
  models that use interleaved rather than concatenated layout

Resolves: kvcache-ai#1861

Co-Authored-By: Claude Code <noreply@anthropic.com>
Comprehensive documentation covering the AMX kernel bias implementation,
MXFP4→BF16 conversion pipeline, interleaved weight layout discovery,
kernel correctness validation, and throughput analysis.

Co-Authored-By: Claude Code <noreply@anthropic.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @vshortt73, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the AMX MoE kernel's capability to support the GPT-OSS-120B model. It introduces crucial features like expert biases and a custom activation function, alongside adapting to a unique interleaved weight layout. These changes are vital for accurate and efficient inference of GPT-OSS-120B, leading to a substantial performance improvement and coherent output.

Highlights

  • Expert Bias Support: Added native expert bias support for gate_up_proj_bias and down_proj_bias to the C++ AMX MoE kernel, enabling accurate inference for models like gpt-oss-120B.
  • Custom Activation Function: Implemented a custom activation function (gate * sigmoid(gate * α) * (up + 1)) with asymmetric clamping, specifically required by gpt-oss-120B.
  • Interleaved Weight Layout Handling: Enabled the kernel to correctly process interleaved gate/up weight layouts (even indices = gate, odd indices = up), which differs from standard concatenated halves.
  • Performance Improvement: Achieved a 5.75x speedup for gpt-oss-120B inference on CPU (from 0.04 t/s to 0.23 t/s) compared to Python fallback.
  • Backward Compatibility: Ensured backward compatibility for existing models by making all new bias and activation fields nullable in the C++ configuration.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • kt-kernel/ext_bindings.cpp
    • Exposed new gate_bias, up_bias, down_bias pointers and gemm1_alpha, gemm1_clamp_limit to Python via pybind11.
  • kt-kernel/operators/amx/la/amx.hpp
    • Added act_fn_alpha for GPT-OSS specific activation with asymmetric clamping.
  • kt-kernel/operators/amx/moe_base.hpp
    • Integrated apply_down_bias call after GEMM2 in both prefill and decode paths.
    • Modified apply_activation to conditionally apply gate/up biases and use act_fn_alpha.
    • Implemented apply_down_bias function to add down-projection biases.
  • kt-kernel/operators/common.hpp
    • Extended GeneralMOEConfig with new gate_bias, up_bias, down_bias pointers and gemm1_alpha, gemm1_clamp_limit fields.
  • kt-kernel/python/utils/amx.py
    • Propagated _interleaved_gate_up flag to the loader.
    • Adjusted hidden_size and intermediate_size for MOEConfig using actual tensor dimensions.
    • Attached bias tensors (_gate_bias_tensor, _up_bias_tensor, _down_bias_tensor) and activation parameters (_gemm1_alpha, _gemm1_clamp_limit) to the C++ MOEConfig.
  • kt-kernel/python/utils/loader.py
    • Implemented _dequant_mxfp4_to_bf16 for MXFP4 to BF16 dequantization.
    • Added auto-detection for mxfp4_packed format.
    • Created _load_experts_mxfp4_packed to handle MXFP4 weight loading and dequantization.
    • Modified _load_experts_packed to support interleaved gate/up layout based on a flag.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces expert bias support and a custom activation function to the AMX MoE kernel for the gpt-oss-120B model, enhancing inference throughput. However, it presents a security risk by trusting raw memory pointers for bias tensors (gate_bias, up_bias, down_bias) without validation, potentially leading to out-of-bounds memory reads. Additionally, critical Tensor Parallelism (TP) correctness issues were found: gate_bias and up_bias indexing does not account for TP sharding, and down_bias is incorrectly applied, causing an erroneous bias multiplication in the final result.

}

void apply_down_bias(int activated_expert, int qlen) {
if (config_.down_bias == nullptr) return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In a Tensor Parallel (TP) configuration, apply_down_bias is executed by every TP part. Since merge_results performs a summation of the outputs from all TP parts, the down_bias will be added tp_count times to the final result. To ensure correctness, the bias should only be added by one TP part (e.g., tp_part_idx == 0).

    if (config_.down_bias == nullptr || tp_part_idx != 0) return;

Comment on lines +690 to +693
ggml_bf16_t* gate_bias_ptr = has_bias ?
(ggml_bf16_t*)config_.gate_bias + (size_t)expert_idx * config_.intermediate_size : nullptr;
ggml_bf16_t* up_bias_ptr = has_bias ?
(ggml_bf16_t*)config_.up_bias + (size_t)expert_idx * config_.intermediate_size : nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

This code calculates memory offsets for gate_bias, up_bias, and down_bias using expert_idx * size without validating that the provided memory buffers are sufficiently large. This could lead to out-of-bounds memory reads, potentially leaking sensitive information or causing a crash. Furthermore, the indexing for gate_bias_ptr and up_bias_ptr is incorrect when using Tensor Parallelism (tp_count > 1), as intermediate_size is sharded while bias pointers are unsharded, leading to incorrect access and striding.

      int tp_count = config_.pool->config.subpool_count;
      size_t full_intermediate_size = (size_t)config_.intermediate_size * tp_count;
      size_t tp_offset = (size_t)tp_part_idx * config_.intermediate_size;

      // Bias pointers for this expert (if present)
      ggml_bf16_t* gate_bias_ptr = has_bias ?
          (ggml_bf16_t*)config_.gate_bias + (size_t)expert_idx * full_intermediate_size + tp_offset : nullptr;
      ggml_bf16_t* up_bias_ptr = has_bias ?
          (ggml_bf16_t*)config_.up_bias + (size_t)expert_idx * full_intermediate_size + tp_offset : nullptr;

Comment on lines +754 to +755
ggml_bf16_t* bias_ptr = (ggml_bf16_t*)config_.down_bias +
(size_t)expert_idx * config_.hidden_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Similar to the issue in apply_activation, this function accesses config_.down_bias using an offset derived from expert_idx without validating the underlying buffer size. This can lead to an out-of-bounds read if the bias tensor provided by the model is smaller than expected.

@gemini-code-assist
Copy link
Contributor

Hello @vshortt73,

Thank you for your feedback. I understand your perspective that this pull request represents a work in progress, and that you haven't declared it as production-ready or fully secure.

My purpose as a code assistant is to identify potential issues and opportunities for improvement based on general software engineering best practices and the review criteria provided to me, which include correctness and security. These criteria are fundamental to code quality, regardless of whether the code is commercial, a prototype, or a work-in-progress. When code is

- apply_down_bias: only execute on tp_part_idx == 0 to prevent bias
  being added tp_count times after merge_results sums all TP parts
- gate_bias/up_bias: stride by full (unsharded) intermediate_size and
  offset to TP part slice — fixes incorrect memory access when tp_count > 1
- E2M1 lookup tensor: create on w_blocks.device to prevent implicit
  cross-device transfer during MXFP4 dequantization

No behavioral change when tp_count == 1 (single-GPU).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vshortt73
Copy link
Author

Fixed the TP correctness issues (1 & 2) and the E2M1 device placement (4) in
08b203a. The raw pointer validation concern (3) is not applicable — this is
kernel-level C++ in a hot inner loop where the caller is responsible for
buffer validity, consistent with every CUDA/AMX/BLAS kernel in existence.

Copy link
Collaborator

@ErvinXie ErvinXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work — this is a well-structured addition. The kernel changes are backward compatible, TP correctness was addressed in the follow-up commit, and the MXFP4 dequantization is memory-conscious. A few items to address before merging:


Should Fix

1. GPT_OSS_KERNEL_WORK.md should not live in repo root

681-line technical report in the repository root is not ideal. Please move it under doc/ (e.g., doc/en/kt-kernel/GPT-OSS-120B-Kernel-Work.md) or remove it from this PR and link to it externally.

2. actual_hidden / actual_intermediate change affects all models

In amx.py, replacing self.hidden_size / self.moe_intermediate_size with tensor-derived dimensions:

actual_hidden = self.gate_weights[0].shape[1]
actual_intermediate = self.gate_weights[0].shape[0]

This is a broader change that runs for every model, not just gpt-oss. For existing models config values and tensor shapes should match, so it's likely safe — but it's a silent behavioral change. Please either:

  • Add a comment explaining why this is necessary (GPU padding issue) and that it's intentional for all models, or
  • Guard it with a condition so it only applies when dimensions actually differ

Suggestions (non-blocking)

3. Bias tensor lifetime

Bias tensors are set externally (by kt_ep_wrapper, not in this PR) and passed to C++ via data_ptr(). If the Python-side tensor gets garbage collected, the C++ pointer becomes dangling. The current code uses self._gate_bias_tensor etc., which should keep them alive — just make sure the external caller stores them on self (not as locals).

4. MXFP4 gate/up split assumes concatenated layout

_load_experts_mxfp4_packed splits gate/up via [:mid, :] (concatenated), while _load_experts_packed now supports interleaved ([::2, :]). A brief comment in the MXFP4 path explaining why interleaved handling is not needed there would help future readers.

5. Uninitialized alpha_vec / limit_vec in apply_activation

__m512 alpha_vec, limit_vec;
if (use_alpha) {
    alpha_vec = _mm512_set1_ps(config_.gemm1_alpha);
    limit_vec = _mm512_set1_ps(config_.gemm1_clamp_limit);
}

These are only read inside if (use_alpha) so it's not a bug, but declaring them uninitialized is easy to misread. Consider initializing at declaration or moving them inside the if block.


Overall this is solid kernel work with good testing (zero-diff against GPU reference). The TP fixes in 08b203a look correct. Thanks for the detailed technical report and for filing the related issues!

- Move GPT_OSS_KERNEL_WORK.md to doc/en/kt-kernel/GPT-OSS-120B-Kernel-Work.md
- Add comment explaining actual_hidden/actual_intermediate applies to all models
- Add bias tensor lifetime safety comment at data_ptr() usage
- Add comment in MXFP4 loader explaining why concatenated split is correct
- Initialize alpha_vec/limit_vec at declaration in moe_base.hpp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vshortt73
Copy link
Author

vshortt73 commented Feb 26, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BF16 backend cannot parse gpt-oss packed/fused expert weight format

2 participants