Skip to content

BF16 backend cannot parse gpt-oss packed/fused expert weight format #1861

@vshortt73

Description

@vshortt73

Reminder

  • I have read the above rules and searched the existing issues.

System Info

  • kt-kernel: main @ 16a8b98
  • SGLang: main @ 45095bac7
  • Model: openai/gpt-oss-120b (HuggingFace safetensors, MXFP4)
  • GPU: RTX 5090 (32GB)
  • CPU: AMD Ryzen 9 9900X (Zen 5, AVX-512 BF16)

Reproduction

Repository: kvcache-ai/ktransformers (could tie to #1655)

Summary

The BF16SafeTensorLoader defaults to DeepSeek MoE weight format and cannot parse gpt-oss's packed fused expert tensors. This is a separate failure mode from the GGUF/MXFP4 type 39 issue already documented in #1655.

Details

Weight format mismatch

The BF16SafeTensorLoader expects DeepSeek-style per-expert weight keys:

model.layers.0.mlp.experts.0.gate_proj.weight
model.layers.0.mlp.experts.0.up_proj.weight
model.layers.0.mlp.experts.0.down_proj.weight

gpt-oss uses packed fused tensors with MXFP4 block/scale format:

model.layers.0.mlp.experts.gate_up_proj_blocks
model.layers.0.mlp.experts.gate_up_proj_scales
model.layers.0.mlp.experts.gate_up_proj_bias
model.layers.0.mlp.experts.down_proj_blocks
model.layers.0.mlp.experts.down_proj_scales
model.layers.0.mlp.experts.down_proj_bias

All 128 experts are packed into single tensors per layer (fused gate+up projection), with separate block quantization scales. There is no per-expert dimension in the key naming.

Observed behavior

[BF16SafeTensorLoader] No MoE format detected, defaulting to: deepseek

The loader (kt-kernel/python/utils/loader.py line 463) fails to match gpt-oss's key pattern and falls back to DeepSeek format, which then fails to load the weights correctly.

Three confirmed failure paths for gpt-oss on KTransformers

For completeness, here are all three integration paths attempted and their failure modes:

Backend Weight Source Failure Root Cause
BF16 HF safetensors Format mismatch Packed fused expert tensors, not per-expert keys
LLAMAFILE MXFP4 GGUF ValueError: 39 is not a valid GGMLQuantizationType gguf 0.17.1 lacks MXFP4 type 39
LLAMAFILE Q4_K_M GGUF Testing in progress Standard quant type, should be compatible

Suggestion

To support gpt-oss (and likely future models using packed/fused MoE formats):

  1. Add a gpt-oss format detector in the BF16SafeTensorLoader alongside the existing DeepSeek detector
  2. Implement unpacking logic for fused gate_up_proj_blocks/scales into individual expert weights
  3. Alternatively, document that gpt-oss requires the LLAMAFILE backend with standard (non-MXFP4) GGUF quantizations

Note on successful llama.cpp baseline

For reference, gpt-oss-120b runs successfully on the same hardware via llama.cpp with --override-tensor expert offloading at 37.95 t/s (15 of 36 layers' experts on GPU, remainder on CPU via DDR5). The model architecture is functional — it's specifically the KTransformers integration paths that need gpt-oss format support.

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions