-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Reminder
- I have read the above rules and searched the existing issues.
System Info
- kt-kernel:
main@ 16a8b98 - SGLang:
main@ 45095bac7 - Model: openai/gpt-oss-120b (HuggingFace safetensors, MXFP4)
- GPU: RTX 5090 (32GB)
- CPU: AMD Ryzen 9 9900X (Zen 5, AVX-512 BF16)
Reproduction
Repository: kvcache-ai/ktransformers (could tie to #1655)
Summary
The BF16SafeTensorLoader defaults to DeepSeek MoE weight format and cannot parse gpt-oss's packed fused expert tensors. This is a separate failure mode from the GGUF/MXFP4 type 39 issue already documented in #1655.
Details
Weight format mismatch
The BF16SafeTensorLoader expects DeepSeek-style per-expert weight keys:
model.layers.0.mlp.experts.0.gate_proj.weight
model.layers.0.mlp.experts.0.up_proj.weight
model.layers.0.mlp.experts.0.down_proj.weight
gpt-oss uses packed fused tensors with MXFP4 block/scale format:
model.layers.0.mlp.experts.gate_up_proj_blocks
model.layers.0.mlp.experts.gate_up_proj_scales
model.layers.0.mlp.experts.gate_up_proj_bias
model.layers.0.mlp.experts.down_proj_blocks
model.layers.0.mlp.experts.down_proj_scales
model.layers.0.mlp.experts.down_proj_bias
All 128 experts are packed into single tensors per layer (fused gate+up projection), with separate block quantization scales. There is no per-expert dimension in the key naming.
Observed behavior
[BF16SafeTensorLoader] No MoE format detected, defaulting to: deepseek
The loader (kt-kernel/python/utils/loader.py line 463) fails to match gpt-oss's key pattern and falls back to DeepSeek format, which then fails to load the weights correctly.
Three confirmed failure paths for gpt-oss on KTransformers
For completeness, here are all three integration paths attempted and their failure modes:
| Backend | Weight Source | Failure | Root Cause |
|---|---|---|---|
| BF16 | HF safetensors | Format mismatch | Packed fused expert tensors, not per-expert keys |
| LLAMAFILE | MXFP4 GGUF | ValueError: 39 is not a valid GGMLQuantizationType |
gguf 0.17.1 lacks MXFP4 type 39 |
| LLAMAFILE | Q4_K_M GGUF | Testing in progress | Standard quant type, should be compatible |
Suggestion
To support gpt-oss (and likely future models using packed/fused MoE formats):
- Add a gpt-oss format detector in the BF16SafeTensorLoader alongside the existing DeepSeek detector
- Implement unpacking logic for fused
gate_up_proj_blocks/scalesinto individual expert weights - Alternatively, document that gpt-oss requires the LLAMAFILE backend with standard (non-MXFP4) GGUF quantizations
Note on successful llama.cpp baseline
For reference, gpt-oss-120b runs successfully on the same hardware via llama.cpp with --override-tensor expert offloading at 37.95 t/s (15 of 36 layers' experts on GPU, remainder on CPU via DDR5). The model architecture is functional — it's specifically the KTransformers integration paths that need gpt-oss format support.
Others
No response