Skip to content

gpt-oss implementation #1739

@sbhavani

Description

@sbhavani

Description

This outlines the current status of gpt-oss features that need to be implemented in Megatron Core, leveraging Transformer Engine.

✅ UPDATE: All core GPT-OSS functionality is now available in Megatron Core (training) and Megatron Bridge (checkpoint conversion).

MoE Layer

Enabled Bias

Attention Mechanisms

Alternating Sliding-Window Attention Pattern

  • Status:Supported - Infrastructure exists for per-layer patterns and sliding window attention using TE

Attention Sinks

Activation Functions

Custom SwiGLU with Clamping

  • Status:Supported
  • Implementation: Megatron Core added partially fused version as "custom quick GeGLU"

FP8-aware fused kernel merged into Transformer Engine

Positional Encodings

YaRN RoPE Scaling

  • Status:Fully Supported
  • Implementation:
    • YaRN scaling to 128k+ context
    • Integration with existing RoPE
    • YaRN for general RoPE/GPT models
    • Convergence validation
  • Usage: --position-embedding-type yarn with YaRN configuration parameters
  • Reference: arXiv:2309.00071

Megatron Bridge Support

Megatron Bridge provides full GPT-OSS integration:

  • Checkpoint Conversion: Hugging Face ↔ Megatron format
  • Pre-configured Providers: GPTOSSProvider20B and GPTOSSProvider120B
  • Quantization Support: Handles MXFP4 weight dequantization

Megatron Bridge + Megatron-LM Example

PR: #2383 provides end-to-end example scripts covering checkpoint conversion (convert_mcore_bf16_checkpoint_from_hf.py) and training/fine-tuning (training_gptoss_20b_h100_bf16_fp8.sh)

Credits: @cuichenx for core implementation, @yiakwy-xpu-ml-framework-team for example scripts

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions