Skip to content

GPT-OSS performance and memory usage #1604

@william-baker-inflection

Description

I have been able to fit OSS 20B on 2 H100 nodes with GPRO Math but I would expect to be able to use a single node given that Qwen3 30B fits on a single node.
Secondly the performance of OSS 20B is much less than expected training workers are running at ~400TPS per GPU while Qwen3 30B runs at ~850TPS training worker group tokens per sec per gpu.

I am not sure if 16 bit training is working as expected and causing the reduced performance and memory usage but here is my config:

defaults: grpo-gptoss-20b-8n8g-megatron.yaml
policy:
  precision: "bfloat16"
  megatron_cfg:

    defer_fp32_logits: true
    activation_checkpointing: true
    optimizer:
      use_precision_aware_optimizer: true
      bf16: true
      fp16: false
  generation:
    vllm_cfg:
      tensor_parallel_size: 4
cluster:
  num_nodes: 2

Using this as a base

As a side note 120B gives alot of conversion warnings:
megatron.bridge.models.conversion.param_mapping:warning: Dtype mismatch between HuggingFace weights and Megatron module. HF dtype: torch.bfloat16. Megatron dtype: torch.float32.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions