GPT-OSS performance and memory usage

I have been able to fit OSS 20B on 2 H100 nodes with GPRO Math but I would expect to be able to use a single node given that Qwen3 30B fits on a single node.
Secondly the performance of OSS 20B is much less than expected training workers are running at ~400TPS per GPU while Qwen3 30B runs at ~850TPS training worker group tokens per sec per gpu.

I am not sure if 16 bit training is working as expected  and causing the reduced performance and memory usage but here is my config:
```
defaults: grpo-gptoss-20b-8n8g-megatron.yaml
policy:
  precision: "bfloat16"
  megatron_cfg:

    defer_fp32_logits: true
    activation_checkpointing: true
    optimizer:
      use_precision_aware_optimizer: true
      bf16: true
      fp16: false
  generation:
    vllm_cfg:
      tensor_parallel_size: 4
cluster:
  num_nodes: 2
```
Using [this](https://github.com/NVIDIA-NeMo/RL/blob/763c73f96811b8a1f109fb4a911f06762b6b3da7/examples/configs/recipes/llm/grpo-gptoss-20b-8n8g-megatron.yaml) as a base

As a side note 120B gives alot of conversion warnings:
`megatron.bridge.models.conversion.param_mapping:warning: Dtype mismatch between HuggingFace weights and Megatron module. HF dtype: torch.bfloat16. Megatron dtype: torch.float32.`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPT-OSS performance and memory usage #1604

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPT-OSS performance and memory usage #1604

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions