-
Notifications
You must be signed in to change notification settings - Fork 209
Closed
Labels
PerformanceRelated to improving performanceRelated to improving performancebugSomething isn't workingSomething isn't workingcommunity-requestexternalx-inflection
Description
I have been able to fit OSS 20B on 2 H100 nodes with GPRO Math but I would expect to be able to use a single node given that Qwen3 30B fits on a single node.
Secondly the performance of OSS 20B is much less than expected training workers are running at ~400TPS per GPU while Qwen3 30B runs at ~850TPS training worker group tokens per sec per gpu.
I am not sure if 16 bit training is working as expected and causing the reduced performance and memory usage but here is my config:
defaults: grpo-gptoss-20b-8n8g-megatron.yaml
policy:
precision: "bfloat16"
megatron_cfg:
defer_fp32_logits: true
activation_checkpointing: true
optimizer:
use_precision_aware_optimizer: true
bf16: true
fp16: false
generation:
vllm_cfg:
tensor_parallel_size: 4
cluster:
num_nodes: 2
Using this as a base
As a side note 120B gives alot of conversion warnings:
megatron.bridge.models.conversion.param_mapping:warning: Dtype mismatch between HuggingFace weights and Megatron module. HF dtype: torch.bfloat16. Megatron dtype: torch.float32.
Metadata
Metadata
Assignees
Labels
PerformanceRelated to improving performanceRelated to improving performancebugSomething isn't workingSomething isn't workingcommunity-requestexternalx-inflection