-
Notifications
You must be signed in to change notification settings - Fork 226
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
I'm trying to train a Qwen3_8B model. These are the important config options used:
model:
# recompute_granularity: full
# recompute_method: uniform
# recompute_num_layers: 1
seq_length: 4096
max_position_embeddings: 4096
tensor_model_parallel_size: 2
pipeline_model_parallel_size: 1
.......
optimizer:
# optimizer: adam (default)
use_distributed_optimizer: true
.........
checkpoint:
save: ${cluster_config.exp_dir}
load: ${cluster_config.exp_dir}
save_interval: 2500
..........
Actual seq length is twice of 4096, and I'm using a single H100 node (8 H100 HBM3 GPUs)
Steps/Code to reproduce bug
Training runs fine, but checkpointing fails with OOM. I think it's using peak memory more than what the model requires.
Expected behavior
No OOM
Additional context
Wandb log (Might be only accessible inside Nvidia): https://wandb.ai/joc/megatron/runs/jvlv139p/files/output.log
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working