Training works fine but checkpointing OOMs out

**Describe the bug**

I'm trying to train a Qwen3_8B model. These are the important config options used:
```
model:
  # recompute_granularity: full
  # recompute_method: uniform
  # recompute_num_layers: 1

  seq_length: 4096
  max_position_embeddings: 4096
  tensor_model_parallel_size: 2
  pipeline_model_parallel_size: 1

.......
optimizer:
  # optimizer: adam (default)
  use_distributed_optimizer: true
.........
checkpoint:
  save: ${cluster_config.exp_dir}
  load: ${cluster_config.exp_dir}
  save_interval: 2500
..........
```
Actual seq length is twice of 4096, and I'm using a single H100 node (8 H100 HBM3 GPUs)

**Steps/Code to reproduce bug**

Training runs fine, but checkpointing fails with OOM. I think it's using peak memory more than what the model requires.

**Expected behavior**


No OOM


**Additional context**

Wandb log (Might be only accessible inside Nvidia): https://wandb.ai/joc/megatron/runs/jvlv139p/files/output.log


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training works fine but checkpointing OOMs out #1748

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training works fine but checkpointing OOMs out #1748

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions