Skip to content

Training works fine but checkpointing OOMs out #1748

@abhinavg4

Description

@abhinavg4

Describe the bug

I'm trying to train a Qwen3_8B model. These are the important config options used:

model:
  # recompute_granularity: full
  # recompute_method: uniform
  # recompute_num_layers: 1

  seq_length: 4096
  max_position_embeddings: 4096
  tensor_model_parallel_size: 2
  pipeline_model_parallel_size: 1

.......
optimizer:
  # optimizer: adam (default)
  use_distributed_optimizer: true
.........
checkpoint:
  save: ${cluster_config.exp_dir}
  load: ${cluster_config.exp_dir}
  save_interval: 2500
..........

Actual seq length is twice of 4096, and I'm using a single H100 node (8 H100 HBM3 GPUs)

Steps/Code to reproduce bug

Training runs fine, but checkpointing fails with OOM. I think it's using peak memory more than what the model requires.

Expected behavior

No OOM

Additional context

Wandb log (Might be only accessible inside Nvidia): https://wandb.ai/joc/megatron/runs/jvlv139p/files/output.log

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions