Skip to content

DeepSpeed Stage 3 in lightning leads to Nan and Inf values in the model parameters. #20534

@LittleFlyingSheep

Description

@LittleFlyingSheep

Bug description

I try to use lightning with DeepSpeed stage 3 to train a model under the precision "16-mixed". However, I find that the model parameters includes Nan and Inf values at the first step. When I change it to DDP, this issue does not exist.

I initialize my trainer as:

trainer = Trainer(
        max_epochs=max_epochs,
        logger=logger,
        callbacks=[checkpoint_callback, lr_monitor],
        sync_batchnorm=sync_batchnorm,
        check_val_every_n_epoch=None,
        val_check_interval=every_n_train_steps * accumulate_grad_batches,
        devices="auto",
        accelerator="gpu",
        precision="16-mixed",
        strategy=deepspeed_stage_3,
        accumulate_grad_batches=accumulate_grad_batches, 
    )

What version are you seeing the problem on?

v2.5

How to reproduce the bug

trainer = Trainer(
        max_epochs=max_epochs,
        logger=logger,
        callbacks=[checkpoint_callback, lr_monitor],
        sync_batchnorm=sync_batchnorm,
        check_val_every_n_epoch=None,
        val_check_interval=every_n_train_steps * accumulate_grad_batches,
        devices="auto",
        accelerator="gpu",
        precision="16-mixed",
        strategy=deepspeed_stage_3accumulate_grad_batches=accumulate_grad_batches, 
    )

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrepro neededThe issue is missing a reproducible examplever: 2.5.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions