-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
bugSomething isn't workingSomething isn't workingrepro neededThe issue is missing a reproducible exampleThe issue is missing a reproducible examplever: 2.5.x
Description
Bug description
I try to use lightning with DeepSpeed stage 3 to train a model under the precision "16-mixed". However, I find that the model parameters includes Nan and Inf values at the first step. When I change it to DDP, this issue does not exist.
I initialize my trainer as:
trainer = Trainer(
max_epochs=max_epochs,
logger=logger,
callbacks=[checkpoint_callback, lr_monitor],
sync_batchnorm=sync_batchnorm,
check_val_every_n_epoch=None,
val_check_interval=every_n_train_steps * accumulate_grad_batches,
devices="auto",
accelerator="gpu",
precision="16-mixed",
strategy=deepspeed_stage_3,
accumulate_grad_batches=accumulate_grad_batches,
)
What version are you seeing the problem on?
v2.5
How to reproduce the bug
trainer = Trainer(
max_epochs=max_epochs,
logger=logger,
callbacks=[checkpoint_callback, lr_monitor],
sync_batchnorm=sync_batchnorm,
check_val_every_n_epoch=None,
val_check_interval=every_n_train_steps * accumulate_grad_batches,
devices="auto",
accelerator="gpu",
precision="16-mixed",
strategy=deepspeed_stage_3,
accumulate_grad_batches=accumulate_grad_batches,
)
Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
More info
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingrepro neededThe issue is missing a reproducible exampleThe issue is missing a reproducible examplever: 2.5.x