[BUG] ZenFlow: loss values becomes NaN after 'update_interval' number of steps.


I was training a Llama model using the script released in the Deepspeed-Examples repository using ZenFlow. Interestingly, I noticed the loss values becoming NaN after the preconfigured (in the deepspeed config) update_interval number of steps.  Ex: the update interval is set to 4 in the deepspeed config, loss becomes nan from the 5th step. The code used is available here: https://github.com/deepspeedai/DeepSpeedExamples/tree/master/training/DeepSpeed-ZenFlow/finetuning


**My software configurations and versions:**

torch==2.5.0+cu118
Transformers==4.57.3
Cloned the latest version of deepspeed from github
Datasets==4.4.1

The jobs are run with DGX H200 GPUs and AMD EPYC 7742 processor.

I wish to understand why the loss turns NaN and more importantly why it happens only after the update_interval step. Can I please get some help with respect to this ?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ZenFlow: loss values becomes NaN after 'update_interval' number of steps. #7759

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] ZenFlow: loss values becomes NaN after 'update_interval' number of steps. #7759

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions