Skip to content

[BUG] ZenFlow: loss values becomes NaN after 'update_interval' number of steps. #7759

@karthik2603-theBrogrammer

Description

I was training a Llama model using the script released in the Deepspeed-Examples repository using ZenFlow. Interestingly, I noticed the loss values becoming NaN after the preconfigured (in the deepspeed config) update_interval number of steps. Ex: the update interval is set to 4 in the deepspeed config, loss becomes nan from the 5th step. The code used is available here: https://github.com/deepspeedai/DeepSpeedExamples/tree/master/training/DeepSpeed-ZenFlow/finetuning

My software configurations and versions:

torch==2.5.0+cu118
Transformers==4.57.3
Cloned the latest version of deepspeed from github
Datasets==4.4.1

The jobs are run with DGX H200 GPUs and AMD EPYC 7742 processor.

I wish to understand why the loss turns NaN and more importantly why it happens only after the update_interval step. Can I please get some help with respect to this ?

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtraining

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions