I was training a Llama model using the script released in the Deepspeed-Examples repository using ZenFlow. Interestingly, I noticed the loss values becoming NaN after the preconfigured (in the deepspeed config) update_interval number of steps. Ex: the update interval is set to 4 in the deepspeed config, loss becomes nan from the 5th step. The code used is available here: https://github.com/deepspeedai/DeepSpeedExamples/tree/master/training/DeepSpeed-ZenFlow/finetuning
My software configurations and versions:
torch==2.5.0+cu118
Transformers==4.57.3
Cloned the latest version of deepspeed from github
Datasets==4.4.1
The jobs are run with DGX H200 GPUs and AMD EPYC 7742 processor.
I wish to understand why the loss turns NaN and more importantly why it happens only after the update_interval step. Can I please get some help with respect to this ?