-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Description
I am trying to fineture a model like deepseek-v3, but less parameters.
In the begining loss was 0.5. At 15 iteration, the loss suddenly increased from 2 to 11. Is there any possible reason?
Nemo Version: NeMo-25.09
GPU device: H100
warmup step: 50
=============logs:==========
0: Training epoch 0, iteration 0/4679 | lr: 5.882e-07 | global_batch_size: 48 | global_step: 0 | reduced_train_loss: 0.4985 | train_step_timing in s: 314.5
0: Training epoch 0, iteration 1/4679 | lr: 1.176e-06 | global_batch_size: 48 | global_step: 1 | reduced_train_loss: 0.507 | train_step_timing in s: 71.29 | consumed_samples: 96
0: Training epoch 0, iteration 2/4679 | lr: 1.765e-06 | global_batch_size: 48 | global_step: 2 | reduced_train_loss: 0.4823 | train_step_timing in s: 68.68 | consumed_samples: 144
0: Training epoch 0, iteration 3/4679 | lr: 2.353e-06 | global_batch_size: 48 | global_step: 3 | reduced_train_loss: 0.4619 | train_step_timing in s: 73.19 | consumed_samples: 192
0: Training epoch 0, iteration 4/4679 | lr: 2.941e-06 | global_batch_size: 48 | global_step: 4 | reduced_train_loss: 0.4728 | train_step_timing in s: 73.02 | consumed_samples: 240
0: [NeMo I 2025-12-18 15:49:42 nemo_logging:393] Running garbage collection at train global_step: 5
0: Training epoch 0, iteration 5/4679 | lr: 3.529e-06 | global_batch_size: 48 | global_step: 5 | reduced_train_loss: 0.4576 | train_step_timing in s: 68.46 | consumed_samples: 288
0: Training epoch 0, iteration 6/4679 | lr: 4.118e-06 | global_batch_size: 48 | global_step: 6 | reduced_train_loss: 0.5541 | train_step_timing in s: 71.64 | consumed_samples: 336
0: Training epoch 0, iteration 7/4679 | lr: 4.706e-06 | global_batch_size: 48 | global_step: 7 | reduced_train_loss: 0.5393 | train_step_timing in s: 70.94 | consumed_samples: 384
0: Training epoch 0, iteration 8/4679 | lr: 5.294e-06 | global_batch_size: 48 | global_step: 8 | reduced_train_loss: 0.5548 | train_step_timing in s: 68.33 | consumed_samples: 432
0: Training epoch 0, iteration 9/4679 | lr: 5.882e-06 | global_batch_size: 48 | global_step: 9 | reduced_train_loss: 0.5739 | train_step_timing in s: 69.96 | consumed_samples: 480
0: [NeMo I 2025-12-18 15:55:32 nemo_logging:393] Running garbage collection at train global_step: 10
0: Training epoch 0, iteration 10/4679 | lr: 6.471e-06 | global_batch_size: 48 | global_step: 10 | reduced_train_loss: 0.9849 | train_step_timing in s: 73.56 | consumed_samples: 528
0: Training epoch 0, iteration 11/4679 | lr: 7.059e-06 | global_batch_size: 48 | global_step: 11 | reduced_train_loss: 1.289 | train_step_timing in s: 64.88 | consumed_samples: 576
0: Training epoch 0, iteration 12/4679 | lr: 7.647e-06 | global_batch_size: 48 | global_step: 12 | reduced_train_loss: 1.609 | train_step_timing in s: 66.76 | consumed_samples: 624
0: Training epoch 0, iteration 13/4679 | lr: 8.235e-06 | global_batch_size: 48 | global_step: 13 | reduced_train_loss: 1.95 | train_step_timing in s: 68.83 | consumed_samples: 672
0: Training epoch 0, iteration 14/4679 | lr: 8.824e-06 | global_batch_size: 48 | global_step: 14 | reduced_train_loss: 2.992 | train_step_timing in s: 67.16 | consumed_samples: 720
0: [NeMo I 2025-12-18 16:01:14 nemo_logging:393] Running garbage collection at train global_step: 15
0: Training epoch 0, iteration 15/4679 | lr: 9.412e-06 | global_batch_size: 48 | global_step: 15 | reduced_train_loss: 11.1 | train_step_timing in s: 70.31 | consumed_samples: 768
0: Training epoch 0, iteration 16/4679 | lr: 1e-05 | global_batch_size: 48 | global_step: 16 | reduced_train_loss: 9.352 | train_step_timing in s: 68.33 | consumed_samples: 816
0: Training epoch 0, iteration 17/4679 | lr: 1.059e-05 | global_batch_size: 48 | global_step: 17 | reduced_train_loss: 10.33 | train_step_timing in s: 64.93 | consumed_samples: 864
0: Training epoch 0, iteration 18/4679 | lr: 1.118e-05 | global_batch_size: 48 | global_step: 18 | reduced_train_loss: 9.795 | train_step_timing in s: 66.78 | consumed_samples: 912
0: Training epoch 0, iteration 19/4679 | lr: 1.176e-05 | global_batch_size: 48 | global_step: 19 | reduced_train_loss: 9.725 | train_step_timing in s: 65.86 | consumed_samples: 960