Skip to content

Commit aee63f8

Browse files
authored
Remove incorrect gradient_accumulation_steps scaling in trainer (#23)
The training loss is normal now [rank0]:[titan] 2025-04-05 23:51:49,581 - root - INFO - step: 2 token: 4,194,304.0 loss: 10.3805 memory: 45.47GiB(57.49%) tps: 13,964 tflops: 39.76 mfu: 4.02% [rank0]:[titan] 2025-04-05 23:51:51,656 - root - INFO - step: 3 token: 6,291,456.0 loss: 10.3369 memory: 45.47GiB(57.49%) tps: 126,411 tflops: 359.91 mfu: 36.39% [rank0]:[titan] 2025-04-05 23:51:53,703 - root - INFO - step: 4 token: 8,388,608.0 loss: 10.2823 memory: 45.47GiB(57.49%) tps: 128,063 tflops: 364.62 mfu: 36.87% [rank0]:[titan] 2025-04-05 23:51:55,797 - root - INFO - step: 5 token: 10,485,760.0 loss: 10.2228 memory: 45.47GiB(57.49%) tps: 125,228 tflops: 356.55 mfu: 36.05% [rank0]:[titan] 2025-04-05 23:51:57,863 - root - INFO - step: 6 token: 12,582,912.0 loss: 10.1647 memory: 45.47GiB(57.49%) tps: 126,922 tflops: 361.37 mfu: 36.54% [rank0]:[titan] 2025-04-05 23:51:59,970 - root - INFO - step: 7 token: 14,680,064.0 loss: 10.1129 memory: 45.47GiB(57.49%) tps: 124,500 tflops: 354.48 mfu: 35.84% [rank0]:[titan] 2025-04-05 23:52:02,089 - root - INFO - step: 8 token: 16,777,216.0 loss: 10.0557 memory: 45.47GiB(57.49%) tps: 123,729 tflops: 352.28 mfu: 35.62% [rank0]:[titan] 2025-04-05 23:52:04,185 - root - INFO - step: 9 token: 18,874,368.0 loss: 10.0028 memory: 45.47GiB(57.49%) tps: 125,141 tflops: 356.30 mfu: 36.03% [rank0]:[titan] 2025-04-05 23:52:06,295 - root - INFO - step: 10 token: 20,971,520.0 loss: 9.9596 memory: 45.47GiB(57.49%) tps: 124,282 tflops: 353.86 mfu: 35.78% [rank0]:[titan] 2025-04-05 23:52:08,427 - root - INFO - step: 11 token: 23,068,672.0 loss: 9.9244 memory: 45.47GiB(57.49%) tps: 123,017 tflops: 350.25 mfu: 35.41% [rank0]:[titan] 2025-04-05 23:52:10,542 - root - INFO - step: 12 token: 25,165,824.0 loss: 9.8943 memory: 45.47GiB(57.49%) tps: 123,983 tflops: 353.00 mfu: 35.69% [rank0]:[titan] 2025-04-05 23:52:12,630 - root - INFO - step: 13 token: 27,262,976.0 loss: 9.8745 memory: 45.47GiB(57.49%) tps: 125,640 tflops: 357.72 mfu: 36.17% [rank0]:[titan] 2025-04-05 23:52:14,722 - root - INFO - step: 14 token: 29,360,128.0 loss: 9.8488 memory: 45.47GiB(57.49%) tps: 125,349 tflops: 356.89 mfu: 36.09% [rank0]:[titan] 2025-04-05 23:52:16,789 - root - INFO - step: 15 token: 31,457,280.0 loss: 9.8301 memory: 45.47GiB(57.49%) tps: 126,920 tflops: 361.36 mfu: 36.54% [rank0]:[titan] 2025-04-05 23:52:18,885 - root - INFO - step: 16 token: 33,554,432.0 loss: 9.8149 memory: 45.47GiB(57.49%) tps: 125,090 tflops: 356.15 mfu: 36.01% [rank0]:[titan] 2025-04-05 23:52:20,950 - root - INFO - step: 17 token: 35,651,584.0 loss: 9.7987 memory: 45.47GiB(57.49%) tps: 127,052 tflops: 361.74 mfu: 36.58% [rank0]:[titan] 2025-04-05 23:52:23,019 - root - INFO - step: 18 token: 37,748,736.0 loss: 9.7819 memory: 45.47GiB(57.49%) tps: 126,738 tflops: 360.85 mfu: 36.49% [rank0]:[titan] 2025-04-05 23:52:25,151 - root - INFO - step: 19 token: 39,845,888.0 loss: 9.7650 memory: 45.47GiB(57.49%) tps: 123,030 tflops: 350.29 mfu: 35.42% [rank0]:[titan] 2025-04-05 23:52:27,262 - root - INFO - step: 20 token: 41,943,040.0 loss: 9.7428 memory: 45.47GiB(57.49%) tps: 124,196 tflops: 353.61 mfu: 35.75% [rank0]:[titan] 2025-04-05 23:52:29,352 - root - INFO - step: 21 token: 44,040,192.0 loss: 9.7238 memory: 45.47GiB(57.49%) tps: 125,497 tflops: 357.31 mfu: 36.13% [rank0]:[titan] 2025-04-05 23:52:31,468 - root - INFO - step: 22 token: 46,137,344.0 loss: 9.6967 memory: 45.47GiB(57.49%) tps: 123,924 tflops: 352.84 mfu: 35.68% [rank0]:[titan] 2025-04-05 23:52:33,543 - root - INFO - step: 23 token: 48,234,496.0 loss: 9.6748 memory: 45.47GiB(57.49%) tps: 126,354 tflops: 359.75 mfu: 36.38% [rank0]:[titan] 2025-04-05 23:52:35,641 - root - INFO - step: 24 token: 50,331,648.0 loss: 9.6474 memory: 45.47GiB(57.49%) tps: 125,026 tflops: 355.97 mfu: 35.99% [rank0]:[titan] 2025-04-05 23:52:37,731 - root - INFO - step: 25 token: 52,428,800.0 loss: 9.6214 memory: 45.47GiB(57.49%) tps: 125,474 tflops: 357.25 mfu: 36.12% [rank0]:[titan] 2025-04-05 23:52:39,844 - root - INFO - step: 26 token: 54,525,952.0 loss: 9.5926 memory: 45.47GiB(57.49%) tps: 124,087 tflops: 353.30 mfu: 35.72% [rank0]:[titan] 2025-04-05 23:52:41,906 - root - INFO - step: 27 token: 56,623,104.0 loss: 9.5642 memory: 45.47GiB(57.49%) tps: 127,227 tflops: 362.24 mfu: 36.63% [rank0]:[titan] 2025-04-05 23:52:43,973 - root - INFO - step: 28 token: 58,720,256.0 loss: 9.5307 memory: 45.47GiB(57.49%) tps: 126,886 tflops: 361.27 mfu: 36.53% [rank0]:[titan] 2025-04-05 23:52:46,084 - root - INFO - step: 29 token: 60,817,408.0 loss: 9.5042 memory: 45.47GiB(57.49%) tps: 124,214 tflops: 353.66 mfu: 35.76% [rank0]:[titan] 2025-04-05 23:52:48,168 - root - INFO - step: 30 token: 62,914,560.0 loss: 9.4670 memory: 45.47GiB(57.49%) tps: 125,806 tflops: 358.19 mfu: 36.22% [rank0]:[titan] 2025-04-05 23:52:50,290 - root - INFO - step: 31 token: 65,011,712.0 loss: 9.4326 memory: 45.47GiB(57.49%) tps: 123,599 tflops: 351.91 mfu: 35.58% [rank0]:[titan] 2025-04-05 23:52:52,399 - root - INFO - step: 32 token: 67,108,864.0 loss: 9.3994 memory: 45.47GiB(57.49%) tps: 124,360 tflops: 354.08 mfu: 35.80% [rank0]:[titan] 2025-04-05 23:52:54,496 - root - INFO - step: 33 token: 69,206,016.0 loss: 9.3723 memory: 45.47GiB(57.49%) tps: 125,051 tflops: 356.04 mfu: 36.00% [rank0]:[titan] 2025-04-05 23:52:56,644 - root - INFO - step: 34 token: 71,303,168.0 loss: 9.3265 memory: 45.47GiB(57.49%) tps: 122,133 tflops: 347.73 mfu: 35.16% [rank0]:[titan] 2025-04-05 23:52:58,743 - root - INFO - step: 35 token: 73,400,320.0 loss: 9.2878 memory: 45.47GiB(57.49%) tps: 124,930 tflops: 355.70 mfu: 35.97% [rank0]:[titan] 2025-04-05 23:53:00,837 - root - INFO - step: 36 token: 75,497,472.0 loss: 9.2547 memory: 45.47GiB(57.49%) tps: 125,233 tflops: 356.56 mfu: 36.05% [rank0]:[titan] 2025-04-05 23:53:02,925 - root - INFO - step: 37 token: 77,594,624.0 loss: 9.2114 memory: 45.47GiB(57.49%) tps: 125,597 tflops: 357.60 mfu: 36.16% [rank0]:[titan] 2025-04-05 23:53:05,027 - root - INFO - step: 38 token: 79,691,776.0 loss: 9.1741 memory: 45.47GiB(57.49%) tps: 124,765 tflops: 355.23 mfu: 35.92% [rank0]:[titan] 2025-04-05 23:53:07,090 - root - INFO - step: 39 token: 81,788,928.0 loss: 9.1278 memory: 45.47GiB(57.49%) tps: 127,133 tflops: 361.97 mfu: 36.60% [rank0]:[titan] 2025-04-05 23:53:09,163 - root - INFO - step: 40 token: 83,886,080.0 loss: 9.0843 memory: 45.47GiB(57.49%) tps: 126,523 tflops: 360.23 mfu: 36.42% [rank0]:[titan] 2025-04-05 23:53:11,250 - root - INFO - step: 41 token: 85,983,232.0 loss: 9.0468 memory: 45.47GiB(57.49%) tps: 125,615 tflops: 357.65 mfu: 36.16% [rank0]:[titan] 2025-04-05 23:53:13,367 - root - INFO - step: 42 token: 88,080,384.0 loss: 9.0057 memory: 45.47GiB(57.49%) tps: 123,906 tflops: 352.78 mfu: 35.67% [rank0]:[titan] 2025-04-05 23:53:15,441 - root - INFO - step: 43 token: 90,177,536.0 loss: 8.9706 memory: 45.47GiB(57.49%) tps: 126,451 tflops: 360.03 mfu: 36.40% [rank0]:[titan] 2025-04-05 23:53:17,537 - root - INFO - step: 44 token: 92,274,688.0 loss: 8.9185 memory: 45.47GiB(57.49%) tps: 125,151 tflops: 356.33 mfu: 36.03% [rank0]:[titan] 2025-04-05 23:53:19,602 - root - INFO - step: 45 token: 94,371,840.0 loss: 8.8834 memory: 45.47GiB(57.49%) tps: 126,965 tflops: 361.49 mfu: 36.55% [rank0]:[titan] 2025-04-05 23:53:21,651 - root - INFO - step: 46 token: 96,468,992.0 loss: 8.8337 memory: 45.47GiB(57.49%) tps: 128,034 tflops: 364.54 mfu: 36.86% [rank0]:[titan] 2025-04-05 23:53:23,726 - root - INFO - step: 47 token: 98,566,144.0 loss: 8.7890 memory: 45.47GiB(57.49%) tps: 126,409 tflops: 359.91 mfu: 36.39% [rank0]:[titan] 2025-04-05 23:53:25,824 - root - INFO - step: 48 token: 100,663,296.0 loss: 8.7448 memory: 45.47GiB(57.49%) tps: 125,001 tflops: 355.90 mfu: 35.99% [rank0]:[titan] 2025-04-05 23:53:27,906 - root - INFO - step: 49 token: 102,760,448.0 loss: 8.6949 memory: 45.47GiB(57.49%) tps: 125,957 tflops: 358.62 mfu: 36.26% [rank0]:[titan] 2025-04-05 23:53:28,097 - root - INFO - [GC] Peforming periodical GC collection. 0.19 seconds. [rank0]:[titan] 2025-04-05 23:53:30,162 - root - INFO - step: 50 token: 104,857,600.0 loss: 8.6557 memory: 45.47GiB(57.49%) tps: 116,257 tflops: 331.00 mfu: 33.47% [rank0]:[titan] 2025-04-05 23:53:32,235 - root - INFO - step: 51 token: 106,954,752.0 loss: 8.6059 memory: 45.47GiB(57.49%) tps: 126,504 tflops: 360.18 mfu: 36.42% [rank0]:[titan] 2025-04-05 23:53:34,335 - root - INFO - step: 52 token: 109,051,904.0 loss: 8.5615 memory: 45.47GiB(57.49%) tps: 124,891 tflops: 355.59 mfu: 35.95% [rank0]:[titan] 2025-04-05 23:53:36,414 - root - INFO - step: 53 token: 111,149,056.0 loss: 8.5215 memory: 45.47GiB(57.49%) tps: 126,139 tflops: 359.14 mfu: 36.31% [rank0]:[titan] 2025-04-05 23:53:38,526 - root - INFO - step: 54 token: 113,246,208.0 loss: 8.4800 memory: 45.47GiB(57.49%) tps: 124,212 tflops: 353.66 mfu: 35.76% [rank0]:[titan] 2025-04-05 23:53:40,630 - root - INFO - step: 55 token: 115,343,360.0 loss: 8.4325 memory: 45.47GiB(57.49%) tps: 124,676 tflops: 354.97 mfu: 35.89% [rank0]:[titan] 2025-04-05 23:53:42,728 - root - INFO - step: 56 token: 117,440,512.0 loss: 8.3893 memory: 45.47GiB(57.49%) tps: 124,981 tflops: 355.84 mfu: 35.98% [rank0]:[titan] 2025-04-05 23:53:44,838 - root - INFO - step: 57 token: 119,537,664.0 loss: 8.3429 memory: 45.47GiB(57.49%) tps: 124,269 tflops: 353.82 mfu: 35.78% [rank0]:[titan] 2025-04-05 23:53:46,938 - root - INFO - step: 58 token: 121,634,816.0 loss: 8.2986 memory: 45.47GiB(57.49%) tps: 124,900 tflops: 355.61 mfu: 35.96% [rank0]:[titan] 2025-04-05 23:53:49,038 - root - INFO - step: 59 token: 123,731,968.0 loss: 8.2578 memory: 45.47GiB(57.49%) tps: 124,923 tflops: 355.68 mfu: 35.96% [rank0]:[titan] 2025-04-05 23:53:51,120 - root - INFO - step: 60 token: 125,829,120.0 loss: 8.2063 memory: 45.47GiB(57.49%) tps: 125,980 tflops: 358.69 mfu: 36.27% [rank0]:[titan] 2025-04-05 23:53:53,208 - root - INFO - step: 61 token: 127,926,272.0 loss: 8.1683 memory: 45.47GiB(57.49%) tps: 125,587 tflops: 357.57 mfu: 36.15% [rank0]:[titan] 2025-04-05 23:53:55,298 - root - INFO - step: 62 token: 130,023,424.0 loss: 8.1136 memory: 45.47GiB(57.49%) tps: 125,497 tflops: 357.31 mfu: 36.13% [rank0]:[titan] 2025-04-05 23:53:57,365 - root - INFO - step: 63 token: 132,120,576.0 loss: 8.0831 memory: 45.47GiB(57.49%) tps: 126,886 tflops: 361.27 mfu: 36.53% [rank0]:[titan] 2025-04-05 23:53:59,448 - root - INFO - step: 64 token: 134,217,728.0 loss: 8.0376 memory: 45.47GiB(57.49%) tps: 125,891 tflops: 358.43 mfu: 36.24% [rank0]:[titan] 2025-04-05 23:54:01,576 - root - INFO - step: 65 token: 136,314,880.0 loss: 8.0242 memory: 45.47GiB(57.49%) tps: 123,241 tflops: 350.89 mfu: 35.48% [rank0]:[titan] 2025-04-05 23:54:03,664 - root - INFO - step: 66 token: 138,412,032.0 loss: 7.9497 memory: 45.47GiB(57.49%) tps: 125,623 tflops: 357.67 mfu: 36.17% [rank0]:[titan] 2025-04-05 23:54:05,784 - root - INFO - step: 67 token: 140,509,184.0 loss: 7.9354 memory: 45.47GiB(57.49%) tps: 123,675 tflops: 352.12 mfu: 35.60% [rank0]:[titan] 2025-04-05 23:54:07,877 - root - INFO - step: 68 token: 142,606,336.0 loss: 7.8769 memory: 45.47GiB(57.49%) tps: 125,341 tflops: 356.87 mfu: 36.08% [rank0]:[titan] 2025-04-05 23:54:09,982 - root - INFO - step: 69 token: 144,703,488.0 loss: 7.8338 memory: 45.47GiB(57.49%) tps: 124,581 tflops: 354.70 mfu: 35.87% [rank0]:[titan] 2025-04-05 23:54:12,095 - root - INFO - step: 70 token: 146,800,640.0 loss: 7.7986 memory: 45.47GiB(57.49%) tps: 124,088 tflops: 353.30 mfu: 35.72% [rank0]:[titan] 2025-04-05 23:54:14,195 - root - INFO - step: 71 token: 148,897,792.0 loss: 7.7634 memory: 45.47GiB(57.49%) tps: 124,851 tflops: 355.47 mfu: 35.94% [rank0]:[titan] 2025-04-05 23:54:16,324 - root - INFO - step: 72 token: 150,994,944.0 loss: 7.7366 memory: 45.47GiB(57.49%) tps: 123,198 tflops: 350.77 mfu: 35.47% [rank0]:[titan] 2025-04-05 23:54:18,411 - root - INFO - step: 73 token: 153,092,096.0 loss: 7.7049 memory: 45.47GiB(57.49%) tps: 125,646 tflops: 357.74 mfu: 36.17% [rank0]:[titan] 2025-04-05 23:54:20,472 - root - INFO - step: 74 token: 155,189,248.0 loss: 7.6418 memory: 45.47GiB(57.49%) tps: 127,283 tflops: 362.40 mfu: 36.64% [rank0]:[titan] 2025-04-05 23:54:22,550 - root - INFO - step: 75 token: 157,286,400.0 loss: 7.6175 memory: 45.47GiB(57.49%) tps: 126,214 tflops: 359.36 mfu: 36.34% [rank0]:[titan] 2025-04-05 23:54:24,637 - root - INFO - step: 76 token: 159,383,552.0 loss: 7.5993 memory: 45.47GiB(57.49%) tps: 125,635 tflops: 357.71 mfu: 36.17% [rank0]:[titan] 2025-04-05 23:54:26,708 - root - INFO - step: 77 token: 161,480,704.0 loss: 7.5597 memory: 45.47GiB(57.49%) tps: 126,671 tflops: 360.66 mfu: 36.47% [rank0]:[titan] 2025-04-05 23:54:28,815 - root - INFO - step: 78 token: 163,577,856.0 loss: 7.5185 memory: 45.47GiB(57.49%) tps: 124,435 tflops: 354.29 mfu: 35.82% [rank0]:[titan] 2025-04-05 23:54:30,942 - root - INFO - step: 79 token: 165,675,008.0 loss: 7.5014 memory: 45.47GiB(57.49%) tps: 123,304 tflops: 351.07 mfu: 35.50% [rank0]:[titan] 2025-04-05 23:54:33,059 - root - INFO - step: 80 token: 167,772,160.0 loss: 7.4652 memory: 45.47GiB(57.49%) tps: 123,891 tflops: 352.74 mfu: 35.67% [rank0]:[titan] 2025-04-05 23:54:35,174 - root - INFO - step: 81 token: 169,869,312.0 loss: 7.4508 memory: 45.47GiB(57.49%) tps: 124,056 tflops: 353.21 mfu: 35.71% [rank0]:[titan] 2025-04-05 23:54:37,266 - root - INFO - step: 82 token: 171,966,464.0 loss: 7.4039 memory: 45.47GiB(57.49%) tps: 125,331 tflops: 356.84 mfu: 36.08% [rank0]:[titan] 2025-04-05 23:54:39,330 - root - INFO - step: 83 token: 174,063,616.0 loss: 7.3982 memory: 45.47GiB(57.49%) tps: 127,112 tflops: 361.91 mfu: 36.59% [rank0]:[titan] 2025-04-05 23:54:41,444 - root - INFO - step: 84 token: 176,160,768.0 loss: 7.3625 memory: 45.47GiB(57.49%) tps: 124,028 tflops: 353.13 mfu: 35.71% [rank0]:[titan] 2025-04-05 23:54:43,541 - root - INFO - step: 85 token: 178,257,920.0 loss: 7.3299 memory: 45.47GiB(57.49%) tps: 125,058 tflops: 356.06 mfu: 36.00% [rank0]:[titan] 2025-04-05 23:54:45,617 - root - INFO - step: 86 token: 180,355,072.0 loss: 7.2925 memory: 45.47GiB(57.49%) tps: 126,346 tflops: 359.73 mfu: 36.37% [rank0]:[titan] 2025-04-05 23:54:47,701 - root - INFO - step: 87 token: 182,452,224.0 loss: 7.2773 memory: 45.47GiB(57.49%) tps: 125,853 tflops: 358.33 mfu: 36.23% [rank0]:[titan] 2025-04-05 23:54:49,823 - root - INFO - step: 88 token: 184,549,376.0 loss: 7.2687 memory: 45.47GiB(57.49%) tps: 123,575 tflops: 351.84 mfu: 35.58% [rank0]:[titan] 2025-04-05 23:54:51,891 - root - INFO - step: 89 token: 186,646,528.0 loss: 7.2467 memory: 45.47GiB(57.49%) tps: 126,867 tflops: 361.21 mfu: 36.52% [rank0]:[titan] 2025-04-05 23:54:54,008 - root - INFO - step: 90 token: 188,743,680.0 loss: 7.2506 memory: 45.47GiB(57.49%) tps: 123,863 tflops: 352.66 mfu: 35.66% [rank0]:[titan] 2025-04-05 23:54:56,142 - root - INFO - step: 91 token: 190,840,832.0 loss: 7.1907 memory: 45.47GiB(57.49%) tps: 122,908 tflops: 349.94 mfu: 35.38% [rank0]:[titan] 2025-04-05 23:54:58,236 - root - INFO - step: 92 token: 192,937,984.0 loss: 7.1888 memory: 45.47GiB(57.49%) tps: 125,253 tflops: 356.62 mfu: 36.06% [rank0]:[titan] 2025-04-05 23:55:00,301 - root - INFO - step: 93 token: 195,035,136.0 loss: 7.1362 memory: 45.47GiB(57.49%) tps: 126,993 tflops: 361.57 mfu: 36.56% [rank0]:[titan] 2025-04-05 23:55:02,373 - root - INFO - step: 94 token: 197,132,288.0 loss: 7.1514 memory: 45.47GiB(57.49%) tps: 126,584 tflops: 360.41 mfu: 36.44% [rank0]:[titan] 2025-04-05 23:55:04,503 - root - INFO - step: 95 token: 199,229,440.0 loss: 7.0960 memory: 45.47GiB(57.49%) tps: 123,150 tflops: 350.63 mfu: 35.45% [rank0]:[titan] 2025-04-05 23:55:06,621 - root - INFO - step: 96 token: 201,326,592.0 loss: 7.0635 memory: 45.47GiB(57.49%) tps: 123,785 tflops: 352.44 mfu: 35.64% [rank0]:[titan] 2025-04-05 23:55:08,705 - root - INFO - step: 97 token: 203,423,744.0 loss: 7.0486 memory: 45.47GiB(57.49%) tps: 125,889 tflops: 358.43 mfu: 36.24% [rank0]:[titan] 2025-04-05 23:55:10,805 - root - INFO - step: 98 token: 205,520,896.0 loss: 7.0309 memory: 45.47GiB(57.49%) tps: 124,869 tflops: 355.52 mfu: 35.95% [rank0]:[titan] 2025-04-05 23:55:12,914 - root - INFO - step: 99 token: 207,618,048.0 loss: 7.0032 memory: 45.47GiB(57.49%) tps: 124,338 tflops: 354.01 mfu: 35.80%
1 parent 0de4470 commit aee63f8

File tree

1 file changed

+4
-8
lines changed

1 file changed

+4
-8
lines changed

flame/train.py

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -580,9 +580,7 @@ def main(job_config: JobConfig):
580580
input_ids, labels = batch["input_ids"], batch["labels"]
581581

582582
# Update metrics processor state before forward/backward
583-
metric_logger.ntokens_since_last_log += (
584-
labels.numel() * job_config.training.gradient_accumulation_steps
585-
)
583+
metric_logger.ntokens_since_last_log += labels.numel()
586584
metric_logger.data_loading_times.append(
587585
time.perf_counter() - data_load_start
588586
)
@@ -703,19 +701,17 @@ def main(job_config: JobConfig):
703701
# Use dist_mean/max on the accumulated loss for the step
704702
global_avg_loss, global_max_loss = (
705703
dist_utils.dist_mean(
706-
loss * job_config.training.gradient_accumulation_steps,
704+
loss,
707705
world_mesh["dp_cp"],
708706
),
709707
dist_utils.dist_max(
710-
loss * job_config.training.gradient_accumulation_steps,
708+
loss,
711709
world_mesh["dp_cp"],
712710
),
713711
)
714712
else:
715713
# Scale back the loss before logging
716-
global_avg_loss = global_max_loss = (
717-
loss.item() * job_config.training.gradient_accumulation_steps
718-
)
714+
global_avg_loss = global_max_loss = loss.item()
719715

720716
# Update train state tokens and elapsed time
721717
time_now = time.perf_counter()

0 commit comments

Comments
 (0)