Commit 15ad92b
Fix ping-pong buffer index reset and removing redundant stream sync (deepspeedai#7805)
Fix deepspeedai#7804 and deepspeedai#7188
After investigating the code in
`deepspeed/runtime/zero/stage_1_and_2.py`, I have identified the root
cause. The regression regarding communication overlap was introduced in
PR deepspeedai#7371 (deepspeedai#7371). While the
additional two-stream synchronization in that PR fixes gradient
corruption, it effectively disables the overlapping behavior.
The underlying issue causing the gradient corruption (which deepspeedai#7371
attempted to fix) was actually introduced in PR deepspeedai#6993
(deepspeedai#6993). In that PR,
`bucket.clear()` incorrectly resets the ping-pong buffer index to 0 at
the end of `reduce_ipg_grads`. This logic disrupts the buffer index
swapping mechanism within
`reduce_independent_p_g_buckets_and_remove_grads`.
To fix this, L121 in `deepspeed/runtime/zero/stage_1_and_2.py` should be
removed to prevent resetting the buffer index. Additionally, the stream
synchronization logic introduced in deepspeedai#7371 should be removed to restore
the `overlap_comm=True` functionality.
---------
Signed-off-by: szlent <metarufolds@gmail.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Masahiro Tanaka <mtanaka@anyscale.com>1 parent 3bc882f commit 15ad92b
1 file changed
+7
-8
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
118 | 118 | | |
119 | 119 | | |
120 | 120 | | |
121 | | - | |
122 | 121 | | |
123 | 122 | | |
124 | 123 | | |
| |||
1052 | 1051 | | |
1053 | 1052 | | |
1054 | 1053 | | |
1055 | | - | |
| 1054 | + | |
1056 | 1055 | | |
1057 | | - | |
1058 | | - | |
1059 | | - | |
1060 | 1056 | | |
1061 | 1057 | | |
1062 | 1058 | | |
| |||
1500 | 1496 | | |
1501 | 1497 | | |
1502 | 1498 | | |
1503 | | - | |
1504 | | - | |
| 1499 | + | |
| 1500 | + | |
| 1501 | + | |
| 1502 | + | |
| 1503 | + | |
1505 | 1504 | | |
1506 | 1505 | | |
1507 | 1506 | | |
| |||
1536 | 1535 | | |
1537 | 1536 | | |
1538 | 1537 | | |
1539 | | - | |
| 1538 | + | |
1540 | 1539 | | |
1541 | 1540 | | |
1542 | 1541 | | |
| |||
0 commit comments