Commit a5f0a80
authored
fix: resolve DCP test hanging and distributed test errors (#330)
* fix(test): prevent dcp integration tests from hanging
In test_e2e_s3_file_system.py, multi_process_dcp_save_load() spawns
multiple processes to execute run(), and a race condition occurs when
faster processes destroyed shared process group in cleanup().
Isolated process group initialization with explicit TCP arguments, and
added barrier synchronization point to ensure all processes complete
before cleanup occurs.
* ci: run DCP and distributed integration tests sequentially
Removed -n auto flag from DCP and distributed training tests in CI
workflow. This change addresses stability issues in multi-process tests
by preventing potential race conditions, improving test reliability.1 parent bf8cb20 commit a5f0a80
File tree
2 files changed
+19
-6
lines changed- .github/workflows
- s3torchconnector/tst/e2e/dcp
2 files changed
+19
-6
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
88 | 88 | | |
89 | 89 | | |
90 | 90 | | |
91 | | - | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
92 | 99 | | |
93 | 100 | | |
94 | 101 | | |
| |||
110 | 117 | | |
111 | 118 | | |
112 | 119 | | |
113 | | - | |
| 120 | + | |
114 | 121 | | |
115 | 122 | | |
116 | 123 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
27 | | - | |
28 | | - | |
29 | | - | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
30 | 33 | | |
31 | 34 | | |
32 | 35 | | |
33 | | - | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
34 | 40 | | |
35 | 41 | | |
36 | 42 | | |
| |||
0 commit comments