Skip to content

Commit a5f0a80

Browse files
authored
fix: resolve DCP test hanging and distributed test errors (#330)
* fix(test): prevent dcp integration tests from hanging In test_e2e_s3_file_system.py, multi_process_dcp_save_load() spawns multiple processes to execute run(), and a race condition occurs when faster processes destroyed shared process group in cleanup(). Isolated process group initialization with explicit TCP arguments, and added barrier synchronization point to ensure all processes complete before cleanup occurs. * ci: run DCP and distributed integration tests sequentially Removed -n auto flag from DCP and distributed training tests in CI workflow. This change addresses stability issues in multi-process tests by preventing potential race conditions, improving test reliability.
1 parent bf8cb20 commit a5f0a80

File tree

2 files changed

+19
-6
lines changed

2 files changed

+19
-6
lines changed

.github/workflows/python-integration.yml

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,14 @@ jobs:
8888
CI_REGION=${{ matrix.test-run.region }} \
8989
CI_BUCKET=${{ matrix.test-run.bucket }} \
9090
CI_STORAGE_CLASS=${{ matrix.test-run.storage-class }} \
91-
pytest s3torchconnector/tst/e2e --ignore-glob '*/**/test_e2e_s3_lightning_checkpoint.py' --ignore-glob '*/**/dcp' -n auto
91+
pytest s3torchconnector/tst/e2e --ignore-glob '*/**/test_e2e_s3_lightning_checkpoint.py' --ignore-glob '*/**/dcp' --ignore-glob '*/**/test_distributed_training.py' -n auto
92+
93+
- name: s3torchconnector ${{ matrix.test-run.name }} distributed training integration tests
94+
run: |
95+
CI_REGION=${{ matrix.test-run.region }} \
96+
CI_BUCKET=${{ matrix.test-run.bucket }} \
97+
CI_STORAGE_CLASS=${{ matrix.test-run.storage-class }} \
98+
pytest s3torchconnector/tst/e2e/test_distributed_training.py
9299
93100
- name: Install Lightning dependency
94101
run: |
@@ -110,7 +117,7 @@ jobs:
110117
CI_REGION=${{ matrix.test-run.region }} \
111118
CI_BUCKET=${{ matrix.test-run.bucket }} \
112119
CI_STORAGE_CLASS=${{ matrix.test-run.storage-class }} \
113-
pytest s3torchconnector/tst/e2e/dcp -n auto
120+
pytest s3torchconnector/tst/e2e/dcp
114121
115122
- name: s3torchconnectorclient ${{ matrix.test-run.name }} integration tests
116123
run: |

s3torchconnector/tst/e2e/dcp/test_e2e_s3_file_system.py

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -24,13 +24,19 @@ def generate_random_port():
2424

2525

2626
def setup(rank, world_size, port):
27-
os.environ["MASTER_ADDR"] = "localhost"
28-
os.environ["MASTER_PORT"] = port
29-
dist.init_process_group("gloo", rank=rank, world_size=world_size)
27+
dist.init_process_group(
28+
backend="gloo",
29+
world_size=world_size,
30+
rank=rank,
31+
init_method=f"tcp://127.0.0.1:{port}",
32+
)
3033

3134

3235
def cleanup():
33-
dist.destroy_process_group()
36+
# Synchronization point: Barrier ensures all process groups reach this point
37+
dist.barrier()
38+
if dist.is_initialized():
39+
dist.destroy_process_group()
3440

3541

3642
def run(

0 commit comments

Comments
 (0)