Refactors `process_group_tests.py` by allenwang28 · Pull Request #103 · meta-pytorch/torchft

allenwang28 · 2025-02-07T20:11:01Z

What does this PR do?

As part of #97, this PR refactors process_group_test:

Renames up _test_pg to run_collectives and extending it to accept a given list of collectives by name.
Breaks up ProcessGroupTest into three tests: GlooTest, NCCLTests and DummyTests:
- GlooTest logically tests every test using gloo, NCCLTest with NCCL, etc.
- This allows some niceties, like marking once that we want to skip all NCCL tests
Adds shutdown() and garbage collection etc. to avoid extraneous messages & warnings like

Traceback (most recent call last):
  File "/home/allencwang/workspace/torchft/torchft/process_group.py", line 824, in _future_handler
    cmd = future_queue.get(timeout=timedelta(seconds=10.0))
  File "/home/allencwang/workspace/torchft/torchft/multiprocessing.py", line 45, in get
    raise RuntimeError(f"process is not alive {self._p.exitcode}")
RuntimeError: process is not alive -15
[rank0]:[W207 10:50:12.933128109 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Why is this needed?

As part of #102, I noticed that there were some mismatches between which collectives ran on which backends (matrix is here). Therefore this logical grouping of tests by backend allows us to define which collectives should be tested explicitly

d4l3k · 2025-02-07T20:38:31Z

torchft/process_group_test.py

        self.assertIs(wrapper.parent, pg)

-        works = _test_pg(wrapper)
+        works = run_collective(pg=wrapper, collective="allreduce")


this seems like a pretty big decrease in coverage?

Added functionality back

d4l3k · 2025-02-07T20:42:41Z

torchft/process_group_test.py

+    shape: torch.Size = example_tensor.shape
+    dtype: torch.dtype = example_tensor.dtype
+    coll = getattr(pg, collective)
+    args_list = _build_args(pg=pg, collective=collective, example_tensor=example_tensor)


What's the intention behind pulling this out? I'm not really convinced that this makes it all that much cleaner

In some ways I think I'd prefer if we got rid of the arg generation and instead flatten this out with direct calls i.e.

if collective == "allreduce": work = pg.allreduce(...) work.wait() ...

I agree, I've removed the arg generation and included it in place for run_collective

d4l3k · 2025-02-07T20:44:23Z

torchft/process_group_test.py

+        pg = ProcessGroupBabyNCCL(timeout=timedelta(seconds=10))
+        try:
+            pg.configure(self.store_addr, 0, 1)


This seems really slow -- how fast does this run? Launching the subprocess is pretty slow so would actually prefer to run these all on the same PG

If you want prettier printing we can use subtests?

i.e.

for collective in collectives: with self.subTest(collective=collective): ...

Good callout, with parameterized it took ~36s, without it took ~16s. I've removed parameterized.

allenwang28 · 2025-02-10T21:33:11Z

updates in #102 are likely cleaner, so I am going to deprecate this PR

Refactors process_group_tests to run collectives parameterized

4c3b78e

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 7, 2025

allenwang28 added 2 commits February 7, 2025 12:13

minor adjustment, move check_tensors out of the loop

5f73001

linter

0c7ac68

allenwang28 marked this pull request as ready for review February 7, 2025 20:28

d4l3k requested changes Feb 7, 2025

View reviewed changes

allenwang28 added 5 commits February 7, 2025 13:18

removes arg generation

c2840b4

add the ability to run a set of collectives

435449c

linters

ec8ae32

rename input_tensors to tensors_to_check

f3ed6a4

slight cleanup

c95fa35

allenwang28 mentioned this pull request Feb 10, 2025

Adds reduce_scatter into torchft #102

Merged

allenwang28 closed this Feb 10, 2025

This was referenced Feb 12, 2025

Adds more collectives to ProcessGroups #108

Merged

process_group_test - Enhance fault tolerance collective tests #109

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactors `process_group_tests.py`#103

Refactors `process_group_tests.py`#103
allenwang28 wants to merge 8 commits intometa-pytorch:mainfrom
allenwang28:pg_test_refactor

allenwang28 commented Feb 7, 2025 •

edited

Loading

Uh oh!

d4l3k Feb 7, 2025

Uh oh!

allenwang28 Feb 7, 2025

Uh oh!

d4l3k Feb 7, 2025

Uh oh!

allenwang28 Feb 7, 2025

Uh oh!

d4l3k Feb 7, 2025

Uh oh!

allenwang28 Feb 7, 2025 •

edited

Loading

Uh oh!

allenwang28 commented Feb 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

allenwang28 commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is this needed?

Uh oh!

d4l3k Feb 7, 2025

Choose a reason for hiding this comment

Uh oh!

allenwang28 Feb 7, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k Feb 7, 2025

Choose a reason for hiding this comment

Uh oh!

allenwang28 Feb 7, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k Feb 7, 2025

Choose a reason for hiding this comment

Uh oh!

allenwang28 Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allenwang28 commented Feb 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

allenwang28 commented Feb 7, 2025 •

edited

Loading

allenwang28 Feb 7, 2025 •

edited

Loading