Adds `reduce_scatter` into `torchft` by allenwang28 · Pull Request #102 · meta-pytorch/torchft

allenwang28 · 2025-02-06T21:55:19Z

What does this PR do?

Partially addresses #97 by adding reduce_scatter into torchft.

Concretely, this consists of a few pieces:

Introducing reduce_scatter into the ProcessGroup following the signature [here](https://github.com/pytorch/pytorch/blob/11f69808c64a65c68a4452250ba7719dcff27c78/torch/csrc/distributed/c10d/PyProcessGroup.hpp#L203
In ProcessGroup* we essentially follow the behavior of other collectives:
- In ProcessGroupWrapper, it depends on the parent implementation
- In ProcessGroupDummy, it writes from the first input into output
- In ProcessGroupBaby, it asserts inputs and moves underlying storage into shared memory
Add ReduceScatterOptions in _PickleSafeOptions
Introduces reduce_scatter as an option in _test_pg, however this necessitated a new function (named _should_run_collective) which was needed as e.g. GLOO does not support reduce_scatter. This function essentially takes the collective, backend and device and copies the logic of the published supported collective matrix.

Tests

Presubmits, and:

$ pytest torchft/process_group_test.py 
============================================= test session starts =============================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/allencwang/workspace/torchft
configfile: pytest.ini
plugins: typeguard-2.13.3
collected 16 items                                                                                            

torchft/process_group_test.py ................                                                          [100%]

============================================= 16 passed in 31.44s =============================================
[rank0]:[W206 14:54:24.777939032 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Next steps

The logic of _should_run_collective is a bit confusing, as it allows "non defined backends" like ErrorSwallowing* through, to mimic the old behavior before this change. Testing here could become a bit unwieldy as we add more collectives and so a future step could be to refactor the testing.

One nice change could be to parameterize tests by the collective. This will make potentially failing collectives more explicit and will reduce the time it takes to run individual tests. Likely can do this in the next PR.

…ce_scatter test

d4l3k · 2025-02-07T00:35:46Z

torchft/process_group_test.py

+                return True
+            return False
+        else:  # cpu
+            if collective_str in ["reduce_scatter", "all_to_all"]:


oh wow -- didn't realize we don't support these on Gloo, good to know! cc @c-p-i-o

ye, we miss many APIs on Gloo.

this approach seems nice and explicit. but is it possible to instead just try: the test, and except: some specific NYI error? (i'm not sure if we raise a consistent type of NYI exception from backends?)

this approach seems nice and explicit. but is it possible to instead just try: the test, and except: some specific NYI error? (i'm not sure if we raise a consistent type of NYI exception from backends?)

Yeah this is a good idea, I modified the block as follows:

for coll_str, args in collectives: try: coll = getattr(pg, coll_str) work = coll(*args) works[coll_str] = work work.wait() fut = work.get_future() fut.wait() # Check that all tensor arguments have the expected shapes and dtypes check_tensors(args) except RuntimeError as e: if f"does not support {coll_str}" in str(e): # Skip collectives that are not supported by the backend. continue raise e

torchft/process_group_test.py

d4l3k

LGTM, thanks for adding this!

… the test suite

allenwang28 · 2025-02-10T21:29:56Z

Updated the test to be simpler, so I removed the utility functions I previously added. This should remove the need for a test refactor. I wanted to parameterize by collective, but #103 shows that tests got much slower after doing this. I will deprecate #103.

I have also added an explicit NotImplementedError for reduce_scatter within ProcessGroupBabyGloo, because otherwise the exception lives within the tx queue (see here). Noticed this as test_baby_gloo_apis would fail here, and that would be a nasty issue for a downstream user. Therefore, adding this explicitly in the API is ultimately cleaner.

d4l3k

LGTM

allenwang28 added 2 commits February 6, 2025 12:39

initial commit for reduce_scatter

d076a54

fixes reduce_scatter function signature, refactors test and adds redu…

a425493

…ce_scatter test

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 6, 2025

fixes test

5190414

d4l3k reviewed Feb 7, 2025

View reviewed changes

allenwang28 marked this pull request as ready for review February 7, 2025 16:55

d4l3k approved these changes Feb 7, 2025

View reviewed changes

allenwang28 mentioned this pull request Feb 7, 2025

Refactors process_group_tests.py #103

Closed

allenwang28 added 3 commits February 10, 2025 13:15

adds explicit NotImplementedError to reduce_scatter in gloo, simplify…

45fac86

… the test suite

Merge branch 'main' into collectives

afcb7f6

fix tests after merge

dc448ec

allenwang28 added 2 commits February 10, 2025 13:50

add explicit error for ProcessGroupGloo

f8d2ac5

notimplementederror->runtimeerror

7aaf7db

allenwang28 requested a review from wconstab February 10, 2025 22:05

d4l3k approved these changes Feb 10, 2025

View reviewed changes

wconstab approved these changes Feb 10, 2025

View reviewed changes

allenwang28 merged commit e55542a into meta-pytorch:main Feb 10, 2025
6 checks passed

allenwang28 deleted the collectives branch February 10, 2025 23:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds `reduce_scatter` into `torchft`#102

Adds `reduce_scatter` into `torchft`#102
allenwang28 merged 8 commits intometa-pytorch:mainfrom
allenwang28:collectives

allenwang28 commented Feb 6, 2025 •

edited

Loading

Uh oh!

d4l3k Feb 7, 2025

Uh oh!

fegin Feb 7, 2025

Uh oh!

wconstab Feb 7, 2025

Uh oh!

allenwang28 Feb 10, 2025

Uh oh!

Uh oh!

d4l3k left a comment

Uh oh!

allenwang28 commented Feb 10, 2025

Uh oh!

d4l3k left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

allenwang28 commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Tests

Next steps

Uh oh!

d4l3k Feb 7, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Feb 7, 2025

Choose a reason for hiding this comment

Uh oh!

wconstab Feb 7, 2025

Choose a reason for hiding this comment

Uh oh!

allenwang28 Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

allenwang28 commented Feb 10, 2025

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

allenwang28 commented Feb 6, 2025 •

edited

Loading