Skip to content

[BC breaking] Add StackTensor support to hl.signal & hl.wait (as_ptrs) #261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 5, 2025

Conversation

joydddd
Copy link
Contributor

@joydddd joydddd commented Jul 10, 2025

@joydddd joydddd force-pushed the joydddd/stack/13 branch from dd98a26 to 4c4b3ae Compare July 10, 2025 18:52
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 10, 2025
@joydddd joydddd force-pushed the joydddd/stack/13 branch 2 times, most recently from 4eb10c4 to dac9b8f Compare July 10, 2025 23:13
@joydddd joydddd changed the base branch from main to joydddd/stack/14 July 10, 2025 23:13
@joydddd joydddd changed the base branch from joydddd/stack/14 to main July 11, 2025 00:32
@joydddd joydddd changed the base branch from main to joydddd/stack/14 July 11, 2025 00:32
@joydddd joydddd changed the base branch from joydddd/stack/14 to main July 11, 2025 01:08
@joydddd joydddd force-pushed the joydddd/stack/13 branch 2 times, most recently from 053923a to 1a74563 Compare July 11, 2025 01:08
@joydddd joydddd changed the base branch from main to joydddd/stack/15 July 11, 2025 01:09
@joydddd joydddd force-pushed the joydddd/stack/15 branch from bc6ea11 to 5348847 Compare July 11, 2025 01:16
@joydddd joydddd changed the base branch from joydddd/stack/15 to main July 11, 2025 01:17
@joydddd joydddd force-pushed the joydddd/stack/13 branch from 1a74563 to a82b355 Compare July 11, 2025 01:17
@joydddd joydddd changed the base branch from main to joydddd/stack/15 July 11, 2025 01:17
@joydddd joydddd changed the base branch from joydddd/stack/15 to main July 11, 2025 02:19
@joydddd joydddd force-pushed the joydddd/stack/13 branch from a82b355 to 9f46f56 Compare July 11, 2025 02:19
@joydddd joydddd changed the base branch from main to joydddd/stack/15 July 11, 2025 02:19
@joydddd joydddd marked this pull request as ready for review July 11, 2025 02:19
@joydddd joydddd requested review from jansel, yf225 and drisspg July 11, 2025 02:19
)
N = signal_pad_ptrs.size(0)
for i in hl.grid(pad_shape):
offset = i * 4 # number of btypes of each pointer (torch.int32)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calculating offset to pointers is kinda ugly here and it breaks the no-index calculation / pointer-handling abstraction. The tensor metadata (i.e. size, stride etc.) is lost when passed in as Tensor of data_ptrs, therefore offset calculation need to be explicit here for indirect barriers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pass the offsets into the function so we can take into account the size of the dtype?

@joydddd joydddd changed the base branch from joydddd/stack/15 to main July 14, 2025 18:18
@joydddd joydddd force-pushed the joydddd/stack/13 branch from 9f46f56 to e646848 Compare July 14, 2025 18:35
Comment on lines +252 to +253
signal_pad_list = [
torch.ones(4, device=DEVICE, dtype=torch.int32) for _ in range(4)
]
signal_pad_ptrs = torch.as_tensor(
[p.data_ptr() for p in signal_pad_list], device=DEVICE, dtype=torch.uint64
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for multiple signal pads here?

Could we have a multi-dimensional pad instead like:

signal_pad_list = torch.ones([4, 4], device=DEVICE, dtype=torch.int32)

then make pass in an index like (multicast_tile, i)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation is to implementation cross rank sync i.e. each barrier lives on a difference card, and we signal our peers and then wait for signal from our peers to make sure all cards have reached this kernel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See test/test_distributed in #245

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess the tensor of pointers is required in this case -- let's figure out a higher level abstraction to make this cleaner.

@joydddd joydddd force-pushed the joydddd/stack/13 branch from e646848 to ab85dfe Compare July 15, 2025 18:40
@joydddd joydddd changed the base branch from joydddd/stack/17 to main July 29, 2025 01:14
@joydddd joydddd changed the base branch from main to joydddd/stack/17 July 29, 2025 17:39
@joydddd joydddd changed the base branch from joydddd/stack/17 to main July 30, 2025 06:10
@joydddd joydddd changed the base branch from main to joydddd/stack/17 July 30, 2025 06:10
@joydddd joydddd changed the base branch from joydddd/stack/17 to main July 30, 2025 20:33
@joydddd joydddd changed the base branch from main to joydddd/stack/17 July 30, 2025 20:33
@joydddd joydddd changed the base branch from joydddd/stack/17 to main July 30, 2025 21:38
@joydddd joydddd changed the base branch from main to joydddd/stack/17 July 30, 2025 21:39
@joydddd joydddd changed the base branch from joydddd/stack/17 to main August 4, 2025 21:21
@joydddd joydddd changed the base branch from main to joydddd/stack/17 August 4, 2025 21:22
@joydddd joydddd changed the base branch from joydddd/stack/17 to main August 4, 2025 21:22
@joydddd joydddd changed the base branch from main to joydddd/stack/17 August 4, 2025 21:23
@joydddd joydddd changed the base branch from joydddd/stack/17 to main August 4, 2025 21:44
@joydddd joydddd changed the base branch from main to joydddd/stack/17 August 4, 2025 21:44
@joydddd joydddd changed the title [BC breaking] Add MulticastTensor support to hl.signal & hl.wait (as_ptrs) [BC breaking] Add StackTensor support to hl.signal & hl.wait (as_ptrs) Aug 5, 2025
joydddd added a commit that referenced this pull request Aug 5, 2025
…ptrs)

stack-info: PR: #261, branch: joydddd/stack/13
@joydddd joydddd changed the base branch from joydddd/stack/17 to main August 5, 2025 18:52
joydddd added a commit that referenced this pull request Aug 5, 2025
…ptrs)

stack-info: PR: #261, branch: joydddd/stack/13
@joydddd joydddd changed the title [BC breaking] Add StackTensor support to hl.signal & hl.wait (as_ptrs) [BC breaking] Add MulticastTensor support to hl.signal & hl.wait (as_ptrs) Aug 5, 2025
@joydddd joydddd changed the title [BC breaking] Add MulticastTensor support to hl.signal & hl.wait (as_ptrs) [BC breaking] Add StackTensor support to hl.signal & hl.wait (as_ptrs) Aug 5, 2025
@joydddd joydddd merged commit 0211a67 into main Aug 5, 2025
6 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants