Skip to content

[Benchmark] Add all reduce benchmark #393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

joydddd
Copy link
Contributor

@joydddd joydddd commented Jul 29, 2025

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 29, 2025
joydddd added a commit that referenced this pull request Jul 29, 2025
stack-info: PR: #393, branch: joydddd/stack/21
@joydddd joydddd force-pushed the joydddd/stack/21 branch from 8a962c5 to 8c301ef Compare July 29, 2025 17:39
@joydddd
Copy link
Contributor Author

joydddd commented Jul 29, 2025

shape dtype multimem oneshot twoshot nccl helion_oneshot kraken_oneshot Best Backend
(4093) torch.bfloat16 nan nan nan 32.812 17.448 18.285 helion_oneshot
(4096) torch.bfloat16 12.399 16.412 19.269 22.764 14.873 14.254 multimem
(5000) torch.bfloat16 15.321 16.834 15.442 27.810 15.044 13.743 kraken_oneshot
(8192) torch.bfloat16 11.815 17.330 15.885 22.541 15.118 13.736 multimem
(8193) torch.bfloat16 nan nan nan 26.715 19.025 19.417 helion_oneshot
(16k) torch.bfloat16 12.031 17.499 15.525 24.240 15.658 13.852 multimem
(16380) torch.bfloat16 14.134 16.840 15.759 29.297 15.565 14.471 multimem
(16387) torch.bfloat16 nan nan nan 34.670 22.661 19.729 kraken_oneshot
(32k) torch.bfloat16 12.058 18.290 16.244 24.535 16.866 15.408 multimem
(64k) torch.bfloat16 12.663 20.521 18.591 24.497 19.169 17.372 multimem
(128k) torch.bfloat16 12.912 25.540 19.263 24.780 24.817 21.413 multimem
(256k) torch.bfloat16 13.796 30.951 18.982 25.564 33.128 29.261 multimem
(512k) torch.bfloat16 16.783 43.295 19.861 33.252 55.684 40.593 multimem
(1m) torch.bfloat16 20.311 66.030 23.859 55.766 103.020 62.850 multimem

Helion overhead compared to Kraken (Triton baseline)

  1. Helion launches an additional kernel to convert pointer to signal_pad_ptrs array to Tensor.
image Kernel+Launch overhead: 2-3us
  1. Additional for loop to create multicast_tile: can be solved by enabling [:] indexing with hl.wait / hl.signal
    performance overhead is minor

  2. for (512k) & (1m): optimal config uses persistent kernel with partial SMs + pre-log.
    Kraken config: NUM_SM = 24. Each SM conduct symmetric_memory sync outside of the virtual_pid for loop.

  3. Helion unrolls tuple of tensor with the same shape, stride & offset, and calculates the offset for each tensor separately.

@joydddd joydddd changed the base branch from joydddd/stack/18 to main July 30, 2025 06:10
@joydddd joydddd changed the base branch from main to joydddd/stack/18 July 30, 2025 06:10
@joydddd
Copy link
Contributor Author

joydddd commented Jul 30, 2025

Use custom cpp from_blob mod to convert signal_pad_ptrs_dev to tensor.

Benchmarking results for allreduce on 8x devices. (time_us)

shape dtype multimem oneshot twoshot nccl helion_oneshot kraken_oneshot Best Backend
(4093) torch.bfloat16 nan nan nan 32.835 15.110 17.817 helion_oneshot
(4096) torch.bfloat16 12.070 16.807 15.716 28.733 12.797 13.692 multimem
(5000) torch.bfloat16 11.606 16.931 16.006 27.992 12.456 13.839 multimem
(8192) torch.bfloat16 11.983 16.828 15.531 22.860 12.526 14.152 multimem
(8193) torch.bfloat16 nan nan nan 26.676 16.445 19.056 helion_oneshot
(16k) torch.bfloat16 11.875 19.216 18.553 24.183 12.978 14.349 multimem
(16380) torch.bfloat16 12.047 17.193 15.615 29.370 12.844 14.334 multimem
(16387) torch.bfloat16 nan nan nan 29.815 17.536 20.474 helion_oneshot
(32k) torch.bfloat16 12.202 17.960 16.332 24.358 15.665 14.748 multimem
(64k) torch.bfloat16 12.552 20.044 18.470 24.726 16.546 17.160 multimem
(128k) torch.bfloat16 12.862 25.099 18.502 24.946 20.598 20.648 multimem
(256k) torch.bfloat16 14.038 30.213 19.259 25.010 30.090 29.323 multimem
(512k) torch.bfloat16 17.003 43.470 19.799 33.173 50.720 40.312 multimem
(1m) torch.bfloat16 20.393 65.007 23.570 57.914 91.041 63.734 multimem

Now our performance gap between Helion & Kraken only exist for shape >= 512k where optimal config uses persistent kernel with partial SMs + pre-log.
Kraken config: NUM_SM = 24. Each SM conduct symmetric_memory sync outside of the virtual_pid for loop.

@joydddd joydddd changed the base branch from joydddd/stack/18 to main July 30, 2025 20:33
@joydddd joydddd force-pushed the joydddd/stack/21 branch from 8c301ef to 331d20a Compare July 30, 2025 20:33
@joydddd joydddd changed the title Add all reduce benchmark [Benchmark] Add all reduce benchmark Jul 30, 2025
@joydddd joydddd changed the base branch from main to joydddd/stack/18 July 30, 2025 20:33
@joydddd joydddd changed the base branch from joydddd/stack/18 to main July 30, 2025 21:38
@joydddd joydddd force-pushed the joydddd/stack/21 branch from 331d20a to bcdadde Compare July 30, 2025 21:38
@joydddd joydddd changed the base branch from main to joydddd/stack/18 July 30, 2025 21:39
@joydddd joydddd marked this pull request as ready for review August 4, 2025 17:27
@joydddd joydddd requested review from jansel, yf225, drisspg and oulgen August 4, 2025 17:27
@joydddd joydddd changed the base branch from joydddd/stack/18 to main August 4, 2025 21:22
@joydddd joydddd changed the base branch from main to joydddd/stack/18 August 4, 2025 21:23
@joydddd joydddd changed the base branch from joydddd/stack/18 to main August 4, 2025 21:44
@joydddd joydddd changed the base branch from main to joydddd/stack/18 August 4, 2025 21:44
joydddd added a commit that referenced this pull request Aug 5, 2025
stack-info: PR: #393, branch: joydddd/stack/21
@joydddd joydddd changed the base branch from joydddd/stack/18 to main August 5, 2025 18:52
@joydddd joydddd changed the base branch from main to joydddd/stack/18 August 5, 2025 20:23
@joydddd joydddd changed the base branch from joydddd/stack/18 to main August 5, 2025 20:39
@joydddd joydddd changed the base branch from main to joydddd/stack/18 August 5, 2025 20:39
@joydddd joydddd changed the base branch from joydddd/stack/18 to main August 5, 2025 20:44
@joydddd joydddd changed the base branch from main to joydddd/stack/18 August 5, 2025 20:45
joydddd added a commit that referenced this pull request Aug 5, 2025
stack-info: PR: #393, branch: joydddd/stack/21
@joydddd joydddd changed the base branch from joydddd/stack/18 to main August 5, 2025 22:28
@joydddd joydddd changed the base branch from main to joydddd/stack/18 August 5, 2025 22:28
@joydddd joydddd changed the base branch from joydddd/stack/18 to main August 5, 2025 22:36
@joydddd joydddd changed the base branch from main to joydddd/stack/18 August 5, 2025 22:36
@joydddd joydddd changed the base branch from joydddd/stack/18 to main August 7, 2025 23:06
@joydddd joydddd changed the base branch from main to joydddd/stack/18 August 7, 2025 23:06
joydddd added a commit that referenced this pull request Aug 8, 2025
stack-info: PR: #393, branch: joydddd/stack/21
@joydddd joydddd changed the base branch from joydddd/stack/18 to main August 8, 2025 15:11
stack-info: PR: #393, branch: joydddd/stack/21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants