Skip to content

[Benchmark] Add all gather matmul benchmark #400

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: joydddd/stack/21
Choose a base branch
from

Conversation

joydddd
Copy link
Contributor

@joydddd joydddd commented Jul 30, 2025

Stacked PRs:


[Benchmark] Add all gather matmul benchmark

joydddd added a commit that referenced this pull request Jul 30, 2025
stack-info: PR: #400, branch: joydddd/stack/22
@joydddd joydddd force-pushed the joydddd/stack/22 branch from 0513a58 to d87a64a Compare July 30, 2025 06:10
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 30, 2025
@joydddd
Copy link
Contributor Author

joydddd commented Jul 30, 2025

shape dtype nccl torch_symm_mem helion kraken Best Backend
(256, 256, 256) torch.bfloat16 41.156 431.319 119.549 111.359 nccl
(384, 384, 384) torch.bfloat16 42.737 424.313 137.685 105.681 nccl
(512, 512, 512) torch.bfloat16 51.284 952.367 147.106 180.390 nccl
(640, 640, 640) torch.bfloat16 54.416 946.200 124.764 nan nccl
(768, 768, 768) torch.bfloat16 2373.010 467.570 127.340 nan helion
(896, 896, 896) torch.bfloat16 79.772 450.538 230.704 nan nccl
(1024, 1024, 1024) torch.bfloat16 100.879 628.553 144.646 161.596 nccl
(1152, 1152, 1152) torch.bfloat16 122.012 628.340 164.184 nan nccl
(1280, 1280, 1280) torch.bfloat16 159.020 433.261 183.333 nan nccl
(1408, 1408, 1408) torch.bfloat16 194.298 433.295 196.417 nan nccl
(1536, 1536, 1536) torch.bfloat16 211.553 431.485 206.027 297.015 helion
(1664, 1664, 1664) torch.bfloat16 251.526 427.265 581.858 nan nccl
(1792, 1792, 1792) torch.bfloat16 286.406 678.410 246.517 nan helion
(1920, 1920, 1920) torch.bfloat16 341.697 974.870 264.127 nan helion
(2048, 2048, 2048) torch.bfloat16 380.024 446.875 287.984 481.179 helion
(2176, 2176, 2176) torch.bfloat16 445.310 477.962 333.809 nan helion
(2304, 2304, 2304) torch.bfloat16 496.317 457.464 363.813 nan helion
(2432, 2432, 2432) torch.bfloat16 597.861 460.951 397.363 nan helion
(2560, 2560, 2560) torch.bfloat16 655.093 489.344 430.963 804.186 helion
(2688, 2688, 2688) torch.bfloat16 775.004 1021.574 1146.624 nan nccl
(2816, 2816, 2816) torch.bfloat16 839.021 562.788 691.636 nan torch_symm_mem
shape dtype nccl torch_symm_mem helion kraken Best Backend
(2944, 2944, 2944) torch.bfloat16 973.901 625.908 649.289 nan torch_symm_mem
(3072, 3072, 3072) torch.bfloat16 1012.294 680.064 737.418 722.340 torch_symm_mem
(3200, 3200, 3200) torch.bfloat16 1458.743 1001.001 927.776 nan helion
(3328, 3328, 3328) torch.bfloat16 1340.722 961.890 2509.657 nan torch_symm_mem
(3456, 3456, 3456) torch.bfloat16 2240.075 2239.796 1171.304 nan helion
(3584, 3584, 3584) torch.bfloat16 2181.338 2044.006 1729.832 1456.677 kraken
(3712, 3712, 3712) torch.bfloat16 4645.679 3967.047 2498.389 nan helion
(3840, 3840, 3840) torch.bfloat16 1950.385 1464.106 1650.844 nan torch_symm_mem
(3968, 3968, 3968) torch.bfloat16 2937.086 2874.260 1747.495 nan helion
(4096, 4096, 4096) torch.bfloat16 2819.741 6313.106 1725.565 1867.402 helion

@joydddd joydddd changed the base branch from joydddd/stack/21 to main July 30, 2025 20:33
joydddd added a commit that referenced this pull request Jul 30, 2025
stack-info: PR: #400, branch: joydddd/stack/22
@joydddd joydddd force-pushed the joydddd/stack/22 branch from d87a64a to a9da45b Compare July 30, 2025 20:33
@joydddd joydddd changed the base branch from main to joydddd/stack/21 July 30, 2025 20:33
@joydddd joydddd changed the base branch from joydddd/stack/21 to main July 30, 2025 21:38
joydddd added a commit that referenced this pull request Jul 30, 2025
stack-info: PR: #400, branch: joydddd/stack/22
@joydddd joydddd force-pushed the joydddd/stack/22 branch from a9da45b to e482622 Compare July 30, 2025 21:38
@joydddd joydddd changed the base branch from main to joydddd/stack/21 July 30, 2025 21:39
@joydddd
Copy link
Contributor Author

joydddd commented Jul 30, 2025

Optimization implemented in Kraken but not supported in Helion:

(a, out) = ag_matmul(a_shared, b), where a = all_gather(a_shared), and out = a@b.
For an a tile originated from the local a_shared, there's potential timesaving by accessing it directly through a_shared, and skip waiting for cudaMemcpy.

Helion does not support conditional calculate tile offset and conditionally use different tensor_descriptor for tensor_descriptor.load. i.e.

if xx: 
   a_load_desc = a_shared_desc
   a_load_offset = a_shared_stride_0 * .... 
else: 
   a_load_desc = a_desc
   a_load_offset = a_stride_0 * .... 
a_load_desc.load(a_load_offset) 

Same access pattern can be implementation in Helion as:

if xx: 
    a_tile = a_shared[tile.index - RANK * M_per_RANK]
else: 
    a_tile = a[tile]

But this generates 2 tensor_descriptor loads in each branch, and breaks Triton data prefetching.

@joydddd joydddd marked this pull request as ready for review August 4, 2025 17:27
@joydddd joydddd requested review from yf225, jansel, oulgen and drisspg August 4, 2025 17:28
@joydddd joydddd changed the base branch from joydddd/stack/21 to main August 4, 2025 21:22
joydddd added a commit that referenced this pull request Aug 4, 2025
stack-info: PR: #400, branch: joydddd/stack/22
@joydddd joydddd changed the base branch from main to joydddd/stack/21 August 4, 2025 21:23
@joydddd joydddd changed the base branch from joydddd/stack/21 to main August 4, 2025 21:44
joydddd added a commit that referenced this pull request Aug 4, 2025
stack-info: PR: #400, branch: joydddd/stack/22
@joydddd joydddd changed the base branch from main to joydddd/stack/21 August 5, 2025 20:45
joydddd added a commit that referenced this pull request Aug 5, 2025
stack-info: PR: #400, branch: joydddd/stack/22
@joydddd joydddd changed the base branch from joydddd/stack/21 to main August 5, 2025 22:28
joydddd added a commit that referenced this pull request Aug 5, 2025
stack-info: PR: #400, branch: joydddd/stack/22
@joydddd joydddd changed the base branch from main to joydddd/stack/21 August 5, 2025 22:28
@joydddd joydddd changed the base branch from joydddd/stack/21 to main August 5, 2025 22:36
joydddd added a commit that referenced this pull request Aug 5, 2025
stack-info: PR: #400, branch: joydddd/stack/22
@joydddd joydddd changed the base branch from main to joydddd/stack/21 August 5, 2025 22:36
@jansel
Copy link
Contributor

jansel commented Aug 6, 2025

If xx is a constexpr would that fix it too? Can it be in this case?

@joydddd joydddd changed the base branch from joydddd/stack/21 to main August 7, 2025 23:06
joydddd added a commit that referenced this pull request Aug 7, 2025
stack-info: PR: #400, branch: joydddd/stack/22
@joydddd joydddd changed the base branch from main to joydddd/stack/21 August 7, 2025 23:06
joydddd added a commit that referenced this pull request Aug 8, 2025
stack-info: PR: #400, branch: joydddd/stack/22
@joydddd joydddd changed the base branch from joydddd/stack/21 to main August 8, 2025 15:11
joydddd added a commit that referenced this pull request Aug 8, 2025
stack-info: PR: #400, branch: joydddd/stack/22
@joydddd joydddd changed the base branch from main to joydddd/stack/21 August 8, 2025 15:11
stack-info: PR: #400, branch: joydddd/stack/22
@joydddd joydddd changed the base branch from joydddd/stack/21 to main August 8, 2025 15:52
@joydddd joydddd changed the base branch from main to joydddd/stack/21 August 8, 2025 15:52
@joydddd
Copy link
Contributor Author

joydddd commented Aug 8, 2025

If xx is a constexpr would that fix it too? Can it be in this case?

Yep. If xx is a constexpr the software pipelining should be fine.
In this case however, xx is dependent on the loop iterator. In most cases the loop is too large to unroll (via tl.static_range) :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants