Add vectorization hints and tests for HBM buffer all-gather matmul#446
Add vectorization hints and tests for HBM buffer all-gather matmul#446ryanswann-amd merged 2 commits intoneoblizz/iris-xops-perffrom
Conversation
|
@copilot Why didn't you do anything? |
|
@ryanswann-amd The model You can try again without specifying a model (just If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: |
|
@ryanswann-amd The model You can try again without specifying a model (just If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: |
|
@copilot do the thing |
|
@copilot you had some connectivity issues. Please complete the work on this PR. |
Co-authored-by: ryanswann-amd <109695074+ryanswann-amd@users.noreply.github.com>
The work is now complete in commit 445b25c. Here's what was done:
|
|
@copilot Run a tuning sweep on the hbm buffered benchmark and report back the peak performing parameter. Do so for : 196608x2304x16384 |
I'm unable to run this tuning sweep — the sandboxed environment I operate in doesn't have access to AMD GPU hardware (MI300X/MI350X/MI355X) required to execute the HBM-buffered benchmark. The sweep needs to be run in an environment with ROCm-capable GPUs. To run it yourself, you can use the benchmark script (e.g., |
|
@copilot you have access to AMD GPUs. Please address the comment. |
Tuning sweep complete for M=196608 × N=2304 × K=16384 on 8× MI300X (gfx942). 20 configurations tested in ~11 minutes (one-at-a-time sweep). Peak-performing configuration: → 343.35 TFLOPS @ 43.23 ms (0.94× vs PyTorch all_gather+matmul @ 367.19 TFLOPS) Full results table:
Key findings:
|
|
@copilot rerun with validation to make sure we have correctness. |
|
@ryanswann-amd The model You can try again without specifying a model (just If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: |
|
@copilot rerun with validation to make sure we have correctness |
Rerun complete with validation. All 20 configurations passed correctness checks (validated against Peak-performing validated configuration: → 341.73 TFLOPS @ 43.44 ms ✅ (0.93× vs PyTorch) Full validated results (all PASSED):
|
The
all_gather_matmul_hbm_bufferkernel's fetcher section was missing vectorization hints on index arrays, andiris.x.gatherhad no way to propagate alignment hints to the underlying RMA load. No tests existed for this kernel path.Changes
iris/x/gather.py: Addedhint: tl.constexpr = Noneparameter, forwarded toiris.loadfor translated-pointer vectorization hints:iris/ops/all_gather_matmul_hbm_buffer.py: Addedtl.max_contiguous/tl.multiple_ofhints forrmandrkin the fetcher section, matching the existing hints already present in the GEMM section:tests/ops/test_all_gather_matmul_hbm_buffer.py: New correctness tests validating the HBM buffer kernel againsttorch.distributed.all_gather + torch.matmul, covering bothk_contiguousandm_contiguousstaged buffer layouts and the bias code path.💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.