-
Notifications
You must be signed in to change notification settings - Fork 33
Open
Labels
examplesExamples showcasing Iris APIs and usageExamples showcasing Iris APIs and usageirisIris project issueIris project issuequestionFurther information is requestedFurther information is requested
Description
Problem Description
Hello, I have some questions and doubts about examples/09_gemm_one_shot_all_reduce/matmul_wrapper.py
- The matmul wrapper returns c instead of c_global, however from the kernel_gemm impl, the c_global should be the one who holds the final acc output, the code refered below in matmul_wrapper.py:
# starting from line 192
matmul._call(
a=a,
b=b,
c=c,
c_global=c_global, # c_global holds the acc output
xxx
)
return c # but here c is returned
- Compared to gemm-allscatter varies implementations, the current one-shot gemm-allreduce impl seems a bit more like fused-sequential right ? I've tested it on some other platform and found that the performance is a bit worse than torch naive implementation. I consider several factors:
- Torch matmul is faster than triton matmul for some certain shape
- What is the side effect of frequent atomic_cas (during signal wait), does it affect heavy L1 intruction STALLs or cache pollution ?
- The new added reduce post process might affect the original gemm performance ?
So I wonder what might be better furture move in terms of performance improvement from your perspective:
- Add workgroup specializaiton to seperate CUs for compute/reduce
- Unfused gemm_allscatter (producer) overlap with ring-reduce kernels (consumer), but I think one of the issues of multi-stream kernels is that the atomic-xxx wait loop in the consumer kernel might affect the producer kernel performance (because 2 streams might race for CUs).
I really would like to have a try to implement, however I could only try on ROCm compatible platforms instead. So I guess maybe some future helps about review and validation on AMD platforms
Operating System
Ubuntu 22.04
CPU
AMD Ryzen
GPU
AMD MI300
ROCm Version
6.3
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
examplesExamples showcasing Iris APIs and usageExamples showcasing Iris APIs and usageirisIris project issueIris project issuequestionFurther information is requestedFurther information is requested