Skip to content

[Question]: 09_gemm_one_shot_all_reduce implementation #231

@hebais

Description

@hebais

Problem Description

Hello, I have some questions and doubts about examples/09_gemm_one_shot_all_reduce/matmul_wrapper.py

  1. The matmul wrapper returns c instead of c_global, however from the kernel_gemm impl, the c_global should be the one who holds the final acc output, the code refered below in matmul_wrapper.py:
# starting from line 192
matmul._call(
            a=a,
            b=b,
            c=c,
            c_global=c_global, # c_global holds the acc output
            xxx
        )
return c # but here c is returned
  1. Compared to gemm-allscatter varies implementations, the current one-shot gemm-allreduce impl seems a bit more like fused-sequential right ? I've tested it on some other platform and found that the performance is a bit worse than torch naive implementation. I consider several factors:
    1. Torch matmul is faster than triton matmul for some certain shape
    2. What is the side effect of frequent atomic_cas (during signal wait), does it affect heavy L1 intruction STALLs or cache pollution ?
    3. The new added reduce post process might affect the original gemm performance ?
      So I wonder what might be better furture move in terms of performance improvement from your perspective:
  2. Add workgroup specializaiton to seperate CUs for compute/reduce
  3. Unfused gemm_allscatter (producer) overlap with ring-reduce kernels (consumer), but I think one of the issues of multi-stream kernels is that the atomic-xxx wait loop in the consumer kernel might affect the producer kernel performance (because 2 streams might race for CUs).

I really would like to have a try to implement, however I could only try on ROCm compatible platforms instead. So I guess maybe some future helps about review and validation on AMD platforms

Operating System

Ubuntu 22.04

CPU

AMD Ryzen

GPU

AMD MI300

ROCm Version

6.3

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    examplesExamples showcasing Iris APIs and usageirisIris project issuequestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions