An example where gemm and all-scatter are independent by neoblizz · Pull Request #232 · ROCm/iris

neoblizz · 2025-10-13T19:31:07Z

This pull request introduces two new files to the examples/20_gemm_all_scatter_independent directory, providing benchmarking and kernel implementations for distributed GEMM (General Matrix Multiply) using Triton and Iris. The changes add a full benchmarking script and a ring-based all-reduce GEMM kernel, enabling efficient multi-GPU matrix multiplication and communication. These additions support flexible configuration, validation, and performance measurement for distributed compute scenarios.

New benchmarking and execution script

Added a comprehensive benchmarking script benchmark.py, supporting distributed execution with PyTorch, command-line configuration for matrix dimensions, datatypes, block sizes, and benchmarking/validation modes. The script manages process spawning, distributed setup, memory allocation, kernel timing, validation, and performance logging.

New Triton kernel implementation for distributed GEMM

Added gemm_all_reduce_ring_based.py, implementing two Triton kernels: persistent_gemm for local matrix multiplication and persistent_all_reduce for ring-based distributed reduction and scatter of results. These kernels use advanced synchronization and communication primitives for efficient multi-GPU execution.

Integration with Iris and validation utilities

Both files integrate with the Iris library for device memory management, inter-GPU communication, and synchronization, and use shared utilities for validation and timestamping. [1] [2]

Performance measurement and output

The benchmarking script collects kernel timings, calculates TFLOPS, and outputs results and traces in JSON format for further analysis.

Support for flexible configuration and debugging

Both files support multiple datatypes, block sizes, debugging, and tracing options, allowing for extensive experimentation and profiling. [1] [2]

Copilot

Pull Request Overview

This PR introduces a new example demonstrating independent GEMM and all-scatter operations in a distributed setting. The implementation provides two algorithmic approaches: bulk synchronous all-scatter and ring-based all-reduce, with comprehensive benchmarking and validation capabilities for multi-GPU matrix multiplication scenarios.

Adds distributed GEMM implementations with two communication strategies (bulk synchronous and ring-based)
Provides a comprehensive benchmarking framework with timing, validation, and trace collection
Integrates with Iris library for multi-GPU memory management and communication primitives

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
matmul_wrapper.py	PyTorch autograd wrapper for distributed GEMM kernel execution with debugging and timing support
gemm_all_scatter_bulk_synchronous.py	Triton kernels for persistent GEMM and bulk synchronous all-scatter communication
gemm_all_reduce_ring_based.py	Alternative implementation using ring-based all-reduce with more complex synchronization
benchmark.py	Comprehensive benchmarking script with distributed execution, validation, and performance measurement

examples/20_gemm_all_scatter_independent/gemm_all_reduce_ring_based.py

…e 20 (#234) Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com> Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Copilot

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

examples/20_gemm_all_scatter_independent/gemm_all_scatter_bulk_synchronous.py

examples/20_gemm_all_scatter_independent/benchmark.py

neoblizz and others added 2 commits October 13, 2025 19:19

Add an example for separate gemm/all-scatter

d8d2377

Apply Ruff auto-fixes

f18ed99

github-actions bot added in-progress We are working on it iris Iris project issue labels Oct 13, 2025

neoblizz marked this pull request as ready for review October 13, 2025 19:40

neoblizz requested review from BKP and mawad-amd as code owners October 13, 2025 19:40

Copilot AI review requested due to automatic review settings October 13, 2025 19:40

Copilot AI reviewed Oct 13, 2025

View reviewed changes

examples/20_gemm_all_scatter_independent/gemm_all_reduce_ring_based.py Outdated Show resolved Hide resolved

Copilot AI and others added 3 commits October 13, 2025 13:41

[WIP] Add validation for example 20 (#236)

1664585

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Delete unused.

5aee08f

neoblizz requested a review from Copilot October 14, 2025 00:24

Copilot AI reviewed Oct 14, 2025

View reviewed changes

examples/20_gemm_all_scatter_independent/gemm_all_scatter_bulk_synchronous.py Show resolved Hide resolved

examples/20_gemm_all_scatter_independent/benchmark.py Show resolved Hide resolved

examples/20_gemm_all_scatter_independent/benchmark.py Show resolved Hide resolved

mawad-amd approved these changes Oct 14, 2025

View reviewed changes

neoblizz merged commit cf56267 into main Oct 14, 2025
4 of 8 checks passed

neoblizz deleted the muhosama/all-scatter-gemm-separatev2 branch October 14, 2025 00:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

An example where gemm and all-scatter are independent#232

An example where gemm and all-scatter are independent#232
neoblizz merged 5 commits intomainfrom
muhosama/all-scatter-gemm-separatev2

neoblizz commented Oct 13, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

neoblizz commented Oct 13, 2025

New benchmarking and execution script

New Triton kernel implementation for distributed GEMM

Integration with Iris and validation utilities

Performance measurement and output

Support for flexible configuration and debugging

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants