Skip to content

Comments

An example where gemm and all-scatter are independent#232

Merged
neoblizz merged 5 commits intomainfrom
muhosama/all-scatter-gemm-separatev2
Oct 14, 2025
Merged

An example where gemm and all-scatter are independent#232
neoblizz merged 5 commits intomainfrom
muhosama/all-scatter-gemm-separatev2

Conversation

@neoblizz
Copy link
Member

This pull request introduces two new files to the examples/20_gemm_all_scatter_independent directory, providing benchmarking and kernel implementations for distributed GEMM (General Matrix Multiply) using Triton and Iris. The changes add a full benchmarking script and a ring-based all-reduce GEMM kernel, enabling efficient multi-GPU matrix multiplication and communication. These additions support flexible configuration, validation, and performance measurement for distributed compute scenarios.

New benchmarking and execution script

  • Added a comprehensive benchmarking script benchmark.py, supporting distributed execution with PyTorch, command-line configuration for matrix dimensions, datatypes, block sizes, and benchmarking/validation modes. The script manages process spawning, distributed setup, memory allocation, kernel timing, validation, and performance logging.

New Triton kernel implementation for distributed GEMM

  • Added gemm_all_reduce_ring_based.py, implementing two Triton kernels: persistent_gemm for local matrix multiplication and persistent_all_reduce for ring-based distributed reduction and scatter of results. These kernels use advanced synchronization and communication primitives for efficient multi-GPU execution.

Integration with Iris and validation utilities

  • Both files integrate with the Iris library for device memory management, inter-GPU communication, and synchronization, and use shared utilities for validation and timestamping. [1] [2]

Performance measurement and output

  • The benchmarking script collects kernel timings, calculates TFLOPS, and outputs results and traces in JSON format for further analysis.

Support for flexible configuration and debugging

  • Both files support multiple datatypes, block sizes, debugging, and tracing options, allowing for extensive experimentation and profiling. [1] [2]

@github-actions github-actions bot added in-progress We are working on it iris Iris project issue labels Oct 13, 2025
@neoblizz neoblizz marked this pull request as ready for review October 13, 2025 19:40
Copilot AI review requested due to automatic review settings October 13, 2025 19:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new example demonstrating independent GEMM and all-scatter operations in a distributed setting. The implementation provides two algorithmic approaches: bulk synchronous all-scatter and ring-based all-reduce, with comprehensive benchmarking and validation capabilities for multi-GPU matrix multiplication scenarios.

  • Adds distributed GEMM implementations with two communication strategies (bulk synchronous and ring-based)
  • Provides a comprehensive benchmarking framework with timing, validation, and trace collection
  • Integrates with Iris library for multi-GPU memory management and communication primitives

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
matmul_wrapper.py PyTorch autograd wrapper for distributed GEMM kernel execution with debugging and timing support
gemm_all_scatter_bulk_synchronous.py Triton kernels for persistent GEMM and bulk synchronous all-scatter communication
gemm_all_reduce_ring_based.py Alternative implementation using ring-based all-reduce with more complex synchronization
benchmark.py Comprehensive benchmarking script with distributed execution, validation, and performance measurement

Copilot AI and others added 3 commits October 13, 2025 13:41
…e 20 (#234)

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@neoblizz neoblizz requested a review from Copilot October 14, 2025 00:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

@neoblizz neoblizz merged commit cf56267 into main Oct 14, 2025
4 of 8 checks passed
@neoblizz neoblizz deleted the muhosama/all-scatter-gemm-separatev2 branch October 14, 2025 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants