Skip to content

Comments

Hoist num_xcds query#205

Merged
mawad-amd merged 1 commit intomainfrom
muhaawad/hoist-num_xcds-query
Oct 9, 2025
Merged

Hoist num_xcds query#205
mawad-amd merged 1 commit intomainfrom
muhaawad/hoist-num_xcds-query

Conversation

@mawad-amd
Copy link
Collaborator

@mawad-amd mawad-amd commented Oct 9, 2025

Motivation

Hoist num_xcds query that goes through HIP APIs/ROCm version query outside of the matmul loop.

Technical Details

This call is on the critical path of the benchmarking and we don't need to query that every loop. It drops the perf to 10s of TFLOPs with it in the loop.

Test Plan

Test Result

Apptainer> python examples/10_gemm_all_scatter_wg_specialization/benchmark.py --benchmark -m 16384 -n 16384 -k 16384 --BLK_M 128 --BLK_N 128 --BLK_K 64 --gsize_m 6 --gemm_sms 256 --validate -r 8
[Iris] [1/8] Validating...
[Iris] [0/8] Validating...
[Iris] [7/8] Validating...
[Iris] [3/8] Validating...
[Iris] [4/8] Validating...
[Iris] [5/8] Validating...
[Iris] [6/8] Validating...
[Iris] [2/8] Validating...
[Iris] [6/8] Final C validation passed.
[Iris] [7/8] Final C validation passed.
[Iris] [3/8] Final C validation passed.
[Iris] [2/8] Final C validation passed.
[Iris] [1/8] Final C validation passed.
[Iris] [5/8] Final C validation passed.
[Iris] [0/8] Final C validation passed.
[Iris] [4/8] Final C validation passed.
[Iris] [0/8] Validating local C...
[Iris] [4/8] Validating local C...
[Iris] [6/8] Validating local C...
[Iris] [1/8] Validating local C...
[Iris] [2/8] Validating local C...
[Iris] [3/8] Validating local C...
[Iris] [7/8] Validating local C...
[Iris] [5/8] Validating local C...
[Iris] [0/8] Validation completed
[Iris] [0/8] Benchmarking...
[Iris] [4/8] Validation completed
[Iris] [6/8] Validation completed
[Iris] [1/8] Validation completed
[Iris] [3/8] Validation completed
[Iris] [2/8] Validation completed
[Iris] [4/8] Benchmarking...
[Iris] [7/8] Validation completed
[Iris] [6/8] Benchmarking...
[Iris] [2/8] Benchmarking...
[Iris] [5/8] Validation completed
[Iris] [1/8] Benchmarking...
[Iris] [3/8] Benchmarking...
[Iris] [7/8] Benchmarking...
[Iris] [5/8] Benchmarking...
[Iris] [6/8] tile matmul + all_scatter (total_tiles=2048): 4.050 ms  2171.865 tflops
[Iris] [0/8] tile matmul + all_scatter (total_tiles=2048): 4.057 ms  2168.361 tflops
[Iris] [7/8] tile matmul + all_scatter (total_tiles=2048): 4.047 ms  2173.719 tflops
[Iris] [1/8] tile matmul + all_scatter (total_tiles=2048): 4.054 ms  2169.950 tflops
[Iris] [3/8] tile matmul + all_scatter (total_tiles=2048): 4.052 ms  2170.794 tflops
[Iris] [4/8] tile matmul + all_scatter (total_tiles=2048): 4.049 ms  2172.236 tflops
[Iris] [5/8] tile matmul + all_scatter (total_tiles=2048): 4.047 ms  2173.229 tflops
[Iris] [2/8] tile matmul + all_scatter (total_tiles=2048): 4.050 ms  2171.925 tflops
{
    "world_size": 8,
    "m": 16384,
    "n": 2048,
    "k": 16384,
    "debug": false,
    "validate": true,
    "trace_tiles": false,
    "benchmark": true,
    "datatype": "fp16",
    "output_file": "log.json",
    "BLK_M": 128,
    "BLK_N": 128,
    "BLK_K": 64,
    "gsize_m": 6,
    "heap_size": 8589934592,
    "gemm_sms": 256,
    "num_sms": 304,
    "num_ranks": 8,
    "M": 16384,
    "N": 16384,
    "K": 16384,
    "success": true,
    "gemm_registers": null,
    "gemm_spills": null,
    "tflops": 2168.361009409875,
    "total_ms": 4.056562991142273,
    "gemm_ms": 3.5572790607573492,
    "gemm_experiments": 126
}
Apptainer> 

Submission Checklist

Copilot AI review requested due to automatic review settings October 9, 2025 05:44
@mawad-amd mawad-amd requested review from BKP and neoblizz as code owners October 9, 2025 05:44
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes performance by moving the num_xcds query outside of the matrix multiplication loop. The change hoists the iris.hip.get_num_xcc() call to class initialization time as a class variable, eliminating repeated HIP API calls during computation.

  • Converts repeated iris.hip.get_num_xcc() calls to a single class-level initialization
  • Updates all matmul wrapper implementations to use the cached value
  • Minor import formatting improvement in one file

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
examples/07_gemm_all_scatter/matmul_wrapper.py Adds cached _num_xcds class variable and uses it instead of repeated API calls
examples/08_gemm_atomics_all_reduce/matmul_wrapper.py Adds cached _num_xcds class variable and uses it instead of repeated API calls
examples/09_gemm_one_shot_all_reduce/matmul_wrapper.py Adds cached _num_xcds class variable and uses it instead of repeated API calls
examples/10_gemm_all_scatter_wg_specialization/matmul_wrapper.py Adds cached _num_xcds class variable, uses it instead of repeated API calls, and formats import statement
examples/11_gemm_all_scatter_producer_consumer/matmul_wrapper.py Adds cached _num_xcds class variable and uses it instead of repeated API calls
examples/12_gemm_all_scatter_bulk_synchronous/matmul_wrapper.py Adds cached _num_xcds class variable and uses it instead of repeated API calls

@github-actions github-actions bot added in-progress We are working on it iris Iris project issue labels Oct 9, 2025
@mawad-amd mawad-amd merged commit 5dbc2fc into main Oct 9, 2025
15 checks passed
@mawad-amd mawad-amd deleted the muhaawad/hoist-num_xcds-query branch October 9, 2025 06:36
neoblizz pushed a commit that referenced this pull request Oct 9, 2025
Xinyu-Kang pushed a commit that referenced this pull request Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants