Skip to content

manual schedule of a transpose in output cached smem#6008

Merged
github-actions[bot] merged 6 commits intomainfrom
llu/transpose_output_smem
Feb 26, 2026
Merged

manual schedule of a transpose in output cached smem#6008
github-actions[bot] merged 6 commits intomainfrom
llu/transpose_output_smem

Conversation

@liqiangxl
Copy link
Collaborator

This PR adds a manual scheduling test case demonstrating how to perform a transpose on cached output shared memory using a TMA store.
The transpose scheduler may choose to apply the transpose on either cached input or cached output, depending on the number of inputs and outputs. The guiding principle is to minimize the total number of required transposes, e.g. will do output transpose when there are more inputs than outputs.

@liqiangxl liqiangxl marked this pull request as ready for review February 24, 2026 17:27
@liqiangxl liqiangxl requested a review from rdspring1 February 24, 2026 17:27
@github-actions
Copy link

github-actions bot commented Feb 24, 2026

Review updated until commit 7bd7044

Auto-merge Status

✅ Internal CI is finished
✅ No failed checks
✅ PR is mergeable
ℹ️ PR mergeable_state: clean

Description

  • Adds new test case TransposeOutputSmem demonstrating manual transpose scheduling on output cached shared memory

  • Implements TMA store with output swizzle for transpose operations (loads rows, writes columns)

  • Includes detailed tiling, swizzling, and parallelization scheduling for optimal memory access

  • Provides performance comparison between TMA load vs non-TMA load approaches (852ms vs 814ms on GB200)

Changes walkthrough

Relevant files
Enhancement
test_transpose.cpp
Add output cached transpose scheduling test                           

tests/cpp/test_transpose.cpp

  • Add new test TransposeInputSmem replacing
    TransposeTMALoadOptionalStore
  • Add comprehensive TransposeOutputSmem test demonstrating output-cached
    transpose scheduling
  • Implement TMA store with output swizzle, tiling, and parallelization
    strategies
  • Include performance metrics and detailed scheduling comments
  • +156/-1 

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ No major issues detected

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Feb 24, 2026

    Greptile Summary

    Adds TransposeOutputSmem test demonstrating manual scheduling of transpose operation on output cached shared memory with TMA store. The test complements the existing TransposeInputSmem test by showing the alternative approach where transpose happens at output rather than input.

    Key changes:

    • New test case manually schedules a 2D transpose [I0, I1][I1, I0] with TMA store on output
    • Thread scheduling: each thread loads multiple rows from input, writes as multiple columns to output
    • Uses 128-byte TMA swizzle with configurable tile sizes (tile_i0=32, tile_i1=64)
    • Test validates correctness on 16384×32768 float tensors

    Issues found:

    • Comment/code mismatch on line 1717 (says tile_i1 but checks tile_i0)

    Confidence Score: 4/5

    • Safe to merge - test-only change with minor documentation inconsistency
    • Test-only change that adds new manual scheduling demonstration. Found one comment/code mismatch that should be corrected but doesn't affect functionality. Issues from previous review threads remain unaddressed but were already flagged.
    • No files require special attention beyond addressing the comment on line 1717

    Important Files Changed

    Filename Overview
    tests/cpp/test_transpose.cpp Adds TransposeOutputSmem test demonstrating transpose on output cached smem with TMA store; minor comment inconsistency found

    Last reviewed commit: 7bd7044

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    1 file reviewed, 3 comments

    Edit Code Review Agent Settings | Greptile

    ref_tv->split(-3, chunks_per_thread);
    // [BIDx, tile_i0/chunk/cpt, cpt, chunk, tile_i1]
    ref_tv->merge(-4, -1);
    // [BIDx, tile_i1/chunk/cpt * tile_i0, cpt, chunk]
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    comment appears to have incorrect merge operation description - merge result should be tile_i0/chunk/cpt * tile_i1 not tile_i1/chunk/cpt * tile_i0

    Suggested change
    // [BIDx, tile_i1/chunk/cpt * tile_i0, cpt, chunk]
    // [BIDx, tile_i0/chunk/cpt * tile_i1, cpt, chunk]

    // without tma load, 814 ms on GB200.
    TEST_F(TransposeTMA, TransposeOutputSmem) {
    NVFUSER_TEST_CUDA_ARCH_GUARD(9, 0);
    const bool use_tma_load = false;
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    hardcoded to false makes the TMA load path untested - consider parameterizing or adding second test variant

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    const int64_t dtype_bytes =
    dataTypeSizeByte(output_smem_cache->getDataType().value());
    const int64_t elements_per_chunk = swizzle_chunk_bytes / dtype_bytes;
    // tile_i1 must equal tma_swizzle_bytes / dtype_bytes.
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    comment says tile_i1 but code checks tile_i0

    Suggested change
    // tile_i1 must equal tma_swizzle_bytes / dtype_bytes.
    // tile_i0 must equal tma_swizzle_bytes / dtype_bytes.

    @liqiangxl
    Copy link
    Collaborator Author

    !build

    @liqiangxl liqiangxl added the enable-auto-merge Auto-merge a PR when: 1) PR mergeable 2) Internal CI complete 3) No failures label Feb 26, 2026
    @github-actions github-actions bot merged commit 3ac302f into main Feb 26, 2026
    20 checks passed
    @github-actions github-actions bot removed the enable-auto-merge Auto-merge a PR when: 1) PR mergeable 2) Internal CI complete 3) No failures label Feb 26, 2026
    @github-actions github-actions bot deleted the llu/transpose_output_smem branch February 26, 2026 21:49
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants