manual schedule of a transpose in output cached smem by liqiangxl · Pull Request #6008 · NVIDIA/Fuser

liqiangxl · 2026-02-24T17:27:43Z

This PR adds a manual scheduling test case demonstrating how to perform a transpose on cached output shared memory using a TMA store.
The transpose scheduler may choose to apply the transpose on either cached input or cached output, depending on the number of inputs and outputs. The guiding principle is to minimize the total number of required transposes, e.g. will do output transpose when there are more inputs than outputs.

github-actions · 2026-02-24T17:29:17Z

Review updated until commit 7bd7044

Auto-merge Status

✅ Internal CI is finished
✅ No failed checks
✅ PR is mergeable
ℹ️ PR mergeable_state: clean

Description

Adds new test case TransposeOutputSmem demonstrating manual transpose scheduling on output cached shared memory
Implements TMA store with output swizzle for transpose operations (loads rows, writes columns)
Includes detailed tiling, swizzling, and parallelization scheduling for optimal memory access
Provides performance comparison between TMA load vs non-TMA load approaches (852ms vs 814ms on GB200)

Changes walkthrough

Relevant files

Enhancement

test_transpose.cpp `Add output cached transpose scheduling test` tests/cpp/test_transpose.cpp Add new test `TransposeInputSmem` replacing `TransposeTMALoadOptionalStore` Add comprehensive `TransposeOutputSmem` test demonstrating output-cached transpose scheduling Implement TMA store with output swizzle, tiling, and parallelization strategies Include performance metrics and detailed scheduling comments	+156/-1

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ No major issues detected

greptile-apps · 2026-02-24T17:29:43Z

Greptile Summary

Adds TransposeOutputSmem test demonstrating manual scheduling of transpose operation on output cached shared memory with TMA store. The test complements the existing TransposeInputSmem test by showing the alternative approach where transpose happens at output rather than input.

Key changes:

New test case manually schedules a 2D transpose [I0, I1] → [I1, I0] with TMA store on output
Thread scheduling: each thread loads multiple rows from input, writes as multiple columns to output
Uses 128-byte TMA swizzle with configurable tile sizes (tile_i0=32, tile_i1=64)
Test validates correctness on 16384×32768 float tensors

Issues found:

Comment/code mismatch on line 1717 (says tile_i1 but checks tile_i0)

Confidence Score: 4/5

Safe to merge - test-only change with minor documentation inconsistency
Test-only change that adds new manual scheduling demonstration. Found one comment/code mismatch that should be corrected but doesn't affect functionality. Issues from previous review threads remain unaddressed but were already flagged.
No files require special attention beyond addressing the comment on line 1717

Important Files Changed

Filename	Overview
tests/cpp/test_transpose.cpp	Adds TransposeOutputSmem test demonstrating transpose on output cached smem with TMA store; minor comment inconsistency found

_{Last reviewed commit: 7bd7044}

greptile-apps

_{1 file reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-24T17:29:47Z

tests/cpp/test_transpose.cpp

+  ref_tv->split(-3, chunks_per_thread);
+  // [BIDx, tile_i0/chunk/cpt, cpt, chunk, tile_i1]
+  ref_tv->merge(-4, -1);
+  // [BIDx, tile_i1/chunk/cpt * tile_i0, cpt, chunk]


comment appears to have incorrect merge operation description - merge result should be tile_i0/chunk/cpt * tile_i1 not tile_i1/chunk/cpt * tile_i0

Suggested change

// [BIDx, tile_i1/chunk/cpt * tile_i0, cpt, chunk]

// [BIDx, tile_i0/chunk/cpt * tile_i1, cpt, chunk]

greptile-apps · 2026-02-24T17:29:48Z

tests/cpp/test_transpose.cpp

+// without tma load, 814 ms on GB200.
+TEST_F(TransposeTMA, TransposeOutputSmem) {
+  NVFUSER_TEST_CUDA_ARCH_GUARD(9, 0);
+  const bool use_tma_load = false;


hardcoded to false makes the TMA load path untested - consider parameterizing or adding second test variant

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-02-26T15:01:22Z

tests/cpp/test_transpose.cpp

+  const int64_t dtype_bytes =
+      dataTypeSizeByte(output_smem_cache->getDataType().value());
+  const int64_t elements_per_chunk = swizzle_chunk_bytes / dtype_bytes;
+  // tile_i1 must equal tma_swizzle_bytes / dtype_bytes.


comment says tile_i1 but code checks tile_i0

Suggested change

// tile_i1 must equal tma_swizzle_bytes / dtype_bytes.

// tile_i0 must equal tma_swizzle_bytes / dtype_bytes.

liqiangxl · 2026-02-26T20:45:30Z

!build

liqiangxl added 5 commits February 23, 2026 13:03

output smem swizzle only

b20047d

remove bank conflicts

c4b66dc

Merge branch 'main' into llu/transpose_output_smem

3651e80

clean

6893f63

non tma load

352977c

liqiangxl marked this pull request as ready for review February 24, 2026 17:27

liqiangxl requested a review from rdspring1 February 24, 2026 17:27

greptile-apps bot reviewed Feb 24, 2026

View reviewed changes

update comment

7bd7044

greptile-apps bot reviewed Feb 26, 2026

View reviewed changes

rdspring1 approved these changes Feb 26, 2026

View reviewed changes

liqiangxl added the enable-auto-merge Auto-merge a PR when: 1) PR mergeable 2) Internal CI complete 3) No failures label Feb 26, 2026

github-actions bot merged commit 3ac302f into main Feb 26, 2026
20 checks passed

github-actions bot removed the enable-auto-merge Auto-merge a PR when: 1) PR mergeable 2) Internal CI complete 3) No failures label Feb 26, 2026

github-actions bot deleted the llu/transpose_output_smem branch February 26, 2026 21:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

manual schedule of a transpose in output cached smem#6008

manual schedule of a transpose in output cached smem#6008
github-actions[bot] merged 6 commits intomainfrom
llu/transpose_output_smem

liqiangxl commented Feb 24, 2026

Uh oh!

github-actions bot commented Feb 24, 2026 •

edited

Loading

Changes walkthrough

PR Reviewer Guide

Uh oh!

greptile-apps bot commented Feb 24, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

greptile-apps bot Feb 24, 2026

Uh oh!

greptile-apps bot Feb 24, 2026

Uh oh!

greptile-apps bot Feb 26, 2026

Uh oh!

liqiangxl commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// [BIDx, tile_i1/chunk/cpt * tile_i0, cpt, chunk]
	// [BIDx, tile_i0/chunk/cpt * tile_i1, cpt, chunk]

	// tile_i1 must equal tma_swizzle_bytes / dtype_bytes.
	// tile_i0 must equal tma_swizzle_bytes / dtype_bytes.

Conversation

liqiangxl commented Feb 24, 2026

Uh oh!

github-actions bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Auto-merge Status

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

greptile-apps bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

liqiangxl commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Feb 24, 2026 •

edited

Loading

greptile-apps bot commented Feb 24, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading