manual schedule of a transpose in output cached smem#6008
manual schedule of a transpose in output cached smem#6008github-actions[bot] merged 6 commits intomainfrom
Conversation
|
Review updated until commit 7bd7044 Auto-merge Status✅ Internal CI is finished Description
|
| Relevant files | |||
|---|---|---|---|
| Enhancement |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ No major issues detected |
Greptile SummaryAdds Key changes:
Issues found:
Confidence Score: 4/5
Important Files Changed
Last reviewed commit: 7bd7044 |
| ref_tv->split(-3, chunks_per_thread); | ||
| // [BIDx, tile_i0/chunk/cpt, cpt, chunk, tile_i1] | ||
| ref_tv->merge(-4, -1); | ||
| // [BIDx, tile_i1/chunk/cpt * tile_i0, cpt, chunk] |
There was a problem hiding this comment.
comment appears to have incorrect merge operation description - merge result should be tile_i0/chunk/cpt * tile_i1 not tile_i1/chunk/cpt * tile_i0
| // [BIDx, tile_i1/chunk/cpt * tile_i0, cpt, chunk] | |
| // [BIDx, tile_i0/chunk/cpt * tile_i1, cpt, chunk] |
| // without tma load, 814 ms on GB200. | ||
| TEST_F(TransposeTMA, TransposeOutputSmem) { | ||
| NVFUSER_TEST_CUDA_ARCH_GUARD(9, 0); | ||
| const bool use_tma_load = false; |
There was a problem hiding this comment.
hardcoded to false makes the TMA load path untested - consider parameterizing or adding second test variant
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| const int64_t dtype_bytes = | ||
| dataTypeSizeByte(output_smem_cache->getDataType().value()); | ||
| const int64_t elements_per_chunk = swizzle_chunk_bytes / dtype_bytes; | ||
| // tile_i1 must equal tma_swizzle_bytes / dtype_bytes. |
There was a problem hiding this comment.
comment says tile_i1 but code checks tile_i0
| // tile_i1 must equal tma_swizzle_bytes / dtype_bytes. | |
| // tile_i0 must equal tma_swizzle_bytes / dtype_bytes. |
|
!build |
This PR adds a manual scheduling test case demonstrating how to perform a transpose on cached output shared memory using a TMA store.
The transpose scheduler may choose to apply the transpose on either cached input or cached output, depending on the number of inputs and outputs. The guiding principle is to minimize the total number of required transposes, e.g. will do output transpose when there are more inputs than outputs.