Skip to content

Conversation

@adedespirlet
Copy link
Contributor

@adedespirlet adedespirlet commented Jan 26, 2026

This PR extends the LDS swizzling mechanism to support scenarios with mismatched instruction granularities (e.g 2 element gather_to_lds writes vs 8 element reads).
The current implementation only applies swizzling when the producer and consumer share the same granularity. If they differ, swizzling is skipped entirely, leaving potential bank conflicts unaddresed.

Implementation:

  1. Swizzle unit: a unified swizzle unit is implemented based on the maximum granularity across the gather and read op. For example, if gather_to_lds writes 8 elements and vector.load reads 4 elements, the swizzle unit is 8.
  2. Max_phase: max_phase is now calculated relative to the swizzle unit rather than individual operation granularities. This prevents OOB accesses when a large swizzle unit is multiplied by a phase count derived from smaller granularity operations.
  3. Internal offset handling: an internal_offset is computed to preserve the thread's position within the swizzle unit when you access a chunk of the swizzle unit

What has been added:

  • Added lit_test for the new swizzling implementation (test_gather_to_shared_with_mixed_granularity_swizzling )
  • Added new option enable_swizzle (default=False) to WaveCompileOptions for explicit control over swizzling in tests
  • Updated test_gather_to_shared_wave_tile_aligned_coalescing to use enable_swizzle=False to avoid interference with coalescing affine map verification
  • Updated test_gather_to_shared_scaled_dims with new swizzle pattern
  • Updated testScaledBatchedGemmMXFP4Codegen to reflect new VGPR counts and waitcnt patterns with swizzling enabled

Limitations

Swizzling only works for the GEMM kernels and is currently disabled for all the attention kernels. It will require separate analysis

@adedespirlet adedespirlet force-pushed the swizzle2 branch 2 times, most recently from ad90228 to a5fa0fc Compare January 27, 2026 10:34
Updated max_phase calculation to be relative to the swizzle_unit (the maximum granularity). This prevents oob accesses that occur when a large swizzle unit is multiplied by a phase count derived from a smaller granularity.

Signed-off-by: Aurore De Spirlet <[email protected]>
Signed-off-by: Aurore De Spirlet <[email protected]>
Set enable_swizzling=False in test_gather_to_shared_wave_tile_aligned_coalescing to avoid interference with coalescing affine map verification

Signed-off-by: Aurore De Spirlet <[email protected]>
@adedespirlet adedespirlet force-pushed the swizzle2 branch 3 times, most recently from e286075 to 3b4d615 Compare January 28, 2026 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant