[AMD] Always pipeline small loads on RDNA (triton-lang#8063)

ptrojahn · paultrojahnamd · web-flow · commit b285609db19c · 2025-09-04T12:09:44.000-07:00
On RDNA, we always pipeline through registers and can only check
completion of loads in the order they were dispatched through
s_wait_loadcnt. If we have small loads that are not pipelined, this can
force a wait on pipelined loads as well, negating the benefits of
pipelining.

Co-authored-by: Paul Trojahn &lt;paul.trojahn@amd.com&gt;
diff --git a/third_party/amd/lib/TritonAMDGPUTransforms/StreamPipeline.cpp b/third_party/amd/lib/TritonAMDGPUTransforms/StreamPipeline.cpp
@@ -555,7 +555,8 @@ preprocessLoop(triton::AMD::ModuleAxisInfoAnalysis &axisInfoAnalysis,
     isaFamily = triton::AMD::deduceISAFamily(*arch);
 
   bool pipelineWithoutDot = forOp->hasAttr(mlir::triton::kNumStagesAttrName);
-  bool filterSmallVectors = isaFamily != triton::AMD::ISAFamily::CDNA4;
+  bool filterSmallVectors =
+      isaFamily != triton::AMD::ISAFamily::CDNA4 && !isRDNA(isaFamily);
   llvm::MapVector<Operation *, std::pair<int, Operation *>> loadOpToIndLevel =
       triton::gpu::loadOpsToIndirectionLevel(forOp, pipelineWithoutDot,
                                              axisInfoAnalysis, numStages,