[XLA:GPU] Add documentation to Priority Fusion pass.

derdrdirk · Google-ML-Automation · commit 1351df86ec0d · 2025-11-26T10:14:26.000-08:00
PiperOrigin-RevId: 837168939
diff --git a/xla/service/gpu/transforms/priority_fusion.h b/xla/service/gpu/transforms/priority_fusion.h
@@ -39,6 +39,45 @@ limitations under the License.
 namespace xla {
 namespace gpu {
 
+// PriorityFusion is the main fusion pass for XLA:GPU. It is an HLO pass that
+// assigns a priority to each producer instruction based on the estimated
+// performance benefit of fusing it into its consumers. The benefit is
+// calculated using a performance cost model:
+//
+//   priority = time_unfused - time_fused
+//
+// Note: If fusing a producer into its consumers requires duplicating the
+// producer, the cost model accounts for this duplication.
+//
+// The algorithm can be summarized in the following steps:
+// 1. For each producer, call the cost model to estimate the potential benefit
+//    of fusing it with all its consumers.
+// 2. Put all producers with a positive benefit into a priority queue, ordered
+//    by benefit.
+// 3. Pop the producer with the highest priority from the queue.
+// 4. Fuse the producer with its consumers. This may result in a new fusion
+//    instruction, or merging into an existing fusion.
+// 5. Update the priorities of the operands of the fused instructions and
+//    of instructions whose consumers have changed, and update them in the
+//    priority queue.
+// 6. If the queue is not empty, go to step 3.
+//
+// Example:
+// Consider A -> B -> C, where A, B, and C are fusible operations.
+// The fusible producers are A and B.
+//
+// Priorities are computed:
+//  - P(A) = benefit of fusing A into B.
+//  - P(B) = benefit of fusing B into C.
+//
+// Assuming P(A)=10 and P(B)=5, the queue is [(A,10), (B,5)].
+//  - A is popped and fused into B, creating fusion(A+B).
+//  - The graph becomes fusion(A+B) -> C.
+//  - Priority of fusion(A+B) is computed, P(fusion(A+B))=8.
+//  - The queue becomes [(fusion(A+B),8)].
+//  - fusion(A+B) is popped and fused into C, creating fusion(A+B+C).
+//  - The queue becomes empty, and fusion terminates.
+//
 class PriorityFusion : public HloModulePass {
  public:
   PriorityFusion(tsl::thread::ThreadPool* thread_pool,