[SWDEV-531526] [SWDEV-527340] Allocation of buffers ordered before compute (#2276)

jataylo · web-flow · commit 8b22352c1509 · 2025-06-27T07:06:50.000-05:00
Ensure fused nodes that allocate buffers come before kernels that usethose buffers In one example we observed: - op8 creates buf10 which mutates buf8 - triton_poi_fused_index_put_lift_fresh_2 kernel tries to use buf8 and buf9 - op6_op7_op16 (fused node) creates buf8 and buf9 But the standard topological sort didn't ensure that the fused node creating buf8 and buf9 came before the kernel using them. After this PR we will identify op8 performs a mutation on buf8, find the node that is responsible for creating the buffer (op6_op7_op16) and add an explicit dependency so now op8 depends on op6_op7_op16 and orders graph accordingly. Note this issue is not seen in PT2.7, not clear as to why. We will hold back on upstreaming this until we observe a similar issue on nightly. Reproducer code (simplified from megatron) https://gist.github.com/jataylo/10bedef08323441c588d2965ad963ae8 Execute with > torchrun --nproc_per_node 1 repro.py Before PR ``` [rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 466, in __call__ [rank0]: return self.current_callable(inputs) [rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/utils.py", line 2128, in run [rank0]: return model(new_inputs) [rank0]: File "/tmp/torchinductor_root/gp/cgpe6weswyihhm442ugdhqxypbr7urxgk3adfr25onncik6tvthr.py", line 423, in call [rank0]: triton_poi_fused_index_put_lift_fresh_2.run(buf9, buf8, 256, grid=grid(256), stream=stream0) [rank0]: UnboundLocalError: local variable 'buf9' referenced before assignment ``` Note the simpler repro fails for both CUDA/ROCm and shows a logic issue across PT2.6, more details in gist.
diff --git a/torch/_inductor/scheduler.py b/torch/_inductor/scheduler.py
@@ -2247,21 +2247,43 @@ def topological_sort_schedule(
         name_to_node: Dict[str, BaseSchedulerNode] = dict()
         result: List[BaseSchedulerNode] = []
 
+        def has_mutations(node: BaseSchedulerNode) -> bool:
+            return any(buf.get_mutations() for buf in node.get_outputs())
+
         def visit(n: BaseSchedulerNode) -> None:
             if n not in seen:
                 seen.add(n)
+
+                # Visit regular dependencies
                 for dep in sorted(n.unmet_dependencies, key=lambda d: d.name):
                     # We only care about doing toposort within `nodes`
                     if dep.name not in name_to_node:
                         continue
                     visit(name_to_node[dep.name])
+
+                # Visit mutation dependencies
+                for buf in n.get_outputs():
+                    for mutation in buf.get_mutations():
+                        if mutation in name_to_node and name_to_node[mutation] != n:
+                            visit(name_to_node[mutation])
+
                 result.append(n)
 
+        # Build name to node mapping
         for node in nodes:
             for name in node.get_buffer_names():
                 name_to_node[name] = node
+
+        # Visit non-mutation nodes first
+        for node in nodes:
+            if not has_mutations(node):
+                visit(node)
+
+        # Then visit mutation nodes
         for node in nodes:
-            visit(node)
+            if has_mutations(node):
+                visit(node)
+
         return result
 
     def _get_unmet_dep_nodes(self, snode: BaseSchedulerNode) -> List[BaseSchedulerNode]: