[FRONTEND] Clone saved exception before raising (#8115)

aakhundov · web-flow · commit 6fa1dd664c73 · 2025-09-09T18:05:31.000Z
triton-lang/triton#7857 has introduced re-throwing the compilation error in subsequent `run` calls on the `CompiledKernel` instance. To this end, `functools.partial` was used to save the error to be raised within a closure of a function that then replaces the `self.run` method. After being raised, the error saved in the closure gets a `__traceback__` attached to it, with the latter holding on to local variables in the stack frames. This is problematic, because the `CompiledKernel` instance is then saved in the global kernel cache which is maintained for the duration of the program. As a result, the objects from the stack trace (e.g., Tensor inputs to the Triton kernel) won't be freed leading to a memory leak. This PR fixes the issue above by cloning the exception to be raised *before* raising it. The cloning needs to be done both before creating the closure with `functools.partial` and within the `_raise_error` function, as if the saved exception instance is raised by `_raise_error`, it will get a traceback attached to it leading to the same problem. P.S. triton-lang/triton#7857 has caused some CI jobs failing in PyTorch CI. The error: CUDAGraph capture in PT2 complains about dangling tensors after the model run. Investigation has pointed to the issue solved by this PR. For more details see the Triton update tracker issue in PyTorch pytorch/pytorch#159704.
diff --git a/python/triton/compiler/compiler.py b/python/triton/compiler/compiler.py
@@ -15,6 +15,7 @@
 import functools
 import os
 import time
+import copy
 
 # - ^\s*tt\.func\s+ : match the start of the string, any leading whitespace, the keyword func,
 #    and any following whitespace
@@ -404,7 +405,7 @@ def __missing__(self, key):
 
 
 def _raise_error(err, *args, **kwargs):
-    raise err
+    raise copy.deepcopy(err)
 
 
 class CompiledKernel:
@@ -445,7 +446,13 @@ def _init_handles(self):
             return
 
         def raise_(err):
-            self._run = functools.partial(_raise_error, err)
+            # clone the exception object so that the one saved in the closure
+            # of the partial function below doesn't get assigned a stack trace
+            # after the subsequent raise. otherwise, the CompiledKernel instance
+            # saved in the (global) kernel cache will keep references to all the
+            # locals in the traceback via the exception instance in the closure.
+            cloned_err = copy.deepcopy(err)
+            self._run = functools.partial(_raise_error, cloned_err)
             raise err
 
         device = driver.active.get_current_device()