[Triton] Generate local MLIR reproducers when possible (#5155)

Mogball · web-flow · commit d5e06fed0275 · 2024-11-14T15:10:35.000-08:00
By setting a reproducer path, the pass manager will dump a standard MLIR
reproducer before each pass manager invocation. This PR also enables
additional local crash reproducer generation (to the same path set
through the env var), which tries to narrow down the specific pass that
failed, if the pass pipeline fails at any point.
diff --git a/README.md b/README.md
@@ -176,6 +176,9 @@ For detailed instructions on how to debug Triton's frontend, please refer to thi
    kernels. Use `MLIR_ENABLE_DUMP=kernelName` to dump for a specific kernel only.
   - Triton cache can interfere with the dump. In cases where `MLIR_ENABLE_DUMP=1` does not work, try cleaning your triton cache: `rm -r ~/.triton/cache/*`
 - `LLVM_IR_ENABLE_DUMP=1` dumps the IR before every pass run over the LLVM IR.
+- `TRITON_REPRODUCER_PATH=<reproducer_path>` will generate an MLIR reproducer file
+  at `<reproducer_path>` before each MLIR compiler stage. If any of the stages fail,
+  `<reproducer_path>` will be a local MLIR reproducer captured right before the failing pass.
 - `TRITON_INTERPRET=1` uses the Triton interpreter instead of running on the
   GPU.  You can insert Python breakpoints in your kernel code!
 - `TRITON_ENABLE_LLVM_DEBUG=1` passes `-debug` to LLVM, printing a lot of
diff --git a/python/src/ir.cc b/python/src/ir.cc
@@ -1707,7 +1707,14 @@ void init_triton_ir(py::module &&m) {
           auto anchorName = self.getOpAnchorName();
           auto passes = self.getPasses();
           Operation *op = mod.getOperation();
+          // Save a reproducer for the current pass manager invocation
+          // immediately.
           makeReproducer(anchorName, passes, op, reproducerPath);
+          // But if the pass manager crashes, attempt to generate a local
+          // reproducer instead.
+          mod.getContext()->disableMultithreading();
+          self.enableCrashReproducerGeneration(reproducerPath,
+                                               /*genLocalReproducer=*/true);
         }
 
         if (triton::tools::getBoolEnv("TRITON_ENABLE_LLVM_DEBUG")) {
@@ -1740,6 +1747,12 @@ void init_triton_ir(py::module &&m) {
           self.enableTiming();
         }
 
+        // Run the pass manager under a source manager diagnostic handler, which
+        // enables emitted MLIR diagnostics to directly reference Python source
+        // code.
+        llvm::SourceMgr sourceMgr;
+        SourceMgrDiagnosticHandler diagHandler(sourceMgr, mod.getContext(),
+                                               llvm::errs());
         if (failed(self.run(mod.getOperation())))
           throw std::runtime_error("PassManager::run failed");
       });