[Testing] Add a div0 check in the benchmarking function (#6868)

njriasan · web-flow · commit 9e9272478cbe · 2025-05-19T17:10:36.000Z
At Meta we try and reuse the Triton benchmarking infrastructure when comparing our Triton kernels to native comparisons. We have found a [rare case where comparing to a CK baseline is registering as "0ms"](https://github.com/pytorch-labs/tritonbench/blob/a13002697ff55096f495cd132d35cdc414ce36bf/tritonbench/operators/fp8_gemm_rowwise/operator.py#L204). This crashes our work-stream, so this adds as simple division by 0 check to prevent this issue. The default of 1000 is chosen arbitrarily.
diff --git a/python/triton/testing.py b/python/triton/testing.py
@@ -95,7 +95,11 @@ def do_bench_cudagraph(fn, rep=20, grad_to_none=None, quantiles=None, return_mod
         end_event.record()
         torch.cuda.synchronize()
         estimate_ms = start_event.elapsed_time(end_event) / 5
-        n_repeat = max(1, int(rep / estimate_ms))
+        # Rewrite to avoid possible division by 0 issues with fast benchmarks
+        if estimate_ms == 0:
+            n_repeat = 1000
+        else:
+            n_repeat = max(1, int(rep / estimate_ms))
         # step 2 - construct a cuda graph with `n_repeat` unrolled function calls to minimize
         # host overhead
         g = torch.cuda.CUDAGraph()