[INTERPRETER] Fix tl.max consistency issue (#7349)

wenqiny · Jokeren · web-flow · commit 7a4cfe7d8491 · 2025-07-07T00:49:12.000Z
# `tl.max` shows different behavior against `torch.max` and interpreter Related issue: #6635 ## Summary I found `tl.max()` method didn't take `nan` as its output when there is any `nan` in the input tensor, but both `torch.max` and interpreter will output `nan` if it existed in the input. it looks like: ``` tl.max("nan", "inf") = "inf" torch.max("nan", inf") = "nan" ``` I thought it may bring some data consistency issue with torch, just like the issue I mentioned above. ## Case <details> <summary> Simple repro </summary> ``` import triton import triton.language as tl import torch @triton.jit def max_kernel(X_ptr, Y_ptr, BLOCK_SIZE: tl.constexpr): offsets = tl.arange(0, BLOCK_SIZE) # [0, 1, ..., 63] x = tl.load(X_ptr + offsets) # Load 64 elements max_val = tl.max(x, axis=0) # Compute max across all 64 elements # Only one block (e.g., block 0) writes out the result if tl.program_id(0) == 0: tl.store(Y_ptr, max_val) inf = float("Inf") # inf = float(999999) nan = float("nan") # a tensor contains one "nan" scalar and other scalars is "inf" or any finite float number. x = torch.tensor( [ nan, inf, inf, inf, inf, inf, inf, inf, # 8 elements a row inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf,], device='cuda:0') # Output tensor: just one element to hold the max y = torch.empty(1, device='cuda') # Launch the kernel with 1 block max_kernel[1,]( x, y, BLOCK_SIZE=64 ) # Fetch the result print("Max value triton:", y[0].item()) print("Max value torch:", torch.max(x).item()) # for validation ``` </details> When we running the above code with input as **a tensor contains one "nan" scalar and other scalars is "inf" or any finite float number**, its output is: ``` Max value triton: inf Max value torch: nan ``` If we run trition in interpreter mode, the output is: ``` Max value triton: nan Max value torch: nan ``` We could see **triton show different behavior with torch and interpreter**. ## Root cause I though the root cause for this is for `_elementwise_max` it will generate a [`arith.maxnumf`](https://mlir.llvm.org/docs/Dialects/ArithOps/#arithmaxnumf-arithmaxnumfop) ir by defult, it means **"If one of the arguments is NaN, then the result is the other argument."** https://github.com/triton-lang/triton/blob/3b41514dc2526628deadbe5271b5596ffa2fb820/python/triton/language/semantic.py#L380-L381 ## Next steps This PR is still in a draft stage, but if it makes sense to you, I will try to fix some test (I guess there may be some test failed) and complement it for `tl.min` too. Separately, I've noticed `tl.argmax` also exhibits an inconsistency with torch and the interpreter. I plan to address this in a follow-up PR if this one is finalized.  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `It's in RFC stage, will add test later`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) --------- Co-authored-by: Keren Zhou <kerenzhou@openai.com>
diff --git a/python/test/unit/language/test_core.py b/python/test/unit/language/test_core.py
@@ -2422,6 +2422,49 @@ def kernel(X, Z, BLOCK: tl.constexpr):
     assert z[0] == 0
 
 
+@pytest.mark.interpreter
+def test_max_min_with_nan(device):
+    # In triton, we implement a "nan ignore" style, which means if there is NaN
+    # in the reduce dimesion, we should ignore it and return the max/min number,
+    # it's different with torch.max/min.
+    @triton.jit
+    def max_kernel(x_ptr, y_ptr, BLOCK_SIZE: tl.constexpr):
+        offsets = tl.arange(0, BLOCK_SIZE)
+        x = tl.load(x_ptr + offsets)
+
+        max_val = tl.max(x, axis=0)
+
+        if tl.program_id(0) == 0:
+            tl.store(y_ptr, max_val)
+
+    @triton.jit
+    def min_kernel(x_ptr, y_ptr, BLOCK_SIZE: tl.constexpr):
+        offsets = tl.arange(0, BLOCK_SIZE)
+        x = tl.load(x_ptr + offsets)
+
+        min_val = tl.min(x, axis=0)
+
+        if tl.program_id(0) == 0:
+            tl.store(y_ptr, min_val)
+
+    BLOCK_SIZE = 64
+    x = torch.rand((1, BLOCK_SIZE), dtype=torch.float32, device=device)
+    # Not the expected output for tl.max
+    x[0, 0] = float('nan')
+    # Expected output for tl.min
+    x[0, 1] = float('-inf')
+    # Expected output for tl.max
+    x[0, 2] = float('inf')
+
+    y = torch.ones(1, device=device)
+
+    max_kernel[(1, )](x, y, BLOCK_SIZE=BLOCK_SIZE)
+    assert y[0] == float('inf')
+
+    min_kernel[(1, )](x, y, BLOCK_SIZE=BLOCK_SIZE)
+    assert y[0] == float('-inf')
+
+
 def get_reduced_dtype(dtype_str, op):
     if op in ('argmin', 'argmax'):
         return 'int32'
diff --git a/python/triton/runtime/interpreter.py b/python/triton/runtime/interpreter.py
@@ -934,9 +934,9 @@ def apply_impl(self, input):
         elif self.combine_fn == tl.standard._argmax_combine_tie_break_left:
             return self.min_max(input[0], val_reduce_op=np.max, idx_reduce_op=np.argmax)
         elif self.combine_fn == tl.standard._elementwise_max:
-            return self.min_max(input[0], val_reduce_op=np.max, idx_reduce_op=None)
+            return self.min_max(input[0], val_reduce_op=np.nanmax, idx_reduce_op=None)
         elif self.combine_fn == tl.standard._elementwise_min:
-            return self.min_max(input[0], val_reduce_op=np.min, idx_reduce_op=None)
+            return self.min_max(input[0], val_reduce_op=np.nanmin, idx_reduce_op=None)
         elif self.combine_fn == tl.standard._sum_combine:
             return self.sum(input[0])
         else: