Add extra attribute in PrintOp to propagate signness info (intel#4363)

dengl11 · web-flow · commit a94dee0cfd21 · 2024-07-25T09:58:29.000-07:00
The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** -------- This PR aims to address triton-lang/triton#4248, to correctly `device_print` the value if it is an signed integer. The signness info is lost when lowering TTGIR to LLIR (e.g. `i32` is always **signless** in MLIR), but the **lowered** data type is currently being used for constructing the format specifier in the `PrintOpToLLVM` implementation (triton-lang/triton#4248 (comment)), so a negative value is printed out as an unsigned int, thus confusing users. A minimal reproducer is ```python import torch import triton import triton.language as tl @triton.jit def print_kernel(ptr): value = tl.load(ptr) tl.device_print("value in kernel from device_print", value) print_kernel[(1,)](torch.tensor(10, dtype=torch.int32).cuda()) print_kernel[(1,)](torch.tensor(-10, dtype=torch.int32).cuda()) print_kernel[(1,)](torch.tensor((1 << 31) + 1000, dtype=torch.uint32).cuda()) ``` Currently, it prints ``` pid (0, 0, 0) idx () value in kernel from device_print: 10 ... pid (0, 0, 0) idx () value in kernel from device_print: 4294967286 ... pid (0, 0, 0) idx () value in kernel from device_print: 2147484648 ``` (always as unsigned int) This PR adds extra `isSigned` attribute in the `PrintOp` to indicate if each operand in the `PrintOp` should be printed as signed or not. With this, the program above now prints correctly ``` pid (0, 0, 0) idx () value in kernel from device_print: 10 ... pid (0, 0, 0) idx () value in kernel from device_print: -10 ... pid (0, 0, 0) idx () value in kernel from device_print: 2147484648 ``` Extra LIT tests and python unit tests are added as well; also manually verified that they failed without the fix and passing now by running ``` $ pytest python/test/unit/language/test_subprocess.py $ cd python/build/cmake.linux-x86_64-cpython-3.10; lit test ``` **Alternative considered**: adds `uint32` in the triton MLIR data type definition and then rely on the triton op data type to determine the format specifier, to retain the original signness info; as in commit triton-lang/triton@f7a7407. However, as PR reviewer pointed out, that means adding a new data type in the Triton IR just for this purpose, which is overkill and introduces unnecessary maintenance overhead and thus less ideal. -------- Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [X] I am not making a trivial change, such as fixing a typo in a comment. - [X] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [X] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [X] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [X] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)
diff --git a/include/triton/Dialect/Triton/IR/TritonOps.td b/include/triton/Dialect/Triton/IR/TritonOps.td
@@ -820,8 +820,14 @@ def TT_HistogramOp : TT_Op<"histogram", [Pure]> {
 //
 // Print Op
 //
-def TT_PrintOp : TT_Op<"print", [MemoryEffects<[MemWrite<GlobalMemory>]>]>,
-  Arguments<(ins StrAttr:$prefix, BoolAttr:$hex, Variadic<AnyTypeOf<[TT_Type]>>:$args)> {
+def TT_PrintOp : TT_Op<"print", [SameVariadicOperandSize, MemoryEffects<[MemWrite<GlobalMemory>]>]> {
+  let arguments = (
+    ins
+    StrAttr:$prefix,
+    BoolAttr:$hex,
+    Variadic<AnyTypeOf<[TT_Type]>>:$args,
+    DenseI32ArrayAttr:$isSigned
+  );
   let summary = "Device-side print, as in CUDA for debugging";
   let description = [{
     `tt.print` takes a literal string prefix and an arbitrary number of scalar or tensor arguments that should be printed.
diff --git a/lib/Conversion/TritonGPUToLLVM/PrintOpToLLVM.cpp b/lib/Conversion/TritonGPUToLLVM/PrintOpToLLVM.cpp
@@ -44,7 +44,10 @@ struct PrintOpConversion : public ConvertOpToLLVMPattern<triton::PrintOp> {
       return success();
     }
 
+    assert(op.getNumOperands() == op.getIsSigned().size());
+
     for (size_t i = 0; i < op.getNumOperands(); i++) {
+      bool isSigned = op.getIsSigned()[i] > 0;
       // Elements of the tensor that are resident in this GPU thread.
       auto elems = unpackLLElements(loc, adaptor.getOperands()[i], rewriter);
 
@@ -76,7 +79,7 @@ struct PrintOpConversion : public ConvertOpToLLVMPattern<triton::PrintOp> {
       if (!elems.empty()) {
         printTensor(op.getPrefix(), /*operand=*/i,
                     /*numOperands=*/op.getNumOperands(), elems, pid, indices,
-                    dimWidths, op.getHex(), rewriter);
+                    dimWidths, op.getHex(), rewriter, isSigned);
       }
     }
     rewriter.eraseOp(op);
@@ -87,7 +90,7 @@ struct PrintOpConversion : public ConvertOpToLLVMPattern<triton::PrintOp> {
                    ArrayRef<Value> elems, std::array<Value, 3> pid,
                    ArrayRef<SmallVector<Value>> indices,
                    ArrayRef<int> dimWidths, bool hex,
-                   ConversionPatternRewriter &rewriter) const {
+                   ConversionPatternRewriter &rewriter, bool isSigned) const {
     assert(!elems.empty());
     assert(elems.size() == indices.size());
     assert(dimWidths.size() == indices.front().size());
@@ -151,7 +154,8 @@ struct PrintOpConversion : public ConvertOpToLLVMPattern<triton::PrintOp> {
       }
 
       auto elem = elems[i];
-      os << getFormatSubstr(elem, hex);
+
+      os << getFormatSubstr(elem, hex, /*width=*/std::nullopt, isSigned);
       printfOperands.push_back(elem);
 
       // It's the same format string each iteration, but it's a lot easier if we
@@ -169,8 +173,10 @@ struct PrintOpConversion : public ConvertOpToLLVMPattern<triton::PrintOp> {
   }
 
   std::string getFormatSubstr(Value value, bool hex = false,
-                              std::optional<int> width = std::nullopt) const {
+                              std::optional<int> width = std::nullopt,
+                              bool isSigned = false) const {
     Type type = value.getType();
+    // If the `value` is a pointer, just return %p.
     if (isa<LLVM::LLVMPointerType>(type)) {
       return "%p";
     }
@@ -192,21 +198,16 @@ struct PrintOpConversion : public ConvertOpToLLVMPattern<triton::PrintOp> {
       prefix += std::to_string(*width);
     } else if (hex) {
       prefix += "0";
-      prefix += std::to_string(value.getType().getIntOrFloatBitWidth() / 4);
+      prefix += std::to_string(type.getIntOrFloatBitWidth() / 4);
     }
 
     if (type.isBF16() || type.isF16() || type.isF32() || type.isF64()) {
       return prefix + "f";
-    } else if (type.isSignedInteger()) {
-      if (type.getIntOrFloatBitWidth() == 64)
-        return prefix + "lli";
-      else
-        return prefix + "i";
-    } else if (type.isUnsignedInteger() || type.isSignlessInteger()) {
+    } else if (type.isInteger()) {
       if (type.getIntOrFloatBitWidth() == 64)
-        return prefix + "llu";
+        return prefix + (isSigned ? "lli" : "llu");
       else
-        return prefix + "u";
+        return prefix + (isSigned ? "i" : "u");
     }
     assert(false && "not supported type");
     return "";
diff --git a/python/src/ir.cc b/python/src/ir.cc
@@ -1520,11 +1520,11 @@ void init_triton_ir(py::module &&m) {
            })
       .def("create_print",
            [](TritonOpBuilder &self, const std::string &prefix, bool hex,
-              const std::vector<Value> &values) -> void {
-             self.create<PrintOp>(
-                 StringAttr::get(self.getBuilder().getContext(),
-                                 llvm::StringRef(prefix)),
-                 hex, values);
+              const std::vector<Value> &values,
+              const std::vector<int32_t> &isSigned) -> void {
+             auto prefixAttr = StringAttr::get(self.getBuilder().getContext(),
+                                               llvm::StringRef(prefix));
+             self.create<PrintOp>(prefixAttr, hex, values, isSigned);
            })
       .def("create_assert",
            [](TritonOpBuilder &self, Value &condition,
diff --git a/python/test/unit/language/print_helper.py b/python/test/unit/language/print_helper.py
@@ -104,6 +104,12 @@ def test_print(func: str, data_type: str, device: str):
     elif func == "device_print_scalar":
         scalar = torch.tensor(42, dtype=x.dtype, device="cuda")
         kernel_device_print_scalar[(1, )](scalar, num_warps=num_warps)
+    elif func == "device_print_negative":
+        x = -x
+        kernel_device_print[(1, )](x, y, num_warps=num_warps, BLOCK=N)
+    elif func == "device_print_uint":
+        x = torch.arange((1 << 31), (1 << 31) + N, device=device).to(getattr(torch, data_type))
+        kernel_device_print[(1, )](x, y, num_warps=num_warps, BLOCK=N)
     elif func == "print":
         kernel_print[(1, )](x, y, num_warps=num_warps, BLOCK=N)
     elif func == "device_print_large":
diff --git a/python/test/unit/language/test_subprocess.py b/python/test/unit/language/test_subprocess.py
@@ -38,6 +38,8 @@ def is_interpreter():
                                                       ("device_print_hex", "int32"),
                                                       ("device_print_hex", "int64"),
                                                       ("device_print_pointer", "int32"),
+                                                      ("device_print_negative", "int32"),
+                                                      ("device_print_uint", "uint32"),
                                                   ])
 def test_print(func_type: str, data_type: str, device: str):
     proc = subprocess.run(
@@ -62,9 +64,10 @@ def test_print(func_type: str, data_type: str, device: str):
     # Format is
     #   pid (<x>, <y>, <z>) idx (<i1>, <i2>, ...) <prefix> (operand <n>) <elem>
     expected_lines = Counter()
-    if func_type == "print" or func_type == "device_print":
+    if func_type in ("print", "device_print", "device_print_uint"):
         for i in range(N):
-            line = f"pid (0, 0, 0) idx ({i:3}) x: {i}"
+            offset = (1 << 31) if data_type == "uint32" else 0
+            line = f"pid (0, 0, 0) idx ({i:3}) x: {i + offset}"
             if data_type.startswith("float"):
                 line += ".000000"
             expected_lines[line] = 1
@@ -73,6 +76,10 @@ def test_print(func_type: str, data_type: str, device: str):
         if data_type.startswith("float"):
             line += ".000000"
         expected_lines[line] = N
+    elif func_type == "device_print_negative":
+        for i in range(N):
+            line = f"pid (0, 0, 0) idx ({i:3}) x: {-i}"
+            expected_lines[line] = 1
     elif func_type == "device_print_hex":
         for i in range(N):
             line = f"pid (0, 0, 0) idx ({i:3}) x: 0x"
diff --git a/python/triton/language/semantic.py b/python/triton/language/semantic.py
@@ -1535,7 +1535,8 @@ def device_print(prefix: str, args: List[tl.tensor], hex: bool, builder: ir.buil
         prefix = " " + prefix
 
     new_args = [arg.handle for arg in args]
-    return tl.tensor(builder.create_print(prefix, hex, new_args), tl.void)
+    is_signed = [arg.dtype in (tl.int1, tl.int8, tl.int16, tl.int32, tl.int64) for arg in args]
+    return tl.tensor(builder.create_print(prefix, hex, new_args, is_signed), tl.void)
 
 
 def device_assert(cond: tl.tensor, msg: str, file_name: str, func_name, lineno: int, builder: ir.builder) -> tl.tensor:
diff --git a/python/triton/runtime/interpreter.py b/python/triton/runtime/interpreter.py
@@ -628,7 +628,10 @@ def create_extern_elementwise(self, libName, libPath, symbol, argList, retType,
     def create_inline_asm(self, inlineAsm, constraints, values, type, isPure, pack):
         raise NotImplementedError("inline_asm not supported in interpreter mode")
 
-    def create_print(self, prefix, hex, values):
+    def create_print(self, prefix, hex, values, isSigned):
+        # NOTE: the `isSigned` variable is not really used here; because Signness is already known
+        # by `values` themselves in python interpreter, thus not really needed here;
+        # it is only used for triton PrintOpToLLVM to correctly construct the format specifier.
         # Interpreter's device_print function has a different format than Triton's device_print
         msg = f"({self.grid_idx[0]}, {self.grid_idx[1]}, {self.grid_idx[2]})"
         if prefix:
diff --git a/test/Conversion/tritongpu_to_llvm.mlir b/test/Conversion/tritongpu_to_llvm.mlir
@@ -1639,7 +1639,33 @@ module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 4 :
   // CHECK-LABEL: print_ptr
   // CHECK: llvm.call @vprintf(%{{.*}}, %{{.*}}) : (!llvm.ptr, !llvm.ptr) -> i32
   tt.func @print_ptr(%arg0 : tensor<256x!tt.ptr<i32>, #blocked0>) {
-    tt.print "ptr: " {hex = false} : %arg0 : tensor<256x!tt.ptr<i32>, #blocked0>
+    tt.print "ptr: " {hex = false, isSigned = array<i32: 0>} : %arg0 : tensor<256x!tt.ptr<i32>, #blocked0>
+    tt.return
+  }
+}
+
+// -----
+#blocked0 = #triton_gpu.blocked<{sizePerThread = [1], threadsPerWarp = [32], warpsPerCTA = [4], order = [0], CTAsPerCGA = [1], CTASplitNum = [1], CTAOrder = [0]}>
+module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 4 : i32} {
+  // Test that %u format specifier is used if isSigned is false
+  // CHECK: llvm.mlir.global internal constant @printfFormat_0("{{.*}}int32 tensor: %u{{.*}}")
+  // CHECK-LABEL: print_int32_tensor_issigned_off
+  // CHECK: llvm.call @vprintf(%{{.*}}, %{{.*}}) : (!llvm.ptr, !llvm.ptr) -> i32
+  tt.func @print_int32_tensor_issigned_off(%arg0 : i32) {
+    tt.print "int32 tensor: " {hex = false, isSigned = array<i32: 0>} : %arg0 : i32
+    tt.return
+  }
+}
+
+// -----
+#blocked0 = #triton_gpu.blocked<{sizePerThread = [1], threadsPerWarp = [32], warpsPerCTA = [4], order = [0], CTAsPerCGA = [1], CTASplitNum = [1], CTAOrder = [0]}>
+module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 4 : i32} {
+  // Test that %i format specifier is used if isSigned is true
+  // CHECK: llvm.mlir.global internal constant @printfFormat_0("{{.*}}int32 tensor: %i{{.*}}")
+  // CHECK-LABEL: print_int32_tensor_issigned_on
+  // CHECK: llvm.call @vprintf(%{{.*}}, %{{.*}}) : (!llvm.ptr, !llvm.ptr) -> i32
+  tt.func @print_int32_tensor_issigned_on(%arg0 : i32) {
+    tt.print "int32 tensor: " {hex = false, isSigned = array<i32: 1>} : %arg0 : i32
     tt.return
   }
 }
diff --git a/test/Triton/ops.mlir b/test/Triton/ops.mlir
@@ -186,7 +186,7 @@ tt.func @dot_ops_infer(%ptr: !tt.ptr<f32>, %v : f32) {
 // CHECK-LABEL: @print_no_arg
 tt.func @print_no_arg(%arg0: !tt.ptr<f32>) {
 // CHECK: tt.print "test"
-  tt.print "test" { hex = false }
+  tt.print "test" { hex = false, isSigned = array<i32: 0>}
   %0 = tt.load %arg0 : !tt.ptr<f32>
   tt.store %arg0, %0 : !tt.ptr<f32>
   tt.return
diff --git a/test/TritonGPU/accelerate-matmul.mlir b/test/TritonGPU/accelerate-matmul.mlir
@@ -119,7 +119,7 @@ module attributes {"triton_gpu.target" = "cuda:80", "triton_gpu.num-ctas" = 1 :
     // CHECK: tt.dot {{.*}} -> tensor<2x16x16xf32, #[[MMA1]]>
     %11 = tt.dot %8, %9, %10, inputPrecision = tf32 : tensor<2x16x16xf32, #triton_gpu.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<2x16x16xf32, #triton_gpu.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<2x16x16xf32, #blocked3>
     %12 = triton_gpu.convert_layout %11 : tensor<2x16x16xf32, #blocked3> -> tensor<2x16x16xf32, #blocked>
-    tt.print ": " {hex = false} : %12 : tensor<2x16x16xf32, #blocked>
+    tt.print ": " {hex = false, isSigned = array<i32: 0>} : %12 : tensor<2x16x16xf32, #blocked>
     tt.return
   }
 }
diff --git a/test/TritonGPU/combine.mlir b/test/TritonGPU/combine.mlir
@@ -1475,7 +1475,7 @@ module attributes {"triton_gpu.num-warps" = 4 : i32, "triton_gpu.threads-per-war
     %23 = tt.splat %arg0 : !tt.ptr<f16> -> tensor<128x32x!tt.ptr<f16>, #blocked1>
     %24 = tt.splat %arg1 : !tt.ptr<f16> -> tensor<32x128x!tt.ptr<f16>, #blocked>
     %25:3 = scf.for %arg7 = %c0_i32 to %22 step %c1_i32 iter_args(%arg8 = %cst_1, %arg9 = %11, %arg10 = %20) -> (f32, tensor<128x32xi32, #blocked1>, tensor<32x128xi32, #blocked>)  : i32 {
-      tt.print "a_offsets: " { hex = false } : %arg9 : tensor<128x32xi32, #blocked1>
+      tt.print "a_offsets: " { hex = false, isSigned = array<i32: 0> } : %arg9 : tensor<128x32xi32, #blocked1>
       %27 = tt.addptr %23, %arg9 : tensor<128x32x!tt.ptr<f16>, #blocked1>, tensor<128x32xi32, #blocked1>
       %28 = triton_gpu.convert_layout %27 : tensor<128x32x!tt.ptr<f16>, #blocked1> -> tensor<128x32x!tt.ptr<f16>, #blocked4>
       %29 = tt.load %28 : tensor<128x32x!tt.ptr<f16>, #blocked4>

Original file line number	Diff line number	Diff line change
`@@ -119,7 +119,7 @@ module attributes {"triton_gpu.target" = "cuda:80", "triton_gpu.num-ctas" = 1 :`
`119`	`119`	`// CHECK: tt.dot {{.*}} -> tensor<2x16x16xf32, #[[MMA1]]>`
`120`	`120`	`%11 = tt.dot %8, %9, %10, inputPrecision = tf32 : tensor<2x16x16xf32, #triton_gpu.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<2x16x16xf32, #triton_gpu.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<2x16x16xf32, #blocked3>`
`121`	`121`	`%12 = triton_gpu.convert_layout %11 : tensor<2x16x16xf32, #blocked3> -> tensor<2x16x16xf32, #blocked>`
`122`		`- tt.print ": " {hex = false} : %12 : tensor<2x16x16xf32, #blocked>`
	`122`	`+ tt.print ": " {hex = false, isSigned = array<i32: 0>} : %12 : tensor<2x16x16xf32, #blocked>`
`123`	`123`	`tt.return`
`124`	`124`	`}`
`125`	`125`	`}`