I traced a numerical error in INT8 Quantized MatMul to a misunderstanding of Python's rounding mechanism. Accordingly, it would be beneficial to document Triton's alternative to round()(using tl.extra.cuda.libdevice.rint() as mentioned in triton-lang/triton#4449.)