Document Triton's alternative to round()

I traced a numerical error in INT8 Quantized MatMul to a misunderstanding of Python's rounding mechanism. Accordingly, it would be beneficial to document Triton's alternative to round()(using tl.extra.cuda.libdevice.rint() as mentioned in https://github.com/triton-lang/triton/issues/4449.)