Skip to content

Commit 243c58e

Browse files
committed
add RFC link and mention that code is subject to change
1 parent 1d57543 commit 243c58e

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

prototype_source/max_autotune_on_CPU_tutorial.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,8 @@ Prerequisites:
99

1010
Introduction
1111
------------
12-
``max-autotune`` mode for the Inductor CPU backend in ``torch.compile`` profiles multiple implementations of operations at compile time and selects the best-performing one,
12+
``max-autotune`` mode for the Inductor CPU backend in ``torch.compile`` (`RFC link <https://github.com/pytorch/pytorch/issues/125683>`)
13+
profiles multiple implementations of operations at compile time and selects the best-performing one,
1314
trading longer compilation times for improved runtime performance. This enhancement is particularly beneficial for GEMM-related operations.
1415
In the Inductor CPU backend, we’ve introduced a C++ template-based GEMM implementation as an alternative to the ATen-based approach that relies on oneDNN and MKL libraries.
1516
This is similar to the max-autotune mode on CUDA, where implementations from ATen, Triton, and CUTLASS are considered.
@@ -85,6 +86,7 @@ We could check the generated output code by setting ``export TORCH_LOGS="+output
8586
When CPP template is selected, we won't have ``torch.ops.mkldnn._linear_pointwise.default`` (for bfloat16) or ``torch.ops.mkl._mkl_linear.default`` (for float32)
8687
in the generated code anymore, instead, we'll find kernel based on CPP GEMM template ``cpp_fused__to_copy_relu_1``
8788
(only part of the code is demonstrated below for simplicity) with the bias and relu epilogues fused inside the CPP GEMM template kernel.
89+
The generated code differs by CPU architecture and is implementation-specific, which is subject to change.
8890

8991
.. code:: python
9092

0 commit comments

Comments
 (0)