[CPU][FP8] Support FP8 SDPA for CPU backend #2689

Valentine233 · 2025-08-05T07:03:28Z

Description:

Reuse the schema of qscaled_dot_product, and extend the FP8 dtype.
Support the fused attention and fallback math kernels for FP8 SDPA.
Support the pattern match for FP8 SDPA.

pytorch-bot · 2025-08-05T07:03:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2689

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 85c0ad7 with merge base 6e9bf26 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Valentine233 · 2025-08-06T01:12:09Z

@CaoE @jianan-gu Please help review, thanks~

CaoE · 2025-08-22T07:31:26Z

torchao/csrc/cpu/quantized_sdpa.cpp

+template <typename scalar_t, typename mask_t,
+          int64_t q_split_size, int64_t kv_split_size>
+inline typename std::enable_if_t<std::is_same_v<scalar_t, at::Float8_e4m3fn>, void>
+fp8_sdpa_fused_kernel_impl(


Is it possible to merge int8 and fp8 implementations ?

We can not merge these two implementations, as fp8 uses flash attention while int8 does not.

CaoE · 2025-08-22T08:51:16Z

test/prototype/inductor/test_int8_sdpa_fusion.py

@@ -157,7 +157,7 @@ def _check_common(
    )
    @config.patch({"freezing": True})
    def _test_sdpa_int8_rewriter(self):
-        from torch.export import export_for_training
+        from torch.export import export


If this test covers fp8, we'd better rename it.

Also the file name.

Thanks and modified.

CaoE · 2025-08-22T08:53:53Z

torchao/csrc/cpu/quantized_sdpa.cpp

+    auto tmp4 = at::vec::clamp(tmp3, vec_min_val, vec_max_val);
+    _store(out + i, tmp4, size - i);
+  }
+  val = vec_tmp_sum.reduce_add();


Does this function need NaN guard ?

We would not get an extreme large value here, because this is part of the safe-softmax where the max value has been subtracted before.

CaoE · 2025-08-22T08:57:17Z

torchao/csrc/cpu/quantized_sdpa.cpp

+          return output.transpose(1, 2);
+      } else {
+#endif // CPU_CAPABILITY_AVX512
+          std::cout << "int8_sdpa_math_kernel" << std::endl;


Seems to be an omission. Remove this.

Thanks and modified.

CaoE · 2025-08-22T08:58:39Z

torchao/csrc/cpu/quantized_sdpa.cpp

+#ifdef CPU_CAPABILITY_AVX512
+      if (at::native::cpublas::could_pack(dtype)) {
+          at::Tensor output = at::empty_like(query, query.options()).transpose(1, 2);
+          std::cout << "int8_sdpa_fused_kernel" << std::endl;


Seems to be an omission.

Thanks and modified.

CaoE · 2025-08-22T08:59:04Z

torchao/csrc/cpu/quantized_sdpa.cpp

+          return output.transpose(1, 2);
+      } else {
+#endif // CPU_CAPABILITY_AVX512 && CPUBLAS_BRGEMM_F8F8F32
+          std::cout << "fp8_sdpa_math_kernel" << std::endl;


Same as above.

Thanks and modified.

CaoE · 2025-08-22T09:00:40Z

torchao/csrc/cpu/quantized_sdpa.cpp

-        return sdpa_int8_math_kernel(query, key, value,
+  if (dtype == at::ScalarType::Byte) {
+#ifdef CPU_CAPABILITY_AVX512
+      if (at::native::cpublas::could_pack(dtype)) {


Do we always need pack on supported platforms ? Is there any cases where do packing is slower than plain format ?

Thanks for your suggestion!
For the cases we care about, for example landing zone models, we have confirmed that packing is better.
For the general cases, we can tune for need_pack in the future.

CaoE · 2025-08-22T09:05:10Z

torchao/csrc/cpu/quantized_sdpa.cpp

+// CPUBLAS_BRGEMM_F8F8F32 is defined if FP8 BRGEMM is supported in PyTorch CPUBlas.
+      if (at::native::cpublas::could_pack(dtype)) {
+          at::Tensor output = at::empty_like(query, query.options()).transpose(1, 2);
+          std::cout << "fp8_sdpa_fused_kernel" << std::endl;


Same as above.

Thanks and modified.

CaoE · 2025-08-22T09:05:54Z

torchao/csrc/cpu/quantized_sdpa.cpp

@@ -1834,6 +2424,43 @@ at::Tensor sdpa_int8_math_kernel(
  return output;
 }

+at::Tensor fp8_sdpa_math_kernel(


Is there any tests for the ref implementation ?

The ref path can be tested when CPUBLAS_BRGEMM_F8F8F32 is not set and this MACRO is defined in PyTorch. So, the fused attention or math ref one, which kernel to go, depends on the version of PyTorch. I have locally validated the both cases.

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 5, 2025

Valentine233 marked this pull request as draft August 5, 2025 07:13

Valentine233 added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Aug 5, 2025

CaoE reviewed Aug 22, 2025

View reviewed changes

Valentine233 force-pushed the cpu_fp8_sdpa branch from 123fd8c to e0241ee Compare August 25, 2025 06:57

Valentine233 requested a review from CaoE August 25, 2025 07:04

Valentine233 marked this pull request as ready for review August 25, 2025 07:46

Valentine233 requested review from jansel, jerryzh168, drisspg and CaoE and removed request for CaoE August 27, 2025 02:40

CaoE approved these changes Aug 27, 2025

View reviewed changes

Valentine233 added 5 commits August 28, 2025 01:16

[cpu][fp8] support fp8 sdpa for cpu

0ef5222

fix format

e9a3432

fix import

6a6d7a1

fix import

36b9775

remove useless

85c0ad7

Valentine233 force-pushed the cpu_fp8_sdpa branch from e0241ee to 85c0ad7 Compare August 28, 2025 01:17

[CPU][FP8] Support FP8 SDPA for CPU backend #2689

Are you sure you want to change the base?

[CPU][FP8] Support FP8 SDPA for CPU backend #2689

Uh oh!

Conversation

Valentine233 commented Aug 5, 2025

Uh oh!

pytorch-bot bot commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2689

✅ No Failures

Uh oh!

Valentine233 commented Aug 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CaoE Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CaoE Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 5, 2025 •

edited

Loading

CaoE Aug 22, 2025 •

edited

Loading

CaoE Aug 22, 2025 •

edited

Loading