[DRAFT] Better explanation for round() vs round-() in decompose

jammychiou1 · jammychiou1 · commit 8a3644e7fa3c · 2025-12-26T17:46:15.000+08:00
Only the AVX2 one is updated currently. The NEON one will also be
updated if this looks good.

Signed-off-by: jammychiou1 &lt;jammy.chiou1@gmail.com&gt;
diff --git a/dev/x86_64/src/poly_decompose_32_avx2.c b/dev/x86_64/src/poly_decompose_32_avx2.c
@@ -72,10 +72,32 @@ void mld_poly_decompose_32_avx2(int32_t *a1, int32_t *a0)
      * _mm256_mulhi_epu16() below.
      */
 
+    /* check-magic: 4290772992 == 1 / (1 / 4092 - 1025 / 2**22) */
     /*
      * Compute f1 = round-(f1' / B) ≈ round(f1' * 1025 / 2^22). This is exact
-     * for 0 <= f1' < 2^16. Note that half is rounded down since 1025 / 2^22 ≲
-     * 1 / 4092.
+     * for 0 <= f1' < 2^16.
+     *
+     * To see this, consider the (signed) error f1' * (1 / B - 1025 / 2^22)
+     * between f1' / B and the (under-)approximation f1' * 1025 / 2^22. Because
+     * eps := 1 / B - 1025 / 2^22 is 1 / 4290772992 ≈ 2^(-31.99) < 2^(-31), we
+     * have 0 <= f1' * eps < 2^16 * 2^(-31) = 1 / 2^15 < 1 / B (note that f1' is
+     * non-negative).
+     *
+     * On the other hand, 1 / B is the spacing between the integral multiples
+     * of 1 / B, which includes all rounding boundaries n + 0.5 (since B is
+     * even). Hence, if f1' / B is not of the form n + 0.5, then it is at least
+     * 1 / B away from the nearest rounding boundary, so moving from f1' / B to
+     * f1' * 1025 / 2^22 does not affect the rounding result, no matter the
+     * type of rounding used in either side. In particular, we have
+     * round-(f1' / B) = round(f1' * 1025 / 2^22) as claimed.
+     *
+     * As for the remaining case where f1' / B _is_ of the form n + 0.5, because
+     * f1' * 1025 / 2^22 is slightly but strictly below f1' / B = n + 0.5 (note
+     * that f1' and thus the error f1' * eps cannot be 0 here), it is always
+     * rounded down to n. More precisely, we have round-(f1' / B) =
+     * round(f1' * 1025 / 2^22), where the round-down on the LHS is essential,
+     * and on the RHS the type of rounding again does not matter. This concludes
+     * the proof.
      *
      * round(f1' * 1025 / 2^22) is in turn computed in 2 steps as
      * round(floor(f1' * 1025 / 2^16) / 2^6). The mulhi computes f1'' =
diff --git a/dev/x86_64/src/poly_decompose_88_avx2.c b/dev/x86_64/src/poly_decompose_88_avx2.c
@@ -73,10 +73,32 @@ void mld_poly_decompose_88_avx2(int32_t *a1, int32_t *a0)
      * _mm256_mulhi_epu16() below.
      */
 
+    /* check-magic: 1560281088 == 1 / (1 / 1488 - 11275 / 2**24) */
     /*
      * Compute f1 = round-(f1' / B) ≈ round(f1' * 11275 / 2^24). This is exact
-     * for 0 <= f1' < 2^16. Note that half is rounded down since 11275 / 2^24 ≲
-     * 1 / 1488.
+     * for 0 <= f1' < 2^16.
+     *
+     * To see this, consider the (signed) error f1' * (1 / B - 11275 / 2^24)
+     * between f1' / B and the (under-)approximation f1' * 11275 / 2^24. Because
+     * eps := 1 / B - 11275 / 2^24 is 1 / 1560281088 ≈ 2^(-30.54) < 2^(-30), we
+     * have 0 <= f1' * eps < 2^16 * 2^(-30) = 1 / 2^14 < 1 / B (note that f1' is
+     * non-negative).
+     *
+     * On the other hand, 1 / B is the spacing between the integral multiples
+     * of 1 / B, which includes all rounding boundaries n + 0.5 (since B is
+     * even). Hence, if f1' / B is not of the form n + 0.5, then it is at least
+     * 1 / B away from the nearest rounding boundary, so moving from f1' / B to
+     * f1' * 11275 / 2^24 does not affect the rounding result, no matter the
+     * type of rounding used in either side. In particular, we have
+     * round-(f1' / B) = round(f1' * 11275 / 2^24) as claimed.
+     *
+     * As for the remaining case where f1' / B _is_ of the form n + 0.5, because
+     * f1' * 11275 / 2^24 is slightly but strictly below f1' / B = n + 0.5 (note
+     * that f1' and thus the error f1' * eps cannot be 0 here), it is always
+     * rounded down to n. More precisely, we have round-(f1' / B) =
+     * round(f1' * 11275 / 2^24), where the round-down on the LHS is essential,
+     * and on the RHS the type of rounding again does not matter. This concludes
+     * the proof.
      *
      * round(f1' * 11275 / 2^24) is in turn computed in 2 steps as
      * round(floor(f1' * 11275 / 2^16) / 2^8). The mulhi computes f1'' =
diff --git a/mldsa/src/native/x86_64/src/poly_decompose_32_avx2.c b/mldsa/src/native/x86_64/src/poly_decompose_32_avx2.c
@@ -72,10 +72,32 @@ void mld_poly_decompose_32_avx2(int32_t *a1, int32_t *a0)
      * _mm256_mulhi_epu16() below.
      */
 
+    /* check-magic: 4290772992 == 1 / (1 / 4092 - 1025 / 2**22) */
     /*
      * Compute f1 = round-(f1' / B) ≈ round(f1' * 1025 / 2^22). This is exact
-     * for 0 <= f1' < 2^16. Note that half is rounded down since 1025 / 2^22 ≲
-     * 1 / 4092.
+     * for 0 <= f1' < 2^16.
+     *
+     * To see this, consider the (signed) error f1' * (1 / B - 1025 / 2^22)
+     * between f1' / B and the (under-)approximation f1' * 1025 / 2^22. Because
+     * eps := 1 / B - 1025 / 2^22 is 1 / 4290772992 ≈ 2^(-31.99) < 2^(-31), we
+     * have 0 <= f1' * eps < 2^16 * 2^(-31) = 1 / 2^15 < 1 / B (note that f1' is
+     * non-negative).
+     *
+     * On the other hand, 1 / B is the spacing between the integral multiples
+     * of 1 / B, which includes all rounding boundaries n + 0.5 (since B is
+     * even). Hence, if f1' / B is not of the form n + 0.5, then it is at least
+     * 1 / B away from the nearest rounding boundary, so moving from f1' / B to
+     * f1' * 1025 / 2^22 does not affect the rounding result, no matter the
+     * type of rounding used in either side. In particular, we have
+     * round-(f1' / B) = round(f1' * 1025 / 2^22) as claimed.
+     *
+     * As for the remaining case where f1' / B _is_ of the form n + 0.5, because
+     * f1' * 1025 / 2^22 is slightly but strictly below f1' / B = n + 0.5 (note
+     * that f1' and thus the error f1' * eps cannot be 0 here), it is always
+     * rounded down to n. More precisely, we have round-(f1' / B) =
+     * round(f1' * 1025 / 2^22), where the round-down on the LHS is essential,
+     * and on the RHS the type of rounding again does not matter. This concludes
+     * the proof.
      *
      * round(f1' * 1025 / 2^22) is in turn computed in 2 steps as
      * round(floor(f1' * 1025 / 2^16) / 2^6). The mulhi computes f1'' =
diff --git a/mldsa/src/native/x86_64/src/poly_decompose_88_avx2.c b/mldsa/src/native/x86_64/src/poly_decompose_88_avx2.c
@@ -73,10 +73,32 @@ void mld_poly_decompose_88_avx2(int32_t *a1, int32_t *a0)
      * _mm256_mulhi_epu16() below.
      */
 
+    /* check-magic: 1560281088 == 1 / (1 / 1488 - 11275 / 2**24) */
     /*
      * Compute f1 = round-(f1' / B) ≈ round(f1' * 11275 / 2^24). This is exact
-     * for 0 <= f1' < 2^16. Note that half is rounded down since 11275 / 2^24 ≲
-     * 1 / 1488.
+     * for 0 <= f1' < 2^16.
+     *
+     * To see this, consider the (signed) error f1' * (1 / B - 11275 / 2^24)
+     * between f1' / B and the (under-)approximation f1' * 11275 / 2^24. Because
+     * eps := 1 / B - 11275 / 2^24 is 1 / 1560281088 ≈ 2^(-30.54) < 2^(-30), we
+     * have 0 <= f1' * eps < 2^16 * 2^(-30) = 1 / 2^14 < 1 / B (note that f1' is
+     * non-negative).
+     *
+     * On the other hand, 1 / B is the spacing between the integral multiples
+     * of 1 / B, which includes all rounding boundaries n + 0.5 (since B is
+     * even). Hence, if f1' / B is not of the form n + 0.5, then it is at least
+     * 1 / B away from the nearest rounding boundary, so moving from f1' / B to
+     * f1' * 11275 / 2^24 does not affect the rounding result, no matter the
+     * type of rounding used in either side. In particular, we have
+     * round-(f1' / B) = round(f1' * 11275 / 2^24) as claimed.
+     *
+     * As for the remaining case where f1' / B _is_ of the form n + 0.5, because
+     * f1' * 11275 / 2^24 is slightly but strictly below f1' / B = n + 0.5 (note
+     * that f1' and thus the error f1' * eps cannot be 0 here), it is always
+     * rounded down to n. More precisely, we have round-(f1' / B) =
+     * round(f1' * 11275 / 2^24), where the round-down on the LHS is essential,
+     * and on the RHS the type of rounding again does not matter. This concludes
+     * the proof.
      *
      * round(f1' * 11275 / 2^24) is in turn computed in 2 steps as
      * round(floor(f1' * 11275 / 2^16) / 2^8). The mulhi computes f1'' =