Better explanation for Barrett division in decompose (C and AVX2)

jammychiou1 · jammychiou1 · commit d3c60705443e · 2025-12-29T11:02:52.000+08:00
Based on Hanno Becker's proposal, the new explanation explains how
round-(f1' / B) can be replaced with rounding-mulhi, regardless of the
type of rounding used in the mulhi. In addition, by bounding the
approximation error to be strictly less than 1 / B, the exactness of the
Barrett division is also justified.

To avoid excessive repetition, we prove the GAMMA2 = (Q-1)/88 case in
the C implementation, remark how the same proof can be adapted to the
GAMMA2 = (Q-1)/32 case, and finally refer to them when explaining the
AVX2 implementation.

Signed-off-by: jammychiou1 &lt;jammy.chiou1@gmail.com&gt;
diff --git a/dev/x86_64/src/poly_decompose_32_avx2.c b/dev/x86_64/src/poly_decompose_32_avx2.c
@@ -74,8 +74,8 @@ void mld_poly_decompose_32_avx2(int32_t *a1, int32_t *a0)
 
     /*
      * Compute f1 = round-(f1' / B) ≈ round(f1' * 1025 / 2^22). This is exact
-     * for 0 <= f1' < 2^16. Note that half is rounded down since 1025 / 2^22 ≲
-     * 1 / 4092.
+     * for 0 <= f1' < 2^16. See mld_decompose() in mldsa/src/rounding.h for the
+     * proof.
      *
      * round(f1' * 1025 / 2^22) is in turn computed in 2 steps as
      * round(floor(f1' * 1025 / 2^16) / 2^6). The mulhi computes f1'' =
diff --git a/dev/x86_64/src/poly_decompose_88_avx2.c b/dev/x86_64/src/poly_decompose_88_avx2.c
@@ -75,8 +75,8 @@ void mld_poly_decompose_88_avx2(int32_t *a1, int32_t *a0)
 
     /*
      * Compute f1 = round-(f1' / B) ≈ round(f1' * 11275 / 2^24). This is exact
-     * for 0 <= f1' < 2^16. Note that half is rounded down since 11275 / 2^24 ≲
-     * 1 / 1488.
+     * for 0 <= f1' < 2^16. See mld_decompose() in mldsa/src/rounding.h for the
+     * proof.
      *
      * round(f1' * 11275 / 2^24) is in turn computed in 2 steps as
      * round(floor(f1' * 11275 / 2^16) / 2^8). The mulhi computes f1'' =
diff --git a/mldsa/src/native/x86_64/src/poly_decompose_32_avx2.c b/mldsa/src/native/x86_64/src/poly_decompose_32_avx2.c
@@ -74,8 +74,8 @@ void mld_poly_decompose_32_avx2(int32_t *a1, int32_t *a0)
 
     /*
      * Compute f1 = round-(f1' / B) ≈ round(f1' * 1025 / 2^22). This is exact
-     * for 0 <= f1' < 2^16. Note that half is rounded down since 1025 / 2^22 ≲
-     * 1 / 4092.
+     * for 0 <= f1' < 2^16. See mld_decompose() in mldsa/src/rounding.h for the
+     * proof.
      *
      * round(f1' * 1025 / 2^22) is in turn computed in 2 steps as
      * round(floor(f1' * 1025 / 2^16) / 2^6). The mulhi computes f1'' =
diff --git a/mldsa/src/native/x86_64/src/poly_decompose_88_avx2.c b/mldsa/src/native/x86_64/src/poly_decompose_88_avx2.c
@@ -75,8 +75,8 @@ void mld_poly_decompose_88_avx2(int32_t *a1, int32_t *a0)
 
     /*
      * Compute f1 = round-(f1' / B) ≈ round(f1' * 11275 / 2^24). This is exact
-     * for 0 <= f1' < 2^16. Note that half is rounded down since 11275 / 2^24 ≲
-     * 1 / 1488.
+     * for 0 <= f1' < 2^16. See mld_decompose() in mldsa/src/rounding.h for the
+     * proof.
      *
      * round(f1' * 11275 / 2^24) is in turn computed in 2 steps as
      * round(floor(f1' * 11275 / 2^16) / 2^8). The mulhi computes f1'' =
diff --git a/mldsa/src/rounding.h b/mldsa/src/rounding.h
@@ -115,10 +115,32 @@ __contract__(
 #if MLD_CONFIG_PARAMETER_SET == 44
   /* check-magic: 1488 == 2 * intdiv(intdiv(MLDSA_Q - 1, 88), 128) */
   /* check-magic: 11275 == floor(2**24 / 1488) */
+  /* check-magic: 1560281088 == 1 / (1 / 1488 - 11275 / 2**24) */
   /*
-   * Compute f1 = round-(f1' / B) ≈ round(f1' * 11275 / 2^24). This is exact
-   * for 0 <= f1' < 2^16. Note that half is rounded down since 11275 / 2^24 ≲
-   * 1 / 1488.
+   * Compute f1 = round-(f1' / B) ≈ round(f1' * 11275 / 2^24). This is exact for
+   * 0 <= f1' < 2^16.
+   *
+   * To see this, consider the (signed) error f1' * (1 / B - 11275 / 2^24)
+   * between f1' / B and the (under-)approximation f1' * 11275 / 2^24. Because
+   * eps := 1 / B - 11275 / 2^24 is 1 / 1560281088 ≈ 2^(-30.54) < 2^(-30), we
+   * have 0 <= f1' * eps < 2^16 * 2^(-30) = 1 / 2^14 < 1 / 2^11 < 1 / B (note
+   * that f1' is non-negative).
+   *
+   * On the other hand, 1 / B is the spacing between the integral multiples
+   * of 1 / B, which includes all rounding boundaries n + 0.5 (since B is even).
+   * Hence, if f1' / B is not of the form n + 0.5, then it is at least 1 / B
+   * away from the nearest rounding boundary, so moving from f1' / B to
+   * f1' * 11275 / 2^24 does not affect the rounding result, no matter the type
+   * of rounding used in either side. In particular, we have round-(f1' / B) =
+   * round(f1' * 11275 / 2^24) as claimed.
+   *
+   * As for the remaining case where f1' / B _is_ of the form n + 0.5, because
+   * f1' * 11275 / 2^24 is slightly but strictly below f1' / B = n + 0.5 (note
+   * that f1' and thus the error f1' * eps cannot be 0 here), it is always
+   * rounded down to n. More precisely, we have round-(f1' / B) =
+   * round(f1' * 11275 / 2^24), where the round-down on the LHS is essential,
+   * and on the RHS the type of rounding again does not matter. This concludes
+   * the proof.
    */
   *a1 = (*a1 * 11275 + (1 << 23)) >> 24;
   mld_assert(*a1 >= 0 && *a1 <= 44);
@@ -128,10 +150,13 @@ __contract__(
 #else /* MLD_CONFIG_PARAMETER_SET == 44 */
   /* check-magic: 4092 == 2 * intdiv(intdiv(MLDSA_Q - 1, 32), 128) */
   /* check-magic: 1025 == floor(2**22 / 4092) */
+  /* check-magic: 4290772992 == 1 / (1 / 4092 - 1025 / 2**22) */
   /*
-   * Compute f1 = round-(f1' / B) ≈ round(f1' * 1025 / 2^22). This is exact
-   * for 0 <= f1' < 2^16. Note that half is rounded down since 1025 / 2^22 ≲
-   * 1 / 4092.
+   * Compute f1 = round-(f1' / B) ≈ round(f1' * 1025 / 2^22). This is exact for
+   * 0 <= f1' < 2^16. Following the same argument above, it suffices to show
+   * that f1' * eps < 1 / B, where eps := 1 / B - 1025 / 2^22. Indeed, we have
+   * eps = 1 / 4290772992 ≈ 2^(-31.99) < 2^(-31), therefore f1' * eps <
+   * 2^16 * 2^(-31) = 1 / 2^15 < 1 / 2^12 < 1 / B.
    */
   *a1 = (*a1 * 1025 + (1 << 21)) >> 22;
   mld_assert(*a1 >= 0 && *a1 <= 16);