Skip to content

Eliminate use_hint 32/88 intrinsics#940

Open
willieyz wants to merge 3 commits intomainfrom
eliminate-use_hint_32_88-intrinsics
Open

Eliminate use_hint 32/88 intrinsics#940
willieyz wants to merge 3 commits intomainfrom
eliminate-use_hint_32_88-intrinsics

Conversation

@willieyz
Copy link
Contributor

@willieyz willieyz commented Feb 3, 2026

We also tried unrolling the loops: mld_poly_use_hint_88_avx2_loop and mld_poly_use_hint_32_avx2_loop
in both files. However, the benchmark results showed that this did not provide any performance benefit, so we decided to keep the current version.

  • bench components
    • Δ (%) = (asm − AVX2) / AVX2 × 100
Component Implementation Build ML-DSA-44 ML-DSA-65 ML-DSA-87 Notes
mld_poly_caddq
(avg)
AVX2 intrinsics no-opt 821 781 789
x86_64 asm no-opt 847 786 787
Δ (%) no-opt +3.17% +0.64% -0.25%
mld_poly_caddq
(avg)
AVX2 intrinsics opt 210 147 143
x86_64 asm opt 220 153 155
x86_64 asm
(unroll)
opt 273 154 156 unroll by 4
Δ (%) opt +4.76% +4.08% +8.39%
Δ (%) (unroll) opt +30.00% +4.76% +9.09% unroll by 4
  • bench
    • Δ (%) = (asm − AVX2) / AVX2 × 100
Component Implementation Build ML-DSA-44 ML-DSA-65 ML-DSA-87 Notes
keypair cycles
(avg)
AVX2 intrinsics no-opt 127436 218610 360739 baseline (main)
x86_64 asm no-opt 127459 217604 367118
Δ (%) no-opt +0.02% -0.46% +1.77%
AVX2 intrinsics opt 56955 98362 157869 baseline (main)
x86_64 asm opt 59747 102961 165706
x86_64 asm
(unroll)
opt 59483 104732 166654
Δ (%) opt +4.90% +4.68% +4.96%
Δ (%) (unroll) opt +4.44% +6.48% +5.56% unroll by 4
sign cycles
(avg)
AVX2 intrinsics no-opt 451922 756003 958151 baseline (main)
x86_64 asm no-opt 452833 752512 974497
Δ (%) no-opt +0.20% -0.46% +1.71%
AVX2 intrinsics opt 170370 281545 347924 baseline (main)
x86_64 asm opt 178564 294843 362677
x86_64 asm
(unroll)
opt 177251 300667 366158
Δ (%) opt +4.81% +4.72% +4.24%
Δ (%) (unroll) opt +4.04% +6.79% +5.24% unroll by 4
verify cycles
(avg)
AVX2 intrinsics no-opt 134113 220671 363234 baseline (main)
x86_64 asm no-opt 134633 220015 369763
Δ (%) no-opt +0.39% -0.30% +1.80%
AVX2 intrinsics opt 60234 98904 156281 baseline (main)
x86_64 asm opt 63140 103682 164376
x86_64 asm
(unroll)
opt 62822 105719 164028
Δ (%) opt +4.82% +4.83% +5.18%
Δ (%) (unroll) opt +4.30% +6.89% +4.96% unroll by 4

@oqs-bot
Copy link
Contributor

oqs-bot commented Feb 3, 2026

CBMC Results (ML-DSA-87)

Full Results (175 proofs)
Proof Status Current Previous Change
**TOTAL** 2632s 2449s +7.5%
sign_verify_internal 375s 353s +6%
mld_attempt_signature_generation 248s 227s +9%
polyvecl_pointwise_acc_montgomery_c 196s 165s +19%
polyvec_matrix_expand 163s 153s +7%
rej_uniform_native 155s 139s +12%
poly_pointwise_montgomery_c 154s 128s +20%
mld_invntt_layer 125s 114s +10%
polyvec_matrix_expand_serial 109s 110s -1%
mld_ct_memcmp 89s 74s +20%
sign_signature_internal 50s 46s +9%
mld_ntt_layer 46s 44s +5%
keccak_squeezeblocks_x4 44s 42s +5%
mld_compute_t0_t1_tr_from_sk_components 24s 25s -4%
polymat_permute_bitrev_to_custom 24s 24s +0%
rej_uniform 22s 21s +5%
fqmul 20s 18s +11%
poly_chknorm_c 20s 17s +18%
poly_uniform_4x 20s 17s +18%
rej_uniform_c 19s 16s +19%
poly_uniform_eta_4x 17s 17s +0%
polyveck_add 16s 13s +23%
polyeta_unpack 15s 13s +15%
polyt0_unpack 15s 17s -12%
polyvec_matrix_pointwise_montgomery 15s 12s +25%
polyveck_power2round 15s 14s +7%
keccakf1600x4_permute_native 14s 12s +17%
mld_ntt_butterfly_block 13s 13s +0%
polyveck_chknorm 12s 6s +100%
sign_keypair_internal 12s 6s +100%
keccakf1600_permute 11s 7s +57%
sign_pk_from_sk 11s 9s +22%
poly_invntt_tomont_c 10s 9s +11%
keccak_absorb_once_x4 9s 10s -10%
mld_check_pct 9s 7s +29%
mld_sample_s1_s2_serial 9s 6s +50%
poly_decompose_c 9s 7s +29%
polyveck_reduce 9s 13s -31%
polyveck_use_hint 9s 9s +0%
keccakf1600_permute_native 8s 10s -20%
polyveck_caddq 8s 7s +14%
polyveck_invntt_tomont 8s 10s -20%
polyveck_sub 8s 7s +14%
polyvecl_ntt 8s 11s -27%
sign 8s 7s +14%
keccak_absorb 7s 7s +0%
mld_compute_pack_z 7s 7s +0%
mld_polyvecl_permute_bitrev_to_custom_native 7s 8s -12%
mld_prepare_domain_separation_prefix 7s 3s +133%
polyveck_decompose 7s 6s +17%
polyveck_ntt 7s 8s -12%
polyveck_pointwise_poly_montgomery 7s 8s -12%
mld_sample_s1_s2 6s 6s +0%
poly_add 6s 4s +50%
poly_challenge 6s 5s +20%
poly_uniform_gamma1_4x 6s 6s +0%
polyveck_shiftl 6s 6s +0%
unpack_hints 6s 4s +50%
intt_native_x86_64 5s 5s +0%
mld_h 5s 2s +150%
poly_invntt_tomont_native 5s 2s +150%
poly_ntt 5s 3s +67%
poly_use_hint_native 5s 4s +25%
polyveck_pack_t0 5s 2s +150%
polyvecl_uniform_gamma1_serial 5s 7s -29%
polyz_unpack_native 5s 4s +25%
shake256x4_squeezeblocks 5s 1s +400%
use_hint 5s 3s +67%
keccak_squeeze 4s 4s +0%
keccakf1600_extract_bytes (big endian) 4s 3s +33%
keccakf1600_xor_bytes 4s 2s +100%
mld_ct_cmask_nonzero_u8 4s 3s +33%
pack_pk 4s 4s +0%
pack_sig_c_h 4s 3s +33%
pack_sig_z 4s 3s +33%
pack_sk 4s 3s +33%
poly_caddq 4s 2s +100%
poly_caddq_c 4s 3s +33%
poly_caddq_native 4s 2s +100%
poly_chknorm_native 4s 2s +100%
poly_decompose_native 4s 5s -20%
poly_make_hint 4s 2s +100%
poly_ntt_native 4s 6s -33%
poly_reduce 4s 4s +0%
poly_sub 4s 3s +33%
poly_uniform 4s 6s -33%
poly_uniform_eta 4s 4s +0%
polyt1_pack 4s 2s +100%
polyt1_unpack 4s 3s +33%
polyveck_make_hint 4s 5s -20%
polyvecl_chknorm 4s 4s +0%
polyvecl_pointwise_acc_montgomery 4s 4s +0%
polyvecl_uniform_gamma1 4s 4s +0%
rej_eta_native 4s 5s -20%
shake128_absorb 4s 2s +100%
sign_keypair 4s 3s +33%
sign_signature 4s 5s -20%
sign_signature_pre_hash_shake256 4s 4s +0%
sign_verify_pre_hash_internal 4s 5s -20%
sys_check_capability 4s 3s +33%
unpack_sk 4s 6s -33%
caddq 3s 4s -25%
fqscale 3s 5s -40%
keccak_finalize 3s 2s +50%
keccakf1600x4_extract_bytes 3s 2s +50%
mld_ct_abs_i32 3s 2s +50%
mld_ct_cmask_neg_i32 3s 1s +200%
mld_ct_cmask_nonzero_u32 3s 4s -25%
mld_ct_get_optblocker_i64 3s 3s +0%
mld_ct_get_optblocker_u8 3s 2s +50%
mld_keccakf1600_extract_bytes 3s 1s +200%
montgomery_reduce 3s 3s +0%
poly_caddq_native_aarch64 3s 3s +0%
poly_invntt_tomont 3s 2s +50%
poly_ntt_c 3s 1s +200%
poly_pointwise_montgomery_native 3s 3s +0%
poly_power2round 3s 3s +0%
poly_shiftl 3s 2s +50%
poly_uniform_gamma1 3s 3s +0%
polyt0_pack 3s 5s -40%
polyveck_pack_eta 3s 2s +50%
polyveck_unpack_t0 3s 6s -50%
polyvecl_permute_bitrev_to_custom 3s 2s +50%
polyvecl_unpack_eta 3s 3s +0%
polyz_pack 3s 2s +50%
polyz_unpack 3s 3s +0%
polyz_unpack_c 3s 5s -40%
power2round 3s 4s -25%
rej_eta_c 3s 4s -25%
shake128_init 3s 2s +50%
shake128_release 3s 3s +0%
shake128_squeeze 3s 5s -40%
shake256 3s 4s -25%
shake256_absorb 3s 3s +0%
shake256x4_absorb_once 3s 2s +50%
sign_open 3s 6s -50%
sign_signature_pre_hash_internal 3s 5s -40%
sign_verify 3s 2s +50%
sign_verify_extmu 3s 3s +0%
sign_verify_pre_hash_shake256 3s 4s -25%
unpack_pk 3s 6s -50%
decompose 2s 4s -50%
keccak_init 2s 3s -33%
keccakf1600x4_permute 2s 2s +0%
make_hint 2s 5s -60%
mld_ct_get_optblocker_u32 2s 2s +0%
mld_ct_sel_int32 2s 2s +0%
mld_value_barrier_i64 2s 3s -33%
mld_value_barrier_u8 2s 2s +0%
ntt_native_x86_64 2s 4s -50%
poly_chknorm 2s 2s +0%
poly_decompose 2s 3s -33%
poly_pointwise_montgomery 2s 4s -50%
poly_use_hint 2s 3s -33%
poly_use_hint_c 2s 4s -50%
polyeta_pack 2s 4s -50%
polyveck_pack_w1 2s 4s -50%
polyvecl_pack_eta 2s 3s -33%
polyvecl_pointwise_acc_montgomery_native 2s 3s -33%
polyvecl_unpack_z 2s 3s -33%
polyw1_pack 2s 2s +0%
rej_eta 2s 2s +0%
shake128_finalize 2s 3s -33%
shake128x4_absorb_once 2s 4s -50%
shake128x4_squeezeblocks 2s 2s +0%
shake256_init 2s 2s +0%
shake256_release 2s 3s -33%
shake256_squeeze 2s 2s +0%
sign_signature_extmu 2s 5s -60%
unpack_sig 2s 5s -60%
keccakf1600_xor_bytes (big endian) 1s 3s -67%
keccakf1600x4_xor_bytes 1s 2s -50%
mld_value_barrier_u32 1s 2s -50%
polyveck_unpack_eta 1s 3s -67%
reduce32 1s 4s -75%
shake256_finalize 1s 3s -67%

@oqs-bot
Copy link
Contributor

oqs-bot commented Feb 3, 2026

CBMC Results (ML-DSA-44)

Full Results (175 proofs)
Proof Status Current Previous Change
**TOTAL** 2187s 2055s +6.4%
sign_verify_internal 264s 254s +4%
mld_attempt_signature_generation 233s 221s +5%
polyvecl_pointwise_acc_montgomery_c 233s 208s +12%
rej_uniform_native 150s 144s +4%
poly_pointwise_montgomery_c 148s 143s +3%
mld_ct_memcmp 88s 82s +7%
mld_invntt_layer 53s 50s +6%
sign_signature_internal 48s 45s +7%
mld_ntt_layer 47s 44s +7%
keccak_squeezeblocks_x4 45s 44s +2%
poly_invntt_tomont_c 43s 39s +10%
rej_uniform 23s 20s +15%
rej_uniform_c 22s 20s +10%
fqmul 18s 19s -5%
poly_uniform_eta_4x 18s 19s -5%
polymat_permute_bitrev_to_custom 17s 16s +6%
mld_polyvecl_permute_bitrev_to_custom_native 15s 14s +7%
poly_chknorm_c 15s 12s +25%
poly_uniform_4x 15s 13s +15%
polyeta_unpack 15s 12s +25%
polyt0_unpack 15s 15s +0%
mld_ntt_butterfly_block 14s 13s +8%
polyvec_matrix_expand 14s 16s -12%
keccakf1600x4_permute_native 13s 14s -7%
mld_compute_t0_t1_tr_from_sk_components 13s 15s -13%
polyz_unpack_c 11s 12s -8%
keccakf1600_permute_native 10s 6s +67%
keccak_absorb_once_x4 9s 9s +0%
keccakf1600_permute 9s 8s +12%
mld_check_pct 9s 6s +50%
polyveck_caddq 9s 8s +12%
keccak_absorb 8s 8s +0%
shake128x4_absorb_once 8s 4s +100%
polyvec_matrix_pointwise_montgomery 7s 6s +17%
polyveck_add 7s 7s +0%
polyveck_ntt 7s 4s +75%
rej_eta_c 7s 4s +75%
sign 7s 5s +40%
decompose 6s 3s +100%
mld_sample_s1_s2_serial 6s 3s +100%
poly_uniform 6s 5s +20%
poly_uniform_eta 6s 5s +20%
poly_uniform_gamma1 6s 4s +50%
poly_use_hint_c 6s 5s +20%
polyvec_matrix_expand_serial 6s 7s -14%
polyveck_pointwise_poly_montgomery 6s 5s +20%
polyveck_reduce 6s 5s +20%
polyveck_use_hint 6s 5s +20%
polyvecl_ntt 6s 6s +0%
sign_keypair 6s 2s +200%
sign_keypair_internal 6s 4s +50%
sign_pk_from_sk 6s 6s +0%
sign_verify_pre_hash_internal 6s 4s +50%
sign_verify_pre_hash_shake256 6s 3s +100%
intt_native_x86_64 5s 3s +67%
mld_compute_pack_z 5s 6s -17%
mld_ct_cmask_nonzero_u32 5s 2s +150%
mld_ct_get_optblocker_u32 5s 1s +400%
mld_ct_sel_int32 5s 4s +25%
mld_keccakf1600_extract_bytes 5s 6s -17%
poly_chknorm_native 5s 6s -17%
poly_ntt_c 5s 2s +150%
poly_pointwise_montgomery 5s 2s +150%
poly_sub 5s 3s +67%
polyveck_chknorm 5s 3s +67%
polyveck_decompose 5s 7s -29%
polyveck_power2round 5s 6s -17%
polyveck_shiftl 5s 5s +0%
polyvecl_chknorm 5s 6s -17%
polyvecl_uniform_gamma1 5s 2s +150%
polyw1_pack 5s 2s +150%
polyz_unpack_native 5s 3s +67%
shake128x4_squeezeblocks 5s 2s +150%
shake256_init 5s 3s +67%
sign_open 5s 6s -17%
sign_signature 5s 6s -17%
sign_signature_extmu 5s 7s -29%
sign_signature_pre_hash_internal 5s 4s +25%
unpack_hints 5s 5s +0%
keccak_squeeze 4s 5s -20%
mld_ct_abs_i32 4s 2s +100%
mld_ct_cmask_nonzero_u8 4s 4s +0%
mld_h 4s 6s -33%
mld_value_barrier_u32 4s 4s +0%
ntt_native_x86_64 4s 3s +33%
poly_caddq_native 4s 2s +100%
poly_caddq_native_aarch64 4s 3s +33%
poly_decompose_c 4s 5s -20%
poly_decompose_native 4s 3s +33%
poly_invntt_tomont_native 4s 4s +0%
poly_uniform_gamma1_4x 4s 4s +0%
polyveck_invntt_tomont 4s 6s -33%
polyveck_make_hint 4s 4s +0%
polyveck_unpack_t0 4s 4s +0%
polyvecl_uniform_gamma1_serial 4s 3s +33%
polyz_unpack 4s 5s -20%
rej_eta_native 4s 4s +0%
shake128_finalize 4s 2s +100%
shake128_init 4s 2s +100%
shake128_release 4s 3s +33%
shake256x4_squeezeblocks 4s 2s +100%
sign_signature_pre_hash_shake256 4s 5s -20%
sign_verify_extmu 4s 2s +100%
unpack_sk 4s 4s +0%
use_hint 4s 2s +100%
fqscale 3s 1s +200%
keccakf1600_xor_bytes 3s 2s +50%
keccakf1600x4_extract_bytes 3s 3s +0%
make_hint 3s 3s +0%
mld_prepare_domain_separation_prefix 3s 5s -40%
mld_sample_s1_s2 3s 4s -25%
montgomery_reduce 3s 2s +50%
pack_pk 3s 3s +0%
pack_sig_c_h 3s 4s -25%
pack_sig_z 3s 2s +50%
poly_add 3s 4s -25%
poly_caddq 3s 3s +0%
poly_caddq_c 3s 3s +0%
poly_challenge 3s 3s +0%
poly_chknorm 3s 4s -25%
poly_decompose 3s 4s -25%
poly_invntt_tomont 3s 2s +50%
poly_make_hint 3s 4s -25%
poly_ntt 3s 3s +0%
poly_ntt_native 3s 2s +50%
poly_pointwise_montgomery_native 3s 3s +0%
poly_power2round 3s 5s -40%
poly_reduce 3s 4s -25%
poly_use_hint 3s 2s +50%
poly_use_hint_native 3s 3s +0%
polyveck_pack_eta 3s 3s +0%
polyveck_pack_w1 3s 4s -25%
polyveck_sub 3s 4s -25%
polyveck_unpack_eta 3s 4s -25%
polyvecl_pointwise_acc_montgomery 3s 4s -25%
polyvecl_pointwise_acc_montgomery_native 3s 3s +0%
polyvecl_unpack_eta 3s 2s +50%
reduce32 3s 3s +0%
shake128_absorb 3s 1s +200%
shake256 3s 3s +0%
shake256_release 3s 3s +0%
sign_verify 3s 3s +0%
sys_check_capability 3s 4s -25%
unpack_pk 3s 3s +0%
unpack_sig 3s 3s +0%
caddq 2s 3s -33%
keccak_finalize 2s 4s -50%
keccak_init 2s 3s -33%
keccakf1600_extract_bytes (big endian) 2s 3s -33%
keccakf1600_xor_bytes (big endian) 2s 2s +0%
keccakf1600x4_xor_bytes 2s 2s +0%
mld_ct_get_optblocker_i64 2s 2s +0%
mld_ct_get_optblocker_u8 2s 1s +100%
mld_value_barrier_i64 2s 4s -50%
mld_value_barrier_u8 2s 1s +100%
pack_sk 2s 3s -33%
poly_shiftl 2s 4s -50%
polyeta_pack 2s 3s -33%
polyt0_pack 2s 4s -50%
polyt1_pack 2s 1s +100%
polyt1_unpack 2s 4s -50%
polyveck_pack_t0 2s 3s -33%
polyvecl_pack_eta 2s 3s -33%
polyvecl_permute_bitrev_to_custom 2s 2s +0%
polyvecl_unpack_z 2s 5s -60%
polyz_pack 2s 2s +0%
power2round 2s 2s +0%
shake256_absorb 2s 2s +0%
shake256_finalize 2s 4s -50%
shake256_squeeze 2s 5s -60%
keccakf1600x4_permute 1s 3s -67%
mld_ct_cmask_neg_i32 1s 2s -50%
rej_eta 1s 1s +0%
shake128_squeeze 1s 2s -50%
shake256x4_absorb_once 1s 4s -75%

@oqs-bot
Copy link
Contributor

oqs-bot commented Feb 3, 2026

CBMC Results (ML-DSA-65)

⚠️ Attention Required

Proof Status Current Previous Change
poly_uniform_4x ⚠️ 20s 13s +54%
Full Results (175 proofs)
Proof Status Current Previous Change
**TOTAL** 2582s 2286s +12.9%
polyvecl_pointwise_acc_montgomery_c 298s 226s +32%
mld_attempt_signature_generation 224s 197s +14%
sign_verify_internal 199s 177s +12%
rej_uniform_native 169s 144s +17%
polyvec_matrix_expand 166s 145s +14%
poly_pointwise_montgomery_c 165s 138s +20%
mld_invntt_layer 139s 117s +19%
mld_ct_memcmp 94s 79s +19%
polyvec_matrix_expand_serial 72s 65s +11%
sign_signature_internal 55s 50s +10%
mld_ntt_layer 51s 44s +16%
keccak_squeezeblocks_x4 45s 42s +7%
mld_compute_t0_t1_tr_from_sk_components 27s 27s +0%
rej_uniform 24s 21s +14%
polymat_permute_bitrev_to_custom 22s 18s +22%
rej_uniform_c 21s 19s +11%
fqmul 20s 18s +11%
poly_uniform_4x ⚠️ 20s 13s +54%
polyveck_decompose 20s 16s +25%
poly_uniform_eta_4x 19s 17s +12%
polyvec_matrix_pointwise_montgomery 18s 14s +29%
poly_chknorm_c 17s 16s +6%
polyt0_unpack 16s 17s -6%
mld_ntt_butterfly_block 15s 13s +15%
mld_polyvecl_permute_bitrev_to_custom_native 14s 14s +0%
polyveck_use_hint 14s 14s +0%
keccakf1600x4_permute_native 13s 14s -7%
mld_check_pct 12s 9s +33%
poly_invntt_tomont_c 12s 8s +50%
sign 11s 9s +22%
keccak_absorb_once_x4 10s 10s +0%
polyveck_add 10s 9s +11%
polyveck_caddq 10s 8s +25%
polyveck_ntt 10s 6s +67%
polyveck_reduce 10s 9s +11%
poly_decompose_c 9s 7s +29%
polyveck_invntt_tomont 9s 8s +12%
polyveck_power2round 9s 11s -18%
polyveck_sub 9s 7s +29%
polyvecl_ntt 9s 8s +12%
keccak_absorb 8s 5s +60%
keccakf1600_permute 8s 9s -11%
polyeta_unpack 8s 7s +14%
sign_keypair_internal 8s 6s +33%
keccakf1600_permute_native 7s 9s -22%
mld_compute_pack_z 7s 7s +0%
mld_sample_s1_s2_serial 7s 6s +17%
polyveck_make_hint 7s 5s +40%
polyveck_shiftl 7s 8s -12%
keccak_finalize 6s 2s +200%
polyveck_unpack_eta 6s 4s +50%
polyvecl_chknorm 6s 5s +20%
polyvecl_pointwise_acc_montgomery_native 6s 5s +20%
sign_pk_from_sk 6s 8s -25%
unpack_hints 6s 5s +20%
mld_sample_s1_s2 5s 4s +25%
pack_sk 5s 2s +150%
poly_add 5s 4s +25%
poly_caddq 5s 3s +67%
poly_caddq_native 5s 4s +25%
poly_challenge 5s 4s +25%
poly_make_hint 5s 3s +67%
poly_pointwise_montgomery 5s 5s +0%
poly_uniform 5s 4s +25%
poly_uniform_eta 5s 3s +67%
poly_use_hint_c 5s 7s -29%
polyveck_pointwise_poly_montgomery 5s 6s -17%
polyvecl_pointwise_acc_montgomery 5s 5s +0%
rej_eta 5s 4s +25%
shake256_absorb 5s 5s +0%
sign_open 5s 5s +0%
sign_signature 5s 4s +25%
sign_signature_pre_hash_internal 5s 6s -17%
sign_signature_pre_hash_shake256 5s 4s +25%
sign_verify_extmu 5s 4s +25%
unpack_sk 5s 5s +0%
decompose 4s 5s -20%
keccakf1600x4_permute 4s 1s +300%
mld_ct_get_optblocker_u8 4s 2s +100%
mld_h 4s 3s +33%
montgomery_reduce 4s 2s +100%
ntt_native_x86_64 4s 3s +33%
poly_caddq_native_aarch64 4s 6s -33%
poly_chknorm_native 4s 4s +0%
poly_invntt_tomont 4s 6s -33%
poly_ntt 4s 5s -20%
poly_power2round 4s 5s -20%
poly_use_hint_native 4s 4s +0%
polyt1_unpack 4s 6s -33%
polyveck_chknorm 4s 5s -20%
polyveck_unpack_t0 4s 3s +33%
polyvecl_permute_bitrev_to_custom 4s 3s +33%
polyvecl_uniform_gamma1 4s 5s -20%
polyvecl_unpack_eta 4s 5s -20%
polyz_unpack_c 4s 5s -20%
power2round 4s 2s +100%
rej_eta_c 4s 3s +33%
shake128_init 4s 1s +300%
shake256 4s 2s +100%
sign_keypair 4s 3s +33%
sign_verify 4s 4s +0%
sign_verify_pre_hash_internal 4s 3s +33%
unpack_pk 4s 3s +33%
use_hint 4s 2s +100%
caddq 3s 3s +0%
fqscale 3s 4s -25%
intt_native_x86_64 3s 3s +0%
keccak_init 3s 3s +0%
keccakf1600_extract_bytes (big endian) 3s 3s +0%
make_hint 3s 3s +0%
mld_ct_abs_i32 3s 4s -25%
mld_ct_cmask_nonzero_u8 3s 2s +50%
mld_ct_get_optblocker_u32 3s 4s -25%
mld_keccakf1600_extract_bytes 3s 2s +50%
mld_prepare_domain_separation_prefix 3s 5s -40%
mld_value_barrier_u32 3s 3s +0%
pack_sig_c_h 3s 2s +50%
pack_sig_z 3s 2s +50%
poly_caddq_c 3s 3s +0%
poly_chknorm 3s 3s +0%
poly_decompose_native 3s 3s +0%
poly_ntt_native 3s 3s +0%
poly_uniform_gamma1_4x 3s 6s -50%
poly_use_hint 3s 3s +0%
polyeta_pack 3s 2s +50%
polyt0_pack 3s 4s -25%
polyt1_pack 3s 2s +50%
polyveck_pack_t0 3s 3s +0%
polyveck_pack_w1 3s 6s -50%
polyvecl_uniform_gamma1_serial 3s 4s -25%
polyvecl_unpack_z 3s 2s +50%
polyz_unpack_native 3s 3s +0%
reduce32 3s 3s +0%
rej_eta_native 3s 4s -25%
shake128x4_absorb_once 3s 4s -25%
shake256_finalize 3s 5s -40%
shake256_init 3s 2s +50%
shake256_release 3s 1s +200%
shake256_squeeze 3s 2s +50%
sign_signature_extmu 3s 4s -25%
sign_verify_pre_hash_shake256 3s 6s -50%
unpack_sig 3s 4s -25%
keccak_squeeze 2s 2s +0%
keccakf1600_xor_bytes 2s 3s -33%
keccakf1600x4_extract_bytes 2s 1s +100%
keccakf1600x4_xor_bytes 2s 2s +0%
mld_ct_cmask_neg_i32 2s 2s +0%
mld_ct_cmask_nonzero_u32 2s 2s +0%
mld_ct_get_optblocker_i64 2s 1s +100%
mld_value_barrier_i64 2s 2s +0%
mld_value_barrier_u8 2s 3s -33%
pack_pk 2s 3s -33%
poly_decompose 2s 3s -33%
poly_invntt_tomont_native 2s 7s -71%
poly_ntt_c 2s 2s +0%
poly_pointwise_montgomery_native 2s 2s +0%
poly_reduce 2s 5s -60%
poly_sub 2s 5s -60%
poly_uniform_gamma1 2s 5s -60%
polyveck_pack_eta 2s 4s -50%
polyvecl_pack_eta 2s 4s -50%
polyz_unpack 2s 2s +0%
shake128_absorb 2s 2s +0%
shake128_finalize 2s 3s -33%
shake128_release 2s 3s -33%
shake128_squeeze 2s 2s +0%
shake128x4_squeezeblocks 2s 1s +100%
shake256x4_absorb_once 2s 4s -50%
shake256x4_squeezeblocks 2s 3s -33%
sys_check_capability 2s 3s -33%
keccakf1600_xor_bytes (big endian) 1s 3s -67%
mld_ct_sel_int32 1s 3s -67%
poly_shiftl 1s 2s -50%
polyw1_pack 1s 1s +0%
polyz_pack 1s 4s -75%

@willieyz willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch 9 times, most recently from 1ea9d5f to 8a19e9a Compare February 5, 2026 06:05
@willieyz willieyz marked this pull request as ready for review February 5, 2026 06:39
@willieyz willieyz requested a review from a team as a code owner February 5, 2026 06:39
@willieyz willieyz marked this pull request as draft February 5, 2026 07:19
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 46205 cycles 46203 cycles 1.00
ML-DSA-44 sign 131278 cycles 131278 cycles 1
ML-DSA-44 verify 47765 cycles 47768 cycles 1.00
ML-DSA-65 keypair 81014 cycles 81024 cycles 1.00
ML-DSA-65 sign 215785 cycles 215787 cycles 1.00
ML-DSA-65 verify 80057 cycles 80052 cycles 1.00
ML-DSA-87 keypair 132158 cycles 132151 cycles 1.00
ML-DSA-87 sign 276862 cycles 276816 cycles 1.00
ML-DSA-87 verify 130418 cycles 130384 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 114213 cycles 114155 cycles 1.00
ML-DSA-44 sign 418158 cycles 417994 cycles 1.00
ML-DSA-44 verify 122319 cycles 122262 cycles 1.00
ML-DSA-65 keypair 195508 cycles 195499 cycles 1.00
ML-DSA-65 sign 682497 cycles 682470 cycles 1.00
ML-DSA-65 verify 197760 cycles 197741 cycles 1.00
ML-DSA-87 keypair 322642 cycles 322656 cycles 1.00
ML-DSA-87 sign 864585 cycles 864584 cycles 1.00
ML-DSA-87 verify 328628 cycles 328653 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 34677 cycles 34696 cycles 1.00
ML-DSA-44 sign 120151 cycles 120195 cycles 1.00
ML-DSA-44 verify 38151 cycles 38145 cycles 1.00
ML-DSA-65 keypair 61275 cycles 60582 cycles 1.01
ML-DSA-65 sign 202094 cycles 200476 cycles 1.01
ML-DSA-65 verify 62940 cycles 62563 cycles 1.01
ML-DSA-87 keypair 93525 cycles 94602 cycles 0.99
ML-DSA-87 sign 236210 cycles 240494 cycles 0.98
ML-DSA-87 verify 95587 cycles 95761 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 93726 cycles 93889 cycles 1.00
ML-DSA-44 sign 333512 cycles 333450 cycles 1.00
ML-DSA-44 verify 99955 cycles 99851 cycles 1.00
ML-DSA-65 keypair 160065 cycles 160390 cycles 1.00
ML-DSA-65 sign 545794 cycles 545908 cycles 1.00
ML-DSA-65 verify 160881 cycles 160887 cycles 1.00
ML-DSA-87 keypair 267728 cycles 267405 cycles 1.00
ML-DSA-87 sign 707504 cycles 707235 cycles 1.00
ML-DSA-87 verify 270918 cycles 269967 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 276468 cycles 277102 cycles 1.00
ML-DSA-44 sign 818650 cycles 810656 cycles 1.01
ML-DSA-44 verify 276672 cycles 278882 cycles 0.99
ML-DSA-65 keypair 475323 cycles 478906 cycles 0.99
ML-DSA-65 sign 1367640 cycles 1360800 cycles 1.01
ML-DSA-65 verify 459822 cycles 466415 cycles 0.99
ML-DSA-87 keypair 825623 cycles 818822 cycles 1.01
ML-DSA-87 sign 1873209 cycles 1878770 cycles 1.00
ML-DSA-87 verify 800938 cycles 794467 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 69035 cycles 69134 cycles 1.00
ML-DSA-44 sign 187364 cycles 187688 cycles 1.00
ML-DSA-44 verify 69341 cycles 69282 cycles 1.00
ML-DSA-65 keypair 119503 cycles 119368 cycles 1.00
ML-DSA-65 sign 303527 cycles 300862 cycles 1.01
ML-DSA-65 verify 115926 cycles 115513 cycles 1.00
ML-DSA-87 keypair 203793 cycles 203546 cycles 1.00
ML-DSA-87 sign 394456 cycles 394636 cycles 1.00
ML-DSA-87 verify 195809 cycles 195483 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 57235 cycles 56751 cycles 1.01
ML-DSA-44 sign 181496 cycles 181670 cycles 1.00
ML-DSA-44 verify 61165 cycles 61146 cycles 1.00
ML-DSA-65 keypair 98680 cycles 98647 cycles 1.00
ML-DSA-65 sign 298309 cycles 298480 cycles 1.00
ML-DSA-65 verify 100528 cycles 100288 cycles 1.00
ML-DSA-87 keypair 152581 cycles 152587 cycles 1.00
ML-DSA-87 sign 355291 cycles 355235 cycles 1.00
ML-DSA-87 verify 153950 cycles 153556 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 68156 cycles 68132 cycles 1.00
ML-DSA-44 sign 202004 cycles 201919 cycles 1.00
ML-DSA-44 verify 70775 cycles 70781 cycles 1.00
ML-DSA-65 keypair 120970 cycles 120914 cycles 1.00
ML-DSA-65 sign 331183 cycles 331101 cycles 1.00
ML-DSA-65 verify 117884 cycles 117908 cycles 1.00
ML-DSA-87 keypair 198649 cycles 198347 cycles 1.00
ML-DSA-87 sign 427544 cycles 427112 cycles 1.00
ML-DSA-87 verify 194417 cycles 194311 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 135070 cycles 134705 cycles 1.00
ML-DSA-44 sign 526006 cycles 524023 cycles 1.00
ML-DSA-44 verify 147853 cycles 147704 cycles 1.00
ML-DSA-65 keypair 226865 cycles 226528 cycles 1.00
ML-DSA-65 sign 860582 cycles 861852 cycles 1.00
ML-DSA-65 verify 235373 cycles 235761 cycles 1.00
ML-DSA-87 keypair 370367 cycles 371080 cycles 1.00
ML-DSA-87 sign 1079627 cycles 1079785 cycles 1.00
ML-DSA-87 verify 382615 cycles 383268 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 41639 cycles 42042 cycles 0.99
ML-DSA-44 sign 134495 cycles 135046 cycles 1.00
ML-DSA-44 verify 44953 cycles 45886 cycles 0.98
ML-DSA-65 keypair 72877 cycles 72408 cycles 1.01
ML-DSA-65 sign 214749 cycles 215490 cycles 1.00
ML-DSA-65 verify 73910 cycles 73252 cycles 1.01
ML-DSA-87 keypair 107778 cycles 107965 cycles 1.00
ML-DSA-87 sign 252308 cycles 254024 cycles 0.99
ML-DSA-87 verify 109196 cycles 111034 cycles 0.98

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 157593 cycles 157623 cycles 1.00
ML-DSA-44 sign 550359 cycles 549610 cycles 1.00
ML-DSA-44 verify 169225 cycles 169078 cycles 1.00
ML-DSA-65 keypair 267977 cycles 267943 cycles 1.00
ML-DSA-65 sign 903637 cycles 902493 cycles 1.00
ML-DSA-65 verify 274125 cycles 274108 cycles 1.00
ML-DSA-87 keypair 450990 cycles 447542 cycles 1.01
ML-DSA-87 sign 1162617 cycles 1156527 cycles 1.01
ML-DSA-87 verify 460584 cycles 457749 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 72258 cycles 72244 cycles 1.00
ML-DSA-44 sign 211991 cycles 212021 cycles 1.00
ML-DSA-44 verify 75712 cycles 75740 cycles 1.00
ML-DSA-65 keypair 127432 cycles 127429 cycles 1.00
ML-DSA-65 sign 350175 cycles 350138 cycles 1.00
ML-DSA-65 verify 125364 cycles 125365 cycles 1.00
ML-DSA-87 keypair 208138 cycles 208164 cycles 1.00
ML-DSA-87 sign 448958 cycles 448891 cycles 1.00
ML-DSA-87 verify 205105 cycles 205092 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 128309 cycles 128287 cycles 1.00
ML-DSA-44 sign 447743 cycles 447655 cycles 1.00
ML-DSA-44 verify 138349 cycles 144617 cycles 0.96
ML-DSA-65 keypair 220300 cycles 220134 cycles 1.00
ML-DSA-65 sign 727626 cycles 727309 cycles 1.00
ML-DSA-65 verify 223200 cycles 223042 cycles 1.00
ML-DSA-87 keypair 365101 cycles 365095 cycles 1.00
ML-DSA-87 sign 926593 cycles 926085 cycles 1.00
ML-DSA-87 verify 372803 cycles 372794 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Details
Benchmark suite Current: 8a19e9a Previous: 41da557 Ratio
ML-DSA-44 keypair 120283 cycles 123215 cycles 0.98
ML-DSA-44 sign 447117 cycles 449447 cycles 0.99
ML-DSA-44 verify 131120 cycles 129997 cycles 1.01
ML-DSA-65 keypair 205159 cycles 204042 cycles 1.01
ML-DSA-65 sign 729240 cycles 726667 cycles 1.00
ML-DSA-65 verify 210548 cycles 209895 cycles 1.00
ML-DSA-87 keypair 336772 cycles 336983 cycles 1.00
ML-DSA-87 sign 923968 cycles 923345 cycles 1.00
ML-DSA-87 verify 346738 cycles 346079 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@willieyz willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch 8 times, most recently from 81d5192 to 377fdc8 Compare February 12, 2026 06:26
@willieyz
Copy link
Contributor Author

Please apply the same changes as requested in #905

Hello, @mkannwischer , I had apply same changes requested in #905, including:

  • Remove all usage # for comment, use // and /*...*/ instead
  • Remove all vzeroupper
  • Use 32-bit constant instead of 64 bit
  • Extract the decompose32/88 and use_hint32/88 macros (referencing the AArch64 versions), and add brief comments in the same style.

Thank you for your help!

@willieyz willieyz marked this pull request as ready for review February 12, 2026 06:43
Copy link
Contributor

@mkannwischer mkannwischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @willieyz.
Some comments on how this can be improved.

/* Reference:
* - @[REF_AVX2] calls poly_decompose to compute all a1, a0 before the loop.
* - Our implementation of decompose() is slightly different from that in
* @[REF_AVX2]. See poly_decompose_32_avx2.c for more information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are referencing to a file here that does not exist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

I have rephrased the description of the file name from poly_decompose_32_avx2.c to poly_decompose_32_avx2, since we plan to remove all intrinsics and transition to pure assembly implementations.
Thank you for your review and help!

/* Reference:
* - @[REF_AVX2] calls poly_decompose to compute all a1, a0 before the loop.
* - Our implementation of decompose() is slightly different from that in
* @[REF_AVX2]. See poly_decompose_88_avx2.c for more information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are referencing to a file here that does not exist anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, same as previous one, thank you for your help!


// delta = (a0 <= 0) ? -1 : 1
vpcmpgtd %ymm5, \a, \a
vandnps \h, \a, \a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for using vandnps over vpandn? vpandn seems more natural and is what was used in the intrinsics.

Copy link
Contributor Author

@willieyz willieyz Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

The reason I originally used vandnps instead of vpandn is that the draft AVX2 assembly was generated by GCC on an x86 Linux platform, and the compiler emitted vandnps—likely due to its internal optimization decisions. However, this is not appropriate for our use case.

I have replaced it with vpandn as you suggested. Thank you for your help.


use_hint88_avx2 %ymm0, %ymm0, %ymm1, %ymm9, %ymm10, %ymm11, %ymm12

vmovaps %ymm0, (%rdi,%rax)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Why not use vmovdqa since we are dealing with integers.

Copy link
Contributor Author

@willieyz willieyz Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

The reason is similar with previous one, I had fixed it according to your suggestion.
Thank you for your help!


mld_poly_use_hint_32_avx2_loop:
vmovdqa (%rsi,%rax), %ymm0
vmovdqa (%rdx,%rax), %ymm2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can fuse this load with the subsqeuent vpandn saving an intruction.

vmovd %ecx, %xmm5
movl $22784256, %ecx /* q_bound: 31*GAMMA2 = 8118528 (stored as 22784256 due to encoding) */

movl $512, %r9d /* 512, 0 alternating (for vpmulhrsw) */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, I don't think the comment is very helpful. I can see that 512 is what is being written.
A comment explaning what 512 is used for later would be more useful.

Copy link
Contributor Author

@willieyz willieyz Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, now change to:

/* round(x * 2^9 / 2^15) => round(x / 2^6), f1 = round(f1''/ 2^6)*/

According to the comment in poly_decompose_32_avx2.c.
Thank you for your review and help.

xorl %eax, %eax
vpxor %xmm5, %xmm5, %xmm5

movl $11275, %r8d /* 11275, 0 alternating (for vpmulhuw) */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

Copy link
Contributor Author

@willieyz willieyz Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

For this one, I reference from your previous suggestion, and the check-magic description in poly_decompose_88_avx2.c.
now this comment change to:

/* check-magic: 11275 == floor(2^24 / 1488) */

Thank you for your review and help!

vmovd %ecx, %xmm4
movl $8285184, %ecx /* 87*GAMMA2 = 8285184 (wrap-around threshold) */

movl $128, %r9d /* 128, 0 alternating (for vpmulhrsw) */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

Copy link
Contributor Author

@willieyz willieyz Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, now change to:

/* round(x * 2^7 / 2^15) => round(x / 2^8),  for f1 = round(f1''/ 2^8)*/

According to the comment in poly_decompose_88_avx2.c.
Thank you for your review and help.

vmovd %r9d, %xmm7
vpbroadcastd %xmm7, %ymm7

movl $43, %r10d /* Load 43 constant (for blend and comparison) */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is also not very useful --- I can see from the instruction that this loads 43 - that does not need a comments.
Something like /* max a1 value */ maybe be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 119 to 124
movl $43, %r10d /* Load 43 constant (for blend and comparison) */
vmovd %r10d, %xmm6
vpbroadcastd %xmm6, %ymm6

vmovd %ecx, %xmm3
movl $43, %ecx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need two copies of this 43? merge into one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the duplicate and updated the vpcmpgtd in decompose88_avx2 to use ymm6. The duplicate copies were kept because GCC generated them in the AVX2 output, but after you point this out, I think they should be merged. Thank you for your help!

This commit adds poly_use_hint to bench --components for benchmarking
the performance impact of the changes to:
- poly_use_hint_32
- poly_use_hint_88

Signed-off-by: willieyz <willie.zhao@chelpis.com>
@willieyz willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch 10 times, most recently from f820653 to 91cddd6 Compare February 24, 2026 10:16
@willieyz willieyz marked this pull request as draft February 25, 2026 01:24
In this PR, we replace the AVX2 intrinsics implementation of
poly_use_hint_32 and poly_use_hint_88 with a x86_64 assembly version,
this is part of the effort to enable HOL-Light proofs.

Signed-off-by: willieyz <willie.zhao@chelpis.com>
@willieyz willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch from 5bc97db to b8c27b4 Compare February 25, 2026 02:35
This commit extract the decompose 32/88 and use_hint 32/88 as a macro.

Signed-off-by: willieyz <willie.zhao@chelpis.com>
@willieyz willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch 2 times, most recently from d3f2de3 to e5ad167 Compare February 25, 2026 03:49
@willieyz willieyz marked this pull request as ready for review February 25, 2026 03:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AVX2: Replace intrinsics implementation of poly_use_hint with assembly

3 participants