Eliminate use_hint 32/88 intrinsics by willieyz · Pull Request #940 · pq-code-package/mldsa-native

willieyz · 2026-02-03T07:41:03Z

Resolves: AVX2: Replace intrinsics implementation of poly_use_hint with assembly #484
In this PR, we replace the AVX2 intrinsics implementation of poly_use_hint_32 and poly_use_hint_88 with a x86_64 assembly version, this is part of the effort to enable HOL-Light proofs.

We also tried unrolling the loops: mld_poly_use_hint_88_avx2_loop and mld_poly_use_hint_32_avx2_loop
in both files. However, the benchmark results showed that this did not provide any performance benefit, so we decided to keep the current version.

bench components
- Δ (%) = (asm − AVX2) / AVX2 × 100

Component	Implementation	Build	ML-DSA-44	ML-DSA-65	ML-DSA-87	Notes
mld_poly_caddq (avg)	AVX2 intrinsics	no-opt	821	781	789
	x86_64 asm	no-opt	847	786	787
	Δ (%)	no-opt	+3.17%	+0.64%	-0.25%
mld_poly_caddq (avg)	AVX2 intrinsics	opt	210	147	143
	x86_64 asm	opt	220	153	155
	x86_64 asm (unroll)	opt	273	154	156	unroll by 4
	Δ (%)	opt	+4.76%	+4.08%	+8.39%
	Δ (%) (unroll)	opt	+30.00%	+4.76%	+9.09%	unroll by 4

bench
- Δ (%) = (asm − AVX2) / AVX2 × 100

Component	Implementation	Build	ML-DSA-44	ML-DSA-65	ML-DSA-87	Notes
keypair cycles (avg)	AVX2 intrinsics	no-opt	127436	218610	360739	baseline (main)
	x86_64 asm	no-opt	127459	217604	367118
	Δ (%)	no-opt	+0.02%	-0.46%	+1.77%
	AVX2 intrinsics	opt	56955	98362	157869	baseline (main)
	x86_64 asm	opt	59747	102961	165706
	x86_64 asm (unroll)	opt	59483	104732	166654
	Δ (%)	opt	+4.90%	+4.68%	+4.96%
	Δ (%) (unroll)	opt	+4.44%	+6.48%	+5.56%	unroll by 4
sign cycles (avg)	AVX2 intrinsics	no-opt	451922	756003	958151	baseline (main)
	x86_64 asm	no-opt	452833	752512	974497
	Δ (%)	no-opt	+0.20%	-0.46%	+1.71%
	AVX2 intrinsics	opt	170370	281545	347924	baseline (main)
	x86_64 asm	opt	178564	294843	362677
	x86_64 asm (unroll)	opt	177251	300667	366158
	Δ (%)	opt	+4.81%	+4.72%	+4.24%
	Δ (%) (unroll)	opt	+4.04%	+6.79%	+5.24%	unroll by 4
verify cycles (avg)	AVX2 intrinsics	no-opt	134113	220671	363234	baseline (main)
	x86_64 asm	no-opt	134633	220015	369763
	Δ (%)	no-opt	+0.39%	-0.30%	+1.80%
	AVX2 intrinsics	opt	60234	98904	156281	baseline (main)
	x86_64 asm	opt	63140	103682	164376
	x86_64 asm (unroll)	opt	62822	105719	164028
	Δ (%)	opt	+4.82%	+4.83%	+5.18%
	Δ (%) (unroll)	opt	+4.30%	+6.89%	+4.96%	unroll by 4

oqs-bot · 2026-02-03T07:58:29Z

CBMC Results (ML-DSA-87)

Full Results (175 proofs)

Proof	Status	Current	Previous	Change
`TOTAL`	✅	2632s	2449s	+7.5%
`sign_verify_internal`	✅	375s	353s	+6%
`mld_attempt_signature_generation`	✅	248s	227s	+9%
`polyvecl_pointwise_acc_montgomery_c`	✅	196s	165s	+19%
`polyvec_matrix_expand`	✅	163s	153s	+7%
`rej_uniform_native`	✅	155s	139s	+12%
`poly_pointwise_montgomery_c`	✅	154s	128s	+20%
`mld_invntt_layer`	✅	125s	114s	+10%
`polyvec_matrix_expand_serial`	✅	109s	110s	-1%
`mld_ct_memcmp`	✅	89s	74s	+20%
`sign_signature_internal`	✅	50s	46s	+9%
`mld_ntt_layer`	✅	46s	44s	+5%
`keccak_squeezeblocks_x4`	✅	44s	42s	+5%
`mld_compute_t0_t1_tr_from_sk_components`	✅	24s	25s	-4%
`polymat_permute_bitrev_to_custom`	✅	24s	24s	+0%
`rej_uniform`	✅	22s	21s	+5%
`fqmul`	✅	20s	18s	+11%
`poly_chknorm_c`	✅	20s	17s	+18%
`poly_uniform_4x`	✅	20s	17s	+18%
`rej_uniform_c`	✅	19s	16s	+19%
`poly_uniform_eta_4x`	✅	17s	17s	+0%
`polyveck_add`	✅	16s	13s	+23%
`polyeta_unpack`	✅	15s	13s	+15%
`polyt0_unpack`	✅	15s	17s	-12%
`polyvec_matrix_pointwise_montgomery`	✅	15s	12s	+25%
`polyveck_power2round`	✅	15s	14s	+7%
`keccakf1600x4_permute_native`	✅	14s	12s	+17%
`mld_ntt_butterfly_block`	✅	13s	13s	+0%
`polyveck_chknorm`	✅	12s	6s	+100%
`sign_keypair_internal`	✅	12s	6s	+100%
`keccakf1600_permute`	✅	11s	7s	+57%
`sign_pk_from_sk`	✅	11s	9s	+22%
`poly_invntt_tomont_c`	✅	10s	9s	+11%
`keccak_absorb_once_x4`	✅	9s	10s	-10%
`mld_check_pct`	✅	9s	7s	+29%
`mld_sample_s1_s2_serial`	✅	9s	6s	+50%
`poly_decompose_c`	✅	9s	7s	+29%
`polyveck_reduce`	✅	9s	13s	-31%
`polyveck_use_hint`	✅	9s	9s	+0%
`keccakf1600_permute_native`	✅	8s	10s	-20%
`polyveck_caddq`	✅	8s	7s	+14%
`polyveck_invntt_tomont`	✅	8s	10s	-20%
`polyveck_sub`	✅	8s	7s	+14%
`polyvecl_ntt`	✅	8s	11s	-27%
`sign`	✅	8s	7s	+14%
`keccak_absorb`	✅	7s	7s	+0%
`mld_compute_pack_z`	✅	7s	7s	+0%
`mld_polyvecl_permute_bitrev_to_custom_native`	✅	7s	8s	-12%
`mld_prepare_domain_separation_prefix`	✅	7s	3s	+133%
`polyveck_decompose`	✅	7s	6s	+17%
`polyveck_ntt`	✅	7s	8s	-12%
`polyveck_pointwise_poly_montgomery`	✅	7s	8s	-12%
`mld_sample_s1_s2`	✅	6s	6s	+0%
`poly_add`	✅	6s	4s	+50%
`poly_challenge`	✅	6s	5s	+20%
`poly_uniform_gamma1_4x`	✅	6s	6s	+0%
`polyveck_shiftl`	✅	6s	6s	+0%
`unpack_hints`	✅	6s	4s	+50%
`intt_native_x86_64`	✅	5s	5s	+0%
`mld_h`	✅	5s	2s	+150%
`poly_invntt_tomont_native`	✅	5s	2s	+150%
`poly_ntt`	✅	5s	3s	+67%
`poly_use_hint_native`	✅	5s	4s	+25%
`polyveck_pack_t0`	✅	5s	2s	+150%
`polyvecl_uniform_gamma1_serial`	✅	5s	7s	-29%
`polyz_unpack_native`	✅	5s	4s	+25%
`shake256x4_squeezeblocks`	✅	5s	1s	+400%
`use_hint`	✅	5s	3s	+67%
`keccak_squeeze`	✅	4s	4s	+0%
`keccakf1600_extract_bytes (big endian)`	✅	4s	3s	+33%
`keccakf1600_xor_bytes`	✅	4s	2s	+100%
`mld_ct_cmask_nonzero_u8`	✅	4s	3s	+33%
`pack_pk`	✅	4s	4s	+0%
`pack_sig_c_h`	✅	4s	3s	+33%
`pack_sig_z`	✅	4s	3s	+33%
`pack_sk`	✅	4s	3s	+33%
`poly_caddq`	✅	4s	2s	+100%
`poly_caddq_c`	✅	4s	3s	+33%
`poly_caddq_native`	✅	4s	2s	+100%
`poly_chknorm_native`	✅	4s	2s	+100%
`poly_decompose_native`	✅	4s	5s	-20%
`poly_make_hint`	✅	4s	2s	+100%
`poly_ntt_native`	✅	4s	6s	-33%
`poly_reduce`	✅	4s	4s	+0%
`poly_sub`	✅	4s	3s	+33%
`poly_uniform`	✅	4s	6s	-33%
`poly_uniform_eta`	✅	4s	4s	+0%
`polyt1_pack`	✅	4s	2s	+100%
`polyt1_unpack`	✅	4s	3s	+33%
`polyveck_make_hint`	✅	4s	5s	-20%
`polyvecl_chknorm`	✅	4s	4s	+0%
`polyvecl_pointwise_acc_montgomery`	✅	4s	4s	+0%
`polyvecl_uniform_gamma1`	✅	4s	4s	+0%
`rej_eta_native`	✅	4s	5s	-20%
`shake128_absorb`	✅	4s	2s	+100%
`sign_keypair`	✅	4s	3s	+33%
`sign_signature`	✅	4s	5s	-20%
`sign_signature_pre_hash_shake256`	✅	4s	4s	+0%
`sign_verify_pre_hash_internal`	✅	4s	5s	-20%
`sys_check_capability`	✅	4s	3s	+33%
`unpack_sk`	✅	4s	6s	-33%
`caddq`	✅	3s	4s	-25%
`fqscale`	✅	3s	5s	-40%
`keccak_finalize`	✅	3s	2s	+50%
`keccakf1600x4_extract_bytes`	✅	3s	2s	+50%
`mld_ct_abs_i32`	✅	3s	2s	+50%
`mld_ct_cmask_neg_i32`	✅	3s	1s	+200%
`mld_ct_cmask_nonzero_u32`	✅	3s	4s	-25%
`mld_ct_get_optblocker_i64`	✅	3s	3s	+0%
`mld_ct_get_optblocker_u8`	✅	3s	2s	+50%
`mld_keccakf1600_extract_bytes`	✅	3s	1s	+200%
`montgomery_reduce`	✅	3s	3s	+0%
`poly_caddq_native_aarch64`	✅	3s	3s	+0%
`poly_invntt_tomont`	✅	3s	2s	+50%
`poly_ntt_c`	✅	3s	1s	+200%
`poly_pointwise_montgomery_native`	✅	3s	3s	+0%
`poly_power2round`	✅	3s	3s	+0%
`poly_shiftl`	✅	3s	2s	+50%
`poly_uniform_gamma1`	✅	3s	3s	+0%
`polyt0_pack`	✅	3s	5s	-40%
`polyveck_pack_eta`	✅	3s	2s	+50%
`polyveck_unpack_t0`	✅	3s	6s	-50%
`polyvecl_permute_bitrev_to_custom`	✅	3s	2s	+50%
`polyvecl_unpack_eta`	✅	3s	3s	+0%
`polyz_pack`	✅	3s	2s	+50%
`polyz_unpack`	✅	3s	3s	+0%
`polyz_unpack_c`	✅	3s	5s	-40%
`power2round`	✅	3s	4s	-25%
`rej_eta_c`	✅	3s	4s	-25%
`shake128_init`	✅	3s	2s	+50%
`shake128_release`	✅	3s	3s	+0%
`shake128_squeeze`	✅	3s	5s	-40%
`shake256`	✅	3s	4s	-25%
`shake256_absorb`	✅	3s	3s	+0%
`shake256x4_absorb_once`	✅	3s	2s	+50%
`sign_open`	✅	3s	6s	-50%
`sign_signature_pre_hash_internal`	✅	3s	5s	-40%
`sign_verify`	✅	3s	2s	+50%
`sign_verify_extmu`	✅	3s	3s	+0%
`sign_verify_pre_hash_shake256`	✅	3s	4s	-25%
`unpack_pk`	✅	3s	6s	-50%
`decompose`	✅	2s	4s	-50%
`keccak_init`	✅	2s	3s	-33%
`keccakf1600x4_permute`	✅	2s	2s	+0%
`make_hint`	✅	2s	5s	-60%
`mld_ct_get_optblocker_u32`	✅	2s	2s	+0%
`mld_ct_sel_int32`	✅	2s	2s	+0%
`mld_value_barrier_i64`	✅	2s	3s	-33%
`mld_value_barrier_u8`	✅	2s	2s	+0%
`ntt_native_x86_64`	✅	2s	4s	-50%
`poly_chknorm`	✅	2s	2s	+0%
`poly_decompose`	✅	2s	3s	-33%
`poly_pointwise_montgomery`	✅	2s	4s	-50%
`poly_use_hint`	✅	2s	3s	-33%
`poly_use_hint_c`	✅	2s	4s	-50%
`polyeta_pack`	✅	2s	4s	-50%
`polyveck_pack_w1`	✅	2s	4s	-50%
`polyvecl_pack_eta`	✅	2s	3s	-33%
`polyvecl_pointwise_acc_montgomery_native`	✅	2s	3s	-33%
`polyvecl_unpack_z`	✅	2s	3s	-33%
`polyw1_pack`	✅	2s	2s	+0%
`rej_eta`	✅	2s	2s	+0%
`shake128_finalize`	✅	2s	3s	-33%
`shake128x4_absorb_once`	✅	2s	4s	-50%
`shake128x4_squeezeblocks`	✅	2s	2s	+0%
`shake256_init`	✅	2s	2s	+0%
`shake256_release`	✅	2s	3s	-33%
`shake256_squeeze`	✅	2s	2s	+0%
`sign_signature_extmu`	✅	2s	5s	-60%
`unpack_sig`	✅	2s	5s	-60%
`keccakf1600_xor_bytes (big endian)`	✅	1s	3s	-67%
`keccakf1600x4_xor_bytes`	✅	1s	2s	-50%
`mld_value_barrier_u32`	✅	1s	2s	-50%
`polyveck_unpack_eta`	✅	1s	3s	-67%
`reduce32`	✅	1s	4s	-75%
`shake256_finalize`	✅	1s	3s	-67%

oqs-bot · 2026-02-03T07:59:14Z

CBMC Results (ML-DSA-44)

Full Results (175 proofs)

Proof	Status	Current	Previous	Change
`TOTAL`	✅	2187s	2055s	+6.4%
`sign_verify_internal`	✅	264s	254s	+4%
`mld_attempt_signature_generation`	✅	233s	221s	+5%
`polyvecl_pointwise_acc_montgomery_c`	✅	233s	208s	+12%
`rej_uniform_native`	✅	150s	144s	+4%
`poly_pointwise_montgomery_c`	✅	148s	143s	+3%
`mld_ct_memcmp`	✅	88s	82s	+7%
`mld_invntt_layer`	✅	53s	50s	+6%
`sign_signature_internal`	✅	48s	45s	+7%
`mld_ntt_layer`	✅	47s	44s	+7%
`keccak_squeezeblocks_x4`	✅	45s	44s	+2%
`poly_invntt_tomont_c`	✅	43s	39s	+10%
`rej_uniform`	✅	23s	20s	+15%
`rej_uniform_c`	✅	22s	20s	+10%
`fqmul`	✅	18s	19s	-5%
`poly_uniform_eta_4x`	✅	18s	19s	-5%
`polymat_permute_bitrev_to_custom`	✅	17s	16s	+6%
`mld_polyvecl_permute_bitrev_to_custom_native`	✅	15s	14s	+7%
`poly_chknorm_c`	✅	15s	12s	+25%
`poly_uniform_4x`	✅	15s	13s	+15%
`polyeta_unpack`	✅	15s	12s	+25%
`polyt0_unpack`	✅	15s	15s	+0%
`mld_ntt_butterfly_block`	✅	14s	13s	+8%
`polyvec_matrix_expand`	✅	14s	16s	-12%
`keccakf1600x4_permute_native`	✅	13s	14s	-7%
`mld_compute_t0_t1_tr_from_sk_components`	✅	13s	15s	-13%
`polyz_unpack_c`	✅	11s	12s	-8%
`keccakf1600_permute_native`	✅	10s	6s	+67%
`keccak_absorb_once_x4`	✅	9s	9s	+0%
`keccakf1600_permute`	✅	9s	8s	+12%
`mld_check_pct`	✅	9s	6s	+50%
`polyveck_caddq`	✅	9s	8s	+12%
`keccak_absorb`	✅	8s	8s	+0%
`shake128x4_absorb_once`	✅	8s	4s	+100%
`polyvec_matrix_pointwise_montgomery`	✅	7s	6s	+17%
`polyveck_add`	✅	7s	7s	+0%
`polyveck_ntt`	✅	7s	4s	+75%
`rej_eta_c`	✅	7s	4s	+75%
`sign`	✅	7s	5s	+40%
`decompose`	✅	6s	3s	+100%
`mld_sample_s1_s2_serial`	✅	6s	3s	+100%
`poly_uniform`	✅	6s	5s	+20%
`poly_uniform_eta`	✅	6s	5s	+20%
`poly_uniform_gamma1`	✅	6s	4s	+50%
`poly_use_hint_c`	✅	6s	5s	+20%
`polyvec_matrix_expand_serial`	✅	6s	7s	-14%
`polyveck_pointwise_poly_montgomery`	✅	6s	5s	+20%
`polyveck_reduce`	✅	6s	5s	+20%
`polyveck_use_hint`	✅	6s	5s	+20%
`polyvecl_ntt`	✅	6s	6s	+0%
`sign_keypair`	✅	6s	2s	+200%
`sign_keypair_internal`	✅	6s	4s	+50%
`sign_pk_from_sk`	✅	6s	6s	+0%
`sign_verify_pre_hash_internal`	✅	6s	4s	+50%
`sign_verify_pre_hash_shake256`	✅	6s	3s	+100%
`intt_native_x86_64`	✅	5s	3s	+67%
`mld_compute_pack_z`	✅	5s	6s	-17%
`mld_ct_cmask_nonzero_u32`	✅	5s	2s	+150%
`mld_ct_get_optblocker_u32`	✅	5s	1s	+400%
`mld_ct_sel_int32`	✅	5s	4s	+25%
`mld_keccakf1600_extract_bytes`	✅	5s	6s	-17%
`poly_chknorm_native`	✅	5s	6s	-17%
`poly_ntt_c`	✅	5s	2s	+150%
`poly_pointwise_montgomery`	✅	5s	2s	+150%
`poly_sub`	✅	5s	3s	+67%
`polyveck_chknorm`	✅	5s	3s	+67%
`polyveck_decompose`	✅	5s	7s	-29%
`polyveck_power2round`	✅	5s	6s	-17%
`polyveck_shiftl`	✅	5s	5s	+0%
`polyvecl_chknorm`	✅	5s	6s	-17%
`polyvecl_uniform_gamma1`	✅	5s	2s	+150%
`polyw1_pack`	✅	5s	2s	+150%
`polyz_unpack_native`	✅	5s	3s	+67%
`shake128x4_squeezeblocks`	✅	5s	2s	+150%
`shake256_init`	✅	5s	3s	+67%
`sign_open`	✅	5s	6s	-17%
`sign_signature`	✅	5s	6s	-17%
`sign_signature_extmu`	✅	5s	7s	-29%
`sign_signature_pre_hash_internal`	✅	5s	4s	+25%
`unpack_hints`	✅	5s	5s	+0%
`keccak_squeeze`	✅	4s	5s	-20%
`mld_ct_abs_i32`	✅	4s	2s	+100%
`mld_ct_cmask_nonzero_u8`	✅	4s	4s	+0%
`mld_h`	✅	4s	6s	-33%
`mld_value_barrier_u32`	✅	4s	4s	+0%
`ntt_native_x86_64`	✅	4s	3s	+33%
`poly_caddq_native`	✅	4s	2s	+100%
`poly_caddq_native_aarch64`	✅	4s	3s	+33%
`poly_decompose_c`	✅	4s	5s	-20%
`poly_decompose_native`	✅	4s	3s	+33%
`poly_invntt_tomont_native`	✅	4s	4s	+0%
`poly_uniform_gamma1_4x`	✅	4s	4s	+0%
`polyveck_invntt_tomont`	✅	4s	6s	-33%
`polyveck_make_hint`	✅	4s	4s	+0%
`polyveck_unpack_t0`	✅	4s	4s	+0%
`polyvecl_uniform_gamma1_serial`	✅	4s	3s	+33%
`polyz_unpack`	✅	4s	5s	-20%
`rej_eta_native`	✅	4s	4s	+0%
`shake128_finalize`	✅	4s	2s	+100%
`shake128_init`	✅	4s	2s	+100%
`shake128_release`	✅	4s	3s	+33%
`shake256x4_squeezeblocks`	✅	4s	2s	+100%
`sign_signature_pre_hash_shake256`	✅	4s	5s	-20%
`sign_verify_extmu`	✅	4s	2s	+100%
`unpack_sk`	✅	4s	4s	+0%
`use_hint`	✅	4s	2s	+100%
`fqscale`	✅	3s	1s	+200%
`keccakf1600_xor_bytes`	✅	3s	2s	+50%
`keccakf1600x4_extract_bytes`	✅	3s	3s	+0%
`make_hint`	✅	3s	3s	+0%
`mld_prepare_domain_separation_prefix`	✅	3s	5s	-40%
`mld_sample_s1_s2`	✅	3s	4s	-25%
`montgomery_reduce`	✅	3s	2s	+50%
`pack_pk`	✅	3s	3s	+0%
`pack_sig_c_h`	✅	3s	4s	-25%
`pack_sig_z`	✅	3s	2s	+50%
`poly_add`	✅	3s	4s	-25%
`poly_caddq`	✅	3s	3s	+0%
`poly_caddq_c`	✅	3s	3s	+0%
`poly_challenge`	✅	3s	3s	+0%
`poly_chknorm`	✅	3s	4s	-25%
`poly_decompose`	✅	3s	4s	-25%
`poly_invntt_tomont`	✅	3s	2s	+50%
`poly_make_hint`	✅	3s	4s	-25%
`poly_ntt`	✅	3s	3s	+0%
`poly_ntt_native`	✅	3s	2s	+50%
`poly_pointwise_montgomery_native`	✅	3s	3s	+0%
`poly_power2round`	✅	3s	5s	-40%
`poly_reduce`	✅	3s	4s	-25%
`poly_use_hint`	✅	3s	2s	+50%
`poly_use_hint_native`	✅	3s	3s	+0%
`polyveck_pack_eta`	✅	3s	3s	+0%
`polyveck_pack_w1`	✅	3s	4s	-25%
`polyveck_sub`	✅	3s	4s	-25%
`polyveck_unpack_eta`	✅	3s	4s	-25%
`polyvecl_pointwise_acc_montgomery`	✅	3s	4s	-25%
`polyvecl_pointwise_acc_montgomery_native`	✅	3s	3s	+0%
`polyvecl_unpack_eta`	✅	3s	2s	+50%
`reduce32`	✅	3s	3s	+0%
`shake128_absorb`	✅	3s	1s	+200%
`shake256`	✅	3s	3s	+0%
`shake256_release`	✅	3s	3s	+0%
`sign_verify`	✅	3s	3s	+0%
`sys_check_capability`	✅	3s	4s	-25%
`unpack_pk`	✅	3s	3s	+0%
`unpack_sig`	✅	3s	3s	+0%
`caddq`	✅	2s	3s	-33%
`keccak_finalize`	✅	2s	4s	-50%
`keccak_init`	✅	2s	3s	-33%
`keccakf1600_extract_bytes (big endian)`	✅	2s	3s	-33%
`keccakf1600_xor_bytes (big endian)`	✅	2s	2s	+0%
`keccakf1600x4_xor_bytes`	✅	2s	2s	+0%
`mld_ct_get_optblocker_i64`	✅	2s	2s	+0%
`mld_ct_get_optblocker_u8`	✅	2s	1s	+100%
`mld_value_barrier_i64`	✅	2s	4s	-50%
`mld_value_barrier_u8`	✅	2s	1s	+100%
`pack_sk`	✅	2s	3s	-33%
`poly_shiftl`	✅	2s	4s	-50%
`polyeta_pack`	✅	2s	3s	-33%
`polyt0_pack`	✅	2s	4s	-50%
`polyt1_pack`	✅	2s	1s	+100%
`polyt1_unpack`	✅	2s	4s	-50%
`polyveck_pack_t0`	✅	2s	3s	-33%
`polyvecl_pack_eta`	✅	2s	3s	-33%
`polyvecl_permute_bitrev_to_custom`	✅	2s	2s	+0%
`polyvecl_unpack_z`	✅	2s	5s	-60%
`polyz_pack`	✅	2s	2s	+0%
`power2round`	✅	2s	2s	+0%
`shake256_absorb`	✅	2s	2s	+0%
`shake256_finalize`	✅	2s	4s	-50%
`shake256_squeeze`	✅	2s	5s	-60%
`keccakf1600x4_permute`	✅	1s	3s	-67%
`mld_ct_cmask_neg_i32`	✅	1s	2s	-50%
`rej_eta`	✅	1s	1s	+0%
`shake128_squeeze`	✅	1s	2s	-50%
`shake256x4_absorb_once`	✅	1s	4s	-75%

oqs-bot · 2026-02-03T08:00:03Z

CBMC Results (ML-DSA-65)

⚠️ Attention Required

Proof	Status	Current	Previous	Change
`poly_uniform_4x`	⚠️	20s	13s	+54%

Full Results (175 proofs)

Proof	Status	Current	Previous	Change
`TOTAL`	✅	2582s	2286s	+12.9%
`polyvecl_pointwise_acc_montgomery_c`	✅	298s	226s	+32%
`mld_attempt_signature_generation`	✅	224s	197s	+14%
`sign_verify_internal`	✅	199s	177s	+12%
`rej_uniform_native`	✅	169s	144s	+17%
`polyvec_matrix_expand`	✅	166s	145s	+14%
`poly_pointwise_montgomery_c`	✅	165s	138s	+20%
`mld_invntt_layer`	✅	139s	117s	+19%
`mld_ct_memcmp`	✅	94s	79s	+19%
`polyvec_matrix_expand_serial`	✅	72s	65s	+11%
`sign_signature_internal`	✅	55s	50s	+10%
`mld_ntt_layer`	✅	51s	44s	+16%
`keccak_squeezeblocks_x4`	✅	45s	42s	+7%
`mld_compute_t0_t1_tr_from_sk_components`	✅	27s	27s	+0%
`rej_uniform`	✅	24s	21s	+14%
`polymat_permute_bitrev_to_custom`	✅	22s	18s	+22%
`rej_uniform_c`	✅	21s	19s	+11%
`fqmul`	✅	20s	18s	+11%
`poly_uniform_4x`	⚠️	20s	13s	+54%
`polyveck_decompose`	✅	20s	16s	+25%
`poly_uniform_eta_4x`	✅	19s	17s	+12%
`polyvec_matrix_pointwise_montgomery`	✅	18s	14s	+29%
`poly_chknorm_c`	✅	17s	16s	+6%
`polyt0_unpack`	✅	16s	17s	-6%
`mld_ntt_butterfly_block`	✅	15s	13s	+15%
`mld_polyvecl_permute_bitrev_to_custom_native`	✅	14s	14s	+0%
`polyveck_use_hint`	✅	14s	14s	+0%
`keccakf1600x4_permute_native`	✅	13s	14s	-7%
`mld_check_pct`	✅	12s	9s	+33%
`poly_invntt_tomont_c`	✅	12s	8s	+50%
`sign`	✅	11s	9s	+22%
`keccak_absorb_once_x4`	✅	10s	10s	+0%
`polyveck_add`	✅	10s	9s	+11%
`polyveck_caddq`	✅	10s	8s	+25%
`polyveck_ntt`	✅	10s	6s	+67%
`polyveck_reduce`	✅	10s	9s	+11%
`poly_decompose_c`	✅	9s	7s	+29%
`polyveck_invntt_tomont`	✅	9s	8s	+12%
`polyveck_power2round`	✅	9s	11s	-18%
`polyveck_sub`	✅	9s	7s	+29%
`polyvecl_ntt`	✅	9s	8s	+12%
`keccak_absorb`	✅	8s	5s	+60%
`keccakf1600_permute`	✅	8s	9s	-11%
`polyeta_unpack`	✅	8s	7s	+14%
`sign_keypair_internal`	✅	8s	6s	+33%
`keccakf1600_permute_native`	✅	7s	9s	-22%
`mld_compute_pack_z`	✅	7s	7s	+0%
`mld_sample_s1_s2_serial`	✅	7s	6s	+17%
`polyveck_make_hint`	✅	7s	5s	+40%
`polyveck_shiftl`	✅	7s	8s	-12%
`keccak_finalize`	✅	6s	2s	+200%
`polyveck_unpack_eta`	✅	6s	4s	+50%
`polyvecl_chknorm`	✅	6s	5s	+20%
`polyvecl_pointwise_acc_montgomery_native`	✅	6s	5s	+20%
`sign_pk_from_sk`	✅	6s	8s	-25%
`unpack_hints`	✅	6s	5s	+20%
`mld_sample_s1_s2`	✅	5s	4s	+25%
`pack_sk`	✅	5s	2s	+150%
`poly_add`	✅	5s	4s	+25%
`poly_caddq`	✅	5s	3s	+67%
`poly_caddq_native`	✅	5s	4s	+25%
`poly_challenge`	✅	5s	4s	+25%
`poly_make_hint`	✅	5s	3s	+67%
`poly_pointwise_montgomery`	✅	5s	5s	+0%
`poly_uniform`	✅	5s	4s	+25%
`poly_uniform_eta`	✅	5s	3s	+67%
`poly_use_hint_c`	✅	5s	7s	-29%
`polyveck_pointwise_poly_montgomery`	✅	5s	6s	-17%
`polyvecl_pointwise_acc_montgomery`	✅	5s	5s	+0%
`rej_eta`	✅	5s	4s	+25%
`shake256_absorb`	✅	5s	5s	+0%
`sign_open`	✅	5s	5s	+0%
`sign_signature`	✅	5s	4s	+25%
`sign_signature_pre_hash_internal`	✅	5s	6s	-17%
`sign_signature_pre_hash_shake256`	✅	5s	4s	+25%
`sign_verify_extmu`	✅	5s	4s	+25%
`unpack_sk`	✅	5s	5s	+0%
`decompose`	✅	4s	5s	-20%
`keccakf1600x4_permute`	✅	4s	1s	+300%
`mld_ct_get_optblocker_u8`	✅	4s	2s	+100%
`mld_h`	✅	4s	3s	+33%
`montgomery_reduce`	✅	4s	2s	+100%
`ntt_native_x86_64`	✅	4s	3s	+33%
`poly_caddq_native_aarch64`	✅	4s	6s	-33%
`poly_chknorm_native`	✅	4s	4s	+0%
`poly_invntt_tomont`	✅	4s	6s	-33%
`poly_ntt`	✅	4s	5s	-20%
`poly_power2round`	✅	4s	5s	-20%
`poly_use_hint_native`	✅	4s	4s	+0%
`polyt1_unpack`	✅	4s	6s	-33%
`polyveck_chknorm`	✅	4s	5s	-20%
`polyveck_unpack_t0`	✅	4s	3s	+33%
`polyvecl_permute_bitrev_to_custom`	✅	4s	3s	+33%
`polyvecl_uniform_gamma1`	✅	4s	5s	-20%
`polyvecl_unpack_eta`	✅	4s	5s	-20%
`polyz_unpack_c`	✅	4s	5s	-20%
`power2round`	✅	4s	2s	+100%
`rej_eta_c`	✅	4s	3s	+33%
`shake128_init`	✅	4s	1s	+300%
`shake256`	✅	4s	2s	+100%
`sign_keypair`	✅	4s	3s	+33%
`sign_verify`	✅	4s	4s	+0%
`sign_verify_pre_hash_internal`	✅	4s	3s	+33%
`unpack_pk`	✅	4s	3s	+33%
`use_hint`	✅	4s	2s	+100%
`caddq`	✅	3s	3s	+0%
`fqscale`	✅	3s	4s	-25%
`intt_native_x86_64`	✅	3s	3s	+0%
`keccak_init`	✅	3s	3s	+0%
`keccakf1600_extract_bytes (big endian)`	✅	3s	3s	+0%
`make_hint`	✅	3s	3s	+0%
`mld_ct_abs_i32`	✅	3s	4s	-25%
`mld_ct_cmask_nonzero_u8`	✅	3s	2s	+50%
`mld_ct_get_optblocker_u32`	✅	3s	4s	-25%
`mld_keccakf1600_extract_bytes`	✅	3s	2s	+50%
`mld_prepare_domain_separation_prefix`	✅	3s	5s	-40%
`mld_value_barrier_u32`	✅	3s	3s	+0%
`pack_sig_c_h`	✅	3s	2s	+50%
`pack_sig_z`	✅	3s	2s	+50%
`poly_caddq_c`	✅	3s	3s	+0%
`poly_chknorm`	✅	3s	3s	+0%
`poly_decompose_native`	✅	3s	3s	+0%
`poly_ntt_native`	✅	3s	3s	+0%
`poly_uniform_gamma1_4x`	✅	3s	6s	-50%
`poly_use_hint`	✅	3s	3s	+0%
`polyeta_pack`	✅	3s	2s	+50%
`polyt0_pack`	✅	3s	4s	-25%
`polyt1_pack`	✅	3s	2s	+50%
`polyveck_pack_t0`	✅	3s	3s	+0%
`polyveck_pack_w1`	✅	3s	6s	-50%
`polyvecl_uniform_gamma1_serial`	✅	3s	4s	-25%
`polyvecl_unpack_z`	✅	3s	2s	+50%
`polyz_unpack_native`	✅	3s	3s	+0%
`reduce32`	✅	3s	3s	+0%
`rej_eta_native`	✅	3s	4s	-25%
`shake128x4_absorb_once`	✅	3s	4s	-25%
`shake256_finalize`	✅	3s	5s	-40%
`shake256_init`	✅	3s	2s	+50%
`shake256_release`	✅	3s	1s	+200%
`shake256_squeeze`	✅	3s	2s	+50%
`sign_signature_extmu`	✅	3s	4s	-25%
`sign_verify_pre_hash_shake256`	✅	3s	6s	-50%
`unpack_sig`	✅	3s	4s	-25%
`keccak_squeeze`	✅	2s	2s	+0%
`keccakf1600_xor_bytes`	✅	2s	3s	-33%
`keccakf1600x4_extract_bytes`	✅	2s	1s	+100%
`keccakf1600x4_xor_bytes`	✅	2s	2s	+0%
`mld_ct_cmask_neg_i32`	✅	2s	2s	+0%
`mld_ct_cmask_nonzero_u32`	✅	2s	2s	+0%
`mld_ct_get_optblocker_i64`	✅	2s	1s	+100%
`mld_value_barrier_i64`	✅	2s	2s	+0%
`mld_value_barrier_u8`	✅	2s	3s	-33%
`pack_pk`	✅	2s	3s	-33%
`poly_decompose`	✅	2s	3s	-33%
`poly_invntt_tomont_native`	✅	2s	7s	-71%
`poly_ntt_c`	✅	2s	2s	+0%
`poly_pointwise_montgomery_native`	✅	2s	2s	+0%
`poly_reduce`	✅	2s	5s	-60%
`poly_sub`	✅	2s	5s	-60%
`poly_uniform_gamma1`	✅	2s	5s	-60%
`polyveck_pack_eta`	✅	2s	4s	-50%
`polyvecl_pack_eta`	✅	2s	4s	-50%
`polyz_unpack`	✅	2s	2s	+0%
`shake128_absorb`	✅	2s	2s	+0%
`shake128_finalize`	✅	2s	3s	-33%
`shake128_release`	✅	2s	3s	-33%
`shake128_squeeze`	✅	2s	2s	+0%
`shake128x4_squeezeblocks`	✅	2s	1s	+100%
`shake256x4_absorb_once`	✅	2s	4s	-50%
`shake256x4_squeezeblocks`	✅	2s	3s	-33%
`sys_check_capability`	✅	2s	3s	-33%
`keccakf1600_xor_bytes (big endian)`	✅	1s	3s	-67%
`mld_ct_sel_int32`	✅	1s	3s	-67%
`poly_shiftl`	✅	1s	2s	-50%
`polyw1_pack`	✅	1s	1s	+0%
`polyz_pack`	✅	1s	4s	-75%

github-actions

Mac Mini (M1, 2020) benchmarks (opt)

Details

Benchmark suite	Current: `8a19e9a`	Previous: `41da557`	Ratio
`ML-DSA-44 keypair`	`46205` cycles	`46203` cycles	`1.00`
`ML-DSA-44 sign`	`131278` cycles	`131278` cycles	`1`
`ML-DSA-44 verify`	`47765` cycles	`47768` cycles	`1.00`
`ML-DSA-65 keypair`	`81014` cycles	`81024` cycles	`1.00`
`ML-DSA-65 sign`	`215785` cycles	`215787` cycles	`1.00`
`ML-DSA-65 verify`	`80057` cycles	`80052` cycles	`1.00`
`ML-DSA-87 keypair`	`132158` cycles	`132151` cycles	`1.00`
`ML-DSA-87 sign`	`276862` cycles	`276816` cycles	`1.00`
`ML-DSA-87 verify`	`130418` cycles	`130384` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

Mac Mini (M1, 2020) benchmarks (no-opt)

Details

Benchmark suite	Current: `8a19e9a`	Previous: `41da557`	Ratio
`ML-DSA-44 keypair`	`114213` cycles	`114155` cycles	`1.00`
`ML-DSA-44 sign`	`418158` cycles	`417994` cycles	`1.00`
`ML-DSA-44 verify`	`122319` cycles	`122262` cycles	`1.00`
`ML-DSA-65 keypair`	`195508` cycles	`195499` cycles	`1.00`
`ML-DSA-65 sign`	`682497` cycles	`682470` cycles	`1.00`
`ML-DSA-65 verify`	`197760` cycles	`197741` cycles	`1.00`
`ML-DSA-87 keypair`	`322642` cycles	`322656` cycles	`1.00`
`ML-DSA-87 sign`	`864585` cycles	`864584` cycles	`1.00`
`ML-DSA-87 verify`	`328628` cycles	`328653` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Intel Xeon 4th gen (c7i)

Details

Benchmark suite	Current: `8a19e9a`	Previous: `41da557`	Ratio
`ML-DSA-44 keypair`	`34677` cycles	`34696` cycles	`1.00`
`ML-DSA-44 sign`	`120151` cycles	`120195` cycles	`1.00`
`ML-DSA-44 verify`	`38151` cycles	`38145` cycles	`1.00`
`ML-DSA-65 keypair`	`61275` cycles	`60582` cycles	`1.01`
`ML-DSA-65 sign`	`202094` cycles	`200476` cycles	`1.01`
`ML-DSA-65 verify`	`62940` cycles	`62563` cycles	`1.01`
`ML-DSA-87 keypair`	`93525` cycles	`94602` cycles	`0.99`
`ML-DSA-87 sign`	`236210` cycles	`240494` cycles	`0.98`
`ML-DSA-87 verify`	`95587` cycles	`95761` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Intel Xeon 4th gen (c7i) (no-opt)

Details

Benchmark suite	Current: `8a19e9a`	Previous: `41da557`	Ratio
`ML-DSA-44 keypair`	`93726` cycles	`93889` cycles	`1.00`
`ML-DSA-44 sign`	`333512` cycles	`333450` cycles	`1.00`
`ML-DSA-44 verify`	`99955` cycles	`99851` cycles	`1.00`
`ML-DSA-65 keypair`	`160065` cycles	`160390` cycles	`1.00`
`ML-DSA-65 sign`	`545794` cycles	`545908` cycles	`1.00`
`ML-DSA-65 verify`	`160881` cycles	`160887` cycles	`1.00`
`ML-DSA-87 keypair`	`267728` cycles	`267405` cycles	`1.00`
`ML-DSA-87 sign`	`707504` cycles	`707235` cycles	`1.00`
`ML-DSA-87 verify`	`270918` cycles	`269967` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)

Details

Benchmark suite	Current: `8a19e9a`	Previous: `41da557`	Ratio
`ML-DSA-44 keypair`	`276468` cycles	`277102` cycles	`1.00`
`ML-DSA-44 sign`	`818650` cycles	`810656` cycles	`1.01`
`ML-DSA-44 verify`	`276672` cycles	`278882` cycles	`0.99`
`ML-DSA-65 keypair`	`475323` cycles	`478906` cycles	`0.99`
`ML-DSA-65 sign`	`1367640` cycles	`1360800` cycles	`1.01`
`ML-DSA-65 verify`	`459822` cycles	`466415` cycles	`0.99`
`ML-DSA-87 keypair`	`825623` cycles	`818822` cycles	`1.01`
`ML-DSA-87 sign`	`1873209` cycles	`1878770` cycles	`1.00`
`ML-DSA-87 verify`	`800938` cycles	`794467` cycles	`1.01`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

AMD EPYC 3rd gen (c6a)

Details

Benchmark suite	Current: `8a19e9a`	Previous: `41da557`	Ratio
`ML-DSA-44 keypair`	`69035` cycles	`69134` cycles	`1.00`
`ML-DSA-44 sign`	`187364` cycles	`187688` cycles	`1.00`
`ML-DSA-44 verify`	`69341` cycles	`69282` cycles	`1.00`
`ML-DSA-65 keypair`	`119503` cycles	`119368` cycles	`1.00`
`ML-DSA-65 sign`	`303527` cycles	`300862` cycles	`1.01`
`ML-DSA-65 verify`	`115926` cycles	`115513` cycles	`1.00`
`ML-DSA-87 keypair`	`203793` cycles	`203546` cycles	`1.00`
`ML-DSA-87 sign`	`394456` cycles	`394636` cycles	`1.00`
`ML-DSA-87 verify`	`195809` cycles	`195483` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Intel Xeon 3rd gen (c6i)

Details

Benchmark suite	Current: `8a19e9a`	Previous: `41da557`	Ratio
`ML-DSA-44 keypair`	`57235` cycles	`56751` cycles	`1.01`
`ML-DSA-44 sign`	`181496` cycles	`181670` cycles	`1.00`
`ML-DSA-44 verify`	`61165` cycles	`61146` cycles	`1.00`
`ML-DSA-65 keypair`	`98680` cycles	`98647` cycles	`1.00`
`ML-DSA-65 sign`	`298309` cycles	`298480` cycles	`1.00`
`ML-DSA-65 verify`	`100528` cycles	`100288` cycles	`1.00`
`ML-DSA-87 keypair`	`152581` cycles	`152587` cycles	`1.00`
`ML-DSA-87 sign`	`355291` cycles	`355235` cycles	`1.00`
`ML-DSA-87 verify`	`153950` cycles	`153556` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton4

Details

Benchmark suite	Current: `8a19e9a`	Previous: `41da557`	Ratio
`ML-DSA-44 keypair`	`68156` cycles	`68132` cycles	`1.00`
`ML-DSA-44 sign`	`202004` cycles	`201919` cycles	`1.00`
`ML-DSA-44 verify`	`70775` cycles	`70781` cycles	`1.00`
`ML-DSA-65 keypair`	`120970` cycles	`120914` cycles	`1.00`
`ML-DSA-65 sign`	`331183` cycles	`331101` cycles	`1.00`
`ML-DSA-65 verify`	`117884` cycles	`117908` cycles	`1.00`
`ML-DSA-87 keypair`	`198649` cycles	`198347` cycles	`1.00`
`ML-DSA-87 sign`	`427544` cycles	`427112` cycles	`1.00`
`ML-DSA-87 verify`	`194417` cycles	`194311` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

AMD EPYC 3rd gen (c6a) (no-opt)

Details

Benchmark suite	Current: `8a19e9a`	Previous: `41da557`	Ratio
`ML-DSA-44 keypair`	`135070` cycles	`134705` cycles	`1.00`
`ML-DSA-44 sign`	`526006` cycles	`524023` cycles	`1.00`
`ML-DSA-44 verify`	`147853` cycles	`147704` cycles	`1.00`
`ML-DSA-65 keypair`	`226865` cycles	`226528` cycles	`1.00`
`ML-DSA-65 sign`	`860582` cycles	`861852` cycles	`1.00`
`ML-DSA-65 verify`	`235373` cycles	`235761` cycles	`1.00`
`ML-DSA-87 keypair`	`370367` cycles	`371080` cycles	`1.00`
`ML-DSA-87 sign`	`1079627` cycles	`1079785` cycles	`1.00`
`ML-DSA-87 verify`	`382615` cycles	`383268` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

AMD EPYC 4th gen (c7a)

Details

Benchmark suite	Current: `8a19e9a`	Previous: `41da557`	Ratio
`ML-DSA-44 keypair`	`41639` cycles	`42042` cycles	`0.99`
`ML-DSA-44 sign`	`134495` cycles	`135046` cycles	`1.00`
`ML-DSA-44 verify`	`44953` cycles	`45886` cycles	`0.98`
`ML-DSA-65 keypair`	`72877` cycles	`72408` cycles	`1.01`
`ML-DSA-65 sign`	`214749` cycles	`215490` cycles	`1.00`
`ML-DSA-65 verify`	`73910` cycles	`73252` cycles	`1.01`
`ML-DSA-87 keypair`	`107778` cycles	`107965` cycles	`1.00`
`ML-DSA-87 sign`	`252308` cycles	`254024` cycles	`0.99`
`ML-DSA-87 verify`	`109196` cycles	`111034` cycles	`0.98`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Intel Xeon 3rd gen (c6i) (no-opt)

Details

Benchmark suite	Current: `8a19e9a`	Previous: `41da557`	Ratio
`ML-DSA-44 keypair`	`157593` cycles	`157623` cycles	`1.00`
`ML-DSA-44 sign`	`550359` cycles	`549610` cycles	`1.00`
`ML-DSA-44 verify`	`169225` cycles	`169078` cycles	`1.00`
`ML-DSA-65 keypair`	`267977` cycles	`267943` cycles	`1.00`
`ML-DSA-65 sign`	`903637` cycles	`902493` cycles	`1.00`
`ML-DSA-65 verify`	`274125` cycles	`274108` cycles	`1.00`
`ML-DSA-87 keypair`	`450990` cycles	`447542` cycles	`1.01`
`ML-DSA-87 sign`	`1162617` cycles	`1156527` cycles	`1.01`
`ML-DSA-87 verify`	`460584` cycles	`457749` cycles	`1.01`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton3

Details

Benchmark suite	Current: `8a19e9a`	Previous: `41da557`	Ratio
`ML-DSA-44 keypair`	`72258` cycles	`72244` cycles	`1.00`
`ML-DSA-44 sign`	`211991` cycles	`212021` cycles	`1.00`
`ML-DSA-44 verify`	`75712` cycles	`75740` cycles	`1.00`
`ML-DSA-65 keypair`	`127432` cycles	`127429` cycles	`1.00`
`ML-DSA-65 sign`	`350175` cycles	`350138` cycles	`1.00`
`ML-DSA-65 verify`	`125364` cycles	`125365` cycles	`1.00`
`ML-DSA-87 keypair`	`208138` cycles	`208164` cycles	`1.00`
`ML-DSA-87 sign`	`448958` cycles	`448891` cycles	`1.00`
`ML-DSA-87 verify`	`205105` cycles	`205092` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton4 (no-opt)

Details

Benchmark suite	Current: `8a19e9a`	Previous: `41da557`	Ratio
`ML-DSA-44 keypair`	`128309` cycles	`128287` cycles	`1.00`
`ML-DSA-44 sign`	`447743` cycles	`447655` cycles	`1.00`
`ML-DSA-44 verify`	`138349` cycles	`144617` cycles	`0.96`
`ML-DSA-65 keypair`	`220300` cycles	`220134` cycles	`1.00`
`ML-DSA-65 sign`	`727626` cycles	`727309` cycles	`1.00`
`ML-DSA-65 verify`	`223200` cycles	`223042` cycles	`1.00`
`ML-DSA-87 keypair`	`365101` cycles	`365095` cycles	`1.00`
`ML-DSA-87 sign`	`926593` cycles	`926085` cycles	`1.00`
`ML-DSA-87 verify`	`372803` cycles	`372794` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

AMD EPYC 4th gen (c7a) (no-opt)

Details

Benchmark suite	Current: `8a19e9a`	Previous: `41da557`	Ratio
`ML-DSA-44 keypair`	`120283` cycles	`123215` cycles	`0.98`
`ML-DSA-44 sign`	`447117` cycles	`449447` cycles	`0.99`
`ML-DSA-44 verify`	`131120` cycles	`129997` cycles	`1.01`
`ML-DSA-65 keypair`	`205159` cycles	`204042` cycles	`1.01`
`ML-DSA-65 sign`	`729240` cycles	`726667` cycles	`1.00`
`ML-DSA-65 verify`	`210548` cycles	`209895` cycles	`1.00`
`ML-DSA-87 keypair`	`336772` cycles	`336983` cycles	`1.00`
`ML-DSA-87 sign`	`923968` cycles	`923345` cycles	`1.00`
`ML-DSA-87 verify`	`346738` cycles	`346079` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

willieyz · 2026-02-12T06:43:20Z

Please apply the same changes as requested in #905

Hello, @mkannwischer , I had apply same changes requested in #905, including:

Remove all usage # for comment, use // and /*...*/ instead
Remove all vzeroupper
Use 32-bit constant instead of 64 bit
Extract the decompose32/88 and use_hint32/88 macros (referencing the AArch64 versions), and add brief comments in the same style.

Thank you for your help!

mkannwischer

Thanks @willieyz.
Some comments on how this can be improved.

mkannwischer · 2026-02-22T02:25:11Z

dev/x86_64/src/poly_use_hint_32_avx2.S

+/* Reference:
+ * - @[REF_AVX2] calls poly_decompose to compute all a1, a0 before the loop.
+ * - Our implementation of decompose() is slightly different from that in
+ *   @[REF_AVX2]. See poly_decompose_32_avx2.c for more information.


You are referencing to a file here that does not exist

Fixed.

I have rephrased the description of the file name from poly_decompose_32_avx2.c to poly_decompose_32_avx2, since we plan to remove all intrinsics and transition to pure assembly implementations.
Thank you for your review and help!

mkannwischer · 2026-02-22T02:25:22Z

dev/x86_64/src/poly_use_hint_88_avx2.S

+/* Reference:
+ * - @[REF_AVX2] calls poly_decompose to compute all a1, a0 before the loop.
+ * - Our implementation of decompose() is slightly different from that in
+ *   @[REF_AVX2]. See poly_decompose_88_avx2.c for more information.


You are referencing to a file here that does not exist anymore.

Fixed, same as previous one, thank you for your help!

mkannwischer · 2026-02-22T02:27:16Z

dev/x86_64/src/poly_use_hint_88_avx2.S

+
+// delta = (a0 <= 0) ? -1 : 1
+vpcmpgtd	%ymm5, \a, \a            
+vandnps	\h, \a, \a               


What's the reason for using vandnps over vpandn? vpandn seems more natural and is what was used in the intrinsics.

Fixed.

The reason I originally used vandnps instead of vpandn is that the draft AVX2 assembly was generated by GCC on an x86 Linux platform, and the compiler emitted vandnps—likely due to its internal optimization decisions. However, this is not appropriate for our use case.

I have replaced it with vpandn as you suggested. Thank you for your help.

mkannwischer · 2026-02-22T02:28:28Z

dev/x86_64/src/poly_use_hint_88_avx2.S

+
+use_hint88_avx2 %ymm0, %ymm0, %ymm1, %ymm9, %ymm10, %ymm11, %ymm12
+
+vmovaps	%ymm0, (%rdi,%rax)


Same here. Why not use vmovdqa since we are dealing with integers.

Fixed.

The reason is similar with previous one, I had fixed it according to your suggestion.
Thank you for your help!

mkannwischer · 2026-02-22T02:34:35Z

dev/x86_64/src/poly_use_hint_32_avx2.S

+
+mld_poly_use_hint_32_avx2_loop:
+vmovdqa	(%rsi,%rax), %ymm0
+vmovdqa	(%rdx,%rax), %ymm2


You can fuse this load with the subsqeuent vpandn saving an intruction.

mkannwischer · 2026-02-22T03:05:33Z

dev/x86_64/src/poly_use_hint_32_avx2.S

+vmovd	%ecx, %xmm5
+movl	$22784256, %ecx    /* q_bound: 31*GAMMA2 = 8118528 (stored as 22784256 due to encoding) */
+
+movl $512, %r9d             /* 512, 0 alternating (for vpmulhrsw) */ 


Same here, I don't think the comment is very helpful. I can see that 512 is what is being written.
A comment explaning what 512 is used for later would be more useful.

Fixed, now change to:

/* round(x * 2^9 / 2^15) => round(x / 2^6), f1 = round(f1''/ 2^6)*/

According to the comment in poly_decompose_32_avx2.c.
Thank you for your review and help.

mkannwischer · 2026-02-22T03:05:54Z

dev/x86_64/src/poly_use_hint_88_avx2.S

+xorl	%eax, %eax
+vpxor	%xmm5, %xmm5, %xmm5
+
+movl $11275, %r8d        /* 11275, 0 alternating (for vpmulhuw) */ 


Fixed.

For this one, I reference from your previous suggestion, and the check-magic description in poly_decompose_88_avx2.c.
now this comment change to:

/* check-magic: 11275 == floor(2^24 / 1488) */

Thank you for your review and help!

mkannwischer · 2026-02-22T03:05:59Z

dev/x86_64/src/poly_use_hint_88_avx2.S

+vmovd	%ecx, %xmm4
+movl	$8285184, %ecx       /* 87*GAMMA2 = 8285184 (wrap-around threshold) */
+
+movl $128, %r9d        /* 128, 0 alternating (for vpmulhrsw)  */  


Fixed, now change to:

/* round(x * 2^7 / 2^15) => round(x / 2^8), for f1 = round(f1''/ 2^8)*/

According to the comment in poly_decompose_88_avx2.c.
Thank you for your review and help.

mkannwischer · 2026-02-22T03:06:56Z

dev/x86_64/src/poly_use_hint_88_avx2.S

+vmovd %r9d, %xmm7
+vpbroadcastd %xmm7, %ymm7
+
+movl $43, %r10d                       /* Load 43 constant (for blend and comparison) */ 


this comment is also not very useful --- I can see from the instruction that this loads 43 - that does not need a comments.
Something like /* max a1 value */ maybe be useful.

mkannwischer · 2026-02-22T03:08:21Z

dev/x86_64/src/poly_use_hint_88_avx2.S

+movl $43, %r10d                       /* Load 43 constant (for blend and comparison) */ 
+vmovd %r10d, %xmm6
+vpbroadcastd %xmm6, %ymm6
+
+vmovd	%ecx, %xmm3
+movl	$43, %ecx


Why do we need two copies of this 43? merge into one?

I have removed the duplicate and updated the vpcmpgtd in decompose88_avx2 to use ymm6. The duplicate copies were kept because GCC generated them in the AVX2 output, but after you point this out, I think they should be merged. Thank you for your help!

This commit adds poly_use_hint to bench --components for benchmarking the performance impact of the changes to: - poly_use_hint_32 - poly_use_hint_88 Signed-off-by: willieyz <willie.zhao@chelpis.com>

In this PR, we replace the AVX2 intrinsics implementation of poly_use_hint_32 and poly_use_hint_88 with a x86_64 assembly version, this is part of the effort to enable HOL-Light proofs. Signed-off-by: willieyz <willie.zhao@chelpis.com>

This commit extract the decompose 32/88 and use_hint 32/88 as a macro. Signed-off-by: willieyz <willie.zhao@chelpis.com>

willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch 9 times, most recently from 1ea9d5f to 8a19e9a Compare February 5, 2026 06:05

willieyz marked this pull request as ready for review February 5, 2026 06:39

willieyz requested a review from a team as a code owner February 5, 2026 06:39

willieyz added the benchmark label Feb 5, 2026

willieyz marked this pull request as draft February 5, 2026 07:19

github-actions bot reviewed Feb 5, 2026

View reviewed changes

oqs-bot reviewed Feb 5, 2026

View reviewed changes

github-actions bot reviewed Feb 5, 2026

View reviewed changes

oqs-bot reviewed Feb 5, 2026

View reviewed changes

willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch 8 times, most recently from 81d5192 to 377fdc8 Compare February 12, 2026 06:26

willieyz marked this pull request as ready for review February 12, 2026 06:43

mkannwischer requested changes Feb 22, 2026

View reviewed changes

Add components benchmark for poly_use_hint_32 and poly_use_hint_88

2027477

This commit adds poly_use_hint to bench --components for benchmarking the performance impact of the changes to: - poly_use_hint_32 - poly_use_hint_88 Signed-off-by: willieyz <willie.zhao@chelpis.com>

willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch 10 times, most recently from f820653 to 91cddd6 Compare February 24, 2026 10:16

willieyz marked this pull request as draft February 25, 2026 01:24

willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch from 5bc97db to b8c27b4 Compare February 25, 2026 02:35

Extract Macro: decompose32/88 and use_hint32/88

e5ad167

This commit extract the decompose 32/88 and use_hint 32/88 as a macro. Signed-off-by: willieyz <willie.zhao@chelpis.com>

willieyz force-pushed the eliminate-use_hint_32_88-intrinsics branch 2 times, most recently from d3f2de3 to e5ad167 Compare February 25, 2026 03:49

willieyz marked this pull request as ready for review February 25, 2026 03:56

willieyz requested a review from mkannwischer February 25, 2026 05:53


		use_hint88_avx2 %ymm0, %ymm0, %ymm1, %ymm9, %ymm10, %ymm11, %ymm12

		vmovaps %ymm0, (%rdi,%rax)

Conversation

willieyz commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oqs-bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CBMC Results (ML-DSA-87)

Uh oh!

oqs-bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CBMC Results (ML-DSA-44)

Uh oh!

oqs-bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CBMC Results (ML-DSA-65)

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Mac Mini (M1, 2020) benchmarks (opt)

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Mac Mini (M1, 2020) benchmarks (no-opt)

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

Intel Xeon 4th gen (c7i)

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

Intel Xeon 4th gen (c7i) (no-opt)

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

AMD EPYC 3rd gen (c6a)

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

Intel Xeon 3rd gen (c6i)

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

Graviton4

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

AMD EPYC 3rd gen (c6a) (no-opt)

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

AMD EPYC 4th gen (c7a)

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

Intel Xeon 3rd gen (c6i) (no-opt)

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

Graviton3

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

Graviton4 (no-opt)

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

AMD EPYC 4th gen (c7a) (no-opt)

Uh oh!

willieyz commented Feb 12, 2026

Uh oh!

mkannwischer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

willieyz commented Feb 3, 2026 •

edited

Loading

oqs-bot commented Feb 3, 2026 •

edited

Loading

oqs-bot commented Feb 3, 2026 •

edited

Loading

oqs-bot commented Feb 3, 2026 •

edited

Loading

willieyz Feb 24, 2026 •

edited

Loading

willieyz Feb 24, 2026 •

edited

Loading

willieyz Feb 24, 2026 •

edited

Loading

willieyz Feb 24, 2026 •

edited

Loading

willieyz Feb 24, 2026 •

edited

Loading