You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We improve fmpz_is_prime for input between 65 and 81 bits, more precisely for $2^{64} \le n < 3317044064679887385961981$, with the addition of a new function n_ll_is_prime.
The speedup is around 4x for random input, 5x for primes, and 4.5x for composites without small factors. Also fmpz_is_probabprime is sped up nearly 2x by using the same algorithm which is now faster than the previous probabilistic BPSW test (though it could be sped up further in the future by switching back to a more optimized BPSW).
We were already using the certified Sorenson-Webster Miller-Rabin test with 13 bases here which as far as I know is the best rigorous primality test for this range. However, this was previously just implemented using fmpz functions, and although the inner loop is in mpn_powm which is pretty decent (GMP has specializations for two-limb moduli), we can do better.
The whole primality test is now implemented using optimized "one-and-a-half-limb arithmetic", i.e. double-limb arithmetic where we exploit the fact that we have a lot of slack in the high limb. This allows doing most of the modular arithmetic on non-canonical residues in an extremely streamlined way (even more streamlined than Montgomery arithmetic): the squaring + mod n step in modular exponentiation is just
sqrlo_3x2x2
mulhi_2x2_sloppy
mullo_2x2
sub_ddmmss
which looks quite optimal, and the scalar multiplication by the small base is just
mullo_2x1
without the need for reduction mod n. Currently the modular arithmetic is entirely ad hoc for this primality test, but this could be used elsewhere, e.g. for the Lucas test to speed up fmpz_is_probabprime.
A second improvement is to do fused Miller-Rabin tests three bases at a time. No actual SIMD; we just interleave the operations of three modular exponentiations to improve instruction level parallelism. On Zen 3 this speeds up the modular exponentiations by about 1.8x.
(With AVX512 I expect that a true SIMD implementation doing all 12 odd bases simultaneously should be even faster. I'm less certain about AVX2 due to the lack of 64-bit multiplication. One could probably use pairs of double to represent ~100-bit integers for this algorithm,
but that would be much more messy to implement.)
A third improvement is to reimplement trial division using Granlund-Montgomery and to reduce the number of trial primes.
Detailed timings for fmpz_is_prime; average time in seconds per call testing $10^5$ random integers of the given bit size:
I was also curious to see how our primality test compares to primesieve ($n < 2^{64}$ only) and the relatively new implementation of the pseudosquares prime sieve by @kimwalisch for generating/counting all primes in a large interval.
For (exactly) 64-bit integers just calling n_is_prime in a loop is competitive with primesieve on intervals of width up to about $10^7$ single-threaded, but maybe up to about $10^9$ multithreaded, depending on how many cores you have (we obviously get nearly perfect linear speedup just doing completely independent primality tests in parallel).
For 65-bit integers n_ll_is_prime seems to be consistently at least 4x faster than pseudosquares_prime_sieve. The results look pretty similar for 81-bit integers. At least for the sizes I've tested there doesn't seem to be an asymptotic trend suggesting an eventual cutoff where pseudosquares_prime_sieve wins (only the hard 81-bit limit for n_ll_is_prime); if there is a cutoff, it would seem to be for intervals significantly wider than $10^{10}$.
Obviously a hybrid method where you replace independent trial divisions with partial sieving should do a bit better yet.
We were already using the certified Sorenson-Webster Miller-Rabin test with 13 bases here which as far as I know is the best rigorous primality test for this range.
Actually https://arxiv.org/abs/1806.08697 suggests that BPSW is provably correct for $n < 2^{80}$ (at least if one can rule out numbers with four or more prime factors, e.g. with preliminary sieving). This could be about 3x faster for prime input. However, one would have to check very carefully that the specific parameters of the test match the hypotheses (this seems to be with a Fibonacci test while fmpz_is_probabprime_BPSW uses a Lucas test).
I've fixed an embarrassing bug in the trial division in the first version and extended n_ll_is_prime to do a partial primality test (trial division + base-2 sprp) for all <=128-bit integers when the certified test isn't applicable. This allows the function to be used for all <=128-bit input to handle the easy cases in fmpz_is_prime and fmpz_is_probabprime.
The improvement for fmpz_is_prime is pretty negligible between 82 and 128 bits since actually certifying candidate primes dominates everything else, but the speedup for fmpz_is_probabprime is not bad.
fredrik-johansson
changed the title
Faster primality test between 65 and 81 bits
Faster primality test between 65 and 128 bits
Apr 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We improve$2^{64} \le n < 3317044064679887385961981$ , with the addition of a new function
fmpz_is_primefor input between 65 and 81 bits, more precisely forn_ll_is_prime.The speedup is around 4x for random input, 5x for primes, and 4.5x for composites without small factors. Also
fmpz_is_probabprimeis sped up nearly 2x by using the same algorithm which is now faster than the previous probabilistic BPSW test (though it could be sped up further in the future by switching back to a more optimized BPSW).We were already using the certified Sorenson-Webster Miller-Rabin test with 13 bases here which as far as I know is the best rigorous primality test for this range. However, this was previously just implemented using
fmpzfunctions, and although the inner loop is inmpn_powmwhich is pretty decent (GMP has specializations for two-limb moduli), we can do better.The whole primality test is now implemented using optimized "one-and-a-half-limb arithmetic", i.e. double-limb arithmetic where we exploit the fact that we have a lot of slack in the high limb. This allows doing most of the modular arithmetic on non-canonical residues in an extremely streamlined way (even more streamlined than Montgomery arithmetic): the squaring + mod
nstep in modular exponentiation is justwhich looks quite optimal, and the scalar multiplication by the small base is just
without the need for reduction mod n. Currently the modular arithmetic is entirely ad hoc for this primality test, but this could be used elsewhere, e.g. for the Lucas test to speed up
fmpz_is_probabprime.A second improvement is to do fused Miller-Rabin tests three bases at a time. No actual SIMD; we just interleave the operations of three modular exponentiations to improve instruction level parallelism. On Zen 3 this speeds up the modular exponentiations by about 1.8x.
(With AVX512 I expect that a true SIMD implementation doing all 12 odd bases simultaneously should be even faster. I'm less certain about AVX2 due to the lack of 64-bit multiplication. One could probably use pairs of
doubleto represent ~100-bit integers for this algorithm,but that would be much more messy to implement.)
A third improvement is to reimplement trial division using Granlund-Montgomery and to reduce the number of trial primes.
Detailed timings for$10^5$ random integers of the given bit size:
fmpz_is_prime; average time in seconds per call testingRandom primes as input:
Composites without small factors:
I was also curious to see how our primality test compares to$n < 2^{64}$ only) and the relatively new implementation of the pseudosquares prime sieve by @kimwalisch for generating/counting all primes in a large interval.
primesieve(For (exactly) 64-bit integers just calling$10^7$ single-threaded, but maybe up to about $10^9$ multithreaded, depending on how many cores you have (we obviously get nearly perfect linear speedup just doing completely independent primality tests in parallel).
n_is_primein a loop is competitive withprimesieveon intervals of width up to aboutFor 65-bit integers$10^{10}$ .
n_ll_is_primeseems to be consistently at least 4x faster thanpseudosquares_prime_sieve. The results look pretty similar for 81-bit integers. At least for the sizes I've tested there doesn't seem to be an asymptotic trend suggesting an eventual cutoff wherepseudosquares_prime_sievewins (only the hard 81-bit limit forn_ll_is_prime); if there is a cutoff, it would seem to be for intervals significantly wider thanObviously a hybrid method where you replace independent trial divisions with partial sieving should do a bit better yet.