Optimize trial division and half-limb probable prime test in n_is_prime#2616
Merged
fredrik-johansson merged 1 commit intoflintlib:mainfrom Mar 22, 2026
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Yet another update to
n_is_prime, giving up to 1.7x speedup in the 21-32 bit range and up to 1.1x speedup in the 33-64 bit range.To compute 2^e mod n for n < 2^32, we use fast 64-bit division with remainder (Fast single-word division and remainder #2597) instead of Shoup mulmods. This doesn't seem to be significantly faster by itself, but it allows doing the multiplications by 2 without modular reduction when n < 2^31. Combined with making the multiplications by 2 branch-free, this gives most of the speedup.
Previously we did branchy trial division by 18 odd primes for n < 2^32 and 34 odd primes for n > 2^32, relying on the compiler to turn
!(n%19)into something efficient. Current GCC generates suboptimal code for divisibility testing by constants, so we reimplement the trial division using Granlund-Montgomery. In addition, for n < 2^32, we keep the divisions by 3 and 5 branchy, but do the rest with branch-free 32-bit code, increasing the number of trial primes to 34. GCC turns this block of 32 trial divisions into a handful of AVX2 instructions which on average execute faster than a loop with early abort. (In the worst case for this method, i.e. if one only inputs n which are divisible by 7 but not by 2, 3 or 5, this is only 15% slower than the old code.) With AVX512 this could make sense in the 64-bit case as well.Detailed speedups by bit size and input type: