Skip to content

lia-andrew/prime-sieve

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prime Sieve

This project provides an extremely fast and efficient way to find all primes within a provided range.

The purpose of this project was to attempt push the boundary on how fast prime sieving can be done. I can happily claim to have achieved that purpose. I also believe that the code is clean, structured, documented, and understandable enough that it can be used to learn more about prime sieve theory for those who are interested.

Description of Algorithm

The efficiency of this prime sieve implementation is primarily achieved through a parallel, wheel-factorized, segmented implementation of the Sieve of Eratosthenes. Each key part will be discussed below.

Segmentation

Segmenting the sieve does not reduce algorithmic complexity, but massively speeds up the process due to cache locality. Optimizing for improved cache locality lets the program loop over the same block of memory without (or at least far fewer) cache misses; these cost a large amount of CPU cycles. A typical cache miss (requiring access to RAM) can easily cost 100+ CPU cycles, compared to a l1 cache hit costing ~10 CPU cycles or less.

Each segment is "compressed" into a bitset, where each bit represents an odd number (even numbers do not need to be represented). Thus, in 64 bits (= 8 bytes) we can represent 128 prime candidates. Combining this with the idea of segmentation, we can now represent a large quantity of prime candidates in a small block of memory stored very close to the CPU.

p.s. On the topic of cache locality, CPU instructions themselves take space in memory. Due to this, the fewer CPU instructions, the easier it is to store them close to the CPU. This prime sieve has been highly optimized and reduced to as little C code as possible, making it just that little bit faster.

Wheel Factorization

The concept of Wheel Factorization can be understood with a small example. Before, it was mentioned that "even numbers do not need to be represented"; why is this? By definition, an even number is divisible by 2, thus all even numbers ( besides 2) cannot be prime, also by definition. This means all represented numbers (those which are not even) will be coprime to 2. This tells us two things: we do not need to check if a prime candidate is divisible by 2 (we already know it is not), and we can skip checking 50% of all numbers. If we abstract from this idea of skipping even numbers and focus more on skipping numbers dividable by 2, the same technique can be applied to skip numbers which are dividable by a specific prime number. We can combine these "skips" to form a wheel. This project uses a (2 * 3 * 5 =) 30 wheel. This means that only numbers that are coprime to 2 and 3 and 5 are checked for primality. This now skips 73.33% of all numbers!

This might leave a question of why not use to a bigger (e.g. 2 * 3 * 5 * 7 = 210) wheel. Each factor that is added to the wheel provides diminishing returns. Skipping all numbers divisible by 2 skips 50% of all numbers, skipping all numbers divisible by 2 and 3 (wheel of 6) skips 67.66% of all numbers, and, as mentioned before, a wheel of 30 skips 73.33% of all numbers. Some other notably efficient prime sieves indeed utilize a 210 wheel, but I found a wheel of 30 to be more efficient to work with. The reasons for this are too detailed to describe here.

Parallelization (via OpenMP)

When attempting to parallelize an algorithm, the goal is to make it embarrassing. Essentially, we want to take one big task and segment it into small, independent tasks. Fortunately the first part, small tasks, is already achieved by segmenting our sieve, which was described before.

The second part, independent tasks, takes more work. The name "wheel" of the aforementioned Wheel Factorization is named as such because of its cyclic nature. If you spin a wheel fast and let go of it, it is difficult to track exactly which part of the rotation it is in (e.g. is it halfway through a full rotation? One-third?). In the case of a wheel of 30, there are exactly 8 (the amount of numbers coprime to 30) parts to a rotation. To make each segment fully independent, there cannot be a global state storing what part the wheel currently is in. Thus, the work needed to be done is to figure out for each segment's starting point, what part of the rotation the wheel is in. The math to achieve this will not be described here.

With the ability to independently process segments of the sieve, multiple segments can be processed in parallel. This allows us to have a queue of segments, with the segment at the front being the next in line after the current segment's primality checks have been exhausted. The size of this queue is set up to be the same as the number of threads being used. Using x amount of threads does not cause x amount of speedup, which can be explained with Amdahl's Law, but it is indeed an integral part of prime sieve implementation. As explained later, the user can choose how many threads to use when the program is run, but for best performance 1 should not be chosen as the algorithm is designed to be parallelized and is not an efficient implementation of a serial prime sieve.

Benchmarks

All benchmarks were performed on a Windows laptop. Its processor was a 12th Gen Intel Core i7-12700H, with a base speed of 2.30 GHz, 14 cores, 1.1MB of l1 cache, 10.0MB of l2 cache, 24.0MB of l3 cache, and 16GB of RAM.

For each benchmark, the code was compiled with GCC 15.2.0 and 14 threads were used with a segment size of 5000000.

Elapsed time was measured with Linux's time command.

  • Sieving up to 10^9 took 0.2 seconds
  • Sieving up to 10^10 took 1 second
  • Sieving up to 10^11 took 11 seconds
  • Sieving up to 10^12 took 1 minute and 50 seconds

Compared with these benchmarksa, which (to the best of my knowledge) are the fastest open-source primes sieves, this implementation is very fast for relatively small maximums (10^9 - 10^10), and is significantly faster than all of its competitors for relatively large maximums (10^11 and onwards).

a The referenced benchmarks are not my own, and I cannot confirm that they are accurate.

Getting Started

Following the upcoming steps will result in you having a local version of the source code that you can modify as well as its executables.

Prerequisites

The following software must be installed to the follow the installation steps:

  • Git - How to install Git
  • GCC / Clang / MinGW (or other C compilers, although compile-time flags and OpenMP support are not guaranteed)
  • CMake
  • Ninja (preferred) / Make

C compilers, CMake and its backends (Ninja / Make) have different methods for installation on different platforms. On Unix-like platforms, you can likely easily install them using the package manager of your choice (e.g. apt or homebrew). For Windows, I recommend installing MSYS2 and using its package manager, pacman, for installation.

Installing

  1. Start by copying this repository into a local folder:

    git clone github.com/lia-andrew/prime-sieve
    
  2. Enter the new prime-sieve directory:

    cd prime-sieve
    
  3. Build the project (OPTIONAL: use -G to specify the generator to use; I recommend Ninja):

    cmake -B build -GNinja && cmake --build build
    
  4. You now have the source code in ./src and executables in ./build/bin.

How To Use

This section will teach you about the usage of the Command Line Arguments and purpose of each executable.

Command Line Arguments

Each executable can be run with Command Line Arguments, the format for which is argument=value, where value is an integer. The usage of each Command Line Argument will be described below:

  • --min: Specify the lower bound of primes that you want consumed (this does not improve the algorithm's performance as the sieving itself will always start from 7); default = 0.
  • --max: Specify the upper bound of primes that you want consumed (this does improve the algorithm's performance, and it is recommended to specify the lowest max that satisfies your intent); default = 2 ^ 31 - 1
  • --segment OR -s: Specify the size of the segments to be used when sieving; default = 5000000
  • --thread or -t: Specify the number of threads to be used when the program is run; default = number of cores on the machine

Executables

Each executable has a separate purpose which will be listed below:

  • benchmark: there is no prime consumption thus runtime performance is maximized
  • print: consuming a prime prints it to stdout
  • count: consuming a prime increments a counter, whose value will be printed when the program terminates
  • print-count: consuming a prime increments a counter then prints both the prime and counter's value to stdout

Built With

  • OpenMP - Parallel Programming Framework
  • CMake - Build Management

Roadmap

  • Unit tests
  • Automated benchmarking
  • A more flexible implementation of prime consumption
  • A buffered writer implementation + an argument to specify output file/stream(/websocket?)
  • Platform independent cache-size measurement
  • Analyze the compiled assembly and write inlined assembly for parts that struggle to be efficiently compiled
  • CI/CD
  • Continual research and development to make what is already fast even faster

Contributing

Contributions to this project are welcome.

Please start by making an issue describing the idea for the contribution you have in mind. If you desire realize this contribution yourself, fork the repository, write and commit the idea described in the issue. When ready, make a pull request justifying the change that was made, as well as a clear comparison of a before and after change benchmark. With this benchmark, specify the specifications of the machine that was used.

If there are a large number of issues and/or pull requests, a proper CONTRIBUTING.md file will be written defining better-defined guidelines.

Authors

License

This project is licensed under the BSD-2 License - see the LICENSE.md file for details.

Keywords

C, OpenMP, CMake, Segmented Sieve of Eratosthenes, Wheel Factorization, Number Theory, Parallelized Sieve, Extreme Optimization, Performance Engineering, Cache Locality, Clean Code, Self-Documenting Code, Documentation.

Self Promotion

If this project is of use to you and/or you are intrigued by it, consider leaving a star on the repository. Doing so will encourage me to write a comprehensive article describing the theory behind my implementation and the process of optimizing it.

Releases

No releases published

Packages

 
 
 

Contributors