Improvements to Fortran solution #726

rbergen · 2021-09-18T08:51:18Z

rbergen
Sep 18, 2021
Maintainer

This is a placeholder/hook post, to facilitate @JohnCampbe11 suggesting improvements to the Fortran solutions submitted by @johandweber and @tjol.

JohnCampbe11 · 2021-09-18T13:55:07Z

JohnCampbe11
Sep 18, 2021

I have reviewed the Fortran Solution 1 submitted by @johandweber and would like to suggest some improvements which improve performance on my computers ( i5-2300 and Ryzen 5900X).

The changes I am suggesting are:

Store the possible primes in an integer*1 array , storing only 2 values per array entry.
Simplify the search loop, which now increments by 1 for each odd value, with an inner clearing loop for each prime found. This allows a simple IF test in the main search loop.
The use of default integers would be ok, although there is no performance penalty for 64-bit integers.

The search loop becomes:
subroutine allocate_odd_list (n)
integer(int64) :: n

if ( allocated(list_odds) ) deallocate (list_odds)
allocate ( list_odds(n) )
list_odds = all_odds_true

end subroutine allocate_odd_list

subroutine run_sieve
implicit none
integer(int64) :: k, k_limit, prime, next_odd

call allocate_odd_list (num_odds)

k_limit = int ( sqrt(dble(sieve_size)) ) / 2

do k = 1,k_limit                        ! while prime <= sqrt(sieve_size)

! find next prime
if ( list_odds(k) == 0 ) cycle
prime = k*2+1

! map out all multiples of prime
next_odd = (prime*prime)/2
if ( next_odd > num_odds ) exit ! search is over

   do while ( next_odd <= num_odds )
     list_odds(next_odd) = 0
     next_odd = next_odd + prime        ! ignoring even numbers
   end do

end do

end subroutine run_sieve

my gfortran compile options are :
gfortran %1.f90 -fimplicit-none -Ofast -march=native -ffast-math -o prime

johandweber_fortran;3801;5.001;1;algorithm=base,faithful=no,bits=1

i5-2300 single thread
Compiler_VersionGCC version 11.1.0
Compiler_Options-march=sandybridge -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mno-avx2 ...
-mtune=sandybridge -Ofast -fimplicit-none -ffast-math
johncampbell_fortran; 4897 ; 5.000 ;1;algorithm=base,faithful=no,bits=2 78498 T
johncampbell_fortran; 63568 ; 5.000 ; count only test : count= 78498

ryzen 5900X single thread
Compiler_VersionGCC version 11.1.0
Compiler_Options-march=znver3 -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 ... -mtune=znver3 -Ofast -fimplicit-none -ffast-math
johncampbell_fortran; 19469 ; 5.000 ;1;algorithm=base,faithful=no,bits=2 78498 T
johncampbell_fortran; 142409 ; 5.000 ; count only test : count= 78498

prime-8bitJC.f90.txt

3 replies

tjol Sep 18, 2021

I may be misunderstanding something, but it looks like what you're proposing is switching from a 1 bit solution to an 8 bit solution.

JohnCampbe11 Sep 19, 2021

Yes, I am ising an 8-bit list, although I have made some minor changes in my code, based on :
I have attempted to simplify the run_sieve coding,
The main loop of Run_sieve loops over the count of odd values, so a simpler test for "is this not a prime"
( do k = 1, k_limit ; if ( list_odds(k) == 0 ) cycle )
The inner loop can be either a "DO WHILE (next_odd <= num_odds)" or "DO next_odd = next_odd, num_odds, prime". The latter is more concise.
I have removed all OO for simplicity (my bias), although surprisingly for little performance benefit.
int64 is not required for sieve sizes in "validated_sieve_sizes" although they have no performance penalty over int32 for 64-bit compiler.
default integers would suffice.
And I would recommend using "gfortran %1.f08 -fimplicit-none -Ofast -march=native -ffast-math -o prime"

It could be described as a 4 bit solution, as even numbers are virtually stored, while only 1 bit of each 8-bit list entry is used.
I am not sure if the 1-bit solution would be better as the 8-bit list would fit in L3 cache, while the 1-bit list would not fit in L1 cache, with a more complex outer loop screening test.

What would be a valid multi-thread solution? Does each thread run a seperate sieve (trivial) or do all threads need to perform on the same sieve ?

Finally, each itteration of run_prime_sieve does not produce a prime count, but only identifies the primes in "this%raw_bits(:)". Is this a valid solution ?

tjol Sep 19, 2021

It could be described as a 4 bit solution, as even numbers are virtually stored, while only 1 bit of each 8-bit list entry is used.
I am not sure if the 1-bit solution would be better as the 8-bit list would fit in L3 cache, while the 1-bit list would not fit in L1 cache, with a more complex outer loop screening test.

The ‘bit count’ has been known to confuse people: it’s not the number of pits per prime, but more or less the number of bits addressed to access a bit of information in the sieve state. “1 bit” essentially means “We do bit manipulation”, “8 bit” means “we store every element of the sieve in one 8-bit byte”, etc.

Yes, solutions without bit flipping are faster on machines a large cache (like modern AMD/Intel PCs). You can see as much by testing the different variants of Fortran solution 2. However, this does change one of the fundamental properties of the solution. If you have changes to the 8-bit variant of solution 2 which increase performance, however, please let us know!

I have removed all OO for simplicity

I don't understand, @johandweber’s solution 1 didn’t use any OO, did it? (aside; If you were suggesting a change to an OO solution which removes the OO parts, this would almost certainly be unacceptable because it'd change a fundamental characteristic of the solution as defined in the CONTRIBUTING document by making it ‘unfaithful’)

What would be a valid multi-thread solution? Does each thread run a seperate sieve (trivial) or do all threads need to perform on the same sieve ?

I think there are both variants in other languages.

Finally, each itteration of run_prime_sieve does not produce a prime count, but only identifies the primes in "this%raw_bits(:)". Is this a valid solution ?

This is what the vast majority of solutions do, including the original three.

johandweber · 2021-09-19T09:56:46Z

johandweber
Sep 19, 2021

I just wanted to confirm that my implementation (solution_1) is not object-oriented and is thereefore 'unfaithful'.

0 replies

JohnCampbe11 · 2021-09-20T09:51:13Z

JohnCampbe11
Sep 20, 2021

I have started an OpenMP solution of the simpler kind, calling run_sieve ( sieve_size, num_odds, list_odds ) in a parallel loop and counting the number of passes.

For the Ryzen 5900X I am getting :
johncampbell_fortran; 182574 ; 5.011 ;1;algorithm=base,faithful=no,bits=2,thread=24 78498 T
johncampbell_fortran; 185380 ; 5.001 ;1;algorithm=base,faithful=no,bits=2,thread=18 78498 T
johncampbell_fortran; 185692 ; 5.002 ;1;algorithm=base,faithful=no,bits=2,thread=12 78498 T
johncampbell_fortran; 149890 ; 5.000 ;1;algorithm=base,faithful=no,bits=2,thread=8 78498 T
johncampbell_fortran; 78531 ; 5.000 ;1;algorithm=base,faithful=no,bits=2,thread=4 78498 T
johncampbell_fortran; 40484 ; 5.000 ;1;algorithm=base,faithful=no,bits=2,thread=2 78498 T
johncampbell_fortran; 20394 ; 5.000 ;1;algorithm=base,faithful=no,bits=2,thread=1 78498 T
johncampbell_fortran; 20372 ; 5.000 ;1;algorithm=base,faithful=no,bits=2 78498 T (single thread program)

For the i5-2300, I am getting:
johncampbell_fortran; 18485 ; 5.005 ;1;algorithm=base,faithful=no,bits=2,thread=4 78498 T
johncampbell_fortran; 9421 ; 5.001 ;1;algorithm=base,faithful=no,bits=2,thread=2 78498 T
johncampbell_fortran; 4830 ; 5.000 ;1;algorithm=base,faithful=no,bits=2,thread=1 78498 T
johncampbell_fortran; 4853 ; 5.000 ;1;algorithm=base,faithful=no,bits=2 78498 T (single thread program)

This comparison is interesting, as it shows my approach to OpenMP stalls with hyper-threading, but the achieved rate of 185,692 sieves with 12 threads on Ryzen 5900X is significantly faster that non-threaded code.

To implement the OpenMP solution, run_sieve can not be CONTAINed, to allow multiple PRIVATE list_odds(:).
Also, the OMP PARALLEL DO loop has to run to completion, so once 5 seconds are up, the loop continues cycling to completion but no calculations. I set the loop counter to 250,000 (after knowing the expected passes, which is not a robust approach).
I did try counting the number of odds that are made not prime to return a prime count, but this loop is the main time user and including an IF test in this loop removes optimisation options for the compiler and slows the program considerably.

I am not familiar with GitHub and as the OpenMP code needs further refining, I would be interested in feedback.

Is there a process for a Fortran Multi-thread solution to first be reviewed and possibly accepted ?
Should I post for others to recommend possible changes ?

1 reply

tjol Sep 20, 2021

Pinging @BenPalmer1983, as I know he was working on multi-threaded Fortran solution at one point.

There is a process, documented in CONTRIBUTIONG.md. It would be ideal if you could create a "fork" of the Primes repository in your GitHub account, create a new branch there, and commit your version / your draft to that branch in the appropriate folder (i.e. probably PrimeFortran/solution_3). You can then create a "pull request" in this repository referring to your new branch.

Pull requests are the best way to discuss and review code on GitHub, even if it's not ready to be merged yet. You'll be able to update your branch pull request with new changes later, and I'm sure we'll be able to help you get your version to the point where it's ready be be accepted.

I'll definitely be happy to have a look at the code, but I have to warn you that I know next to nothing about OpenMP (and not that much about Fortran if I'm honest)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvements to Fortran solution #726

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Improvements to Fortran solution #726

Uh oh!

rbergen Sep 18, 2021 Maintainer

Replies: 3 comments · 4 replies

Uh oh!

JohnCampbe11 Sep 18, 2021

Uh oh!

tjol Sep 18, 2021

Uh oh!

JohnCampbe11 Sep 19, 2021

Uh oh!

tjol Sep 19, 2021

Uh oh!

johandweber Sep 19, 2021

Uh oh!

JohnCampbe11 Sep 20, 2021

Uh oh!

tjol Sep 20, 2021

rbergen
Sep 18, 2021
Maintainer

Replies: 3 comments 4 replies

JohnCampbe11
Sep 18, 2021

johandweber
Sep 19, 2021

JohnCampbe11
Sep 20, 2021