Bring in shr_reprosum fixes from E3SM #81

billsacks · 2026-01-07T18:50:23Z

Summary

This PR brings in three related sets of fixes/cleanups to shr_reprosum_mod from @worleyph , which were made to E3SM about 3 years ago:

Update comments and coding style in shr_reprosum_mod E3SM-Project/E3SM#5534
Fix support for preventing overflow and catastrophic cancellation E3SM-Project/E3SM#5549
Eliminate potential nonreproducibility in final local sum E3SM-Project/E3SM#5560

Resolves #79

This changes answers for CAM-FV cases, and may change answers for other cases, but I haven't observed other answer changes from my so-far-limited testing.

How I brought in these changes

I brought in the changes by cherry-picking the commits that were made in E3SM. I organized these into three batches so that I could make a merge commit corresponding to each of the E3SM PRs, where the merge commit message summarizes the changes in the corresponding batch.

I then compared the final shr_reprosum_mod.F90 with the version in E3SM. These are now identical except for the following:

(1) CESM has numerous #ifdef TIMING, whereas E3SM has #ifndef EAMXX_STANDALONE. There are also some whitespace differences between the CESM and E3SM versions around these ifdefs.

(2) CESM has #include <mpif.h>, E3SM has use mpi

(3) E3SM has:

#ifdef EAMXX_STANDALONE   
   ! Declare the C function interface
   interface
      subroutine shr_reprosumx86_fix_start(arg) bind(c)
        use iso_c_binding
        integer, intent(out) :: arg
      end subroutine shr_reprosumx86_fix_start
   end interface

   interface
      subroutine shr_reprosumx86_fix_end(arg) bind(c)
        use iso_c_binding
        integer, intent(in) :: arg
      end subroutine shr_reprosumx86_fix_end
   end interface
#endif

(4) E3SM has:

! With Fujitsu always abort on NaNs or INFs in input
#ifdef CPRFJ
      abort_inf_nan = .true.
#endif

(5) E3SM has an #ifndef CPRFJ around a small block of code

Testing done

These changes have been tested extensively in E3SM.

From some limited testing in CESM, I am finding that the changes here seem to change answers for CAM-FV cases, but not for CAM-SE.

Specifically, here is the testing I ran, showing which baselines passed vs. failed. The pattern I see is that FV tests have baseline differences whereas SE tests do not.

    PASS ERS_D_Ld3.ne30pg3_t232.B1850C_LTso.derecho_intel.allactive-defaultio BASELINE cesm3_0_alpha08a:
    PASS SMS_Ld1.ne30_ne30_mg17.FC2010climo.derecho_intel.cam-outfrq1d BASELINE cesm3_0_alpha08a:
    PASS SMS_Ln9.ne30_ne30_mg17.FW2000climo.derecho_intel.cam-outfrq9s BASELINE cesm3_0_alpha08a:

    FAIL ERP_Ld3.f09_f09_mg17.FCfireHIST.derecho_intel.cam-outfrq1d BASELINE cesm3_0_alpha08a: DIFF
    FAIL ERP_Ln9.f09_f09_mg17.F2000climo.derecho_intel.cam-outfrq9s BASELINE cesm3_0_alpha08a: DIFF
    FAIL SMS_D_Ln9.f09_f09_mg17.FCHIST.derecho_intel.cam-outfrq9s_ocnemis BASELINE cesm3_0_alpha08a: DIFF
    FAIL SMS_D_Ln9.f09_f09_mg17.FCts2SD.derecho_intel.cam-outfrq9s BASELINE cesm3_0_alpha08a: DIFF
    FAIL SMS_D_Ln9.f09_f09_mg17.FWma2000climo.derecho_intel.cam-outfrq9s BASELINE cesm3_0_alpha08a: DIFF
    FAIL SMS_D_Ln9.f19_f19_mg17.FWsc2010climo.derecho_intel.cam-outfrq9s BASELINE cesm3_0_alpha08a: DIFF
    FAIL SMS_D_Ln9_P144x3.f19_f19_mg17.FC2000climo.derecho_intel.cam-outfrq9s_camchem_mam4 BASELINE cesm3_0_alpha08a: DIFF
    FAIL SMS_Ld1.f09_f09_mg17.FW2010climo.derecho_intel.cam-outfrq1d BASELINE cesm3_0_alpha08a: DIFF

All of those tests themselves passed.

The comments in the shr_reprosum_mod module describe the default scalable reproducible algorithm as first converting each floating point input to a fixed point representation. This is an inadequate description of the process. (It does adequately describe the original algorithm in which a single integer was used, but not the current version which uses a vector of integers representation for each floating point value.) Comments throughout the module are modified to increase clarity on this topic. [BFB] - Bit-For-Bit

The word 'process' is often used when not referring to an MPI task, adding confusion. Change 'process' to 'MPI task' when this is what is meant in order to avoid this. BFB

Currently both double and single quotes are used in the comments. Change double quotes to single quotes in comments in order to be consistent throughout. BFB

There is no paper describing the algorithms in this module in any detail, and the comments in the code are somewhat cryptic. A fuller description of what is going on is added to the comments at the beginning of routine shr_reprosum_calc and some comments are expanded upon with the code. BFB

The arguments used in digits() and radix() are sometimes the constant '0' and sometimes the constant '1'. For consistency, will always use '1', replacing 0_i8 and 0._r8 with 1_i8 and 1.0_r8, respectively, in these intrinsics. Also, replacing 0._r8 and 1._r8 by 0.0_r8 and 1.0_r8 through the code, so as to use one style consistently. BFB

The code currently uses a mix of 'text' and 'symbolic' relational logical operators. This commit changes them all to the symbolic equivalent, i.e. == for .eq., /= for .ne., < for .lt., > for .gt. <= for .le. and >= for .ge. Also replaced 'dble' intrinsic call by 'real(.,r8)' . (BFB)

A few different 'styles' of comments are used in the code. Change capitalization and spacing to introduce consistency. (BFB)

This brings in the changes from E3SM-Project/E3SM#5534 Here is the merge commit message from the merge of that PR into E3SM master: Update comments and coding style in shr_reprosum_mod a) Change description of default algorithm in shr_reprosum_mod module. The comments in the shr_reprosum_mod module describe the default scalable reproducible algorithm as first converting each floating point input to a fixed point representation. This is an inadequate description of the process. (It does adequately describe the original algorithm in which a single integer was used, but not the current version which uses a vector of integers representation for each floating point value.) Comments throughout the module are modified to increase clarity on this topic. b) Replace 'process' with 'MPI task' when that is what is meant. The word 'process' is often used when not referring to an MPI task, adding confusion. Change 'process' to 'MPI task' when this is what is meant in order to avoid this. c) Change double quotes to single quotes in comments. Currently both double and single quotes are used in the comments. Change double quotes to single quotes in comments in order to be consistent throughout. d) Add fuller description of algorithms to the module. There is no paper describing the algorithms in this module in any detail, and the comments in the code are somewhat cryptic. A fuller description of what is going on is added to the comments at the beginning of routine shr_reprosum_calc and some comments are expanded upon with the code. e) Use constants consistently. The arguments used in digits() and radix() are sometimes the constant '0' and sometimes the constant '1'. For consistency, will always use '1', replacing 0_i8 and 0._r8 with 1_i8 and 1.0_r8, respectively, in these intrinsics. Also, replacing 0._r8 and 1._r8 by 0.0_r8 and 1.0_r8 through the code, so as to use one style consistently. f) Update relational operators to symbolic form and replace dble intrinsic. The code currently uses a mix of 'text' and 'symbolic' relational logical operators. This commit changes them all to the symbolic equivalent, i.e. == for .eq., /= for .ne., < for .lt., > for .gt. <= for .le. and >= for .ge. Also replaced 'dble' intrinsic call by 'real(.,r8)' . g) Introduce consistent capitalization and spacing in comments. A few different 'styles' of comments are used in the code. Change capitalization and spacing to introduce consistency. [BFB]

As part of the process of converting the vector of integers representing the global sum into a floating point value, the components are first modified to all be the same sign. This eliminates the possibility of catatrophic cancellation during the next step, when each integer component is converted to a floating point value and these floating point values are then summed to generate the global sum. The algorithm works as follows. Each integer*8 component IX(i), i=0,1,... , has an associated implicit exponent LX(i) and represents the floating point value real(IX(i),r8)**LX(i). The sequence of exponents all differ by the same amount, LX(i) - LX(i+1) = max_shift, and are monotonically decreasing in (i). By construction, abs(IX(i)) < radix(1_i8)**max_shift for i = 1,2,... In consequence, if IX(i-1), IX(i), and IX(i+1) are all nonzero, then the magnitude of the value at index (i) (abs(real(IX(i),r8)**LX(i))) is strictly less than the magnitude of the value at index (i-1) (abs(real(IX(i-1),r8)**LX(i-1))) and strictly greater than the magnitude of the value at index (i+1) (abs(real(IX(i+1),r8)**LX(i+1))). If IX(i) is greater than zero and IX(i+1) is less than zero, then 1_i8 can be subtracted from IX(i) and radix(1_i8)**max_shift added to IX(i+1) so that IX(i) is now greater than or equal to zero, and IX(i+1) is greater than zero. If IX(i) is less than zero and IX(i+1) is greater than zero, then, after adding 1_i8 to IX(i) and subtracting radix(1_i8)**max_shift from IX(i+1), IX(i) is less than or equal to zero and IX(i+1) is less than zero. The algorithm begins by finding the first nonzero component (i=0,1,...), which determines the sign of the resulting sum, then uses the above process to change all components to this sign. The above description does not treat the situation when IX(i) is zero, which is where an error arises in the current algorithm. The current algorithm always treats 0_i8 as positive (as it uses the fortran 'sign' intrinsic to compare signs between components). This can cause problems if IX(i-1) is greater than zero, IX(i) is zero, and IX(i+1) is less than zero. In this case IX(i-1) and IX(i) have the same sign and neither are modified after their comparison. However, IX(i) and IX(i+1) have different signs, so 1_i8 will be subtracted from IX(i), changing it to negative, while radix(1_i8)**max_shift will be added to IX(i+1), changing it to positive, and the same sign property will not be achieved. The solution is to always treat a zero IX(i) as having a different sign from IX(i-1) so that IX(i) always becomes nonzero. As we always start the process with a nonzero IX(i), we end up with all components either greater than or equal to zero or all components less than or equal to zero. Note that while this error can prevent the resulting sequence of floating point values from being all the same sign or zero, the situation in which this error occurs does not lead to additional rounding error in the final summation and so does not decrease the accuracy. Also, the fix has been BFB with the buggy version in practice, though it probably is not guaranteed to be. Despite the innocuousness of the error, it is important to fix it as part of the resolution to issue #5276 ("Improving accuracy of shr_reprosum_calc"). Claiming that this is BFB for the purposes of testing (as it likely is BFB for E3SM testing and production simulations), so that any non-BFB behavior in the testing will be examined before the PR is accepted. BFB

The default shr_reprosum_mod algorithm, the integer vector algorithm, first converts each floating point summand into an equivalent representation using a vector of integers, sums the integer vectors, then converts the resulting sum back to a floating point representation. Each integer*8 component IX(i), i=1,... , has an associated implicit exponent LX(i) and represents the floating point value real(IX(i),r8)**LX(i). The sequence of exponents all differ by the same amount, LX(i) - LX(i+1) = max_shift, and are monotonically decreasing in (i), and each component for a given summand is strictly less than radix(1_i8)**arr_max_shift in absolute value. To allow for a larger max_shift (and shorter integer vector), the components of the integer vector are manipulated after the local single threaded sums and before the sum over OpenMP threads and MPI tasks so that abs(IX(i)) < radix(1_i8)**max_shift for all i > 0, as follows. If abs(IX(i)) is larger than or equal to (radix(1_i8)**arr_max_shift), subtract this 'overlap' from the current value and add it (appropriately shifted) to IX(i-1). By starting from the largest index (smallest implicit exponent), this removes any overlap from each component. Note that the integer vector has an IX(0) that is initially zero to capture the overlap from IX(1). The current logic stops the process at index 0. However, this can be greater than the target radix(1_i8)**max_shift, and the subsequent summation over MPI tasks and OpenMP threads can potentially overflow for the distributed sum of IX(0) if the number of OpenMP threads times the number of MPI tasks is large. By extending the length of the integer vector from 0:max_levels to -(extra_levels-1):max_levels and continuing the above process to index -(extra_levels-1), can guarantee that the integer value for all indices satisfies abs(IX(i)) < radix(1_i8)**max_shift if choose extra_levels appropriately, thus eliminating the possibility of overflow. Another weakness in the original algorithm arises from using RX_8 = real(IX(i),r8) Overlap = int(scale(RX_8,-max_shift),i8) to identify the overlap in IX(i), which may miss some of the overlap if IX(i) has too many significant digits to be represented by a floating point value. However, simply calculating the overlap using integer division Overlap = IX(i) / (i8_radix**arr_max_shift) eliminates this (slight) possibility, and also simplifies the code. This commit implements both fixes. It also augments the comments to better document why this approach is sufficient to avoid overlap during the sum over threads and MPI tasks. Note that, while these changes are not guaranteed to be BFB, the calculated value for extra_levels for all E3SM simulations in the test suites (and likely all production simulations) is almost certainly '1', and so will be identical to the existing algorithm. The change in the overlap calculation is also likely to be BFB with the current algorithm in practice for this code fragment and, even if not, is still unlikely to change the value of the final sum. Again, claiming that it is BFB for the purposes of this PR so that any non-BFB behavior in the testing will be examined before the PR is accepted. BFB

In the process of backing out manageable commits from the final proposed PR, still to be submitted, an error was introduced in the previous commit in this PR in that the new lower bound on the ilevel loops (min_level) was not introduced into all ilevel loops. This was not caught in testing as the min_level is currently identical with the previous lower bound ('0') in all E3SM testing scenarios. BFB

This brings in the changes from E3SM-Project/E3SM#5549 Here is the merge commit message from the merge of that PR into E3SM master: Fix support for preventing overflow and catastrophic cancellation The default shr_reprosum_mod algorithm, the integer vector algorithm, first converts each floating point summand into an equivalent representation using a vector of integers, sums the integer vectors, then converts the resulting sum back to a floating point representation. The algorithm includes logic to preclude the possibility of integer overflow in the intermediate sums and to eliminate the possibility of catastrophic cancellation in the final summation. Upon close inspection it has been determined that this logic is not sufficient to achieve these goals in all situations. Though it is very unlikely for these "edge cases" to occur during E3SM simulations, it is desirable to correct the problems anyway. a) Fix logic to make all integers in integer expansion the same sign. As part of the process of converting the vector of integers representing the global sum into a floating point value, the components are first modified to all be the same sign. This eliminates the possibility of catatrophic cancellation during the next step, when each integer component is converted to a floating point value and these floating point values are then summed to generate the global sum. The algorithm works as follows. Each integer*8 component IX(i), i=0,1,... , has an associated implicit exponent LX(i) and represents the floating point value real(IX(i),r8)**LX(i). The sequence of exponents all differ by the same amount, LX(i) - LX(i+1) = max_shift, and are monotonically decreasing in (i). By construction, abs(IX(i)) < radix(1_i8)**max_shift for i = 1,2,... In consequence, if IX(i-1), IX(i), and IX(i+1) are all nonzero, then the magnitude of the value at index (i) (abs(real(IX(i),r8)**LX(i))) is strictly less than the magnitude of the value at index (i-1) (abs(real(IX(i-1),r8)**LX(i-1))) and strictly greater than the magnitude of the value at index (i+1) (abs(real(IX(i+1),r8)**LX(i+1))). If IX(i) is greater than zero and IX(i+1) is less than zero, then 1_i8 can be subtracted from IX(i) and radix(1_i8)**max_shift added to IX(i+1) so that IX(i) is now greater than or equal to zero, and IX(i+1) is greater than zero. If IX(i) is less than zero and IX(i+1) is greater than zero, then, after adding 1_i8 to IX(i) and subtracting radix(1_i8)**max_shift from IX(i+1), IX(i) is less than or equal to zero and IX(i+1) is less than zero. The algorithm begins by finding the first nonzero component (i=0,1,...), which determines the sign of the resulting sum, then uses the above process to change all components to this sign. The above description does not treat the situation when IX(i) is zero, which is where an error arises in the current algorithm. The current algorithm always treats 0_i8 as positive (as it uses the fortran 'sign' intrinsic to compare signs between components). This can cause problems if IX(i-1) is greater than zero, IX(i) is zero, and IX(i+1) is less than zero. In this case IX(i-1) and IX(i) have the same sign and neither are modified after their comparison. However, IX(i) and IX(i+1) have different signs, so 1_i8 will be subtracted from IX(i), changing it to negative, while radix(1_i8)**max_shift will be added to IX(i+1), changing it to positive, and the same sign property will not be achieved. The solution is to always treat a zero IX(i) as having a different sign from IX(i-1) so that IX(i) always becomes nonzero. As we always start the process with a nonzero IX(i), we end up with all components either greater than or equal to zero or all components less than or equal to zero. Note that while this error can prevent the resulting sequence of floating point values from being all the same sign or zero, the situation in which this error occurs does not lead to additional rounding error in the final summation and so does not decrease the accuracy. Also, the fix has been BFB with the buggy version in practice, though it probably is not guaranteed to be. Despite the innocuousness of the error, it is important to fix it as part of the resolution to issue E3SM-Project/E3SM#5276 ("Improving accuracy of shr_reprosum_calc"). Claiming that this is BFB, for the purposes of testing (as it likely is BFB for E3SM testing and production simulations) so that any non-BFB behavior in the testing will be examined before the PR is merged into master. b) Fix logic to prevent overflow in intermediate sums To reiterate, each integer*8 component IX(i), i=1,... , has an associated implicit exponent LX(i) and represents the floating point value real(IX(i),r8)**LX(i). The sequence of exponents all differ by the same amount, LX(i) - LX(i+1) = max_shift, and are monotonically decreasing in (i), and each component for a given summand is strictly less than radix(1_i8)**max_shift in absolute value. To allow for a larger max_shift (and shorter integer vector), the components of the integer vector are manipulated after the local single threaded sums and before the sum over OpenMP threads and MPI tasks so that abs(IX(i)) < radix(1_i8)**max_shift for all i > 0, as follows. If abs(IX(i)) is larger than or equal to (radix(1_i8)**arr_max_shift), subtract this 'overlap' from the current value and add it (appropriately shifted) to IX(i-1). By starting from the largest index (smallest implicit exponent), this removes any overlap from each component. Note that the integer vector has an IX(0) that is initially zero to capture the overlap from IX(1). The current logic stops the process at index 0. However, this can be greater than the target radix(1_i8)**max_shift, and the subsequent summation over MPI tasks and OpenMP threads can potentially overflow for the distributed sum of IX(0) if the number of OpenMP threads times the number of MPI tasks is large. By extending the length of the integer vector from 0:max_levels to -(extra_levels-1):max_levels and continuing the above process to index -(extra_levels-1), can guarantee that the integer value for all indices satisfies abs(IX(i)) < radix(1_i8)**max_shift if choose extra_levels appropriately, thus eliminating the possibility of overflow. Another weakness in the original algorithm arises from using RX_8 = real(IX(i),r8) Overlap = int(scale(RX_8,-max_shift),i8) to identify the overlap in IX(i), which may miss some of the overlap if IX(i) has too many significant digits to be represented by a floating point value. However, simply calculating the overlap using integer division Overlap = IX(i) / (i8_radix**max_shift) eliminates this (slight) possibility, and also simplifies the code. This commit implements both fixes. It also augments the comments to better document why this approach is sufficient to avoid overlap during the sum over threads and MPI tasks. Note that, while these changes are not guaranteed to be BFB, the calculated value for extra_levels for all E3SM simulations in the test suites (and likely all production simulations) is almost certainly '1', and so will be identical to the existing algorithm. The change in the overlap calculation is also likely to be BFB with the current algorithm in practice for this code fragment and, even if not, is still unlikely to change the value of the final sum. Again, claiming that it is BFB for the purposes of this PR so that any non-BFB behavior in the testing will be examined before the PR is accepted. Fixes E3SM-Project/E3SM#5537 [BFB]

The reprosum algorithm converts each floating point summand into a vector of integers representation. These vectors are summed locally first, and then globally (using an mpi_allreduce). All of this is exact and reproducible. The final step is to convert the final sum, in the vector of integers representation, back into a floating point value. For this last step, each component integer in the vector representation is converted back into a floating point value (call it a_{i} for the ith element of the vector), and then these values are summed locally, smallest to largest, to get the final value. The integer representation is first massaged so that all a_{i} have the same sign and the range of values is usually not overlapping (least significant binary digit in abs(a_{i}) represents a value larger than that represented by abs(a_{i+1}) in most cases, but sometimes is equal to). When constructing the final floating point value, round-off error can be introduced in the least significant digit of the final sum. This level of inaccuracy has no practical impact on overall model accuracy, but it can potentially lead to loss of reproducibility with respect to the numbers of MPI tasks and OpenMP threads. The issue is that the set of floating point values generated from the vector of integers will vary with the number of components in the integer vector and the implicit exponents associated with each component, which are a functions of, among things, the number of MPI tasks and number of OpenMP threads. Any round-off error will be sensitive to the specific floating point values being summed, even though the sum would be invariant with respect to exact arithmetic. This nonreproducibility has been observed when changing an 8-byte integer vector representation to a 4-byte integer vector representation (which also changes the vector length and assocated implicit exponents), but only in stand-alone contrived examples and never in E3SM. However, we can not rule out nonreproducibility occurring in practice (without a more detailed analysis of the range of sums likely to occur in E3SM simulations). The solution implemented here is to make the set of floating values generated from the integer vector independent of the size of the vector and definition of implicit exponents. Effectively, a new integer vector is generated from the original in which each component integer has a fixed number of significant digits (e.g., digits(1.0_r8)), and the floating point values are generated using this vector. As number, and value, of the significant digits is unchanged between the two representations, the implied value is unchanged, but any rounding error in the summation from using the new representation will be reproducible. Since digits(1.0_r8) would be too many digts for a 4-byte integer representation of the vector, and since we want to continue to support this option, each new floating point value is actually contructed from some number of floating point values with possibly fewer significant digits, but in such a way that the summation of these is exact (no rounding error possible), and is identical to the floating point value that would be generated if the integer vector had been modified as indicated in the previous paragraph. Note that another possibility to address this issue is by further decreasing the likelihood of rounding errors by calculating this final sum of floating point values using the DDPDD algorithm locally. (DDPDD is an alternative algorithm already available in the reprosum code.) DDPDD uses a two-floating-point-value representation of the intermediate values in the summation to approximately double the precision, i.e., for double precision values, summing in quad precision. In this case, round-off error is isolated to the least significant digit in the quad precision value, and it is even less likely that this will impact the least significant digit in the resulting double precision value (though not impossible). This 'local' DDPDD approach was implemented, though it is not included in this commit as the proposed algorithm is preferred as it is always reproducible, and not just almost always. However, comparing the two approaches in E3SM experiments, they are BFB, implying that the proposed algorithm improves accuracy as well. In similar experiments, the propsoed is also BFB with using the existing alternative 'distributed' DDPDD algorithm. As implied, both the proposed algorithm and the two DDPDD approaches are not BFB with the current default integer vector algorithm in E3SM simulations. The differences arise from differing round-off in the least significant digit, but this does accumulate over time. For example, for a 5 day run on Chrysalis using --compset WCYCL1850 --res ne30pg2_EC30to60E2r2 --pecount XS (675 ATM processes) there are 3669 reprosum calls, all in ATM, and using the proposed algorithm (or 'local' DDPDD or 'distributed' DDPDD) for the final sum changes the result 200 times (so 5.5% of the time). Looking at the actual difference (printing out the mantissas in the final sums in octal), it is always +/- 1 in the least significant binary digit. But this is still enough to perturb the model output, e.g. (from the atm.log) nstep, te 241 0.26199582846629281E+10 0.26199596541672482E+10 0.75730462729090879E-04 0.98528834650276956E+05 vs. nstep, te 241 0.26199694926722941E+10 0.26199707204570022E+10 0.67893797509715502E-04 0.98528681087863835E+05 Note that the computational cost of the proposed algorithm for the conversion of the integer vector into a floating point value is a little more expensive than that of the original algorithm, but the performance difference is in the noise as most of the cost of shr_reprosum_calc is in the common aspects of the current and proposed algorithms, i.e., determination of global max/min and number of summands, conversion of each floating point summand into the vector of integers representation, and the local and global sums of the integer vectors. For example, performance comparisons between the new and old approaches for calls to shr_reprosum_cale are inconclusive as to which is faster. Also note that the 'local' DDPDD algorithm is slightly more expensive than the proposed algorithm, but the difference is again in the noise. [Non-BFB]

The r8 values generated from the integer vector representation and to be summed locally to generate the global sum are held in the vector summand_vector. The integer*8 integer vector, i8_gsum_level, is dimensioned as (-(extra_levels-1):max_level), so is length (max_level+extral_levels). Exactly digits(1.0_r8) digits are extracted from the integer vector representation for each r8 value, so at most ((max_level+extral_levels)*(digits(1_i8)))/digits(1.0_r8) + 1 r8 values are generated, which is bounded from above by (max_level+extral_levels)*(1 + (digits(1_i8)/digits(1.0_r8))) and this is how summand_vector should be dimensioned. [BFB]

This brings in the changes from E3SM-Project/E3SM#5560 Here is the merge commit message from the merge of that PR into E3SM master: Eliminate potential nonreproducibility in final local sum The default reprosum algorithm converts each floating point summand into a vector of integers representation. These vectors are summed locally first, and then globally (using an mpi_allreduce). All of this is exact and reproducible. The final step is to convert the final sum, in the vector of integers representation, back into a floating point value. This final sum begins by converting each element of the integer vector into a floating point value, and then adding these, smallest to largest. Round off can occur in the least significant digit of the final result, and whether round off occurs is sensitive to the particular floating point summands. The numbers of MPI tasks and OpenMP threads can affect the length of the integer vector, and thus the floating point summands. In consequence, reproducibility with respect to degree of parallelism is not guaranteed. (The situation in which reproducibility is lost has never been observed in E3SM simulations, but it cannot be ruled out.) This PR modifies the final summation so that the same set of floating point summands is generated independent of the number of MPI tasks and number of OpenMP threads, establishing reproducibility. Fixes E3SM-Project/E3SM#5276 [non-BFB]

billsacks · 2026-01-07T18:53:04Z

I'm assigning the following reviewers here:

@fvitt - because it sounds like you are the person most familiar with this code on the CESM side
@briandobbins - to give your approval for bringing this into CESM
@gold2718 - to give your approval from the NorESM side

Also FYI: @dabail10 @mvertens @whlipscomb @cacraigucar

fvitt

I trust that the baseline differences are roundoff level. There is very detailed explanation which is nice to see. Thanks.

dabail10 · 2026-01-07T21:34:52Z

I would also like to tag @apcraig as we do use the code originally from Pat Worley in CICE.

worleyph and others added 16 commits January 6, 2026 18:48

Replace 'process' with 'MPI task' when that is what is meant

56023ff

The word 'process' is often used when not referring to an MPI task, adding confusion. Change 'process' to 'MPI task' when this is what is meant in order to avoid this. BFB

Change double quotes to single quotes in comments

c4bd08b

Currently both double and single quotes are used in the comments. Change double quotes to single quotes in comments in order to be consistent throughout. BFB

Introduce consistent capitalization and spacing in comments

2536a5f

A few different 'styles' of comments are used in the code. Change capitalization and spacing to introduce consistency. (BFB)

Add TIMING ifdefs around some new timing calls

ee957a8

billsacks requested review from briandobbins, fvitt and gold2718 January 7, 2026 18:50

fvitt approved these changes Jan 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bring in shr_reprosum fixes from E3SM #81

Bring in shr_reprosum fixes from E3SM #81

Uh oh!

billsacks commented Jan 7, 2026

Uh oh!

billsacks commented Jan 7, 2026

Uh oh!

fvitt left a comment

Uh oh!

dabail10 commented Jan 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Bring in shr_reprosum fixes from E3SM #81

Are you sure you want to change the base?

Bring in shr_reprosum fixes from E3SM #81

Uh oh!

Conversation

billsacks commented Jan 7, 2026

Summary

How I brought in these changes

Testing done

Uh oh!

billsacks commented Jan 7, 2026

Uh oh!

fvitt left a comment

Choose a reason for hiding this comment

Uh oh!

dabail10 commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dabail10 commented Jan 7, 2026 •

edited

Loading