Skip to content

Conversation

@gold2718
Copy link
Collaborator

@gold2718 gold2718 commented Dec 6, 2020

Initial implementation of weak scaling physics infrastructure

  • Implement scalable physics grid with no load balancing
  • Implement scalable gmean algorithm that does not need global data
  • Modify SE/dyn_grid.F90 and related files for weak scaling interface
  • Add repro sum variables if weak-scaling fix is on
  • Added memory use for phys_grid_init and gmean timing
  • Add cam_repro_sum to test namelists to ensure BFB compatibility
  • Add checks to phys_debug_lat and phys_debug_lon
  • Note: The repro_sum version of gmean is not scalable
  • Note: create_native_mapping_files contains non-scalable memory use

fixes #127
fixes #259

Steve Goldhaber added 3 commits December 6, 2020 14:01
Implement scalable physics grid with no load balancing
Implement scalable gmean algorithm that does not need global data
Set up directory for weak scaling version of SE dycore interface
Modify SE/dyn_grid.F90 for weak scaling interface
Add repro sum variables if weak-scaling fix is on
Added memory use for phys_grid_init and gmean timing
Add cam_repro_sum to test namelists to ensure BFB compatibility
Add checks to phys_debug_lat and phys_debug_lon
Add src/physics/weak_scaling to TR8.sh
Note that the repro_sum version of gmean is not scalable
Note that create_native_mapping_files contains non-scalable memory use
@gold2718 gold2718 added the enhancement New feature or request label Dec 6, 2020
@gold2718 gold2718 added this to the CESM2.3 milestone Dec 6, 2020
@gold2718 gold2718 self-assigned this Dec 6, 2020
@cacraigucar cacraigucar changed the title Initial implementation of weak scaling physics infrastructure cam6_3_007?: Initial implementation of weak scaling physics infrastructure Dec 7, 2020
Copy link
Collaborator

@cacraigucar cacraigucar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of reviews

bld/configure Outdated
my @dirs = split ',', $usr_src;
while ( my $dir = shift @dirs ) {
print $fh "$dir\n";
print $fh "$dir xx\n";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a typo.

Comment on lines 412 to 413
The value is the maximum number of double-precision fields on a task
for global calculations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really understand this explanation until I read the following in the ChangeLog:
Maximum number of global fields to sum at one time

Copy link
Collaborator

@fvitt fvitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the "-weak_scaling_fix" configure

Comment on lines 6 to 10
One-line Summary: Weak scaling infrastructure and bug fixes
Github PR URL:

Purpose of changes:
#127: Fix weak scaling issues in physics decomp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on what purpose of these changes? I gather infrastructure changes are to improve performance scaling of cam physics in terms of number of MPI tasks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ChangeLog_template says to copy the issue number and title text. The explanation is in the issue: #127

#259: _Fillvalue in cam history files (remove fill value for coordinates)

Describe any changes made to build system:
- Add weak scaling directory to override several files (see below)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the directory name "weak_scaling" to be confusing. Would something like "improved_scaling" be better?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the project was to solve the weak scaling issue but it does also address strong scaling issues.
@cacraigucar, any thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How ingrained is the name "weak scaling"? I know it has been used for quite some time.

On the other hand, now is the time to give it a name change, if it is appropriate. Since it sounds like this is more than weak scaling, I do like @fvitt suggestion of "improved_scaling" as it covers both the weak and strong scaling issues.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This project is replacing the dynamics-physics coupler with a new version that has better weak scaling properties. The current coupler code lives in physics/cam/phys_grid.F90. This would be a good time to move the new implementation into its own directory, say dyn_phys_coupling, at the same level as the dynamics and physics subdirectories under src/. There doesn't seem to be any good logic behind having the coupling code in the directory meant for physics parameterizations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brian-eaton, phys_grid.F90 does a lot more than dynamics-physics coupling. It is also the home to defining the cam_grid information for history output from physics code. With the ionosphere PR, it also becomes the center for regridding from physics to the ionosphere and for other uses, such as inline regridding of emissions data onto the physics grid.
Also, while I agree that physics/cam is not a great place, that will be reformed as the physics becomes CCPP-ized and moves into its own repository.

Finally, only two of the files are about the new infrastructure, the others are temporary until all dycores use this interface.

How about new_phys_grid or physics_infrastructure? improved_scaling is okay as well. Consensus? Feedback on the suggestion to move it up a level (to src instead of in src/physics)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like "src/physics/infrastructure" or "src/physics_infrastructure"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's coupling between dynamics and physics, or physics and ionosphere, I like having the directory at the same level as those 3 rather than inside physics. Calling the directory "infrastructure" would be OK. MPAS has a directory called "framework" which is similar. We have a lot of infrastructure type stuff in control/ and utils/ as well which could eventually migrate. Anything we do at this point will not be entirely consistent with the historical evolution of our directory structure, but that's OK.

Copy link
Collaborator

@cacraigucar cacraigucar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 2 comments

Comment on lines 2062 to 2063
! write(6, *) 'XXG ',iam,') ',ie,i,j,elem(ie)%spherep(i,j)%lat,dbuf2(indx,ie)*deg2rad
! call shr_sys_flush(6)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these lines be deleted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea. Brian left then in when he upgraded the implementation. @brian-eaton ? Any love left for these debug lines?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This output only occurs before endrun is called. It is more sanity check than debug code. So seems like good info to put in the log file. Rather than commenting them out the hardcoded unit 6 should be changed to iulog, and the XXG changed to something like 'file lat not correct on task', and add text describing the other output data.

Comment on lines 2088 to 2089
! write(6, *) 'XXG ',iam,') ',ie,i,j,elem(ie)%spherep(i,j)%lon,dbuf2(indx,ie)*deg2rad
! call shr_sys_flush(6)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these lines be deleted?

close(unitn)
call freeunit(unitn)
! Check inputs
if (phys_debug_lat /= uninit_r8) then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this statement confusing on first glance. I tried to figure out why you were adding the r8 to the unit number. It was only on closer inspection that I saw slight difference in characters. While I know the name uninit_r8 has been around for a while in this routine, I would suggest renaming it to be uninitr8 so that it doesn't look like a real number being promoted to type r8. This is mainly because it is immediately following the use of unitn. Maybe I'm the only one who will be tripped up by this.

Comment on lines 5 to 6
! Perform mixed layer global calculations for energy conservation checks.
!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by "mixed layer"? Is this exclusively for energy conservation checks? Are other parts of the code using this utility to calculate global means?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments in src/utils/gmean_mod.F90 indicate that the code originated from SOM, in which case the mixed layer refers to the ocean. Just a case of old comments not being updated.

Doing a global search on 'call gmean' indicates that the primary purpose of the code is for diagnostics and the check_energy code. There's also some use in chemistry. The use of gmean in the check_energy code would cause answers to depend on task count if the means were not calculated in a task independent way. Since CAM is currently giving task count independent answers I have to assume that src/utils/gmean_mod.F90 is working as designed. But what's not clear is whether a call to gmean is relying on the non-scalable calculation to achieve this result. The current implementation is that a scalable algorithm is used first (shr_reprosum_calc called from gmean_fixed_repro), and the scalable algorithm makes an error estimate which is subsequently used to decide whether the non-scalable (but always task count independent) calculation should be used. However, as far as I can tell the check of the error estimate is not done because shr_reprosum_reldiffmax defaults to -1.0. So it appears that the current scalable code is being used and is producing task count independent results. @gold2718, did you come to a different conclusion after examining the current gmean_mod.F90?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brian-eaton, I had not noticed the settings of shr_reprosum_reldiffmax and shr_reprosum_recompute.
Not only is shr_reprosum_reldiffmax set to -1.0 (which means do not make an estimate) but shr_reprosum_recompute is set to .false. meaning that even if the reprosum fails, CAM uses the answer anyway!

It looks to me that while our global means are reproducible, we have no idea if they are correct. Ouch!

Unless you know this is not happening, I am going to throw in some tests to see how we are doing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the default settings in shr_reprosum_mod came from Pat Worley. I assume that when Pat and other DOE folks implemented and tested this code they found that the algorithm in shr_reprosum_int worked well enough that there was no need to go to the expense of producing the error estimate in the default mode of running. The error estimate is made by redoing the sum using mpi_allreduce (so not reproducible, but an independent estimate) and comparing that with the results from shr_reprosum_int.

The thing that makes shr_reprosum_int susceptible to error is that the reproducibility is obtained by converting the floating point values to integer representations, and doing the mpi_allreduce on the integers. Then you convert back to floating point. I'm not familiar with the algorithmic details, but depending on the properties of the floats I imagine there could be pathologies when converting to integers. My guess as to why this scheme works well for this application is that the sums involve values that are all positive, e.g., energy of a column or the mass of a constituent in a column.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests I have been running are telling me is that shr_reprosum is pretty good if given positive-definite fields, even with a fairly high dynamic range.
However, it is easy to mess with if you give it inputs that gyrate around zero and end up with a small result (compared to the size of the gyrations). In these situations, it computes incorrect answers and is not even reproducible.
Since we are currently using gmean on non-negative fields and are not checking whether or not shr_reprosum is working, I propose the following steps to resolve this issue:

  • Replace the unused safety-net (the backup algorithm which is not used because we do not check) with a call to endrun
  • Always use shr_reprosum (i.e., remove the new cam_repro_sum namelist variable I introduced).
  • Turn on the shr_reprosum check for CAM regression tests. This is to prevent a future use of gmean on oscillating fields.

Thoughts? @brian-eaton? @fvitt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Steve, what value did you use for reprosum_diffmax? And what were the means computed by the reprosum calc and allreduce code for, say, the worst case?

Francis, I wouldn't expect there to be any significant error for a non-negative fields. But if you want to check you could add the optional arg rel_diff to the shr_reprosum_calc calls. To get the rel_diff values you also need to set the reprosum_diffmax namelist variable to some positive value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, some of the baseline diffs were as high as 5.E-01 although for not-so-small absolute values, it still got to 1.7E-01.
I used reprosum_diffmax=1.0e-14

I do not see either the value computed or the backup value computed by par_xsum. The output looks something like:

 par_xsum: difference in fixed and floating point sums  exceeds tolerance in  71  fields.
   Maximum relative diff: (rel)   0.9694656488549618  (abs)    2.8876456781290472E-11
   Maximum absolute diff: (rel)   6.9519501266768202E-11  (abs)    6.9519501266768202E-11

par_xsum does not use allreduce, it has some other reproducible algorithm.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite understanding the par_xsum output. I think to pursue those errors you'd need to set up par_xsum the same way we discussed setting up gmean so that when the error tolerance is exceeded we print the means from both estimates and call endrun.

Did you see any failures related to gmean calls? 1.e-14 is a pretty tight tolerance.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That output is from the shr_reprosum_calc calls made by par_xsum. The (abs) numbers are just abs(fast - allreduce) while the (rel) numbers are abs(fast - allreduce) / abs(allreduce) (unless abs(allreduce) is too small in which case (rel) is just (abs)).

  • "Maximum relative diff" is the case (out of the 71 listed above) which had the largest (rel) value.
  • "Maximum absolute diff" is the case (out of the 71 listed above) which had the largest (abs) value.

To get an idea of the (absolute) value of the allreduce method, I would just look at (abs) / (rel) which gives 2.9786E-11 for the "Maximum relative diff" case (the absolute diff case did not compute a relative value since the absolute value was too small (abs(allreduce) <= abs(fast - allreduce)`).

I did not see any gmean errors which is expected because they are all summing non-negative values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to interpret relative errors when the values are so close to zero. To be sure I'd still want to see the means computed by both methods. But I think there's no need to pursue this. When Pat Worley updated par_xsum to use the shr_reprosum_mod code he did plenty of testing. Just for the record, here is part of the relevent ChangeLog history:


Tag name: cam3_6_37
Originator(s): pworley
Date: Mon Apr 27 2009
One-line Summary: more accurate reproducible sum; additional OpenMP
parallelism.

Purpose of changes:

  1. Modify repro_sum fixed-point algorithm to improve accuracy.
    In certain situations, numerical cancellation can degrade accuracy
    of original repro_sum algorithm as compared to floating-point-based
    algorithms. New algorithm is nearly exact, employing variable
    precision as needed. This results in roundoff level differences.

The ChangeLog file contains a lot more details about the introduction of and improvements to the scalable reproducible sums code, going back to July 2008.

end if
! Normalize
do fld_ind = 1, nflds
arr_gmean(fld_ind) = arr_gmean(fld_ind) / (4.0 * pi)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not multiply by inv4pi rather than divide by (4.0*pi)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hopefully temporary. This code is BFB with the baselines, multiplying by inv4pi was not.
In order to get this code in, we decided to go with the BFB version and introduce the faster, scalable version in the future.

Comment on lines 518 to 528
! Non reproducible algorithm
do fld_ind = 1, nflds
! Get weighted sum from physics
call weighted_sum_p(arr(:, :, fld_ind), wsum(fld_ind))
end do
call MPI_allreduce(wsum, arr_gmean, nflds, MPI_REAL8, MPI_SUM, &
mpicom, ierr)
! Normalize
do fld_ind = 1, nflds
arr_gmean(fld_ind) = arr_gmean(fld_ind) * inv4pi
end do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this reproducible if PE layout does not change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be although the MPI standard does not guarantee that the MPI_allreduce call will always return the same sum (the network could end up doing the sum in a different order if you end up on different nodes).
All my tests have at most roundoff differences with a PE change -- I did not try repeatedly running one layout to see if it ever varies.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The potential for simulation results that are not reproducible does not seem right.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After spending more time looking at both the gmean_mod.F90 in src/utils/ and the new one in weak_scaling/ I find that I don't understand what problem the new version is trying to solve. The issue comments indicate that a scalable version is the goal, but the current code already provides both scalable and non-scalable versions. CAM is currently set up to use the scalable version which is also reproducible.

If I'm missing what's going on and the new code is improving gmean_mod, then I think it should be implemented in shr_reprosum_mod.F90. The code there originally came from CAM, but was moved so that all components had access to scalable reproducible sums. I don't think we should be making improvements to those algorithms in a CAM specific location.

!========================================================================
!

subroutine test_gmean(max_diff)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a stand-alone unit test for gmean?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not currently. I tested it by inserting calls in the init code.
We will have an easier-to-use unit test system in the new infrastructure.

end if
end function bits_to_string

!!XXgoldyXX: v remove once new algorithm is in place
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the new algorithm in place? Is this stuff ready to remove?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. See note above about getting this PR pushed out to allow MPAS-A to begin converting.
This will be done later (or maybe removed entirely if we decide to abandon the notion of BFB runs for high-resolution configurations).

integer :: max_nflds = -1 ! # fields for disp. decomp
type(column_redist_t) :: column_reorder

private :: test_gmean
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this used if it is private?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With code insertion (see note above about how it was used).
I left the code in so that we can use it when we have easier-to setup unit tests.

Comment on lines 43 to 46
!!XXgoldyXX: v remove once new algorithm is in place
private :: gmean_float_repro
private :: gmean_fixed_repro
!!XXgoldyXX: ^ remove once new algorithm is in place
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to remove?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope.

fld_dyn(blk_ind(1), 2, k, ie) = fldv(icol, k, lchnk-begchunk+1)
end do
end do
deallocate(fld_dyn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section just looks wrong. You allocate, set a default value, fill the field and then promptly deallocate. It looks like the deallocate should not be there. Not sure how this working because there is a subsequent reshape of fld_dyn right after this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This deallocate is misplaced.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which makes me think this code isn't being tested by our regression tests. I know people often use the interpolated output with SE rather than deal with native grid output, so interpolated output should be added to an SE test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added tests for interpolated output (from an SE run) and also zonal mean output (from an FV run). I did this by adding a new file to existing tests (see ChangeLog).

Comment on lines 174 to 175
! Initialize the gmean module (must be after phys_grid_init)
call gmean_init(pcnst) !!XXgoldyXX: Is that a good value to use?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, this is the only line that is different from the standard cam_comp.F90 routine. Instead of copying this routine in its entirety to just add this one line, couldn't we hide if with #ifdef for weak_scaling. I believe this is one of the limited times when a #ifdef is valid. Alternatively, there could be an empty gmean_init routine for non-weak scaling cases. Curious why you took the approach of copying the entire cam_comp.F90 to insert this one line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no #ifdef for weak scaling but an empty routine could be added to the non-weak-scaling gmean_mod.
As for reasons, I suppose at the time, I did not know what the full impact would be.

fi

##setup CESM work directory
## If this is a Nag test, run the r8 and git tests
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm, the r8 and git test results will end up in the aux_cam_nag_XXXX.log file? Since we don't typically look in this file for test results, if it ends up there, we will need to augment the wiki page.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@fvitt fvitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a bunch of XXgoldyXX comments in the code. Will these be cleaned up?


implicit none
private
save
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope (implied at module scope).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Steve, I thought the save attribute was only implied for module variables that have an initializer. No?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That used to be true but no compiler ever implemented any module variable non-saving so they made it official in the 2008 standard that all data objects at module scope have the save attribute.

Comment on lines 85 to 95
!!XXgoldyXX: v try to remove?
public :: get_horiz_grid_dim_d
public :: dyn_grid_get_colndx ! get element block/column and MPI process indices
public :: get_dyn_grid_parm
!!XXgoldyXX: ^ try to remove?
public :: dyn_grid_get_elem_coords ! get coords of a specified block element

!!XXgoldyXX: v MUST BE DELETED
public :: get_horiz_grid_d
public :: get_dyn_grid_parm_real1d
!!XXgoldyXX: ^ MUST BE DELETED
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will these be removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I could swear I did that but obviously, it did not make its way to the PR. I will clean up this file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still a block of comments but hopefully it is a bit more clear as to how these interfaces need to be treated down the road.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that there are still a few calls to get_horiz_grid_d in places where there is no conditional logic to restrict the calls to rectangular grids, i.e., in src/ionosphere/waccmx/ionosphere_interface.F90 and in src/physics/carma/models/meteor_impact/carma_model_mod.F90. Are these physics configurations not supported for the SE dycore?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ionosphere code currently only runs on FV. The call to get_horiz_grid_d is removed in #264 which also enables SE support by moving the coupling to the physics grid.

On the other hand, CARMA is a hot mess. I do not know if anyone has tried it with SE but I would be surprised if it ever ran on an unstructured dycore. There has been talk (from ACOM folks) of reimplementing CARMA but I have not heard from them for a few months.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware of the CARMA resurrection efforts which is why I wanted to keep it on our radar. The use of get_horiz_grid_d in the CARMA code is minimal, and could easily be updated to just use info in the local chunks and then shr_reprosum_calc to compute the area of interest.

Comment on lines 315 to 317
!!XXgoldyXX: v debug only
character(len=256) :: errmsg
!!XXgoldyXX: ^ debug only
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for noticing.

@cacraigucar
Copy link
Collaborator

@gold2718 Have you found the updated file(s) that you thought was(were) committed? It would be good to get them updated ASAP.

Copy link
Collaborator

@nusbaume nusbaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first round of reviews. Hope they are helpful (and don't show too much of my own ignorance)!


!========================================================================

subroutine get_rlat_all_p(lcid, rlatdim, rlats)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that most of the get_XXX_all_p subroutines here and below are missing any sort of phys_grid_initialized() check. I assume they should probably have one, just to be safe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That check happens in the chunk_info_to_index_p call.

allocate(areaA(ngcols_d))
allocate(clat(ngcols_d),clon(ngcols_d))
call get_horiz_grid_d(ngcols_d, clat_d_out=clat, clon_d_out=clon, area_d_out=areaA)
call get_horiz_grid_int(ngcols_d, clat_d_out=clat, clon_d_out=clon, area_d_out=areaA)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the native mapping files still used? It seems like this functionality has been superseded by the ESMF remapping tools. Maybe this whole conditional can go away. @PeterHjortLauritzen what do you think?

owners = (igcol * 0) -1 ! Kill compiler warnings
col = -1 ! Kill compiler warnings
lbk = -1 ! Kill compiler warnings
call endrun('dyn_grid_get_colndx: not implemented for unstructured grids')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This routine is called by the sat_hist module. It looks like the old code supported the GLL grid, but not the FVM grid. Now the routine doesn't support either. Does this mean that it's no longer possible to have sathist output with SE?

Copy link
Collaborator

@brian-eaton brian-eaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more nits.

Remove weak-scaling-fix as an option
Move weak scale fix code to infrastructure directory
Add interpolated output and zonal mean processing to tested history options
Use num_global_phys_cols all the time in place of ngcols_p
Always use get_grid_dims in place of get_horiz_grid_dim_d
Always call gmean_init (Allows using only one cam_comp module)
Add checks for repro sum failures
Fix PLB test to use one build (same config), just test load balance option
cime_config/testdefs/testmods_dirs/cam/waccmx_weimer/user_nl_cpl
- Add shr_reprosum difference check
doc/ChangeLog
- Waxed poetic about modifying code
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It this poetry?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everybody's a critic :-)

Comment on lines 323 to 325
<entry id="weak_scale_fix" valid_values="0,1" value="0">
Use weak scaling version of physics grid and d_p coupling: 0=off, 1=on.
</entry>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It this configure option used? It seems that the src/infrastructure code is always used when the dycore is SE.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I removed it in the last round and forgot to take it out here. Thanks for noticing.

doc/ChangeLog Outdated
Comment on lines 17 to 18
- phys_global_max_fields: Maximum number of global fields to sum at one time
- cam_repro_sum: Flag to enable reproducible sum
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These have been removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that (and all your other comments -- even the nits :-)

Copy link
Collaborator

@fvitt fvitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weak_scale_fix needs to be removed from config_files/definition.xml

Copy link
Collaborator

@nusbaume nusbaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now! Just one grammar fix and a request for additional error checks.

CAM_CONFIG_OPTS = self._case1.get_value("CAM_CONFIG_OPTS")
self._case.set_value("CAM_CONFIG_OPTS",CAM_CONFIG_OPTS.replace(' -nadv_tt 5 -cppdefs -DTRACER_CHECK',''))
comments = "Setting phys_loadbalance to value that's set in the testmod."
comments = "Using phys_loadbalance to value that's set in the testmod."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think grammatically this new comment should be:

"Using phys_loadbalance value that's set in the testmod."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I just want to express how much I 'value' the 'set' belonging to 'that' in the testmod :-)


!========================================================================

subroutine get_rlat_all_p(lcid, rlatdim, rlats)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe all of the get_XXX_all_p subroutines still need a phys_grid_initialized() check, as they do not call chunk_info_to_index_p.

Remove unused weak-scale-fix keyword from definition.xml
Make sure phys_grid is initialized in access functions in phys_grid
Fix comments in plb.py
Update ChangeLog with test results
@gold2718 gold2718 dismissed cacraigucar’s stale review January 1, 2021 22:15

Change request fulfilled, reviewer not available for re-review

@gold2718 gold2718 changed the title cam6_3_007?: Initial implementation of weak scaling physics infrastructure cam6_3_007: Initial implementation of weak scaling physics infrastructure Jan 1, 2021
@gold2718 gold2718 merged commit e3630ff into ESCOMP:cam_development Jan 1, 2021
@gold2718 gold2718 deleted the weak_scaling branch June 13, 2021 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants