Skip to content

Conversation

@angus-g
Copy link
Collaborator

@angus-g angus-g commented Sep 22, 2025

1. Summary:
As mentioned in willaguiar/OM3-8km-tidal-tunning#3, this is an 8km global RYF configuration, based on the 8km RYF beta. The immediate aim is to use this with global tides as a high-resolution model for tuning drag parameterisations.

This has run into some issues with initialisation-time generation of regridding weights, and needs a bit of thought about optimisation/tuning because it's quite expensive!

2. Issues Addressed:

3. Dependencies (e.g. on payu, model or om3-scripts)

This change requires changes to (note required version where true):

  • payu:
  • access-om3:
  • om3-scripts:

4. Ad-hoc Testing

What ad-hoc testing was done? How are you convinced this change is correct (plots are good)?

5. CI Testing

  • !test repro has been run

6. Reproducibility

Is this reproducible with the previous commit? (If not, why not?)

  • Yes
  • No - !test repro commit has been run.

7. Documentation

The docs folder has been updated with output from running the model?

  • Yes
  • N/A

A PR has been created for updating the documentation?

  • Yes:
  • N/A

8. Formatting

Changes to MOM_input have been copied from model output in docs/MOM_parameter_docs.short?

  • Yes
  • N/A

9. Merge Strategy

  • Merge commit
  • Rebase and merge
  • Squash

@chrisb13
Copy link
Collaborator

Thanks @angus-g !

Just to be clear when you wrote

based on the 8km RYF beta.

That's a typo and you meant the 25km: release-MC_25km_jra_ryf-1.0-beta ?

@angus-g
Copy link
Collaborator Author

angus-g commented Sep 23, 2025

Yes, of course! (maybe this is a recursive configuration 🤔)

@chrisb13
Copy link
Collaborator

chrisb13 commented Oct 7, 2025

Meeting summary with @manodeep @chrisb13 @dougiesquire @ezhilsabareesh8 @minghangli-uni @angus-g (today).

Problem statement two-fold:

  1. 2-3 hours is taken to create re-gridding weights (patch weights are the slow ones).
  2. Model run super slowly (currently gets ~5 days in 5 hours submission).
    From Claire's ISF config:

Expected cost is 20kSU/month, walltime 5:30:00/month
@angus-g suspects ~9 times more points than OM3 25km but still super slow!

@angus-g needs:

  • First order 1 year (for tide runs).
  • So if initialisation time can be reduced, that would really help.

Questions/comments:

  • @dougiesquire: what would we expect from global OM2 1/10th degree?
  • @minghangli-uni thinks (the current) 600 cores for the mediator is too many.
  • Currently, we have a draft write up, of how to do load balancing (PR; rendered version). This currently covers how to set up the runs, we'll shortly add how to analyse the outputs.

Possible approaches to helping:

  1. Re-gridding file creation scales badly.
  2. Same re-grid file is used in two different places (gives slightly different answers).
  3. Load balancing.

@angus-g already has a workaround for the second issue (atmosphere grid same as mediator grid). @dougiesquire suggested would be good to double check this workflow and propagate it across our configs.

Note @minghangli-uni: Load balancing between components can only occur when the first two items have been resolved. (Single components can be looked at.) Run sequence could also be looked at independently (was order ~25% improvement in the panan) but this was already in the git commit that Angus has been working off.

Next actions:

  • @dougiesquire to look back into early decision to match the ice/ocean and data model grids. This was done to avoid regridding twice, but as Angus reports it isn't achieving that anyway due to the mask and no-mask mesh variants. Instead what we should probably do is match the data and data model grids. This is not high priority for Angus' work because he has a workaround as above.;
  • @minghangli-uni (and others) is currently load balancing work for 25km. Will work on panan-antarctic and this global 8km after that;
  • @minghangli-uni can share his work to date on parallelising the writing of the re-grid weights. @angus-g is happy to pick this up;
  • @micaeljtoliveira (software transformation team) where capacity allows, may look into why the re-gridding routine doesn't scale well (although this becomes less important if the offline approach is solved).

@chrisb13
Copy link
Collaborator

As discussed, I've now set up a meeting for our next chat: Tue 28/10/2025 15:00 - 16:00

Relevant people should have received a calendar invite but get in touch with me if you didn't get it or would like to come -- all welcome!

@chrisb13
Copy link
Collaborator

  • @minghangli-uni can share his work to date on parallelising the writing of the re-grid weights.

Have you had a chance to pass this work on @minghangli-uni ? Is it on GitHub somewhere?

@aidanheerdegen
Copy link
Member

  1. 2-3 hours is taken to create re-gridding weights (patch weights are the slow ones).

Is this offline? If so for the 0.1 OM2 we needed to use 8 cpus and 400GB memory

https://github.com/COSIMA/initial_conditions_WOA/blob/master/01/make_ic

@chrisb13
Copy link
Collaborator

Is this offline?

Online I'm afraid. Although @minghangli-uni and @angus-g are working on a semi-offline version.

@chrisb13
Copy link
Collaborator

chrisb13 commented Oct 28, 2025

Meeting at 3pm today with @AndyHoggANU @adele-morrison @aekiss @minghangli-uni @angus-g

From last time's possible solutions

  • Re-gridding file creation scales badly.

Angus is still interested to follow up on this based on @minghangli-uni previous work (minghang will share next week). @chrisb13 and other ocean team people are available to help if needed.

  • Same re-grid file is used in two different places (gives slightly different answers).

Angus hasn't had a chance to implement this yet, it's a configuration change. Will update the PR once it's in.

  • Load balancing.

@minghangli-uni + others have been working on some optimisation doc's here which will be relevant:
#806 (see rendered version)

For compiler flags, @edoyango wrote a little here:
#573
i.e. specifically here

Side note: @adele-morrison and claire are keen for help on why the time-step / increasing the cores is leading to seg faults:
#863

@angus-g
Copy link
Collaborator Author

angus-g commented Nov 11, 2025

Finally got back to actually attempting to run this with the mask consistency. Of the total runtime for a single day, most of it is spent in the MOM6 tracer advection.

                                      hits          tmin          tmax          tavg          tstd  tfrac grain pemin pemax                        
Total runtime                            1  15261.402669  15261.407049  15261.404403      0.000906  1.000     0     0  4447                        
Ocean Initialization                     2     51.860114     57.299862     56.449770      0.193983  0.004    11     0  4447                        
Ocean                                  191  13382.310719  13382.419540  13382.372271      0.015021  0.877     1     0  4447                        
Ocean dynamics                         384    106.303673    111.784372    108.666483      0.740531  0.007    11     0  4447                        
Ocean thermodynamics and tracers        72  13225.714495  13260.840877  13251.913411      5.615316  0.868    11     0  4447                        
Ocean Other                            622     12.236470     41.623930     20.637757      5.366829  0.001    11     0  4447

I actually turned off all diagnostic output from MOM6 (via the diag_table) just to be sure this wasn't adversely affected by IO. I'm not sure where to get any information from NUOPC about the timing of the separate components, if that indeed exists? So we're still at a rather glacial 4h/day!

@AndyHoggANU
Copy link
Collaborator

What is the tracer timestep?

@chrisb13
Copy link
Collaborator

I'm not sure where to get any information from NUOPC about the timing of the separate components, if that indeed exists?

I believe it does. @minghangli-uni can advise. As I alluded at the talk today, here's the draft of the optimisation work that @minghangli-uni is leading.

@minghangli-uni
Copy link
Collaborator

@angus-g , can you paste following into your payu config.yaml then run it for 1 model day, and then share the run path so I can take a look.

      env:
        ESMF_RUNTIME_PROFILE: "on"
        ESMF_RUNTIME_TRACE: "on"
        ESMF_RUNTIME_PROFILE_OUTPUT: "SUMMARY BINARY"

@minghangli-uni
Copy link
Collaborator

And this draft ACCESS-NRI/om3-scripts#92 is to offline generate the routehandles (including regridding weights). This is not complete yet. But @angus-g you could have a look. I’ve started looking into this again and have found another way that might be simpler. Will report if there's progress.

@angus-g
Copy link
Collaborator Author

angus-g commented Nov 11, 2025

What is the tracer timestep?
DT = 450, DT_THERM = 3600

Will run again with trace and profile, thanks @minghangli-uni

@angus-g
Copy link
Collaborator Author

angus-g commented Nov 11, 2025

@angus-g , can you paste following into your payu config.yaml then run it for 1 model day, and then share the run path so I can take a look.

      env:
        ESMF_RUNTIME_PROFILE: "on"
        ESMF_RUNTIME_TRACE: "on"
        ESMF_RUNTIME_PROFILE_OUTPUT: "SUMMARY BINARY"

Alright, this is available at /scratch/x77/ahg157/access-om3/archive/20250826-global-8km-angus-g/global-8km-3758ae6d/output000 (but I can put it the summary/trace files on a different project if you don't have access to x77). Just skimming the summary file, it does indeed seem like the ocean component is by far the biggest issue at the moment, although there is probably significant load imbalance hiding under that.

I didn't actually look into the MOM6 timer breakdown closely enough, but the scary lines are:

(Ocean tracer halo updates)         119097    510.743365  13520.936651   3542.063623   3413.432901  0.225    41     0  4447
(Ocean diffuse tracer)                  24  13664.700767  13694.169944  13689.317091      4.104975  0.871    31     0  4447

So the halo updates are the slow (and massively imbalanced) part of everything at the moment. It seems like those associated with the diffusion (rather than advection) are the issue, but then again there's a lot more halo updates in the diffusion code.

@minghangli-uni
Copy link
Collaborator

Alright, this is available at /scratch/x77/ahg157/access-om3/archive/20250826-global-8km-angus-g/global-8km-3758ae6d/output000 (but I can put it the summary/trace files on a different project if you don't have access to x77)

I can access to x77. Thanks @angus-g

So the halo updates are the slow (and massively imbalanced) part of everything at the moment. It seems like those associated with the diffusion (rather than advection) are the issue, but then again there's a lot more halo updates in the diffusion code.

Just to check whether this is tied to the processor layout: I saw you tried four different layouts,

#ncpus: 1664 # 16 nodes
ncpus: 4992 # 48 nodes
#ncpus: 3328 # 32 nodes
#ncpus: 2912 # 28 nodes

Do you see the same halo-update behaviour in the other three configs as well?

@angus-g
Copy link
Collaborator Author

angus-g commented Nov 11, 2025

Do you see the same halo-update behaviour in the other three configs as well?

That's a good question -- they were my initial tests before I realised the weight generation was the initialisation issue. So I've only run a proper segment on 48 nodes. Definitely worth checking out!

@minghangli-uni
Copy link
Collaborator

minghangli-uni commented Nov 12, 2025

Hi @angus-g , As you might’ve seen, the file /scratch/x77/ahg157/access-om3/archive/20250826-global-8km-angus-g/global-8km-3758ae6d/output000/ESMF_Profile.summary contains the timing summary for each model component.

I’ve also generated an interactive flame graph to visualise how the components interact and where time is spent:
/g/data/tm70/ml0072/COMMON/git_repos/access-experiment-generator/global-8km-profiling-scaling/global-8km-3758ae6d_all_comp/postprocessing_global-8km-3758ae6d/output000/global-8km-3758ae6d_flamegraph.html

You can open it locally on gadi with:

$ python3 -m http.server 1111

and then navigate to http://localhost:1111 in your browser. You can zoom in/out to check the start, end, and duration of each phase.

From what I’ve seen, most of the runtime is dominated by the tracer steps.

Initialisation, on the other hand, doesn’t seem to be a major bottleneck for this config - the evidence is clear both from the ESMF_Profile.summary and in the flame graph itself,

      [ESM0001] IPDv02p5                                                                4992   4992   1        21.2653     21.2620     4024    21.2681     2994   
        [MED] IPDv03p7                                                                  544    544    2        21.1778     21.1551     369     21.2081     1      
          MED: (med_map_mod: RouteHandles_init)                                         544    544    1        19.9193     19.8989     478     19.9390     431    

@dougiesquire
Copy link
Collaborator

  • @dougiesquire to look back into early decision to match the ice/ocean and data model grids. This was done to avoid regridding twice, but as Angus reports it isn't achieving that anyway due to the mask and no-mask mesh variants. Instead what we should probably do is match the data and data model grids. This is not high priority for Angus' work because he has a workaround as above.;

Just a note that this has become potentially important for @anton-seaice's CICE C-grid work, so he is going to take take a look at this.

@chrisb13
Copy link
Collaborator

chrisb13 commented Dec 9, 2025

Thanks @dougiesquire is there an issue or PR that's related to this? (aside from the c-grid one)

@dougiesquire
Copy link
Collaborator

Thanks @dougiesquire is there an issue or PR that's related to this? (aside from the c-grid one)

#968

@chrisb13
Copy link
Collaborator

chrisb13 commented Dec 9, 2025

Great thanks, I had a feeling it might be that one but couldn't find it again.

@angus-g
Copy link
Collaborator Author

angus-g commented Dec 11, 2025

So I think I've got a handle on at least one of the issues bedevilling the initialisation. It was telling that almost all of the time was spent in tracer diffusion: it's due to the CHECK_DIFFUSIVE_CFL parameter. This will perform do a max_across_PEs (i.e. MPI_Allreduce) over the tracer diffusion CFL, which is obviously a super expensive thing to be doing. I guess DT_TRACER_ADVECT could be extended to push this out a bit.

I tried a different track, running 6 hours of the dynamic part of the ocean in 30s by setting CHECK_DIFFUSIVE_CFL = False (which required MAX_TR_DIFFUSION_CFL = 1.0 and disabling the bad surface value check). The rest of the model (I guess initialisation + restart generation) was an extra 1800s on top, but currently PARALLEL_RESTARTFILES = False. There's probably a bunch of similar tuning like that to be done anyway.

This probably needs a bit deeper investigation: clearly by disabling the CFL check, we end up with bad surface values. At the same time, this may just be one of those initialisation-only things? Ideally we can keep the parameter off and not have to pay for a global collective every tracer timestep...

@minghangli-uni
Copy link
Collaborator

minghangli-uni commented Dec 11, 2025

Setting an additional parameter MAX_TR_DIFFUSION_CFL = 2 could resolve this problem. More details can be found in #732

Edit: for the 25 km configs, the slowdown only occurs during the first few months without MAX_TR_DIFFUSION_CFL = 2. After that performance returns to normal.

@angus-g
Copy link
Collaborator Author

angus-g commented Dec 11, 2025

Thanks @minghangli-uni, that also works. I guess the load imbalance was showing up in the allreduce. I think that unblocks the performance enough to try a longer run segment.

@access-hive-bot
Copy link

This pull request has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/cosima-twg-announce/401/79

@chrisb13
Copy link
Collaborator

@angus-g. Just checking in on the global 8km work. We were chatting at the last TWG on Wednesday about how it would be helpful if we have a try at running the config / taking a closer look (part of this workplan is that users can run it). Do you have any updates on your side? Is the angus-g:angus-g/global-8km branch up to date/the one to try?

@angus-g
Copy link
Collaborator Author

angus-g commented Jan 21, 2026

Go for it! I ran 2 months in a bit over 4 hours, so it's much more tractable.

Is the angus-g:angus-g/global-8km branch up to date/the one to try?

That's right, let me know if you have any issues, it's probably a bit all over the place...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants