8km RYF configuration #781

angus-g · 2025-09-22T08:16:52Z

1. Summary:
As mentioned in willaguiar/OM3-8km-tidal-tunning#3, this is an 8km global RYF configuration, based on the 8km RYF beta. The immediate aim is to use this with global tides as a high-resolution model for tuning drag parameterisations.

This has run into some issues with initialisation-time generation of regridding weights, and needs a bit of thought about optimisation/tuning because it's quite expensive!

2. Issues Addressed:

ACCESS-OM3 8km RYF model configuration #780

3. Dependencies (e.g. on payu, model or om3-scripts)

This change requires changes to (note required version where true):

payu:
access-om3:
om3-scripts:

4. Ad-hoc Testing

What ad-hoc testing was done? How are you convinced this change is correct (plots are good)?

5. CI Testing

!test repro has been run

6. Reproducibility

Is this reproducible with the previous commit? (If not, why not?)

Yes
No - !test repro commit has been run.

7. Documentation

The docs folder has been updated with output from running the model?

Yes
N/A

A PR has been created for updating the documentation?

Yes:
N/A

8. Formatting

Changes to MOM_input have been copied from model output in docs/MOM_parameter_docs.short?

Yes
N/A

9. Merge Strategy

Merge commit
Rebase and merge
Squash

chrisb13 · 2025-09-23T00:41:51Z

Thanks @angus-g !

Just to be clear when you wrote

based on the 8km RYF beta.

That's a typo and you meant the 25km: release-MC_25km_jra_ryf-1.0-beta ?

angus-g · 2025-09-23T08:05:07Z

Yes, of course! (maybe this is a recursive configuration 🤔)

chrisb13 · 2025-10-07T04:50:55Z

Meeting summary with @manodeep @chrisb13 @dougiesquire @ezhilsabareesh8 @minghangli-uni @angus-g (today).

Problem statement two-fold:

2-3 hours is taken to create re-gridding weights (patch weights are the slow ones).
Model run super slowly (currently gets ~5 days in 5 hours submission).
From Claire's ISF config:

Expected cost is 20kSU/month, walltime 5:30:00/month
@angus-g suspects ~9 times more points than OM3 25km but still super slow!

@angus-g needs:

First order 1 year (for tide runs).
So if initialisation time can be reduced, that would really help.

Questions/comments:

@dougiesquire: what would we expect from global OM2 1/10th degree?
@minghangli-uni thinks (the current) 600 cores for the mediator is too many.
Currently, we have a draft write up, of how to do load balancing (PR; rendered version). This currently covers how to set up the runs, we'll shortly add how to analyse the outputs.

Possible approaches to helping:

Re-gridding file creation scales badly.
Same re-grid file is used in two different places (gives slightly different answers).
Load balancing.

@angus-g already has a workaround for the second issue (atmosphere grid same as mediator grid). @dougiesquire suggested would be good to double check this workflow and propagate it across our configs.

Note @minghangli-uni: Load balancing between components can only occur when the first two items have been resolved. (Single components can be looked at.) Run sequence could also be looked at independently (was order ~25% improvement in the panan) but this was already in the git commit that Angus has been working off.

Next actions:

@dougiesquire to look back into early decision to match the ice/ocean and data model grids. This was done to avoid regridding twice, but as Angus reports it isn't achieving that anyway due to the mask and no-mask mesh variants. Instead what we should probably do is match the data and data model grids. This is not high priority for Angus' work because he has a workaround as above.;
@minghangli-uni (and others) is currently load balancing work for 25km. Will work on panan-antarctic and this global 8km after that;
@minghangli-uni can share his work to date on parallelising the writing of the re-grid weights. @angus-g is happy to pick this up;
@micaeljtoliveira (software transformation team) where capacity allows, may look into why the re-gridding routine doesn't scale well (although this becomes less important if the offline approach is solved).

chrisb13 · 2025-10-10T02:51:29Z

As discussed, I've now set up a meeting for our next chat: Tue 28/10/2025 15:00 - 16:00

Relevant people should have received a calendar invite but get in touch with me if you didn't get it or would like to come -- all welcome!

chrisb13 · 2025-10-10T02:57:45Z

@minghangli-uni can share his work to date on parallelising the writing of the re-grid weights.

Have you had a chance to pass this work on @minghangli-uni ? Is it on GitHub somewhere?

aidanheerdegen · 2025-10-27T00:33:33Z

2-3 hours is taken to create re-gridding weights (patch weights are the slow ones).

Is this offline? If so for the 0.1 OM2 we needed to use 8 cpus and 400GB memory

https://github.com/COSIMA/initial_conditions_WOA/blob/master/01/make_ic

chrisb13 · 2025-10-27T04:42:48Z

Is this offline?

Online I'm afraid. Although @minghangli-uni and @angus-g are working on a semi-offline version.

chrisb13 · 2025-10-28T04:52:47Z

Meeting at 3pm today with @AndyHoggANU @adele-morrison @aekiss @minghangli-uni @angus-g

From last time's possible solutions

Re-gridding file creation scales badly.

Angus is still interested to follow up on this based on @minghangli-uni previous work (minghang will share next week). @chrisb13 and other ocean team people are available to help if needed.

Same re-grid file is used in two different places (gives slightly different answers).

Angus hasn't had a chance to implement this yet, it's a configuration change. Will update the PR once it's in.

Load balancing.

@minghangli-uni + others have been working on some optimisation doc's here which will be relevant:
#806 (see rendered version)

For compiler flags, @edoyango wrote a little here:
#573
i.e. specifically here

Side note: @adele-morrison and claire are keen for help on why the time-step / increasing the cores is leading to seg faults:
#863

angus-g · 2025-11-11T03:35:14Z

Finally got back to actually attempting to run this with the mask consistency. Of the total runtime for a single day, most of it is spent in the MOM6 tracer advection.

                                      hits          tmin          tmax          tavg          tstd  tfrac grain pemin pemax                        
Total runtime                            1  15261.402669  15261.407049  15261.404403      0.000906  1.000     0     0  4447                        
Ocean Initialization                     2     51.860114     57.299862     56.449770      0.193983  0.004    11     0  4447                        
Ocean                                  191  13382.310719  13382.419540  13382.372271      0.015021  0.877     1     0  4447                        
Ocean dynamics                         384    106.303673    111.784372    108.666483      0.740531  0.007    11     0  4447                        
Ocean thermodynamics and tracers        72  13225.714495  13260.840877  13251.913411      5.615316  0.868    11     0  4447                        
Ocean Other                            622     12.236470     41.623930     20.637757      5.366829  0.001    11     0  4447

I actually turned off all diagnostic output from MOM6 (via the diag_table) just to be sure this wasn't adversely affected by IO. I'm not sure where to get any information from NUOPC about the timing of the separate components, if that indeed exists? So we're still at a rather glacial 4h/day!

AndyHoggANU · 2025-11-11T03:39:19Z

What is the tracer timestep?

chrisb13 · 2025-11-11T03:45:33Z

I'm not sure where to get any information from NUOPC about the timing of the separate components, if that indeed exists?

I believe it does. @minghangli-uni can advise. As I alluded at the talk today, here's the draft of the optimisation work that @minghangli-uni is leading.

minghangli-uni · 2025-11-11T04:02:52Z

@angus-g , can you paste following into your payu config.yaml then run it for 1 model day, and then share the run path so I can take a look.

      env:
        ESMF_RUNTIME_PROFILE: "on"
        ESMF_RUNTIME_TRACE: "on"
        ESMF_RUNTIME_PROFILE_OUTPUT: "SUMMARY BINARY"

minghangli-uni · 2025-11-11T04:08:31Z

And this draft ACCESS-NRI/om3-scripts#92 is to offline generate the routehandles (including regridding weights). This is not complete yet. But @angus-g you could have a look. I’ve started looking into this again and have found another way that might be simpler. Will report if there's progress.

angus-g · 2025-11-11T04:20:24Z

What is the tracer timestep?
DT = 450, DT_THERM = 3600

Will run again with trace and profile, thanks @minghangli-uni

angus-g · 2025-11-11T22:58:50Z

@angus-g , can you paste following into your payu config.yaml then run it for 1 model day, and then share the run path so I can take a look.
      env:
        ESMF_RUNTIME_PROFILE: "on"
        ESMF_RUNTIME_TRACE: "on"
        ESMF_RUNTIME_PROFILE_OUTPUT: "SUMMARY BINARY"

Alright, this is available at /scratch/x77/ahg157/access-om3/archive/20250826-global-8km-angus-g/global-8km-3758ae6d/output000 (but I can put it the summary/trace files on a different project if you don't have access to x77). Just skimming the summary file, it does indeed seem like the ocean component is by far the biggest issue at the moment, although there is probably significant load imbalance hiding under that.

I didn't actually look into the MOM6 timer breakdown closely enough, but the scary lines are:

(Ocean tracer halo updates)         119097    510.743365  13520.936651   3542.063623   3413.432901  0.225    41     0  4447
(Ocean diffuse tracer)                  24  13664.700767  13694.169944  13689.317091      4.104975  0.871    31     0  4447

So the halo updates are the slow (and massively imbalanced) part of everything at the moment. It seems like those associated with the diffusion (rather than advection) are the issue, but then again there's a lot more halo updates in the diffusion code.

minghangli-uni · 2025-11-11T23:33:16Z

Alright, this is available at /scratch/x77/ahg157/access-om3/archive/20250826-global-8km-angus-g/global-8km-3758ae6d/output000 (but I can put it the summary/trace files on a different project if you don't have access to x77)

I can access to x77. Thanks @angus-g

So the halo updates are the slow (and massively imbalanced) part of everything at the moment. It seems like those associated with the diffusion (rather than advection) are the issue, but then again there's a lot more halo updates in the diffusion code.

Just to check whether this is tied to the processor layout: I saw you tried four different layouts,

#ncpus: 1664 # 16 nodes
ncpus: 4992 # 48 nodes
#ncpus: 3328 # 32 nodes
#ncpus: 2912 # 28 nodes

Do you see the same halo-update behaviour in the other three configs as well?

angus-g · 2025-11-11T23:52:52Z

Do you see the same halo-update behaviour in the other three configs as well?

That's a good question -- they were my initial tests before I realised the weight generation was the initialisation issue. So I've only run a proper segment on 48 nodes. Definitely worth checking out!

minghangli-uni · 2025-11-12T03:53:28Z

Hi @angus-g , As you might’ve seen, the file /scratch/x77/ahg157/access-om3/archive/20250826-global-8km-angus-g/global-8km-3758ae6d/output000/ESMF_Profile.summary contains the timing summary for each model component.

I’ve also generated an interactive flame graph to visualise how the components interact and where time is spent:
/g/data/tm70/ml0072/COMMON/git_repos/access-experiment-generator/global-8km-profiling-scaling/global-8km-3758ae6d_all_comp/postprocessing_global-8km-3758ae6d/output000/global-8km-3758ae6d_flamegraph.html

You can open it locally on gadi with:

$ python3 -m http.server 1111

and then navigate to http://localhost:1111 in your browser. You can zoom in/out to check the start, end, and duration of each phase.

From what I’ve seen, most of the runtime is dominated by the tracer steps.

Initialisation, on the other hand, doesn’t seem to be a major bottleneck for this config - the evidence is clear both from the ESMF_Profile.summary and in the flame graph itself,

      [ESM0001] IPDv02p5                                                                4992   4992   1        21.2653     21.2620     4024    21.2681     2994   
        [MED] IPDv03p7                                                                  544    544    2        21.1778     21.1551     369     21.2081     1      
          MED: (med_map_mod: RouteHandles_init)                                         544    544    1        19.9193     19.8989     478     19.9390     431

dougiesquire · 2025-12-08T03:04:54Z

@dougiesquire to look back into early decision to match the ice/ocean and data model grids. This was done to avoid regridding twice, but as Angus reports it isn't achieving that anyway due to the mask and no-mask mesh variants. Instead what we should probably do is match the data and data model grids. This is not high priority for Angus' work because he has a workaround as above.;

Just a note that this has become potentially important for @anton-seaice's CICE C-grid work, so he is going to take take a look at this.

chrisb13 · 2025-12-09T21:39:40Z

Thanks @dougiesquire is there an issue or PR that's related to this? (aside from the c-grid one)

dougiesquire · 2025-12-09T21:42:03Z

Thanks @dougiesquire is there an issue or PR that's related to this? (aside from the c-grid one)

#968

chrisb13 · 2025-12-09T21:45:18Z

Great thanks, I had a feeling it might be that one but couldn't find it again.

angus-g · 2025-12-11T02:20:15Z

So I think I've got a handle on at least one of the issues bedevilling the initialisation. It was telling that almost all of the time was spent in tracer diffusion: it's due to the CHECK_DIFFUSIVE_CFL parameter. This will perform do a max_across_PEs (i.e. MPI_Allreduce) over the tracer diffusion CFL, which is obviously a super expensive thing to be doing. I guess DT_TRACER_ADVECT could be extended to push this out a bit.

I tried a different track, running 6 hours of the dynamic part of the ocean in 30s by setting CHECK_DIFFUSIVE_CFL = False (which required MAX_TR_DIFFUSION_CFL = 1.0 and disabling the bad surface value check). The rest of the model (I guess initialisation + restart generation) was an extra 1800s on top, but currently PARALLEL_RESTARTFILES = False. There's probably a bunch of similar tuning like that to be done anyway.

This probably needs a bit deeper investigation: clearly by disabling the CFL check, we end up with bad surface values. At the same time, this may just be one of those initialisation-only things? Ideally we can keep the parameter off and not have to pay for a global collective every tracer timestep...

minghangli-uni · 2025-12-11T02:28:44Z

Setting an additional parameter MAX_TR_DIFFUSION_CFL = 2 could resolve this problem. More details can be found in #732

Edit: for the 25 km configs, the slowdown only occurs during the first few months without MAX_TR_DIFFUSION_CFL = 2. After that performance returns to normal.

angus-g · 2025-12-11T03:53:47Z

Thanks @minghangli-uni, that also works. I guess the load imbalance was showing up in the allreduce. I think that unblocks the performance enough to try a longer run segment.

access-hive-bot · 2026-01-14T03:17:09Z

This pull request has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/cosima-twg-announce/401/79

chrisb13 · 2026-01-20T22:48:05Z

@angus-g. Just checking in on the global 8km work. We were chatting at the last TWG on Wednesday about how it would be helpful if we have a try at running the config / taking a closer look (part of this workplan is that users can run it). Do you have any updates on your side? Is the angus-g:angus-g/global-8km branch up to date/the one to try?

angus-g · 2026-01-21T00:06:10Z

Go for it! I ran 2 months in a bit over 4 hours, so it's much more tractable.

Is the angus-g:angus-g/global-8km branch up to date/the one to try?

That's right, let me know if you have any issues, it's probably a bit all over the place...

chrisb13 mentioned this pull request Oct 10, 2025

ACCESS-OM3 8km RYF model configuration #780

Open

Updated 8km config

653094a

angus-g force-pushed the angus-g/global-8km branch from ce9748b to 653094a Compare December 16, 2025 03:47

chrisb13 mentioned this pull request Dec 16, 2025

Evaluation: Conservation of water, salt, heat ACCESS-Community-Hub/access-om3-paper-1#62

Open

8km RYF configuration #781

Are you sure you want to change the base?

8km RYF configuration #781

Uh oh!

Conversation

angus-g commented Sep 22, 2025

Uh oh!

chrisb13 commented Sep 23, 2025

Uh oh!

angus-g commented Sep 23, 2025

Uh oh!

chrisb13 commented Oct 7, 2025 • edited by dougiesquire Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrisb13 commented Oct 10, 2025

Uh oh!

chrisb13 commented Oct 10, 2025

Uh oh!

aidanheerdegen commented Oct 27, 2025

Uh oh!

chrisb13 commented Oct 27, 2025

Uh oh!

chrisb13 commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

angus-g commented Nov 11, 2025

Uh oh!

AndyHoggANU commented Nov 11, 2025

Uh oh!

chrisb13 commented Nov 11, 2025

Uh oh!

minghangli-uni commented Nov 11, 2025

Uh oh!

minghangli-uni commented Nov 11, 2025

Uh oh!

angus-g commented Nov 11, 2025

Uh oh!

angus-g commented Nov 11, 2025

Uh oh!

minghangli-uni commented Nov 11, 2025

Uh oh!

angus-g commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

minghangli-uni commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dougiesquire commented Dec 8, 2025

Uh oh!

chrisb13 commented Dec 9, 2025

Uh oh!

dougiesquire commented Dec 9, 2025

Uh oh!

chrisb13 commented Dec 9, 2025

Uh oh!

angus-g commented Dec 11, 2025

Uh oh!

minghangli-uni commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

angus-g commented Dec 11, 2025

Uh oh!

access-hive-bot commented Jan 14, 2026

Uh oh!

chrisb13 commented Jan 20, 2026

Uh oh!

angus-g commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

chrisb13 commented Oct 7, 2025 •

edited by dougiesquire

Loading

chrisb13 commented Oct 28, 2025 •

edited

Loading

angus-g commented Nov 11, 2025 •

edited

Loading

minghangli-uni commented Nov 12, 2025 •

edited

Loading

minghangli-uni commented Dec 11, 2025 •

edited

Loading