-
Notifications
You must be signed in to change notification settings - Fork 16
8km RYF configuration #781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 780-dev-MC_8km_jra_ryf
Are you sure you want to change the base?
8km RYF configuration #781
Conversation
|
Yes, of course! (maybe this is a recursive configuration 🤔) |
|
Meeting summary with @manodeep @chrisb13 @dougiesquire @ezhilsabareesh8 @minghangli-uni @angus-g (today). Problem statement two-fold:
@angus-g needs:
Questions/comments:
Possible approaches to helping:
@angus-g already has a workaround for the second issue (atmosphere grid same as mediator grid). @dougiesquire suggested would be good to double check this workflow and propagate it across our configs. Note @minghangli-uni: Load balancing between components can only occur when the first two items have been resolved. (Single components can be looked at.) Run sequence could also be looked at independently (was order ~25% improvement in the panan) but this was already in the git commit that Angus has been working off. Next actions:
|
|
As discussed, I've now set up a meeting for our next chat: Tue 28/10/2025 15:00 - 16:00 Relevant people should have received a calendar invite but get in touch with me if you didn't get it or would like to come -- all welcome! |
Have you had a chance to pass this work on @minghangli-uni ? Is it on GitHub somewhere? |
Is this offline? If so for the 0.1 OM2 we needed to use 8 cpus and 400GB memory https://github.com/COSIMA/initial_conditions_WOA/blob/master/01/make_ic |
Online I'm afraid. Although @minghangli-uni and @angus-g are working on a semi-offline version. |
|
Meeting at 3pm today with @AndyHoggANU @adele-morrison @aekiss @minghangli-uni @angus-g From last time's possible solutions
Angus is still interested to follow up on this based on @minghangli-uni previous work (minghang will share next week). @chrisb13 and other ocean team people are available to help if needed.
Angus hasn't had a chance to implement this yet, it's a configuration change. Will update the PR once it's in.
@minghangli-uni + others have been working on some optimisation doc's here which will be relevant: For compiler flags, @edoyango wrote a little here: Side note: @adele-morrison and claire are keen for help on why the time-step / increasing the cores is leading to seg faults: |
|
Finally got back to actually attempting to run this with the mask consistency. Of the total runtime for a single day, most of it is spent in the MOM6 tracer advection. I actually turned off all diagnostic output from MOM6 (via the |
|
What is the tracer timestep? |
I believe it does. @minghangli-uni can advise. As I alluded at the talk today, here's the draft of the optimisation work that @minghangli-uni is leading. |
|
@angus-g , can you paste following into your payu config.yaml then run it for 1 model day, and then share the run path so I can take a look. env:
ESMF_RUNTIME_PROFILE: "on"
ESMF_RUNTIME_TRACE: "on"
ESMF_RUNTIME_PROFILE_OUTPUT: "SUMMARY BINARY" |
|
And this draft ACCESS-NRI/om3-scripts#92 is to offline generate the routehandles (including regridding weights). This is not complete yet. But @angus-g you could have a look. I’ve started looking into this again and have found another way that might be simpler. Will report if there's progress. |
Will run again with trace and profile, thanks @minghangli-uni |
Alright, this is available at I didn't actually look into the MOM6 timer breakdown closely enough, but the scary lines are: So the halo updates are the slow (and massively imbalanced) part of everything at the moment. It seems like those associated with the diffusion (rather than advection) are the issue, but then again there's a lot more halo updates in the diffusion code. |
I can access to x77. Thanks @angus-g
Just to check whether this is tied to the processor layout: I saw you tried four different layouts, Do you see the same halo-update behaviour in the other three configs as well? |
That's a good question -- they were my initial tests before I realised the weight generation was the initialisation issue. So I've only run a proper segment on 48 nodes. Definitely worth checking out! |
|
Hi @angus-g , As you might’ve seen, the file I’ve also generated an interactive flame graph to visualise how the components interact and where time is spent: You can open it locally on gadi with: and then navigate to From what I’ve seen, most of the runtime is dominated by the tracer steps. Initialisation, on the other hand, doesn’t seem to be a major bottleneck for this config - the evidence is clear both from the |
Just a note that this has become potentially important for @anton-seaice's CICE C-grid work, so he is going to take take a look at this. |
|
Thanks @dougiesquire is there an issue or PR that's related to this? (aside from the c-grid one) |
|
|
Great thanks, I had a feeling it might be that one but couldn't find it again. |
|
So I think I've got a handle on at least one of the issues bedevilling the initialisation. It was telling that almost all of the time was spent in tracer diffusion: it's due to the I tried a different track, running 6 hours of the dynamic part of the ocean in 30s by setting This probably needs a bit deeper investigation: clearly by disabling the CFL check, we end up with bad surface values. At the same time, this may just be one of those initialisation-only things? Ideally we can keep the parameter off and not have to pay for a global collective every tracer timestep... |
|
Setting an additional parameter Edit: for the 25 km configs, the slowdown only occurs during the first few months without |
|
Thanks @minghangli-uni, that also works. I guess the load imbalance was showing up in the allreduce. I think that unblocks the performance enough to try a longer run segment. |
ce9748b to
653094a
Compare
|
This pull request has been mentioned on ACCESS Hive Community Forum. There might be relevant details there: https://forum.access-hive.org.au/t/cosima-twg-announce/401/79 |
|
@angus-g. Just checking in on the global 8km work. We were chatting at the last TWG on Wednesday about how it would be helpful if we have a try at running the config / taking a closer look (part of this workplan is that users can run it). Do you have any updates on your side? Is the |
|
Go for it! I ran 2 months in a bit over 4 hours, so it's much more tractable.
That's right, let me know if you have any issues, it's probably a bit all over the place... |
1. Summary:
As mentioned in willaguiar/OM3-8km-tidal-tunning#3, this is an 8km global RYF configuration, based on the 8km RYF beta. The immediate aim is to use this with global tides as a high-resolution model for tuning drag parameterisations.
This has run into some issues with initialisation-time generation of regridding weights, and needs a bit of thought about optimisation/tuning because it's quite expensive!
2. Issues Addressed:
3. Dependencies (e.g. on payu, model or om3-scripts)
This change requires changes to (note required version where true):
4. Ad-hoc Testing
What ad-hoc testing was done? How are you convinced this change is correct (plots are good)?
5. CI Testing
!test reprohas been run6. Reproducibility
Is this reproducible with the previous commit? (If not, why not?)
!test repro commithas been run.7. Documentation
The docs folder has been updated with output from running the model?
A PR has been created for updating the documentation?
8. Formatting
Changes to MOM_input have been copied from model output in docs/MOM_parameter_docs.short?
9. Merge Strategy