claireyung:dev-MC_4km_jra_ryf+regionalpanan. PR #3.#1077
Conversation
|
@dougiesquire, I think I've now done what we discussed today. Can you have a go at resolving the conflicts? If needed, I imagine we can ask for Claire's help resolving any trickier bits. |
|
I tried to generate repro checksums using Claire's branch (claireyung:dev-MC_4km_jra_ryf+regionalpanan) so that I could confirm answers remain unchanged after rebasing. However that configuration does not run reliably - it crashes at initialisation every third or so time it is run. Error log (access-om2.err)This will need to be resolved before we can release this configuration. (Note, I did evenutally manage to generate the repro checksums and pushed them to Claire's branch for reference) (Note note, the configuration is also not restart reproducible) |
|
Thanks @dougiesquire. Just to be clear I'm assuming this was with the discussed update to ESMF
? I imagine it's unlikely to be important but I think you've also included the latest 7 commits from the merge commit we talked about yesterday? i.e. If you think these could be playing a role I think it would be a fairer comparison (e.g. what Claire was using) trying from: On the same thread, the comments @AndyHoggANU made yesterday (from @claireyung I gather) made it sound like it wasn't quite as crash prone as your experience but I would think that would be the ice-shelf case, from a restart. Hence, is it possible to repeat the tests you did for this branch of Claire's? Ideally using a re-start of her choosing. (Similar to your comment, this would give us a baseline before thinking about this PR.) Finally, I suppose there is the approach from the other direction, perhaps this problem might be improved by trying an updated build and config? |
Nope, it was just running the branch as is. I'm testing with updated ESMF at the moment and will update here. Yup, I'll try a few other things today. Again, will update here. |
|
Just a note that updating to ESMF 8.8 does not help the issue. Using Still testing a few things. |
|
A touch better than a third I guess.
Well, I guess you got one less fail but it doesn't fix the issue you mean? Just to mention I had a brief chat to @aidanheerdegen about this who mentioned a re-sub script that's been used in the past to circumvent this kind of thing:
I guess it's not happening in the global 8k (afaik), so it's presumably something about the regional setup? |
|
The error is caused by the runoff remapping weights - or, at least, the error does not occur when the weights are not applied. I don't know exactly what the issue with them is yet, but should be an easy fix. |
Hmmm maybe not... I can't see anything funny in the remapping weights. I also can run the configuration including runoff remapping without issue when UCX is removed ( |
Interesting. I'm unfamiliar with UCX though, do we need it? i.e. could we make this the new default? |
|
I also can run the configuration including runoff remapping without issue with |
|
@angus-g I'm curious if you have any thoughts. I'm feeling a bit out of my depth. The summary is:
Have you experienced intermittent issues like this with the global 8km config? |
|
Oh and
|
|
I did run into the same error in an ocean-only build in October. My solution was similar, I just disabled ucc: |
|
Thanks @angus-g. That's interesting that you've seen something similar in an ocean-only build. Do you mean using the solo driver or using the |
Yeah, that's where I saw it. Just reproduced (with no MPI flags to mitigate): mom6.errmom6.out |
|
This pull request has been mentioned on ACCESS Hive Community Forum. There might be relevant details there: https://forum.access-hive.org.au/t/cosima-twg-announce/401/80 |
c6ab817 to
e767347
Compare
This commit squashes 19 commits made during the original development of this configuration. See #713 for the original commits. The first lines of the original commit messages are as follows: * Add regional panantarctic configuration (1/12th degree/4km setup) * Update regional panantarctic configuration and ensure it runs * Get rid of some old information in the README * Add MOM_override params for unwanted upstream changes I overlooked * 2025-08-25 22:10:29: Run 0 * payu archive: documentation of MOM6 run-time configuration * Change to new executable with updated MOM6 code * Try jra v1-6 * Update ocn_dt_cpl to reflect real coupling timestep * Revert MOM_input to OM3-25km config, i.e. USE_PSURF_IN_EOS = False is now default True, MAX_P_SURF = 0 is now default -1, USE_RIGID_SEA_ICE = True is now deafult, False, SEA_ICE_RIGID_MASS = 100.0 is now deafult 1000. This was done by removing the lines in MOM_input that had set these parameters to be not default. * Move where diag_table sits * 2025-08-29 22:03:14: Run 0 * payu archive: documentation of MOM6 run-time configuration * Add author details for contributors who weren't already on the CITATIONS.cff * Updating to latest released executable * Merge MOM_input and MOM_override to be one file, using the MOM_parameter.short file in https://github.com/claireyung/access-om3-configs/tree/a1642770156249411fb2ad47d15d559e97a22c9f * Set THICKNESSDIFFUSE to be False * 2025-10-17 14:44:05: Run 0 * Update docs based on files produced in 44cea06 -------- Co-authored-by: Edward Yang <yang.e@wehi.edu.au> Co-authored-by: Helen Macdonald <179985228+helenmacdonald@users.noreply.github.com> Co-authored-by: Dougie Squire <42455466+dougiesquire@users.noreply.github.com>
e767347 to
d7fb023
Compare
|
!test repro commit |
|
❌ The Bitwise Reproducibility Check Failed ❌ When comparing:
🔧 The new checksums will be committed to this PR, if they differ from what is on this branch. Further informationThe experiment can be found on Gadi at The checksums generated by this The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/d60cb09f95af27f0512ef6c6575ca8047be37b48/testing/checksum Test summary: |
|
I've squashed and rebased this onto the lastest While rebasing and resolving conflicts, I've temporarily reverted all answer-changing changes that have been made to the base since Claire branched. This is to give myself peace-of-mind that I haven't accidentally changed something of Claire's while rebasing. The checksums that have just been committed match those I created using Claire's original branch. The plan is now to add in those changes that I reverted, which will change answers relative to Claire's original branch:
|
SW pen update requires a new input and will be done in a separate commit
|
!test repro commit |
|
❌ The Bitwise Reproducibility Check Failed ❌ When comparing:
🔧 The new checksums will be committed to this PR, if they differ from what is on this branch. Further informationThe experiment can be found on Gadi at The checksums generated by this The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/d60cb09f95af27f0512ef6c6575ca8047be37b48/testing/checksum Test summary: |
|
Thanks for pulling this together @dougiesquire. As suggested, we've focused on using this comparison. We've briefly looked at this, when thinking about the merge (e.g. We haven't found anything that looks amiss technically so we think the merge can proceed. We have some small comments related to the science. Observations from @helenmacdonald and I:
USE_RIVER_HEAT_CONTENT = True ! [Boolean] default = False
! If true, use the fluxes%runoff_Hflx field to set the heat carried by runoff,
! instead of using SST*CP*liq_runoff.
USE_CALVING_HEAT_CONTENT = True ! [Boolean] default = False
! If true, use the fluxes%calving_Hflx field to set the heat carried by runoff,
! instead of using SST*CP*froz_runoff.@claireyung had false for both so was using the
|
helenmacdonald
left a comment
There was a problem hiding this comment.
Thanks @dougiesquire! @chrisb13 and I went through it and are happy for it to be merged. Chris has noted a few items that we need to be mindful of to check that the answer changes are not bad for the science output.
Links to where these changes were originally discussed (including answers to some of your questions): |
Thanks for checking. I think it is safe to be removed. I believe the pr version contains various updates from @helenmacdonald. If the above turns out to be incorrect, we can just add it back in again. |
Also snuck in a minor syntactical change that will get squashed away anyway
9aff818
into
dev-MC_4km_jra_ryf+regionalpanan
This commit squashes: - 19 commits made during the original development of this configuration* - 1 commit adding repro checksums - 1 commit removing the file `panatarctic_instructions.md`. This has been moved to #573 *See #713 for the original 19 commits. The first lines of the original 19 commit messages are as follows: - Add regional panantarctic configuration (1/12th degree/4km setup) - Update regional panantarctic configuration and ensure it runs - Get rid of some old information in the README - Add MOM_override params for unwanted upstream changes I overlooked - 2025-08-25 22:10:29: Run 0 - payu archive: documentation of MOM6 run-time configuration - Change to new executable with updated MOM6 code - Try jra v1-6 - Update ocn_dt_cpl to reflect real coupling timestep - Revert MOM_input to OM3-25km config, i.e. USE_PSURF_IN_EOS = False is now default True, MAX_P_SURF = 0 is now default -1, USE_RIGID_SEA_ICE = True is now deafult, False, SEA_ICE_RIGID_MASS = 100.0 is now deafult 1000. This was done by removing the lines in MOM_input that had set these parameters to be not default. - Move where diag_table sits - 2025-08-29 22:03:14: Run 0 - payu archive: documentation of MOM6 run-time configuration - Add author details for contributors who weren't already on the CITATIONS.cff - Updating to latest released executable - Merge MOM_input and MOM_override to be one file, using the MOM_parameter.short file in https://github.com/claireyung/access-om3-configs/tree/a1642770156249411fb2ad47d15d559e97a22c9f - Set THICKNESSDIFFUSE to be False - 2025-10-17 14:44:05: Run 0 - Update docs based on files produced in 44cea06 -------- Co-authored-by: Edward Yang <yang.e@wehi.edu.au> Co-authored-by: Helen Macdonald <179985228+helenmacdonald@users.noreply.github.com> Co-authored-by: Dougie Squire <42455466+dougiesquire@users.noreply.github.com> Co-authored-by: access-bot <113399144+access-bot@users.noreply.github.com>
A new PR for the alpha release of the regional panan. This PR allows us to add @claireyung's commits on top of the latest
dev-MC_25km_jra_ryf.ADDED BY DOUGIE: This PR now includes the same commits as #713 but rebased onto an updated dev-MC_4km_jra_ryf+regionalpanan branch
We plan to do a squash merge.
Previous PRs on this:
And discussion: