Skip to content

claireyung:dev-MC_4km_jra_ryf+regionalpanan. PR #3.#1077

Merged
dougiesquire merged 5 commits intodev-MC_4km_jra_ryf+regionalpananfrom
880-dev-MC_4km_jra_ryf+regionalpanan_temp
Feb 2, 2026
Merged

claireyung:dev-MC_4km_jra_ryf+regionalpanan. PR #3.#1077
dougiesquire merged 5 commits intodev-MC_4km_jra_ryf+regionalpananfrom
880-dev-MC_4km_jra_ryf+regionalpanan_temp

Conversation

@chrisb13
Copy link
Collaborator

@chrisb13 chrisb13 commented Jan 21, 2026

A new PR for the alpha release of the regional panan. This PR allows us to add @claireyung's commits on top of the latest dev-MC_25km_jra_ryf.

ADDED BY DOUGIE: This PR now includes the same commits as #713 but rebased onto an updated dev-MC_4km_jra_ryf+regionalpanan branch

We plan to do a squash merge.

Previous PRs on this:

And discussion:

@chrisb13
Copy link
Collaborator Author

@dougiesquire, I think I've now done what we discussed today. Can you have a go at resolving the conflicts?

If needed, I imagine we can ask for Claire's help resolving any trickier bits.

@chrisb13 chrisb13 changed the title claireyung:dev-MC_4km_jra_ryf+regionalpanan PR #3. claireyung:dev-MC_4km_jra_ryf+regionalpanan. PR #3. Jan 21, 2026
@dougiesquire
Copy link
Collaborator

dougiesquire commented Jan 21, 2026

I tried to generate repro checksums using Claire's branch (claireyung:dev-MC_4km_jra_ryf+regionalpanan) so that I could confirm answers remain unchanged after rebasing. However that configuration does not run reliably - it crashes at initialisation every third or so time it is run.

Error log (access-om2.err)
[gadi-cpu-spr-0089:1532201:0:1532201] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1dc4d000)
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
BFD: Dwarf Error: Invalid abstract instance DIE ref.
==== backtrace (tid:1532201) ====
0 0x0000000000012990 __funlockfile()  :0
1 0x00000000000de9d2 __memcpy_avx512_unaligned_erms()  :0
2 0x00000000000650a1 ucp_proto_rndv_am_bcopy_pack.lto_priv.0()  :0
3 0x000000000004c15d uct_dc_mlx5_ep_am_bcopy()  ???:0
4 0x000000000006c947 ucp_proto_rndv_am_bcopy_progress.lto_priv.0()  :0
5 0x000000000006ac39 ucp_proto_rndv_send_start()  ???:0
6 0x000000000006b607 ucp_proto_rndv_handle_rtr()  ???:0
7 0x00000000000567d2 uct_dc_mlx5_iface_progress_ll.lto_priv.0()  :0
8 0x00000000000483ca ucp_worker_progress()  ???:0
9 0x00000000000ad26d mca_pml_ucx_send_nbr()  /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/ompi/mca/pml/ucx/pml_ucx.c:928
10 0x00000000000ad26d mca_pml_ucx_send_nbr()  /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/ompi/mca/pml/ucx/pml_ucx.c:928
11 0x00000000000ad26d mca_pml_ucx_send()  /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/ompi/mca/pml/ucx/pml_ucx.c:949
12 0x0000000000208143 PMPI_Send()  /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/build/gcc/ompi/psend.c:81
13 0x00000000000394ce pio_read_darray_nc_serial()  ???:0
14 0x000000000003686a PIOc_read_darray()  ???:0
15 0x0000000000526ffd get_nodeCoords_from_ESMFMesh_file()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Infrastructure/Mesh/src/ESMCI_ESMFMesh_Util.C:780
16 0x00000000007a65c4 ESMCI_mesh_create_from_ESMFMesh_file()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Infrastructure/Mesh/src/ESMCI_Mesh_FileIO.C:545
17 0x00000000007a5bba ESMCI_mesh_create_from_file()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Infrastructure/Mesh/src/ESMCI_Mesh_FileIO.C:193
18 0x0000000000758e02 ESMCI::MeshCap::meshcreatefromfilenew()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Infrastructure/Mesh/src/ESMCI_MeshCap.C:2592
19 0x00000000007a5916 c_esmc_meshcreatefromfile_()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Infrastructure/Mesh/interface/ESMCI_Mesh_F.C:941
20 0x0000000000e42e2d c_esmc_meshcreatefromfile_.t19878p.t19880p.t19881p.t19883p.t19885p.t19887p.t19889p.t19891p.t11p.t11p.t19892p.t3v.t3v()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Infrastructure/Mesh/interface/ESMF_Mesh.F90:0
21 0x000000000222f25c dshr_strdata_mod_mp_shr_strdata_init_()  ???:0
22 0x000000000222df4c dshr_strdata_mod_mp_shr_strdata_init_from_config_()  ???:0
23 0x00000000022129ca atm_comp_nuopc_mp_initializerealize_()  ???:0
24 0x00000000005396d8 ESMCI::FTable::callVFuncPtr()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2187
25 0x000000000053911c ESMCI_FTableCallEntryPointVMHop()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:844
26 0x00000000008787e4 ESMCI::VM::enter()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
27 0x0000000000539db0 c_esmc_ftablecallentrypointvm_()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:1001
28 0x0000000000a78978 c_esmc_ftablecallentrypointvm_.t16698p.t16703p.t16707p.t16711p.t16713p.t16715p.t16717p.t16719p.t16720p.t16721p.t16722p.t16723p()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:0
29 0x0000000000cf3cb9 esmf_gridcompmod_mp_esmf_gridcompinitialize_()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1443
30 0x000000000107f5f7 nuopc_driver_mp_loopmodelcompss_()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:2890
31 0x00000000010752ce nuopc_driver_mp_initializeipdv02p3_()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:1982
32 0x00000000005396d8 ESMCI::FTable::callVFuncPtr()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2187
33 0x000000000053911c ESMCI_FTableCallEntryPointVMHop()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:844
34 0x00000000008787e4 ESMCI::VM::enter()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
35 0x0000000000539db0 c_esmc_ftablecallentrypointvm_()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:1001
36 0x0000000000a78978 c_esmc_ftablecallentrypointvm_.t16698p.t16703p.t16707p.t16711p.t16713p.t16715p.t16717p.t16719p.t16720p.t16721p.t16722p.t16723p()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:0
37 0x0000000000cf3cb9 esmf_gridcompmod_mp_esmf_gridcompinitialize_()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1443
38 0x000000000107f5f7 nuopc_driver_mp_loopmodelcompss_()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:2890
39 0x00000000010753cf nuopc_driver_mp_initializeipdv02p3_()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:1987
40 0x00000000010656e3 nuopc_driver_mp_initializegeneric_()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:487
41 0x00000000005396d8 ESMCI::FTable::callVFuncPtr()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2187
42 0x000000000053911c ESMCI_FTableCallEntryPointVMHop()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:844
43 0x00000000008787e4 ESMCI::VM::enter()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
44 0x0000000000539db0 c_esmc_ftablecallentrypointvm_()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:1001
45 0x0000000000a78978 c_esmc_ftablecallentrypointvm_.t16698p.t16703p.t16707p.t16711p.t16713p.t16715p.t16717p.t16719p.t16720p.t16721p.t16722p.t16723p()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:0
46 0x0000000000cf3cb9 esmf_gridcompmod_mp_esmf_gridcompinitialize_()  /scratch/tm70/tm70_ci/tmp/spack-stage/spack-stage-esmf-8.7.0-uhxpnqokooatmylfhvhl2dor362cg6j5/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1443
47 0x0000000000670e0d MAIN__()  ???:0
48 0x000000000067080d main()  ???:0
49 0x000000000003a7e5 __libc_start_main()  ???:0
50 0x000000000067072e _start()  ???:0
=================================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
libpthread-2.28.s  0000154ACB3A7990  Unknown               Unknown  Unknown
libuct_ib.so.0.0.  0000154AC03C3892  Unknown               Unknown  Unknown
libucp.so.0.0.0    0000154AC6C8C3CA  ucp_worker_progre     Unknown  Unknown
libucc.so.1.0.0    0000154AC7FB9A27  ucc_context_progr     Unknown  Unknown
libmpi.so.40.30.7  0000154ACC2FD6A7  mca_coll_ucc_prog     Unknown  Unknown
libopen-pal.so.40  0000154AC7114923  opal_progress         Unknown  Unknown
libopen-pal.so.40  0000154AC7114AD5  ompi_sync_wait_mt     Unknown  Unknown
libmpi.so.40.30.7  0000154ACC37F6CA  ompi_request_defa     Unknown  Unknown
libmpi.so.40.30.7  0000154ACC2B7ECA  mca_coll_basic_al     Unknown  Unknown
libmpi.so.40.30.7  0000154ACC3612AB  MPI_Alltoallw         Unknown  Unknown
libpioc.so         0000154ACE40DBDB  pio_swapm             Unknown  Unknown
libpioc.so         0000154ACE411B76  rearrange_io2comp     Unknown  Unknown
libpioc.so         0000154ACE42B8BE  PIOc_read_darray      Unknown  Unknown
libesmf.so         0000154ACEDC4FFD  _Z33get_nodeCoord     Unknown  Unknown
libesmf.so         0000154ACF0445C4  _Z36ESMCI_mesh_cr     Unknown  Unknown
libesmf.so         0000154ACF043BBA  _Z27ESMCI_mesh_cr     Unknown  Unknown
libesmf.so         0000154ACEFF6E02  _ZN5ESMCI7MeshCap     Unknown  Unknown
libesmf.so         0000154ACF043916  c_esmc_meshcreate     Unknown  Unknown
libesmf.so         0000154ACF6E0E2D  esmf_meshmod_mp_e     Unknown  Unknown
access-om3-MOM6-C  000000000222F25C  shr_strdata_init          438  dshr_strdata_mod.F90
access-om3-MOM6-C  000000000222DF4C  shr_strdata_init_         234  dshr_strdata_mod.F90
access-om3-MOM6-C  00000000022129CA  initializerealize         453  atm_comp_nuopc.F90
libesmf.so         0000154ACEDD76D8  _ZN5ESMCI6FTable1     Unknown  Unknown
libesmf.so         0000154ACEDD711C  ESMCI_FTableCallE     Unknown  Unknown
libesmf.so         0000154ACF1167E4  _ZN5ESMCI2VM5ente     Unknown  Unknown
libesmf.so         0000154ACEDD7DB0  c_esmc_ftablecall     Unknown  Unknown
libesmf.so         0000154ACF316978  esmf_compmod_mp_e     Unknown  Unknown
libesmf.so         0000154ACF591CB9  esmf_gridcompmod_     Unknown  Unknown
libesmf.so         0000154ACF91D5F7  nuopc_driver_mp_l     Unknown  Unknown
libesmf.so         0000154ACF9132CE  nuopc_driver_mp_i     Unknown  Unknown
libesmf.so         0000154ACEDD76D8  _ZN5ESMCI6FTable1     Unknown  Unknown
libesmf.so         0000154ACEDD711C  ESMCI_FTableCallE     Unknown  Unknown
libesmf.so         0000154ACF1167E4  _ZN5ESMCI2VM5ente     Unknown  Unknown
libesmf.so         0000154ACEDD7DB0  c_esmc_ftablecall     Unknown  Unknown
libesmf.so         0000154ACF316978  esmf_compmod_mp_e     Unknown  Unknown
libesmf.so         0000154ACF591CB9  esmf_gridcompmod_     Unknown  Unknown
libesmf.so         0000154ACF91D5F7  nuopc_driver_mp_l     Unknown  Unknown
libesmf.so         0000154ACF9133CF  nuopc_driver_mp_i     Unknown  Unknown
libesmf.so         0000154ACF9036E3  nuopc_driver_mp_i     Unknown  Unknown
libesmf.so         0000154ACEDD76D8  _ZN5ESMCI6FTable1     Unknown  Unknown
libesmf.so         0000154ACEDD711C  ESMCI_FTableCallE     Unknown  Unknown
libesmf.so         0000154ACF1167E4  _ZN5ESMCI2VM5ente     Unknown  Unknown
libesmf.so         0000154ACEDD7DB0  c_esmc_ftablecall     Unknown  Unknown
libesmf.so         0000154ACF316978  esmf_compmod_mp_e     Unknown  Unknown
libesmf.so         0000154ACF591CB9  esmf_gridcompmod_     Unknown  Unknown
access-om3-MOM6-C  0000000000670E0D  esmapp                    128  esmApp.F90
access-om3-MOM6-C  000000000067080D  Unknown               Unknown  Unknown
libc-2.28.so       0000154ACAFF87E5  __libc_start_main     Unknown  Unknown
access-om3-MOM6-C  000000000067072E  Unknown               Unknown  Unknown

This will need to be resolved before we can release this configuration.

(Note, I did evenutally manage to generate the repro checksums and pushed them to Claire's branch for reference)

(Note note, the configuration is also not restart reproducible)

@chrisb13
Copy link
Collaborator Author

chrisb13 commented Jan 21, 2026

Thanks @dougiesquire. Just to be clear I'm assuming this was with the discussed update to ESMF

ESMF to 8.8.0 to include esmf-org/esmf@0a84bc6

source.

?

I imagine it's unlikely to be important but I think you've also included the latest 7 commits from the merge commit we talked about yesterday? i.e. If you think these could be playing a role I think it would be a fairer comparison (e.g. what Claire was using) trying from:
c6ab8175e87fd2d9f6e10d20d994b4b3c35846b5.

On the same thread, the comments @AndyHoggANU made yesterday (from @claireyung I gather) made it sound like it wasn't quite as crash prone as your experience but I would think that would be the ice-shelf case, from a restart. Hence, is it possible to repeat the tests you did for this branch of Claire's? Ideally using a re-start of her choosing. (Similar to your comment, this would give us a baseline before thinking about this PR.)

Finally, I suppose there is the approach from the other direction, perhaps this problem might be improved by trying an updated build and config?

@dougiesquire
Copy link
Collaborator

Thanks @dougiesquire. Just to be clear I'm assuming this was with the discussed update to ESMF

ESMF to 8.8.0 to include esmf-org/esmf@0a84bc6

source.

Nope, it was just running the branch as is. I'm testing with updated ESMF at the moment and will update here.

Yup, I'll try a few other things today. Again, will update here.

@dougiesquire
Copy link
Collaborator

dougiesquire commented Jan 23, 2026

Just a note that updating to ESMF 8.8 does not help the issue.

Using access-om3/2025.08.001 (ESMF 8.7.0): 5/20 1 day runs failed
Using access-om3/pr183-5 (ESMF 8.8.1): 4/20 1 day runs failed

Still testing a few things.

@chrisb13
Copy link
Collaborator Author

A touch better than a third I guess.

Just a note that updating to ESMF 8.8 does not help the issue.

Well, I guess you got one less fail but it doesn't fix the issue you mean?

Just to mention I had a brief chat to @aidanheerdegen about this who mentioned a re-sub script that's been used in the past to circumvent this kind of thing:

I guess it's not happening in the global 8k (afaik), so it's presumably something about the regional setup?

@dougiesquire
Copy link
Collaborator

dougiesquire commented Jan 27, 2026

The error is caused by the runoff remapping weights - or, at least, the error does not occur when the weights are not applied. I don't know exactly what the issue with them is yet, but should be an easy fix.

@dougiesquire
Copy link
Collaborator

but should be an easy fix.

Hmmm maybe not... I can't see anything funny in the remapping weights. I also can run the configuration including runoff remapping without issue when UCX is removed (--mca pml ob1 --mca coll ^ucc,hcoll), which maybe implies MPI weirdness, but I'm rapidly getting out of my depth

@chrisb13
Copy link
Collaborator Author

I also can run the configuration including runoff remapping without issue when UCX is removed (--mca pml ob1 --mca coll ^ucc,hcoll)

Interesting. I'm unfamiliar with UCX though, do we need it? i.e. could we make this the new default?

@dougiesquire
Copy link
Collaborator

I also can run the configuration including runoff remapping without issue with --mca pml ucx -x UCX_RNDV_THRESH=inf

@dougiesquire
Copy link
Collaborator

@angus-g I'm curious if you have any thoughts. I'm feeling a bit out of my depth.

The summary is:

  • Claire's panan config fails ~20% of the time with the error in this comment
  • Updating to ESMF 8.8 doesn't help
  • The error seems to go away if our custom runoff remapping is disabled
  • The error seems to go away using --mca pml ob1 --mca coll ^hcoll or --mca pml ucx -x UCX_RNDV_THRESH=inf

Have you experienced intermittent issues like this with the global 8km config?

@dougiesquire
Copy link
Collaborator

Oh and

  • After updating to openmpi 5.0.8, ~50% of runs crash immediately and ~50% hang indefinitely

@angus-g
Copy link
Collaborator

angus-g commented Jan 27, 2026

I did run into the same error in an ocean-only build in October. My solution was similar, I just disabled ucc: -mca coll ^ucc. I think I also hit it in some other configs, but I didn't investigate or figure out specifically what fixed anything! I have never had any decent luck with openmpi 5.x, so I would've only used 4.1.7... I'll see if I can re-reproduce the issue with the ocean-only build, which might help to track it down.

@dougiesquire
Copy link
Collaborator

Thanks @angus-g. That's interesting that you've seen something similar in an ocean-only build. Do you mean using the solo driver or using the MOM6 ACCESS-OM3 exe? For this panan configuration, -mca coll ^ucc doesn't stop the crashes.

@angus-g
Copy link
Collaborator

angus-g commented Jan 28, 2026

Do you mean using the solo driver

Yeah, that's where I saw it. Just reproduced (with no MPI flags to mitigate):

mom6.err
[gadi-cpu-clx-2353.gadi.nci.org.au:440492] Error: /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/ompi/mca/coll/ucc/coll_ucc_bcast.c:64 - coll_ucc_req_wait() ucc_collective_test failed: Unhandled error
[gadi-cpu-clx-2353.gadi.nci.org.au:440504] Error: /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/ompi/mca/coll/ucc/coll_ucc_bcast.c:64 - coll_ucc_req_wait() ucc_collective_test failed: Unhandled error
[gadi-cpu-clx-2353.gadi.nci.org.au:440517] Error: /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/ompi/mca/coll/ucc/coll_ucc_bcast.c:64 - coll_ucc_req_wait() ucc_collective_test failed: Unhandled error
[gadi-cpu-clx-2354.gadi.nci.org.au:566075] Error: /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/ompi/mca/coll/ucc/coll_ucc_bcast.c:64 - coll_ucc_req_wait() ucc_collective_test failed: Unhandled error
[gadi-cpu-clx-2356.gadi.nci.org.au:2168173] Error: /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/ompi/mca/coll/ucc/coll_ucc_bcast.c:64 - coll_ucc_req_wait() ucc_collective_test failed: Unhandled error
[gadi-cpu-clx-2355.gadi.nci.org.au:623656] Error: /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/ompi/mca/coll/ucc/coll_ucc_bcast.c:64 - coll_ucc_req_wait() ucc_collective_test failed: Unhandled error
[gadi-cpu-clx-2354.gadi.nci.org.au:566087] Error: /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/ompi/mca/coll/ucc/coll_ucc_bcast.c:64 - coll_ucc_req_wait() ucc_collective_test failed: Unhandled error
[gadi-cpu-clx-2371.gadi.nci.org.au:4165575] Error: /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/ompi/mca/coll/ucc/coll_ucc_bcast.c:64 - coll_ucc_req_wait() ucc_collective_test failed: Unhandled error
[gadi-cpu-clx-2356.gadi.nci.org.au:2168185] Error: /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/ompi/mca/coll/ucc/coll_ucc_bcast.c:64 - coll_ucc_req_wait() ucc_collective_test failed: Unhandled error
[gadi-cpu-clx-2355.gadi.nci.org.au:623665] Error: /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/ompi/mca/coll/ucc/coll_ucc_bcast.c:64 - coll_ucc_req_wait() ucc_collective_test failed: Unhandled error
[gadi-cpu-clx-2354.gadi.nci.org.au:566099] Error: /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/ompi/mca/coll/ucc/coll_ucc_bcast.c:64 - coll_ucc_req_wait() ucc_collective_test failed: Unhandled error
[gadi-cpu-clx-2371.gadi.nci.org.au:4165584] Error: /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/ompi/mca/coll/ucc/coll_ucc_bcast.c:64 - coll_ucc_req_wait() ucc_collective_test failed: Unhandled error
[gadi-cpu-clx-2371.gadi.nci.org.au:4165589] Error: /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/ompi/mca/coll/ucc/coll_ucc_bcast.c:64 - coll_ucc_req_wait() ucc_collective_test failed: Unhandled error
[gadi-cpu-clx-2355.gadi.nci.org.au:623675] Error: /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/ompi/mca/coll/ucc/coll_ucc_bcast.c:64 - coll_ucc_req_wait() ucc_collective_test failed: Unhandled error
[gadi-cpu-clx-2371.gadi.nci.org.au:4165599] Error: /jobfs/129486601.gadi-pbs/0/openmpi/4.1.7/source/openmpi-4.1.7/ompi/mca/coll/ucc/coll_ucc_bcast.c:64 - coll_ucc_req_wait() ucc_collective_test failed: Unhandled error
mom6.out
[1769553598.552098] [gadi-cpu-clx-2353:440484:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552098] [gadi-cpu-clx-2353:440485:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552098] [gadi-cpu-clx-2353:440492:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552136] [gadi-cpu-clx-2353:440492:0]    ucc_schedule.h:198  UCC  ERROR failure in task 0x4a513c0, Unhandled error
[1769553598.552090] [gadi-cpu-clx-2353:440517:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552130] [gadi-cpu-clx-2353:440517:0]    ucc_schedule.h:198  UCC  ERROR failure in task 0x42e0580, Unhandled error
[1769553598.552118] [gadi-cpu-clx-2353:440495:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552118] [gadi-cpu-clx-2353:440497:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552120] [gadi-cpu-clx-2353:440504:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552150] [gadi-cpu-clx-2353:440504:0]    ucc_schedule.h:198  UCC  ERROR failure in task 0x528a700, Unhandled error
[1769553598.552174] [gadi-cpu-clx-2353:440506:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552175] [gadi-cpu-clx-2353:440508:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552196] [gadi-cpu-clx-2353:440519:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552196] [gadi-cpu-clx-2353:440520:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552049] [gadi-cpu-clx-2355:623675:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552132] [gadi-cpu-clx-2355:623649:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552094] [gadi-cpu-clx-2371:4165573:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552107] [gadi-cpu-clx-2354:566070:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552094] [gadi-cpu-clx-2371:4165569:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552107] [gadi-cpu-clx-2354:566072:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552094] [gadi-cpu-clx-2371:4165575:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552107] [gadi-cpu-clx-2354:566075:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552145] [gadi-cpu-clx-2354:566075:0]    ucc_schedule.h:198  UCC  ERROR failure in task 0x4c51b40, Unhandled error
[1769553598.552165] [gadi-cpu-clx-2356:2168168:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552144] [gadi-cpu-clx-2371:4165584:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552140] [gadi-cpu-clx-2371:4165580:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552145] [gadi-cpu-clx-2354:566082:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552165] [gadi-cpu-clx-2356:2168169:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552144] [gadi-cpu-clx-2371:4165589:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552132] [gadi-cpu-clx-2355:623650:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552146] [gadi-cpu-clx-2354:566081:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552165] [gadi-cpu-clx-2356:2168173:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552207] [gadi-cpu-clx-2356:2168173:0]    ucc_schedule.h:198  UCC  ERROR failure in task 0x50c2180, Unhandled error
[1769553598.552095] [gadi-cpu-clx-2371:4165599:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552132] [gadi-cpu-clx-2355:623656:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552168] [gadi-cpu-clx-2355:623656:0]    ucc_schedule.h:198  UCC  ERROR failure in task 0x455d640, Unhandled error
[1769553598.552107] [gadi-cpu-clx-2354:566087:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552166] [gadi-cpu-clx-2354:566087:0]    ucc_schedule.h:198  UCC  ERROR failure in task 0x52d5200, Unhandled error
[1769553598.552208] [gadi-cpu-clx-2371:4165597:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552211] [gadi-cpu-clx-2356:2168180:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552208] [gadi-cpu-clx-2371:4165591:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552177] [gadi-cpu-clx-2355:623661:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552168] [gadi-cpu-clx-2354:566094:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552211] [gadi-cpu-clx-2356:2168182:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552213] [gadi-cpu-clx-2371:4165607:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552175] [gadi-cpu-clx-2355:623660:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552168] [gadi-cpu-clx-2354:566093:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552165] [gadi-cpu-clx-2356:2168185:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552220] [gadi-cpu-clx-2356:2168185:0]    ucc_schedule.h:198  UCC  ERROR failure in task 0x424ae00, Unhandled error
[1769553598.552132] [gadi-cpu-clx-2355:623665:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552183] [gadi-cpu-clx-2355:623665:0]    ucc_schedule.h:198  UCC  ERROR failure in task 0x39a9280, Unhandled error
[1769553598.552118] [gadi-cpu-clx-2354:566099:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552165] [gadi-cpu-clx-2354:566099:0]    ucc_schedule.h:198  UCC  ERROR failure in task 0x434fd80, Unhandled error
[1769553598.552350] [gadi-cpu-clx-2371:4165575:0]    ucc_schedule.h:198  UCC  ERROR failure in task 0x3a39d40, Unhandled error
[1769553598.552248] [gadi-cpu-clx-2356:2168192:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552194] [gadi-cpu-clx-2355:623669:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552248] [gadi-cpu-clx-2356:2168194:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552359] [gadi-cpu-clx-2371:4165584:0]    ucc_schedule.h:198  UCC  ERROR failure in task 0x44748c0, Unhandled error
[1769553598.552192] [gadi-cpu-clx-2355:623670:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552248] [gadi-cpu-clx-2356:2168203:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552248] [gadi-cpu-clx-2356:2168206:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552358] [gadi-cpu-clx-2371:4165599:0]    ucc_schedule.h:198  UCC  ERROR failure in task 0x4d2f500, Unhandled error
[1769553598.552195] [gadi-cpu-clx-2355:623681:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552365] [gadi-cpu-clx-2371:4165589:0]    ucc_schedule.h:198  UCC  ERROR failure in task 0x3c27bc0, Unhandled error
[1769553598.552196] [gadi-cpu-clx-2355:623682:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated
[1769553598.552140] [gadi-cpu-clx-2355:623675:0]    ucc_schedule.h:198  UCC  ERROR failure in task 0x50a8000, Unhandled error
[1769553598.552730] [gadi-cpu-clx-2371:4165608:0]     tl_ucp_coll.c:137  TL_UCP ERROR failure in recv completion Message truncated

@access-hive-bot
Copy link

This pull request has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/cosima-twg-announce/401/80

@dougiesquire dougiesquire force-pushed the 880-dev-MC_4km_jra_ryf+regionalpanan_temp branch from c6ab817 to e767347 Compare January 29, 2026 05:37
This commit squashes 19 commits made during the original development of this configuration. See #713 for the original commits.

The first lines of the original commit messages are as follows:

* Add regional panantarctic configuration (1/12th degree/4km setup)

* Update regional panantarctic configuration and ensure it runs

* Get rid of some old information in the README

* Add MOM_override params for unwanted upstream changes I overlooked

* 2025-08-25 22:10:29: Run 0

* payu archive: documentation of MOM6 run-time configuration

* Change to new executable with updated MOM6 code

* Try jra v1-6

* Update ocn_dt_cpl to reflect real coupling timestep

* Revert MOM_input to OM3-25km config, i.e. USE_PSURF_IN_EOS = False is now default True, MAX_P_SURF = 0 is now default -1, USE_RIGID_SEA_ICE = True  is now deafult, False, SEA_ICE_RIGID_MASS = 100.0 is now deafult 1000. This was done by removing the lines in MOM_input that had set these parameters to be not default.

* Move where diag_table sits

* 2025-08-29 22:03:14: Run 0

* payu archive: documentation of MOM6 run-time configuration

* Add author details for contributors who weren't already on the CITATIONS.cff

* Updating to latest released executable

* Merge MOM_input and MOM_override to be one file, using the MOM_parameter.short file in https://github.com/claireyung/access-om3-configs/tree/a1642770156249411fb2ad47d15d559e97a22c9f

* Set THICKNESSDIFFUSE to be False

* 2025-10-17 14:44:05: Run 0

* Update docs based on files produced in 44cea06

--------

Co-authored-by: Edward Yang <yang.e@wehi.edu.au>
Co-authored-by: Helen Macdonald <179985228+helenmacdonald@users.noreply.github.com>
Co-authored-by: Dougie Squire <42455466+dougiesquire@users.noreply.github.com>
@dougiesquire dougiesquire force-pushed the 880-dev-MC_4km_jra_ryf+regionalpanan_temp branch from e767347 to d7fb023 Compare January 29, 2026 05:42
@dougiesquire
Copy link
Collaborator

!test repro commit

@github-actions
Copy link

❌ The Bitwise Reproducibility Check Failed ❌

When comparing:

  • 880-dev-MC_4km_jra_ryf+regionalpanan_temp (checksums created using commit d7fb023), against
  • dev-MC_4km_jra_ryf+regionalpanan (checksums in commit d60cb09)

🔧 The new checksums will be committed to this PR, if they differ from what is on this branch.

Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/d7fb023e70b256cb93a1a614999c6f130cfdbc55, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/61832389511.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/21467135764/artifacts/5299033792.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/d60cb09f95af27f0512ef6c6575ca8047be37b48/testing/checksum

Test summary:
test_repro_historical

@dougiesquire
Copy link
Collaborator

I've squashed and rebased this onto the lastest dev-MC_4km_jra_ryf+regionalpanan branch which is currently just a copy of the latest dev-MC_25km_jra_ryf branch.

While rebasing and resolving conflicts, I've temporarily reverted all answer-changing changes that have been made to the base since Claire branched. This is to give myself peace-of-mind that I haven't accidentally changed something of Claire's while rebasing.

The checksums that have just been committed match those I created using Claire's original branch.

The plan is now to add in those changes that I reverted, which will change answers relative to Claire's original branch:

SW pen update requires a new input and will be done in a separate commit
@dougiesquire
Copy link
Collaborator

!test repro commit

@dougiesquire
Copy link
Collaborator

@chrisb13, I suggest we squash merge this now. I'll submit new PRs to add CHL-informed SW pen and update the MPI flags.

Here's the difference between Claire's original branch and this one.

@github-actions
Copy link

❌ The Bitwise Reproducibility Check Failed ❌

When comparing:

  • 880-dev-MC_4km_jra_ryf+regionalpanan_temp (checksums created using commit 64a09e0), against
  • dev-MC_4km_jra_ryf+regionalpanan (checksums in commit d60cb09)

🔧 The new checksums will be committed to this PR, if they differ from what is on this branch.

Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/64a09e012c71c38c75afecc28c105b60e86f0dc9, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/61955140745.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/21503437635/artifacts/5313522455.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/d60cb09f95af27f0512ef6c6575ca8047be37b48/testing/checksum

Test summary:
test_repro_historical

@dougiesquire dougiesquire marked this pull request as ready for review January 30, 2026 04:25
@chrisb13
Copy link
Collaborator Author

chrisb13 commented Feb 2, 2026

Thanks for pulling this together @dougiesquire.

As suggested, we've focused on using this comparison. We've briefly looked at this, when thinking about the merge (e.g. DT = 600.0 after the merge?).

We haven't found anything that looks amiss technically so we think the merge can proceed. We have some small comments related to the science. Observations from @helenmacdonald and I:

  1. Claire had VELOCITY_TOLERANCE = 3.0E+08 ! [m s-1] default = 3.0E+08 whereas we have a much smaller value: VELOCITY_TOLERANCE = 1.0E-04. Is this likely to affect stability? The default seems very large to me and ours quite small -- this was presumably discussed elsewhere?

source.

  1. More of a down the road science issue. We noticed one of the answer changing changes was this:
USE_RIVER_HEAT_CONTENT = True   !   [Boolean] default = False
                                ! If true, use the fluxes%runoff_Hflx field to set the heat carried by runoff,
                                ! instead of using SST*CP*liq_runoff.
USE_CALVING_HEAT_CONTENT = True !   [Boolean] default = False
                                ! If true, use the fluxes%calving_Hflx field to set the heat carried by runoff,
                                ! instead of using SST*CP*froz_runoff.

source

@claireyung had false for both so was using the SST*CP*liq_runoff or SST*CP*froz_runoff. How are fluxes%runoff_Hflx and fluxes%calving_Hflx calculated? I can imagine that changing these to a prescribed approach could have large impacts on the southern ocean state. Perhaps this is related to Anton's recent tweaks?

  1. Chris and Helen noted that the docs are still in a separate PR but that can be addressed once Helen is back from leave.

helenmacdonald
helenmacdonald previously approved these changes Feb 2, 2026
Copy link
Contributor

@helenmacdonald helenmacdonald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dougiesquire! @chrisb13 and I went through it and are happy for it to be merged. Chris has noted a few items that we need to be mindful of to check that the answer changes are not bad for the science output.

@dougiesquire
Copy link
Collaborator

  1. Claire had VELOCITY_TOLERANCE = 3.0E+08 ! [m s-1] default = 3.0E+08 whereas we have a much smaller value: VELOCITY_TOLERANCE = 1.0E-04. Is this likely to affect stability? The default seems very large to me and ours quite small -- this was presumably discussed elsewhere?

source.

2. More of a down the road science issue. We noticed one of the answer changing changes was this:
USE_RIVER_HEAT_CONTENT = True   !   [Boolean] default = False
                                ! If true, use the fluxes%runoff_Hflx field to set the heat carried by runoff,
                                ! instead of using SST*CP*liq_runoff.
USE_CALVING_HEAT_CONTENT = True !   [Boolean] default = False
                                ! If true, use the fluxes%calving_Hflx field to set the heat carried by runoff,
                                ! instead of using SST*CP*froz_runoff.

source

@claireyung had false for both so was using the SST*CP*liq_runoff or SST*CP*froz_runoff. How are fluxes%runoff_Hflx and fluxes%calving_Hflx calculated? I can imagine that changing these to a prescribed approach could have large impacts on the southern ocean state. Perhaps this is related to Anton's recent tweaks?

Links to where these changes were originally discussed (including answers to some of your questions):

  1. What should the 100km mom config inherit from 25km / OM4-05? #771 (comment)

  2. Enthalpy in MOM6 #631

@dougiesquire
Copy link
Collaborator

3. Chris and Helen noted that the docs are still in a separate PR but that can be addressed once Helen is back from leave.

As I just mentioned in Zulip, this PR includes panatarctic_instructions.md because it was there in Claire's original branch. @chrisb13 do you want this removed?

@chrisb13
Copy link
Collaborator Author

chrisb13 commented Feb 2, 2026

  1. Chris and Helen noted that the docs are still in a separate PR but that can be addressed once Helen is back from leave.

As I just mentioned in Zulip, this PR includes panatarctic_instructions.md because it was there in Claire's original branch. @chrisb13 do you want this removed?

Thanks for checking.

I think it is safe to be removed. I believe the pr version contains various updates from @helenmacdonald.

If the above turns out to be incorrect, we can just add it back in again.

Also snuck in a minor syntactical change that will get squashed away anyway
@dougiesquire dougiesquire merged commit 9aff818 into dev-MC_4km_jra_ryf+regionalpanan Feb 2, 2026
12 checks passed
@dougiesquire dougiesquire deleted the 880-dev-MC_4km_jra_ryf+regionalpanan_temp branch February 2, 2026 05:24
dougiesquire added a commit that referenced this pull request Feb 2, 2026
This commit squashes:
- 19 commits made during the original development of this configuration*
- 1 commit adding repro checksums
- 1 commit removing the file `panatarctic_instructions.md`. This has been moved to #573

*See #713 for the original 19 commits. The first lines of the original 19 commit messages are as follows:

- Add regional panantarctic configuration (1/12th degree/4km setup)

- Update regional panantarctic configuration and ensure it runs

- Get rid of some old information in the README

- Add MOM_override params for unwanted upstream changes I overlooked

- 2025-08-25 22:10:29: Run 0

- payu archive: documentation of MOM6 run-time configuration

- Change to new executable with updated MOM6 code

- Try jra v1-6

- Update ocn_dt_cpl to reflect real coupling timestep

- Revert MOM_input to OM3-25km config, i.e. USE_PSURF_IN_EOS = False is now default True, MAX_P_SURF = 0 is now default -1, USE_RIGID_SEA_ICE = True  is now deafult, False, SEA_ICE_RIGID_MASS = 100.0 is now deafult 1000. This was done by removing the lines in MOM_input that had set these parameters to be not default.

- Move where diag_table sits

- 2025-08-29 22:03:14: Run 0

- payu archive: documentation of MOM6 run-time configuration

- Add author details for contributors who weren't already on the CITATIONS.cff

- Updating to latest released executable

- Merge MOM_input and MOM_override to be one file, using the MOM_parameter.short file in https://github.com/claireyung/access-om3-configs/tree/a1642770156249411fb2ad47d15d559e97a22c9f

- Set THICKNESSDIFFUSE to be False

- 2025-10-17 14:44:05: Run 0

- Update docs based on files produced in 44cea06

--------

Co-authored-by: Edward Yang <yang.e@wehi.edu.au>
Co-authored-by: Helen Macdonald <179985228+helenmacdonald@users.noreply.github.com>
Co-authored-by: Dougie Squire <42455466+dougiesquire@users.noreply.github.com>
Co-authored-by: access-bot <113399144+access-bot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants