Skip to content

Debugging non-reproducibility of ACCESS-NRI 0.3.1 -> 0.4.0#173

Closed
dougiesquire wants to merge 9 commits intodev-025deg_jra55do_ryffrom
266-025deg_jra55do_ryf
Closed

Debugging non-reproducibility of ACCESS-NRI 0.3.1 -> 0.4.0#173
dougiesquire wants to merge 9 commits intodev-025deg_jra55do_ryffrom
266-025deg_jra55do_ryf

Conversation

@dougiesquire
Copy link
Collaborator

DO NOT MERGE

Companion PR to ACCESS-NRI/ACCESS-OM3#47 to debug and document where we lost reproducibility when we updated from 0.3.1 to 0.4.0.

@dougiesquire
Copy link
Collaborator Author

!test repro

@github-actions
Copy link

✅ The Bitwise Reproducibility Check Succeeded ✅

When comparing:

  • 266-025deg_jra55do_ryf (checksums created using commit dcdacd5), against
  • dev-025deg_jra55do_ryf (checksums in commit 90a3e99)
Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/dcdacd53427156984c278888aafaea89928c9cea, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/37157098873.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/13305855537/artifacts/2584899472.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/90a3e99186d6c8548b4892bbde46b08067299949/testing/checksum

@dougiesquire
Copy link
Collaborator Author

!test repro

@github-actions
Copy link

✅ The Bitwise Reproducibility Check Succeeded ✅

When comparing:

  • 266-025deg_jra55do_ryf (checksums created using commit 9f772ef), against
  • dev-025deg_jra55do_ryf (checksums in commit 90a3e99)
Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/9f772ef6e376c76ea4214da5a2fe836bf0c9827a, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/37187627970.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/13315120832/artifacts/2588122329.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/90a3e99186d6c8548b4892bbde46b08067299949/testing/checksum

@dougiesquire
Copy link
Collaborator Author

!test repro

@github-actions
Copy link

✅ The Bitwise Reproducibility Check Succeeded ✅

When comparing:

  • 266-025deg_jra55do_ryf (checksums created using commit 586757c), against
  • dev-025deg_jra55do_ryf (checksums in commit 90a3e99)
Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/586757cc0e443d0388f5a20516a285f10d22e992, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/37198094002.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/13318356254/artifacts/2589278831.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/90a3e99186d6c8548b4892bbde46b08067299949/testing/checksum

@dougiesquire
Copy link
Collaborator Author

!test repro

@github-actions
Copy link

❌ The Bitwise Reproducibility Check Failed ❌

When comparing:

  • 266-025deg_jra55do_ryf (checksums created using commit 618844f), against
  • dev-025deg_jra55do_ryf (checksums in commit 90a3e99)
Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/618844fb50d6953f16eb09f2c0859c9787fe22bc, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/37204948671.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/13320693848/artifacts/2589972299.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/90a3e99186d6c8548b4892bbde46b08067299949/testing/checksum

@dougiesquire
Copy link
Collaborator Author

!test repro

@github-actions
Copy link

❌ The Bitwise Reproducibility Check Failed ❌

When comparing:

  • 266-025deg_jra55do_ryf (checksums created using commit 4afb68c), against
  • dev-025deg_jra55do_ryf (checksums in commit 90a3e99)
Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/4afb68ca888c020e4bf490b92b1828a7a5c5ee84, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/37205673809.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/13320987417/artifacts/2590046482.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/90a3e99186d6c8548b4892bbde46b08067299949/testing/checksum

…versions used in 0.4.0

This also inclusde updating ESMF from 8.5.0 to 8.7.0. Unfortunately, the new versions of CMEPS/CDEPS require updating ESMF and the old versions don't work with the updated ESMF. This makes it very difficult to test just updating ESMF
@dougiesquire
Copy link
Collaborator Author

!test repro

@github-actions
Copy link

❌ The Bitwise Reproducibility Check Failed ❌

When comparing:

  • 266-025deg_jra55do_ryf (checksums created using commit e075dad), against
  • dev-025deg_jra55do_ryf (checksums in commit 90a3e99)
Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/e075dad51205bef7b5a7886f0cef5ea206d4bc1e, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/37219985531.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/13326066832/artifacts/2591456025.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/90a3e99186d6c8548b4892bbde46b08067299949/testing/checksum

@dougiesquire
Copy link
Collaborator Author

Summarising what this PR shows:

MOM

  • We can preserve answers across the MOM update with the right MOM parameter settings (see 9f772ef and repro test)

CESM-share

CICE

  • We do not preserve answers across the CICE6 update, even without using the new MOM supergrid functionality (see 618844f and repro test)
  • Using the new MOM supergrid functionality causes ACCESS-OM3 to crash within three hours due to velocity truncations in MOM (see 4afb68c and repro test)

CMEPS/CDEPS

  • Updating CMEPS and CDEPS requires also updating ESMF. We do not preserve answers across this update (see e075dad and repro test)

I'll open a separate issue/PR with suggestions for how to set the MOM6 parameters for the update to 0.4.0.

I'm a little worried that using the new MOM supergrid functionality in CICE causes MOM velocity truncations. Should we dig into this a little before we commit to using it in ACCESS-OM3? @anton-seaice, @chrisb13?

@aekiss
Copy link
Contributor

aekiss commented Feb 16, 2025

Thanks for exploring this and laying it out so clearly.

Can we conclude that there's a problem with the supergrid implementation in CICE (grid_format = "mom_nc") that somehow produces fluxes that crash MOM?

Is it expected that the CICE and ESMF updates break reproducibility?

@dougiesquire
Copy link
Collaborator Author

dougiesquire commented Feb 17, 2025

Can we conclude that there's a problem with the supergrid implementation in CICE (grid_format = "mom_nc") that somehow produces fluxes that crash MOM?

Possibly... I think it's unclear at this stage, but we are going to revert back to the old grid while we investigate.

Is it expected that the CICE and ESMF updates break reproducibility?

I'll defer to @anton-seaice re CICE.

Regarding the ESMF update from 8.5.0 to 8.7.0, I'd say no. Looking at the changelog, there's only one reported bfb change between these two releases (in 8.6.0) and that should only be observed when not using strict floating point compiler options. We use -fp-model precise so I think should see bfb reproducibility. I guess that suggests that it's the changes to CMEPS/CDEPS that change answers. Unfortunately it's hard to test this explicitly since the 0.3.1 versions of CMEPS/CDEPS don't compile with ESMF 8.7.0, and the 0.4.0 versions don't compile with 8.5.0... I'll try building 0.3.1 with ESMF with 8.6.0 and see if that learns us anything.

@anton-seaice
Copy link
Collaborator

I'll defer to @anton-seaice re CICE.

Its not immediately clear the answers should have changed.

ACCESS-NRI/CICE@12dd204...e68e05b

There are some updates which are not bit for bit in there but none look like their should impact our configurations.

@anton-seaice
Copy link
Collaborator

I'm a little worried that using the new MOM supergrid functionality in CICE causes MOM velocity truncations. Should we dig into this a little before we commit to using it in ACCESS-OM3? @anton-seaice, @chrisb13?

Yes lets drop the commit for now and make a new issue to investigate it

@dougiesquire
Copy link
Collaborator Author

!test repro

@github-actions
Copy link

✅ The Bitwise Reproducibility Check Succeeded ✅

When comparing:

  • 266-025deg_jra55do_ryf (checksums created using commit 47ac794), against
  • dev-025deg_jra55do_ryf (checksums in commit 90a3e99)
Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/47ac794bd038d514f4f10ed4f1ce551dde230fee, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/37311687707.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/13361350132/artifacts/2600532870.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/90a3e99186d6c8548b4892bbde46b08067299949/testing/checksum

@dougiesquire
Copy link
Collaborator Author

dougiesquire commented Feb 17, 2025

I'll try building 0.3.1 with ESMF with 8.6.0 and see if that learns us anything.

This passed repro tests (see 47ac794 and repro test) which suggests that the answer changes that arise from the CMEPS/CDEPS/ESMF updates in ACCESS-OM3 0.4.0 come from CMEPS/CDEPS rather than ESMF (since the ESMF changelog reports full bfb reproducibility between ESMF 8.6.0 and 8.7.0).

CMEPS changes: ESCOMP/CMEPS@ffb5737...959e9a0
CDEPS changes: ESCOMP/CDEPS@3c70fc8...8197f05

@dougiesquire
Copy link
Collaborator Author

!test repro

@github-actions
Copy link

✅ The Bitwise Reproducibility Check Succeeded ✅

When comparing:

  • 266-025deg_jra55do_ryf (checksums created using commit 2f52959), against
  • dev-025deg_jra55do_ryf (checksums in commit 90a3e99)
Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/2f529596b6c6b3691f72dcc7198d77a8970c7683, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/37680474084.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/13487396308/artifacts/2638085455.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/90a3e99186d6c8548b4892bbde46b08067299949/testing/checksum

@dougiesquire
Copy link
Collaborator Author

!test repro

@github-actions
Copy link

❌ The Bitwise Reproducibility Check Failed ❌

When comparing:

  • 266-025deg_jra55do_ryf (checksums created using commit 4c6aef7), against
  • dev-025deg_jra55do_ryf (checksums in commit 90a3e99)
Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/4c6aef7ced9ca29edc35cf398cc152e5363f97d9, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/37681364167.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/13487779720/artifacts/2638168448.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/90a3e99186d6c8548b4892bbde46b08067299949/testing/checksum

@dougiesquire
Copy link
Collaborator Author

Closing as I think we've learnt what we wanted to about the loss of historical repro when we updated from 0.3.1 to 0.4.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants