Additional PC2 optimisations for NG-ARCH by tinyendian · Pull Request #53 · MetOffice/lfric_apps

tinyendian · 2025-12-16T04:00:08Z

PR Summary

Sci/Tech Reviewer: @MetBenjaminWent
Code Reviewer: @EdHone

Code Quality Checklist

(Some checks are automatically carried out via the CI pipeline)

I have performed a self-review of my own code
My code follows the project's style guidelines
style guidelines
Comments have been included that aid understanding and enhance the
readability of the code
My changes generate no new warnings

Testing

I have tested this change locally, using the LFRic Apps rose-stem suite
If any tests fail (rose-stem or CI) the reason is understood and
acceptable (eg. kgo changes)
I have added tests to cover new functionality as appropriate (eg. system
tests, unit tests, etc.)
Any new tests have been assigned an appropriate amount of compute resource
and have tests been allocated to an appropriate testing group (i.e. the
developer tests are for jobs which use a small amount of compute resource
and complete in a matter of minutes)

I can only run a subset of Rose stem tests on Monsoon due to various difficulties, but the ex1a_omp_developer succeed, and the run_lfric_atm_scm_coma9_toga-BiP2x2-50000x50000_ex1a_gnu_fast-debug-64bit and run_lfric_atm_scm_comorph_dev_toga-BiP2x2-50000x50000_ex1a_gnu_fast-debug-64bit tests no longer fail their KGO tests (these failures were caused by PSyclone dropping the !DIR$ IVDEP compiler directives)
Successful builds with the meto-ex1a and esnz-cascade optimisation platforms on the ESNZ Cascade HPC

trac.log

Unable to run the full test suite on Monsoon

Security Considerations

I have reviewed my changes for potential security issues
Sensitive data is properly handled (if applicable)
Authentication and authorisation are properly implemented (if applicable)

Performance Impact

Performance of the code has been considered and, if applicable, suitable
performance measurements have been conducted

AI Assistance and Attribution

Some of the content of this change has been produced with the assistance
of Generative AI tool name (e.g., Met Office Github Copilot Enterprise,
Github Copilot Personal, ChatGPT GPT-4, etc) and I have followed the
Simulation Systems AI policy
(including attribution labels)

Documentation

Where appropriate I have updated documentation related to this change and
confirmed that it builds correctly

PSyclone Approval

If you have edited any psyclone related code (eg. PsyKAl-lite, Kernal
inteface, optimisation scripts, LFRic data structure code) then please
contact the
tooscollabdevteam@metoffice.gov.uk

Note: The email address does not work, unfortunately

Sci/Tech Review

I understand this area of code and the changes being added
The proposed changes correspond to the pull request description
Documentation is sufficient (do documentation papers need updating)
Sufficient testing has been completed

Please alert the code reviewer via a tag when you have approved the SR

Code Review

All dependencies have been resolved
Related Issues have been properly linked and addressed
CLA compliance has been confirmed
Code quality standards have been met
Tests are adequate and have passed
Documentation is complete and accurate
Security considerations have been addressed
Performance impact is acceptable

tinyendian · 2025-12-16T07:03:34Z

Closes #47

MetBenjaminWent · 2025-12-16T14:02:43Z

Are you okay Assigning me this when it's ready for SR? Thanks!

tinyendian · 2025-12-16T21:45:31Z

Thanks @MetBenjaminWent - the PR should be ready, but my fork became detached from the lfric_apps repo when its visibility was changed to public yesterday. I am currently trying to get this resolved via GitHub support, but there seem to be issues with support entitlement for my GitHub user, in which case I'll probably need to refork and recreate the branch and pull request. I'll wait for a day and see what happens.

tinyendian · 2025-12-18T01:57:17Z

SR notes for @MetBenjaminWent:

I'm not sure what is required for the "Security Considerations" section in the PR Summary, so I left it empty for now.
I could not contact toolscollabdevteam@metoffice.gov.uk for PSyclone script approval, the email address doesn't seem to work (and the PR Summary template has a typo)
The PSyclone transmute script for pc_bm_initiate uses a custom CompilerDirective class to insert !DIR$ NOFISSION and !DIR$ IVDEP directives into the subroutine; the latter fixes the KGO comparison failures with CCE that we previously observed. This class is a workaround for a couple of issues - PSyclone v3.2 introduced --keep-comments and --keep-directives flags, but its FortranWriter introduces spaces in the output (e.g., ! DIR$ NOFISSION), which breaks directives. Using these flags would also require per-file PSyclone flag support in the LFRic build system to avoid using them globally, which might break other scripts, and the PSyclone script would have to modify or remove the existing OpenMP statements in the Fortran code. The fparser package has a Directive class, but it is currently broken due to a missing method. The best way forward seemed to be using a custom CompilerDirective class. Note that the PSyclone and fparser issues are already known and should be resolved in the next releases.

tinyendian · 2025-12-18T01:59:11Z

@MetBenjaminWent I don't seem to have permission to reassign the PR, you or @jennyhickson would need to do this, thanks.

MetBenjaminWent · 2025-12-19T14:23:07Z

Initial runs with the branch on EXs are sadly showing KGO failures with GNU and CCE.
We need to identify the root cause of these before it can be approved for trunk.

I'm trying the files one by one to see if one of them in particular are causing the failures.

I'll attach the trac output below once the runs are finished.

MetBenjaminWent · 2025-12-19T16:06:55Z

It seems something about pc2_bm_initiate is causing the KGO failure in this ticket sadly.

tinyendian · 2025-12-22T06:07:01Z

Thanks for running the tests, @MetBenjaminWent. Did you attach the output from the failed tests somewhere? I'll look into it after the Christmas break, hopefully I'll be able to get them to work on Monsoon to reproduce the problem (some test executables don't run there).

MetBenjaminWent · 2026-01-06T10:21:19Z

KGO difference:

< Inner product checksum rho =         48D7FC40
< Inner product checksum theta =         5392A678
< Inner product checksum u =         6A97B83A
< Inner product checksum mr1 =         41CCED01
< Inner product checksum mr2 =         39C8DCE2
< Inner product checksum mr3 =         37ADC3D4
< Inner product checksum mr4 =         39774AE9
---
> Inner product checksum rho =         48D7FBE8
> Inner product checksum theta =         5392A5AE
> Inner product checksum u =         6A97C717
> Inner product checksum mr1 =         41CCEEC7
> Inner product checksum mr2 =         39D1449A
> Inner product checksum mr3 =         37AE7933
> Inner product checksum mr4 =         3973CE88

Only additional file added in this run (The rest were not passed to PSyclone):
pc2_bm_initiate

I'm having a look at where the directives where added relative to the original, that's the only thing I can think of, some kind of ordering issue still.

MetBenjaminWent · 2026-01-06T10:42:53Z

In regards to possible differences, potentially an ordering of this NOFUSION and the OMP?

In the original, where the OMP was around the J loop at L536, where we have moved it down to the I loop in the original file at L657, and a pre-processed and PSycloned file at L275, the replacement !DIR$ NOFISSION is now before the OMP section, as opposed to inside it. I'm not sure if that has had a effect, I can have a look at the listing files.

Same again with original file L705 and generated L317.

Otherwise I cannot spot anything.

I'll see if the KGOs hold. If so we could reach out to the Code Owner to see if they will accept the change.

MetBenjaminWent

We are seeing KGO issues in pc2_bm_initiate, until we can pinpoint the reason, or gain a approval from the CO, this PSyclone script cannot be accepted.

MetBenjaminWent · 2026-01-06T17:00:21Z

Updated KGO does not hold with CCE, indicating a possible race condition of some kind

tinyendian · 2026-01-06T20:50:29Z

Thanks for running these tests @MetBenjaminWent. As I mentioned previously, I would need to know which ones are failing to be able to resolve this. Monsoon doesn't seem to be supported as a regular test platform in the LFRic Apps Rose stem setup, I ported a bunch of workflow patches every time I created a new branch, and some of tests won't run at all. Hopefully I'll be able to get the failing ones to work once I know which ones they are.

MetBenjaminWent · 2026-01-07T09:58:58Z

Apologies, it was primarily these tests:
https://github.com/MetOffice/lfric_apps/blob/main/rose-stem/site/meto/groups/groups_lfric_atm.cylc

    "ex1a_omp_C48_cce": [
        "lfric_atm_nwp_gal9_noukca_1T-C48_MG_ex1a_cce_fast-debug-32bit",
        "lfric_atm_nwp_gal9_noukca_2T-C48_MG_ex1a_cce_fast-debug-32bit",
        "lfric_atm_nwp_gal9_noukca_4T-C48_MG_ex1a_cce_fast-debug-32bit",
.......
    "ex1a_omp_gnu": [
.......
        "lfric_atm_nwp_gal9_noukca_1T-C48_MG_ex1a_gnu_fast-debug-32bit",
        "lfric_atm_nwp_gal9_noukca_2T-C48_MG_ex1a_gnu_fast-debug-32bit",
        "lfric_atm_nwp_gal9_noukca_4T-C48_MG_ex1a_gnu_fast-debug-32bit",
.......

But any of the KGO threaded teats from the same group are also failing, but I would use copies of these as the yard stick.

So thats thenwp_gal9_noukcaconfiguration.

Weak scaled MPI ranks against OMP threads as noted in here:
https://github.com/MetOffice/lfric_apps/blob/main/rose-stem/site/common/lfric_atm/tasks_lfric_atm.cylc

tinyendian · 2026-01-08T04:17:49Z

Thanks @MetBenjaminWent, that is curious - I use the ex1a_omp_developer test group that includes ex1a_omp_C48_cce and ex1a_omp_gnu as my main developer tests. Most of these run on Monsoon and I don't see any KGO failures at all. I do get runtime crashes with the lfric_atm_clim_gal9_chem_xT-C12 tests both for CCE and GNU (not sure what causes these), and KGO seems to be turned off for most of the GNU tests in rose-stem/site/common/lfric_atm/tasks_lfric_atm.cylc (only few GNU tests seem to have KGO checksums in rose-stem/site/meto/kgos/lfric_atm/ex1a).

Seen that I get no KGO comparison errors for tests like lfric_atm_nwp_gal9_noukca_xT-C48_MG_ex1a_cce_fast-debug-32bit on Monsoon, maybe the failures are caused by differences in the build setup? I had to apply several workarounds to rose-stem/site/meto/common/suite_config_ex1a.cylc to be able to run the tests:

{% set ex1a_base = 'module use /projects/metoff/spackadmin.mon/releases/2025.10.1/ngms' %}
[...]
 {% set ex1a_compiler_cce =  'module switch PrgEnv-cray PrgEnv-cray/8.4.0 ; ' ~
                             'module load cpe/23.05 ; '~
                             'module switch cce cce/15.0.0 ; '~
                             'module load lfric-cray/15.0.0/3.0+ || true ; module load py-pyyaml/6.0.2 py-psyclone/3.1.0 || true ; '~
                             'export FFLAGS="-I/home/users/wolfgang.hayek.ext/xios/xios_r2701_prod_O3_spack/inc $FFLAGS"; '~
                             'export LDFLAGS="-L/home/users/wolfgang.hayek.ext/xios/xios_r2701_prod_O3_spack/lib $LDFLAGS"; '~
                             'export PYTHONPATH=/projects/metoff/spackadmin.mon/releases/2025.10.1/install/linux-sles15-zen3/gcc-12.2.0/py-graphviz-0.13.2-trph5f36s33gx26rnf3kbvdp7dh32guw/lib/python3.10/site-packages:$PYTHONPATH ; ' %}

I had to downgrade PSyclone v3.2.0 to v3.1.0 due to build errors (PSyclone crashes), and I needed to rebuild XIOS since the Spack-built version crashes at runtime. There were also some Python packages missing (pyyaml and graphviz).

Which PSyclone version do you use, v3.2.0? If that's the case, I'll investigate why builds with v3.2.0 don't work for me.

MetBenjaminWent · 2026-01-08T09:36:48Z

On the EXs, it looks like the PSyclone version is psyclone/3.2.2

I think there's been a few important changes to PSyclone. However I'm currently not sure that this is the root cause of the issues.

I think for the rest of the PR, it might be worth breaking work related to pc_bm_initiate out into it's own PR?

tinyendian · 2026-01-09T03:55:09Z

Thanks, that resolved the mystery - after switching from PSyclone v3.1.0 to PSyclone v3.2.2, I can now reproduce the KGO comparison failures on Monsoon, and some of the tests even fail to run at all, reporting algorithm failures.

The error mechanism turned out to be the following:

PSyclone v3.2.2 applies additional checks for race conditions compared to PSyclone v3.1.0, finds more false dependencies, and thus refuses to add OpenMP worksharing loop directives (omp do) to some of the loops in pc2_bm_initiate.f90.
Unfortunately, these failures do not trigger an exception in the transmute script as in previous PSyclone releases, the script simply silently stops applying OpenMP transformations half-way through and without any error messages, and the build system carries on. I can only get warnings about these (false) race conditions when I set the verbose=True parameter in the apply method of the OpenMP loop transformation, and they are added to the Fortran output as comments rather than to stderr and do not trigger an exception.
The result is that only omp parallel directives are added to the pc2_bm_initiate.f90 Fortran code without any omp do directives inside the region. OpenMP threads will thus execute the code block redundantly, causing race conditions that lead to the observed runtime and KGO comparison failures.

Maybe it was a deliberate choice that PSyclone no longer triggers exceptions for race condition warnings? As you suggested, pc2_bm_initiate was affected, but there may be other broken transmute scripts. I'll investigate, and I'll need to work out how to best handle the false dependencies and how to get PSyclone to trigger exceptions if something goes wrong.

There is no rush with this pull request; I'll leave it open for now and will implement the fixes.

MetBenjaminWent · 2026-01-09T09:26:51Z

Good spot, yeah the silent failing certainly seems incorrect, especially in this instance, likely a output warning followed by it continuing would be better?

No worries, I'll drop this off my radar for now. If I miss it when it goes back into review, feel free to drop me an email too to prompt me again, thanks!

tinyendian · 2026-02-05T02:56:40Z

@MetBenjaminWent I fixed the issue that broke KGO comparison by adding additional variables to the list of false dependency variables and adding function match_call_args to transmute_functions.py to scan loop bodies for their occurrence. The branch is ready for review, but I can't reassign it to you myself. There are once again conflicts with CONTRIBUTORS.md and transmute_functions.py; I'll resolve these after the reviews if this doesn't need to be done beforehand to avoid multiple iterations (the main branch is moving fast).

Note that I found more silent failures related to PSyclone v3.2.0 in multiple transmute scripts that are currently on the main branch and created an issue for that which is referenced above this comment.

jennyhickson · 2026-02-18T13:50:08Z

I haven't removed this from the Spring 2026 Milestone yet, but wanted to check on the status of this as it has been in review for some time, but hasn't made it to code review yet. We're approaching the final stages for the upcoming release so if you can either change the milestone or let me know if there are urgent reasons for this to make it. Thanks

tinyendian · 2026-02-18T20:36:48Z

Hi @jennyhickson; I fixed the problems caused by the recent PSyclone upgrade to v3.2, and I pinged @MetBenjaminWent for re-review via email as I don't have the required permissions to reassign PRs. The ticket is not urgent, but I would be grateful if it could be reviewed soon (before the next release would be ideal) - the ESNZ part of the NG-ARCH project finished at the end of last year. Thanks!

MetBenjaminWent · 2026-02-20T16:08:03Z

https://cylchub/services/cylc-review/taskjobs?user=benjamin.went&suite=cr_47_additional_pc2_optimisations%2Frun1&no_fuzzy_time=0&cycles=&tasks=&task_status=expired&task_status=failed&task_status=preparing&task_status=running&task_status=submitted&task_status=submit-failed&task_status=succeeded&task_status=waiting&job_status=&order=time_desc&per_page=300

I'll link the trac log instead when its ready, but for internal reviewers, internal testing of the OMP are all passing and holding KGOs.

MetBenjaminWent

Changes are allowing KGOs to now hold, scripts make sense relative to their intent.

Approved (Assuming branch will be updated for trunk for CR).

tinyendian · 2026-02-23T21:54:08Z

Thanks for the review @MetBenjaminWent - @EdHone both CONTRIBUTORS.md and transmute_functions.py are moving fast; I can easily update this branch to resolve the conflicts, but I'm not sure what the best time for that is as these files will likely see further conflicts. I have a second pull request #267 that will add my name to CONTRIBUTORS.md if it gets onto trunk first.

EdHone

Very happy with the change, just some copywrite statements to fix and it would be nice to see some issues created to remove some of the workaround code when relevant dependencies have been updated.

We also need to resolve some conflicts so updating the branch to a mergeable state is also necessary

In terms of testing the gravity_wave unit tests failed to build with a psyclone error - I'm not sure how it could be related to your change so I will suggest we keep going and see if it is still an error when the branch is updated to the head of main

...ications/lfric_atm/optimisation/esnz-cascade/transmute/large_scale_cloud/pc2_bl_forced_cu.py

applications/lfric_atm/optimisation/esnz-cascade/transmute/large_scale_cloud/pc2_bm_initiate.py

...ations/lfric_atm/optimisation/esnz-cascade/transmute/large_scale_cloud/pc2_initiation_ctl.py

applications/lfric_atm/optimisation/meto-ex1a/transmute/large_scale_cloud/pc2_bl_forced_cu.py

applications/lfric_atm/optimisation/meto-ex1a/transmute/large_scale_cloud/pc2_bm_initiate.py

applications/lfric_atm/optimisation/meto-ex1a/transmute/large_scale_cloud/pc2_initiation_ctl.py

applications/ngarch/optimisation/esnz-cascade/transmute/large_scale_cloud/pc2_bm_initiate.py

applications/ngarch/optimisation/esnz-cascade/transmute/large_scale_cloud/pc2_initiation_ctl.py

applications/ngarch/optimisation/esnz-cascade/transmute/large_scale_cloud/pc2_bl_forced_cu.py

EdHone

Thanks Wolfgang - I'm happy with these changes now. If you get the chance to answer my questions about psyclone issues and additional issues for removing the workaround code that would be great

…ptimisations

github-actions bot added the cla-required The CLA has not yet been signed by the author of this PR - added by GA label Dec 16, 2025

tinyendian force-pushed the 47_additional_pc2_optimisations branch from f8b1263 to 849fd80 Compare December 16, 2025 05:23

github-actions bot added cla-signed The CLA has been signed as part of this PR - added by GA and removed cla-required The CLA has not yet been signed by the author of this PR - added by GA labels Dec 16, 2025

tinyendian mentioned this pull request Dec 16, 2025

Additional PC2 optimisations for NG-ARCH #47

Open

jennyhickson requested a review from MetBenjaminWent December 16, 2025 11:21

MetBenjaminWent requested changes Jan 6, 2026

View reviewed changes

MetBenjaminWent assigned tinyendian Jan 6, 2026

yaswant added this to the Spring 2026 milestone Jan 12, 2026

tinyendian added 4 commits February 2, 2026 04:07

Add new PC2 transmute scripts

e37c15c

Replicate original parallel loop schedules

1097daa

Add optimisation scripts to ngarch miniapp and ESNZ site

15b3e06

Add PC2 transmute scripts to ngarch build setup

ed464c2

Add contributor details

a40c354

tinyendian force-pushed the 47_additional_pc2_optimisations branch from efcaa4f to a40c354 Compare February 2, 2026 04:13

Ignore additional false dependencies

993d122

tinyendian mentioned this pull request Feb 4, 2026

Silent lfric_atm transmute script errors with PSyclone v3.2.0 MetOffice/lfric_core#263

Open

jennyhickson requested a review from MetBenjaminWent February 18, 2026 13:47

MetBenjaminWent approved these changes Feb 20, 2026

View reviewed changes

MetBenjaminWent requested a review from EdHone February 20, 2026 16:09

EdHone requested changes Feb 24, 2026

View reviewed changes

Update copyright statements to latest version

02f874a

tinyendian requested review from a team, MatthewHambley, christophermaynard, hiker and iboutle as code owners February 25, 2026 04:47

tinyendian requested review from t00sa and removed request for a team February 25, 2026 04:47

EdHone approved these changes Feb 25, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into 47_additional_pc2_o…

ce54631

…ptimisations

EdHone merged commit 0669da2 into MetOffice:main Feb 25, 2026
3 checks passed

Conversation

tinyendian commented Dec 16, 2025 • edited by EdHone Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Code Quality Checklist

Testing

trac.log

Security Considerations

Performance Impact

AI Assistance and Attribution

Documentation

PSyclone Approval

Sci/Tech Review

Code Review

Uh oh!

tinyendian commented Dec 16, 2025

Uh oh!

MetBenjaminWent commented Dec 16, 2025

Uh oh!

tinyendian commented Dec 16, 2025

Uh oh!

tinyendian commented Dec 18, 2025

Uh oh!

tinyendian commented Dec 18, 2025

Uh oh!

MetBenjaminWent commented Dec 19, 2025

Uh oh!

MetBenjaminWent commented Dec 19, 2025

Uh oh!

tinyendian commented Dec 22, 2025

Uh oh!

MetBenjaminWent commented Jan 6, 2026

Uh oh!

MetBenjaminWent commented Jan 6, 2026

Uh oh!

MetBenjaminWent left a comment

Choose a reason for hiding this comment

Uh oh!

MetBenjaminWent commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tinyendian commented Jan 6, 2026

Uh oh!

MetBenjaminWent commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tinyendian commented Jan 8, 2026

Uh oh!

MetBenjaminWent commented Jan 8, 2026

Uh oh!

tinyendian commented Jan 9, 2026

Uh oh!

MetBenjaminWent commented Jan 9, 2026

Uh oh!

tinyendian commented Feb 5, 2026

Uh oh!

jennyhickson commented Feb 18, 2026

Uh oh!

tinyendian commented Feb 18, 2026

Uh oh!

MetBenjaminWent commented Feb 20, 2026

Uh oh!

MetBenjaminWent left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tinyendian commented Feb 23, 2026

Uh oh!

EdHone left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tinyendian commented Dec 16, 2025 •

edited by EdHone

Loading

MetBenjaminWent commented Jan 6, 2026 •

edited

Loading

MetBenjaminWent commented Jan 7, 2026 •

edited

Loading

MetBenjaminWent left a comment •

edited

Loading

EdHone left a comment •

edited

Loading