Skip to content

Conversation

@tinyendian
Copy link

@tinyendian tinyendian commented Dec 16, 2025

PR Summary

Sci/Tech Reviewer: @MetBenjaminWent
Code Reviewer: @EdHone

Code Quality Checklist

(Some checks are automatically carried out via the CI pipeline)

  • I have performed a self-review of my own code
  • My code follows the project's style guidelines
    style guidelines
  • Comments have been included that aid understanding and enhance the
    readability of the code
  • My changes generate no new warnings

Testing

  • I have tested this change locally, using the LFRic Apps rose-stem suite
  • If any tests fail (rose-stem or CI) the reason is understood and
    acceptable (eg. kgo changes)
  • I have added tests to cover new functionality as appropriate (eg. system
    tests, unit tests, etc.)
  • Any new tests have been assigned an appropriate amount of compute resource
    and have tests been allocated to an appropriate testing group (i.e. the
    developer tests are for jobs which use a small amount of compute resource
    and complete in a matter of minutes)
  • I can only run a subset of Rose stem tests on Monsoon due to various difficulties, but the ex1a_omp_developer succeed, and the run_lfric_atm_scm_coma9_toga-BiP2x2-50000x50000_ex1a_gnu_fast-debug-64bit and run_lfric_atm_scm_comorph_dev_toga-BiP2x2-50000x50000_ex1a_gnu_fast-debug-64bit tests no longer fail their KGO tests (these failures were caused by PSyclone dropping the !DIR$ IVDEP compiler directives)
  • Successful builds with the meto-ex1a and esnz-cascade optimisation platforms on the ESNZ Cascade HPC

trac.log

  • Unable to run the full test suite on Monsoon

Security Considerations

  • I have reviewed my changes for potential security issues
  • Sensitive data is properly handled (if applicable)
  • Authentication and authorisation are properly implemented (if applicable)

Performance Impact

  • Performance of the code has been considered and, if applicable, suitable
    performance measurements have been conducted

AI Assistance and Attribution

  • Some of the content of this change has been produced with the assistance
    of Generative AI tool name (e.g., Met Office Github Copilot Enterprise,
    Github Copilot Personal, ChatGPT GPT-4, etc) and I have followed the
    Simulation Systems AI policy
    (including attribution labels)

Documentation

  • Where appropriate I have updated documentation related to this change and
    confirmed that it builds correctly

PSyclone Approval

  • If you have edited any psyclone related code (eg. PsyKAl-lite, Kernal
    inteface, optimisation scripts, LFRic data structure code) then please
    contact the
    [email protected]

Note: The email address does not work, unfortunately

Sci/Tech Review

  • I understand this area of code and the changes being added
  • The proposed changes correspond to the pull request description
  • Documentation is sufficient (do documentation papers need updating)
  • Sufficient testing has been completed

Please alert the code reviewer via a tag when you have approved the SR

Code Review

  • All dependencies have been resolved
  • Related Issues have been properly linked and addressed
  • CLA compliance has been confirmed
  • Code quality standards have been met
  • Tests are adequate and have passed
  • Documentation is complete and accurate
  • Security considerations have been addressed
  • Performance impact is acceptable

@github-actions github-actions bot added the cla-required CLA signature is required for this PR. label Dec 16, 2025
@tinyendian tinyendian force-pushed the 47_additional_pc2_optimisations branch from f8b1263 to 849fd80 Compare December 16, 2025 05:23
@github-actions github-actions bot added cla-signed This contributor has signed the CLA. and removed cla-required CLA signature is required for this PR. labels Dec 16, 2025
@tinyendian
Copy link
Author

Closes #47

@MetBenjaminWent
Copy link
Contributor

Are you okay Assigning me this when it's ready for SR? Thanks!

@tinyendian
Copy link
Author

Thanks @MetBenjaminWent - the PR should be ready, but my fork became detached from the lfric_apps repo when its visibility was changed to public yesterday. I am currently trying to get this resolved via GitHub support, but there seem to be issues with support entitlement for my GitHub user, in which case I'll probably need to refork and recreate the branch and pull request. I'll wait for a day and see what happens.

@tinyendian
Copy link
Author

SR notes for @MetBenjaminWent:

  • I'm not sure what is required for the "Security Considerations" section in the PR Summary, so I left it empty for now.
  • I could not contact [email protected] for PSyclone script approval, the email address doesn't seem to work (and the PR Summary template has a typo)
  • The PSyclone transmute script for pc_bm_initiate uses a custom CompilerDirective class to insert !DIR$ NOFISSION and !DIR$ IVDEP directives into the subroutine; the latter fixes the KGO comparison failures with CCE that we previously observed. This class is a workaround for a couple of issues - PSyclone v3.2 introduced --keep-comments and --keep-directives flags, but its FortranWriter introduces spaces in the output (e.g., ! DIR$ NOFISSION), which breaks directives. Using these flags would also require per-file PSyclone flag support in the LFRic build system to avoid using them globally, which might break other scripts, and the PSyclone script would have to modify or remove the existing OpenMP statements in the Fortran code. The fparser package has a Directive class, but it is currently broken due to a missing method. The best way forward seemed to be using a custom CompilerDirective class. Note that the PSyclone and fparser issues are already known and should be resolved in the next releases.

@tinyendian
Copy link
Author

@MetBenjaminWent I don't seem to have permission to reassign the PR, you or @jennyhickson would need to do this, thanks.

@MetBenjaminWent
Copy link
Contributor

Initial runs with the branch on EXs are sadly showing KGO failures with GNU and CCE.
We need to identify the root cause of these before it can be approved for trunk.

I'm trying the files one by one to see if one of them in particular are causing the failures.

I'll attach the trac output below once the runs are finished.

@MetBenjaminWent
Copy link
Contributor

It seems something about pc2_bm_initiate is causing the KGO failure in this ticket sadly.

@tinyendian
Copy link
Author

Thanks for running the tests, @MetBenjaminWent. Did you attach the output from the failed tests somewhere? I'll look into it after the Christmas break, hopefully I'll be able to get them to work on Monsoon to reproduce the problem (some test executables don't run there).

@MetBenjaminWent
Copy link
Contributor

KGO difference:

< Inner product checksum rho =         48D7FC40
< Inner product checksum theta =         5392A678
< Inner product checksum u =         6A97B83A
< Inner product checksum mr1 =         41CCED01
< Inner product checksum mr2 =         39C8DCE2
< Inner product checksum mr3 =         37ADC3D4
< Inner product checksum mr4 =         39774AE9
---
> Inner product checksum rho =         48D7FBE8
> Inner product checksum theta =         5392A5AE
> Inner product checksum u =         6A97C717
> Inner product checksum mr1 =         41CCEEC7
> Inner product checksum mr2 =         39D1449A
> Inner product checksum mr3 =         37AE7933
> Inner product checksum mr4 =         3973CE88

Only additional file added in this run (The rest were not passed to PSyclone):
pc2_bm_initiate

I'm having a look at where the directives where added relative to the original, that's the only thing I can think of, some kind of ordering issue still.

@MetBenjaminWent
Copy link
Contributor

In regards to possible differences, potentially an ordering of this NOFUSION and the OMP?

In the original, where the OMP was around the J loop at L536, where we have moved it down to the I loop in the original file at L657, and a pre-processed and PSycloned file at L275, the replacement !DIR$ NOFISSION is now before the OMP section, as opposed to inside it. I'm not sure if that has had a effect, I can have a look at the listing files.

Same again with original file L705 and generated L317.

Otherwise I cannot spot anything.

I'll see if the KGOs hold. If so we could reach out to the Code Owner to see if they will accept the change.

Copy link
Contributor

@MetBenjaminWent MetBenjaminWent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are seeing KGO issues in pc2_bm_initiate, until we can pinpoint the reason, or gain a approval from the CO, this PSyclone script cannot be accepted.

@MetBenjaminWent
Copy link
Contributor

MetBenjaminWent commented Jan 6, 2026

Updated KGO does not hold with CCE, indicating a possible race condition of some kind

@tinyendian
Copy link
Author

Thanks for running these tests @MetBenjaminWent. As I mentioned previously, I would need to know which ones are failing to be able to resolve this. Monsoon doesn't seem to be supported as a regular test platform in the LFRic Apps Rose stem setup, I ported a bunch of workflow patches every time I created a new branch, and some of tests won't run at all. Hopefully I'll be able to get the failing ones to work once I know which ones they are.

@MetBenjaminWent
Copy link
Contributor

MetBenjaminWent commented Jan 7, 2026

Apologies, it was primarily these tests:
https://github.com/MetOffice/lfric_apps/blob/main/rose-stem/site/meto/groups/groups_lfric_atm.cylc

    "ex1a_omp_C48_cce": [
        "lfric_atm_nwp_gal9_noukca_1T-C48_MG_ex1a_cce_fast-debug-32bit",
        "lfric_atm_nwp_gal9_noukca_2T-C48_MG_ex1a_cce_fast-debug-32bit",
        "lfric_atm_nwp_gal9_noukca_4T-C48_MG_ex1a_cce_fast-debug-32bit",
.......
    "ex1a_omp_gnu": [
.......
        "lfric_atm_nwp_gal9_noukca_1T-C48_MG_ex1a_gnu_fast-debug-32bit",
        "lfric_atm_nwp_gal9_noukca_2T-C48_MG_ex1a_gnu_fast-debug-32bit",
        "lfric_atm_nwp_gal9_noukca_4T-C48_MG_ex1a_gnu_fast-debug-32bit",
.......

But any of the KGO threaded teats from the same group are also failing, but I would use copies of these as the yard stick.

So thats thenwp_gal9_noukcaconfiguration.

Weak scaled MPI ranks against OMP threads as noted in here:
https://github.com/MetOffice/lfric_apps/blob/main/rose-stem/site/common/lfric_atm/tasks_lfric_atm.cylc

@tinyendian
Copy link
Author

Thanks @MetBenjaminWent, that is curious - I use the ex1a_omp_developer test group that includes ex1a_omp_C48_cce and ex1a_omp_gnu as my main developer tests. Most of these run on Monsoon and I don't see any KGO failures at all. I do get runtime crashes with the lfric_atm_clim_gal9_chem_xT-C12 tests both for CCE and GNU (not sure what causes these), and KGO seems to be turned off for most of the GNU tests in rose-stem/site/common/lfric_atm/tasks_lfric_atm.cylc (only few GNU tests seem to have KGO checksums in rose-stem/site/meto/kgos/lfric_atm/ex1a).

Seen that I get no KGO comparison errors for tests like lfric_atm_nwp_gal9_noukca_xT-C48_MG_ex1a_cce_fast-debug-32bit on Monsoon, maybe the failures are caused by differences in the build setup? I had to apply several workarounds to rose-stem/site/meto/common/suite_config_ex1a.cylc to be able to run the tests:

{% set ex1a_base = 'module use /projects/metoff/spackadmin.mon/releases/2025.10.1/ngms' %}
[...]
 {% set ex1a_compiler_cce =  'module switch PrgEnv-cray PrgEnv-cray/8.4.0 ; ' ~
                             'module load cpe/23.05 ; '~
                             'module switch cce cce/15.0.0 ; '~
                             'module load lfric-cray/15.0.0/3.0+ || true ; module load py-pyyaml/6.0.2 py-psyclone/3.1.0 || true ; '~
                             'export FFLAGS="-I/home/users/wolfgang.hayek.ext/xios/xios_r2701_prod_O3_spack/inc $FFLAGS"; '~
                             'export LDFLAGS="-L/home/users/wolfgang.hayek.ext/xios/xios_r2701_prod_O3_spack/lib $LDFLAGS"; '~
                             'export PYTHONPATH=/projects/metoff/spackadmin.mon/releases/2025.10.1/install/linux-sles15-zen3/gcc-12.2.0/py-graphviz-0.13.2-trph5f36s33gx26rnf3kbvdp7dh32guw/lib/python3.10/site-packages:$PYTHONPATH ; ' %}

I had to downgrade PSyclone v3.2.0 to v3.1.0 due to build errors (PSyclone crashes), and I needed to rebuild XIOS since the Spack-built version crashes at runtime. There were also some Python packages missing (pyyaml and graphviz).

Which PSyclone version do you use, v3.2.0? If that's the case, I'll investigate why builds with v3.2.0 don't work for me.

@MetBenjaminWent
Copy link
Contributor

On the EXs, it looks like the PSyclone version is psyclone/3.2.2

I think there's been a few important changes to PSyclone. However I'm currently not sure that this is the root cause of the issues.

I think for the rest of the PR, it might be worth breaking work related to pc_bm_initiate out into it's own PR?

@tinyendian
Copy link
Author

Thanks, that resolved the mystery - after switching from PSyclone v3.1.0 to PSyclone v3.2.2, I can now reproduce the KGO comparison failures on Monsoon, and some of the tests even fail to run at all, reporting algorithm failures.

The error mechanism turned out to be the following:

  • PSyclone v3.2.2 applies additional checks for race conditions compared to PSyclone v3.1.0, finds more false dependencies, and thus refuses to add OpenMP worksharing loop directives (omp do) to some of the loops in pc2_bm_initiate.f90.
  • Unfortunately, these failures do not trigger an exception in the transmute script as in previous PSyclone releases, the script simply silently stops applying OpenMP transformations half-way through and without any error messages, and the build system carries on. I can only get warnings about these (false) race conditions when I set the verbose=True parameter in the apply method of the OpenMP loop transformation, and they are added to the Fortran output as comments rather than to stderr and do not trigger an exception.
  • The result is that only omp parallel directives are added to the pc2_bm_initiate.f90 Fortran code without any omp do directives inside the region. OpenMP threads will thus execute the code block redundantly, causing race conditions that lead to the observed runtime and KGO comparison failures.

Maybe it was a deliberate choice that PSyclone no longer triggers exceptions for race condition warnings? As you suggested, pc2_bm_initiate was affected, but there may be other broken transmute scripts. I'll investigate, and I'll need to work out how to best handle the false dependencies and how to get PSyclone to trigger exceptions if something goes wrong.

There is no rush with this pull request; I'll leave it open for now and will implement the fixes.

@MetBenjaminWent
Copy link
Contributor

Good spot, yeah the silent failing certainly seems incorrect, especially in this instance, likely a output warning followed by it continuing would be better?

No worries, I'll drop this off my radar for now. If I miss it when it goes back into review, feel free to drop me an email too to prompt me again, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed This contributor has signed the CLA.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants