-
Notifications
You must be signed in to change notification settings - Fork 34
Additional PC2 optimisations for NG-ARCH #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Additional PC2 optimisations for NG-ARCH #53
Conversation
f8b1263 to
849fd80
Compare
|
Closes #47 |
|
Are you okay Assigning me this when it's ready for SR? Thanks! |
|
Thanks @MetBenjaminWent - the PR should be ready, but my fork became detached from the |
|
SR notes for @MetBenjaminWent:
|
|
@MetBenjaminWent I don't seem to have permission to reassign the PR, you or @jennyhickson would need to do this, thanks. |
|
Initial runs with the branch on EXs are sadly showing KGO failures with GNU and CCE. I'm trying the files one by one to see if one of them in particular are causing the failures. I'll attach the trac output below once the runs are finished. |
|
It seems something about pc2_bm_initiate is causing the KGO failure in this ticket sadly. |
|
Thanks for running the tests, @MetBenjaminWent. Did you attach the output from the failed tests somewhere? I'll look into it after the Christmas break, hopefully I'll be able to get them to work on Monsoon to reproduce the problem (some test executables don't run there). |
|
KGO difference: Only additional file added in this run (The rest were not passed to PSyclone): I'm having a look at where the directives where added relative to the original, that's the only thing I can think of, some kind of ordering issue still. |
|
In regards to possible differences, potentially an ordering of this NOFUSION and the OMP? In the original, where the OMP was around the J loop at L536, where we have moved it down to the I loop in the original file at L657, and a pre-processed and PSycloned file at L275, the replacement Same again with original file L705 and generated L317. Otherwise I cannot spot anything. I'll see if the KGOs hold. If so we could reach out to the Code Owner to see if they will accept the change. |
MetBenjaminWent
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are seeing KGO issues in pc2_bm_initiate, until we can pinpoint the reason, or gain a approval from the CO, this PSyclone script cannot be accepted.
|
Updated KGO does not hold with CCE, indicating a possible race condition of some kind |
|
Thanks for running these tests @MetBenjaminWent. As I mentioned previously, I would need to know which ones are failing to be able to resolve this. Monsoon doesn't seem to be supported as a regular test platform in the LFRic Apps Rose stem setup, I ported a bunch of workflow patches every time I created a new branch, and some of tests won't run at all. Hopefully I'll be able to get the failing ones to work once I know which ones they are. |
|
Apologies, it was primarily these tests: But any of the KGO threaded teats from the same group are also failing, but I would use copies of these as the yard stick. So thats the Weak scaled MPI ranks against OMP threads as noted in here: |
|
Thanks @MetBenjaminWent, that is curious - I use the Seen that I get no KGO comparison errors for tests like I had to downgrade PSyclone v3.2.0 to v3.1.0 due to build errors (PSyclone crashes), and I needed to rebuild XIOS since the Spack-built version crashes at runtime. There were also some Python packages missing (pyyaml and graphviz). Which PSyclone version do you use, v3.2.0? If that's the case, I'll investigate why builds with v3.2.0 don't work for me. |
|
On the EXs, it looks like the PSyclone version is I think there's been a few important changes to PSyclone. However I'm currently not sure that this is the root cause of the issues. I think for the rest of the PR, it might be worth breaking work related to |
|
Thanks, that resolved the mystery - after switching from PSyclone v3.1.0 to PSyclone v3.2.2, I can now reproduce the KGO comparison failures on Monsoon, and some of the tests even fail to run at all, reporting algorithm failures. The error mechanism turned out to be the following:
Maybe it was a deliberate choice that PSyclone no longer triggers exceptions for race condition warnings? As you suggested, There is no rush with this pull request; I'll leave it open for now and will implement the fixes. |
|
Good spot, yeah the silent failing certainly seems incorrect, especially in this instance, likely a output warning followed by it continuing would be better? No worries, I'll drop this off my radar for now. If I miss it when it goes back into review, feel free to drop me an email too to prompt me again, thanks! |
PR Summary
Sci/Tech Reviewer: @MetBenjaminWent
Code Reviewer: @EdHone
Code Quality Checklist
(Some checks are automatically carried out via the CI pipeline)
style guidelines
readability of the code
Testing
acceptable (eg. kgo changes)
tests, unit tests, etc.)
and have tests been allocated to an appropriate testing group (i.e. the
developer tests are for jobs which use a small amount of compute resource
and complete in a matter of minutes)
ex1a_omp_developersucceed, and therun_lfric_atm_scm_coma9_toga-BiP2x2-50000x50000_ex1a_gnu_fast-debug-64bitandrun_lfric_atm_scm_comorph_dev_toga-BiP2x2-50000x50000_ex1a_gnu_fast-debug-64bittests no longer fail their KGO tests (these failures were caused by PSyclone dropping the!DIR$ IVDEPcompiler directives)meto-ex1aandesnz-cascadeoptimisation platforms on the ESNZ Cascade HPCtrac.log
Security Considerations
Performance Impact
performance measurements have been conducted
AI Assistance and Attribution
of Generative AI tool name (e.g., Met Office Github Copilot Enterprise,
Github Copilot Personal, ChatGPT GPT-4, etc) and I have followed the
Simulation Systems AI policy
(including attribution labels)
Documentation
confirmed that it builds correctly
PSyclone Approval
inteface, optimisation scripts, LFRic data structure code) then please
contact the
[email protected]
Note: The email address does not work, unfortunately
Sci/Tech Review
Please alert the code reviewer via a tag when you have approved the SR
Code Review