-
Notifications
You must be signed in to change notification settings - Fork 121
Frontier Benchmarking (#453) #881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #881 +/- ##
==========================================
+ Coverage 44.03% 44.15% +0.11%
==========================================
Files 68 68
Lines 18395 18347 -48
Branches 2227 2227
==========================================
Hits 8101 8101
+ Misses 8991 8943 -48
Partials 1303 1303 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Reduced the job duration to 3 hrs to see whether it would yield the same error regardless of duration. |
|
I did |
|
This benchmark test will never pass in its current state because the Frontier files for benchmarking do not exist on the master branch, hence this error (cd pr && bash .github/workflows/frontier/submit-bench.sh .github/workflows/frontier/bench.sh gpu) &
(cd pr && bash .github/workflows/frontier/submit-bench.sh .github/workflows/frontier/bench.sh gpu) &
(cd master && bash .github/workflows/frontier/submit-bench.sh .github/workflows/frontier/bench.sh gpu) &
wait %1 && wait %[2](https://github.com/MFlowCode/MFC/actions/runs/15826502985/job/44607985758?pr=881#step:5:2)
shell: /usr/bin/bash -e {0}
env:
ACTIONS_RUNNER_FORCE_ACTIONS_NODE_VERSION: node16
ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: true
bash: .github/workflows/frontier/submit-bench.sh: No such file or directory
Submitted batch job [3](https://github.com/MFlowCode/MFC/actions/runs/15826502985/job/44607985758?pr=881#step:5:3)531713once it looks like everything is working as well as one can expect, we can merge in the minimal files ( |
|
aight, myself or someone has to test it out manually by cloning master & pr and adding bash files in each then benchmarking on Frontier as a slurm/interative job to make sure nothing will corrupt in the process. |
|
I verified that this works on my end. The IBM case still gives NaNs though... |
Thanks much, and I wonder what the deal is with the IBM case ngl. Any specific error messages or such? If the issue persists, we can just exclude that case somehow. Also, NaNs I guess won't fail the test as can be seen on my recent PR when I assigned null to IBM grind/exec #895 (comment) Edit: lmk, if you suspect anything that might have caused that. |
|
Well, the NaN issue was supposed to be fixed by #892 but it appears that that's not the case |
|
status? |
|
@sbryngelson done on my end tbh and nothing to add |
what's going on here? |
Any ideas @anandrdbz ? |
|
I'll look into it, last time I checked 2D_ibm was working, perhaps there were multiple issues causing NaNs |
|
I just ran 2D_ibm and 2D_ibm_multiphase to completion on an interactive node @wilfonba, is there another example case file that's failing ? |
It's the IBM case in the benchmarking cases (what this PR is about) |
|
Not sure when this was done but the case file in ibm in benchmarks does not actually have ib = T, in fact it's just running a single fluid hypo elastic case |
|
Anyways, I believe the reason why this particular case fails obviously has nothing to do with IBM since ib is not set, I think the reason is the problem size on frontier is larger than Phoenix due to it using 8 GPUs while the time step is hardcoded. I ran the same case file on a single GCD on frontier and it worked. I also reduced dt by a factor of 2 on 8 ranks and that also runs. But I guess there's not much point debugging this since there needs to be an overhaul of the case file to include an actual IBM case |
|
waiting for CI to run them will merge |
Co-authored-by: mohdsaid497566 <[email protected]> Co-authored-by: Spencer Bryngelson <[email protected]> Co-authored-by: Spencer Bryngelson <[email protected]> Co-authored-by: wilfonba <[email protected]>
Description
Added one GPU benchmarking case by submitting SLURM jobs on Frontier - duplicate implementation of Phoenix. (#453)
Manually Benchmarking,
Cloning
Copying Bash Scripts into master
Submit Benchmark Jobs
Process Benchmark Results
once the slurm jobs are done