Skip to content

Conversation

@Malmahrouqi3
Copy link
Collaborator

@Malmahrouqi3 Malmahrouqi3 commented Jun 11, 2025

Description

Added one GPU benchmarking case by submitting SLURM jobs on Frontier - duplicate implementation of Phoenix. (#453)

Manually Benchmarking,

Cloning

git clone --depth 1 https://github.com/MFlowCode/MFC.git master
git clone https://github.com/Malmahrouqi3/MFC-mo2.git pr --branch frontier-CI2

Copying Bash Scripts into master

rm -rf master/.github/workflows/*
cp -r pr/.github/workflows/* master/.github/workflows/*

Submit Benchmark Jobs

bash pr/.github/workflows/frontier/submit-bench.sh pr/.github/workflows/frontier/bench.sh gpu
bash master/.github/workflows/frontier/submit-bench.sh master/.github/workflows/frontier/bench.sh gpu

Process Benchmark Results
once the slurm jobs are done

cd pr && . ./mfc.sh load -c f -m g
./mfc.sh bench_diff ../master/bench-gpu.yaml ../pr/bench-gpu.yaml

@codecov
Copy link

codecov bot commented Jun 12, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 44.15%. Comparing base (47053f8) to head (e7e7de8).
Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #881      +/-   ##
==========================================
+ Coverage   44.03%   44.15%   +0.11%     
==========================================
  Files          68       68              
  Lines       18395    18347      -48     
  Branches     2227     2227              
==========================================
  Hits         8101     8101              
+ Misses       8991     8943      -48     
  Partials     1303     1303              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Malmahrouqi3
Copy link
Collaborator Author

Reduced the job duration to 3 hrs to see whether it would yield the same error regardless of duration.

@sbryngelson sbryngelson requested a review from Copilot June 21, 2025 16:44
@Malmahrouqi3
Copy link
Collaborator Author

I did dos2unix for all frontier directory files. Anyways thanks, I will wait if that is gonna pass the test now.

@sbryngelson
Copy link
Member

This benchmark test will never pass in its current state because the Frontier files for benchmarking do not exist on the master branch, hence this error

(cd pr     && bash .github/workflows/frontier/submit-bench.sh .github/workflows/frontier/bench.sh gpu) &
  (cd pr     && bash .github/workflows/frontier/submit-bench.sh .github/workflows/frontier/bench.sh gpu) &
  (cd master && bash .github/workflows/frontier/submit-bench.sh .github/workflows/frontier/bench.sh gpu) &
  wait %1 && wait %[2](https://github.com/MFlowCode/MFC/actions/runs/15826502985/job/44607985758?pr=881#step:5:2)
  shell: /usr/bin/bash -e {0}
  env:
    ACTIONS_RUNNER_FORCE_ACTIONS_NODE_VERSION: node16
    ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: true
bash: .github/workflows/frontier/submit-bench.sh: No such file or directory
Submitted batch job [3](https://github.com/MFlowCode/MFC/actions/runs/15826502985/job/44607985758?pr=881#step:5:3)531713

once it looks like everything is working as well as one can expect, we can merge in the minimal files (.github/workflows/*) and then create a new PR that tests it properly.

@Malmahrouqi3
Copy link
Collaborator Author

aight, myself or someone has to test it out manually by cloning master & pr and adding bash files in each then benchmarking on Frontier as a slurm/interative job to make sure nothing will corrupt in the process.

@wilfonba
Copy link
Collaborator

I verified that this works on my end. The IBM case still gives NaNs though...

@Malmahrouqi3
Copy link
Collaborator Author

Malmahrouqi3 commented Jun 25, 2025

I verified that this works on my end. The IBM case still gives NaNs though...

Thanks much, and I wonder what the deal is with the IBM case ngl. Any specific error messages or such? If the issue persists, we can just exclude that case somehow. Also, NaNs I guess won't fail the test as can be seen on my recent PR when I assigned null to IBM grind/exec #895 (comment)

Edit: lmk, if you suspect anything that might have caused that.

@wilfonba
Copy link
Collaborator

Well, the NaN issue was supposed to be fixed by #892 but it appears that that's not the case

@sbryngelson
Copy link
Member

status?

@Malmahrouqi3
Copy link
Collaborator Author

@sbryngelson done on my end tbh and nothing to add

@sbryngelson
Copy link
Member

Well, the NaN issue was supposed to be fixed by #892 but it appears that that's not the case

what's going on here?

@wilfonba
Copy link
Collaborator

Well, the NaN issue was supposed to be fixed by #892 but it appears that that's not the case

what's going on here?

Any ideas @anandrdbz ?

@anandrdbz
Copy link
Contributor

anandrdbz commented Jun 30, 2025

I'll look into it, last time I checked 2D_ibm was working, perhaps there were multiple issues causing NaNs

@anandrdbz
Copy link
Contributor

anandrdbz commented Jun 30, 2025

I just ran 2D_ibm and 2D_ibm_multiphase to completion on an interactive node @wilfonba, is there another example case file that's failing ?

@sbryngelson
Copy link
Member

I just ran 2D_ibm and 2D_ibm_multiphase to completion on an interactive node @wilfonba, is there another example case file that's failing ?

It's the IBM case in the benchmarking cases (what this PR is about)

@anandrdbz
Copy link
Contributor

Not sure when this was done but the case file in ibm in benchmarks does not actually have ib = T, in fact it's just running a single fluid hypo elastic case

@anandrdbz
Copy link
Contributor

anandrdbz commented Jun 30, 2025

Anyways, I believe the reason why this particular case fails obviously has nothing to do with IBM since ib is not set, I think the reason is the problem size on frontier is larger than Phoenix due to it using 8 GPUs while the time step is hardcoded. I ran the same case file on a single GCD on frontier and it worked. I also reduced dt by a factor of 2 on 8 ranks and that also runs.

But I guess there's not much point debugging this since there needs to be an overhaul of the case file to include an actual IBM case

@sbryngelson
Copy link
Member

waiting for CI to run them will merge

@sbryngelson sbryngelson self-requested a review July 3, 2025 13:07
@sbryngelson sbryngelson merged commit 6f58eec into MFlowCode:master Jul 3, 2025
25 of 31 checks passed
prathi-wind pushed a commit to prathi-wind/MFC-prathi that referenced this pull request Jul 13, 2025
Co-authored-by: mohdsaid497566 <[email protected]>
Co-authored-by: Spencer Bryngelson <[email protected]>
Co-authored-by: Spencer Bryngelson <[email protected]>
Co-authored-by: wilfonba <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants