-
Notifications
You must be signed in to change notification settings - Fork 121
Resolve nvhpc 25 3 #1020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve nvhpc 25 3 #1020
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
| h-gpu nvhpc/25.9 | ||
| h-gpu CUDA_HOME="/apps/compilers/cuda/12.8.1" | ||
| h-all HPC_OMPI_DIR="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7" | ||
| h-all HPC_OMPI_BIN="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7/bin" | ||
| h-all OMPI_MCA_pml=ob1 OMPI_MCA_coll_hcoll_enable=0 | ||
| h-gpu PATH="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7/bin:${PATH}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Change the NVHPC compiler version from nvhpc/25.9 to nvhpc/25.3 to match the version used for the OpenMPI libraries, preventing potential ABI compatibility issues. [possible issue, importance: 9]
| h-gpu nvhpc/25.9 | |
| h-gpu CUDA_HOME="/apps/compilers/cuda/12.8.1" | |
| h-all HPC_OMPI_DIR="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7" | |
| h-all HPC_OMPI_BIN="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7/bin" | |
| h-all OMPI_MCA_pml=ob1 OMPI_MCA_coll_hcoll_enable=0 | |
| h-gpu PATH="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7/bin:${PATH}" | |
| h-gpu nvhpc/25.3 | |
| h-gpu CUDA_HOME="/apps/compilers/cuda/12.8.1" | |
| h-all HPC_OMPI_DIR="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7" | |
| h-all HPC_OMPI_BIN="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7/bin" | |
| h-all OMPI_MCA_pml=ob1 OMPI_MCA_coll_hcoll_enable=0 | |
| h-gpu PATH="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7/bin:${PATH}" |
| (set -x; ${profiler} \ | ||
| mpirun -np ${nodes*tasks_per_node} \ | ||
| --bind-to none \ | ||
| "${target.get_install_binpath(case)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: To improve performance on multi-GPU nodes, change the mpirun binding from --bind-to none to --bind-to socket to ensure better process-to-GPU affinity. [general, importance: 7]
| (set -x; ${profiler} \ | |
| mpirun -np ${nodes*tasks_per_node} \ | |
| --bind-to none \ | |
| "${target.get_install_binpath(case)}") | |
| (set -x; ${profiler} \ | |
| mpirun -np ${nodes*tasks_per_node} \ | |
| --bind-to socket \ | |
| --report-bindings \ | |
| "${target.get_install_binpath(case)}") |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1020 +/- ##
=======================================
Coverage 41.60% 41.60%
=======================================
Files 70 70
Lines 20783 20783
Branches 2616 2616
=======================================
Hits 8647 8647
Misses 10499 10499
Partials 1637 1637 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
User description
Description
This branch started as an intent to resolve issues with NVHPC/25.9 (hence the name) and slowly converted to setting up the build environment for NVHPC/25.9 on hipergator. After much testing, it looks like the primary issues with build MFC on HiPerGator was driver mismatch issues, which are resolved using the new load module. I added the
hflag for the hipergator system and a new mako script for batching.There are a couple of issue with this code, mostly in the MPI changes. Looking at the modules files we set to environmental flags:
OMPI_MCA_pml=ob1andOMPI_MCA_coll_hcoll_enable=0. LookingType of change
Please delete options that are not relevant.
Scope
If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.
How Has This Been Tested?
was able to successfully run a job on the B200s interactively by getting on a node and first calling
source ./mfc.sh load -m g -c h. I was also able to successfully batch a job with the MAKO file via:./mfc.sh /path/to/case.py --engine batch --gpu --mpi --computer hipergatorTest Configuration:
This was run using Nvidia B200s on the University of Florida HiPerGator system.
Checklist
./mfc.sh formatbefore committing my codePR Type
Enhancement
Description
Add HiPerGator system support with NVHPC/25.9 compiler configuration
Create batch job template for HiPerGator SLURM scheduler
Configure MPI environment variables for OpenMPI compatibility
Set CUDA and compiler paths for B200 GPU nodes
Diagram Walkthrough
File Walkthrough
hipergator.mako
HiPerGator SLURM batch job templatetoolchain/templates/hipergator.mako
partition
modules
HiPerGator module configuration with NVHPC/25.9toolchain/modules
hsystem identifier for HiPerGator