Skip to content

Conversation

@danieljvickers
Copy link
Member

@danieljvickers danieljvickers commented Oct 31, 2025

User description

Description

This branch started as an intent to resolve issues with NVHPC/25.9 (hence the name) and slowly converted to setting up the build environment for NVHPC/25.9 on hipergator. After much testing, it looks like the primary issues with build MFC on HiPerGator was driver mismatch issues, which are resolved using the new load module. I added the h flag for the hipergator system and a new mako script for batching.

There are a couple of issue with this code, mostly in the MPI changes. Looking at the modules files we set to environmental flags: OMPI_MCA_pml=ob1 and OMPI_MCA_coll_hcoll_enable=0. Looking

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

Scope

  • This PR comprises a set of related changes with a common goal

If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.

How Has This Been Tested?

was able to successfully run a job on the B200s interactively by getting on a node and first calling source ./mfc.sh load -m g -c h. I was also able to successfully batch a job with the MAKO file via: ./mfc.sh /path/to/case.py --engine batch --gpu --mpi --computer hipergator

Test Configuration:

This was run using Nvidia B200s on the University of Florida HiPerGator system.

Checklist

  • I ran ./mfc.sh format before committing my code
  • New and existing tests pass locally with my changes, including with GPU capability enabled (both NVIDIA hardware with NVHPC compilers and AMD hardware with CRAY compilers) and disabled
  • This PR does not introduce any repeated code (it follows the DRY principle)
  • I cannot think of a way to condense this code and reduce any introduced additional line count

PR Type

Enhancement


Description

  • Add HiPerGator system support with NVHPC/25.9 compiler configuration

  • Create batch job template for HiPerGator SLURM scheduler

  • Configure MPI environment variables for OpenMPI compatibility

  • Set CUDA and compiler paths for B200 GPU nodes


Diagram Walkthrough

flowchart LR
  A["HiPerGator System"] --> B["Module Configuration"]
  A --> C["Batch Template"]
  B --> D["NVHPC/25.9 Compiler"]
  B --> E["OpenMPI 5.0.7 Setup"]
  B --> F["CUDA 12.8.1 Paths"]
  C --> G["SLURM Job Script"]
  C --> H["GPU Resource Allocation"]
Loading

File Walkthrough

Relevant files
Enhancement
hipergator.mako
HiPerGator SLURM batch job template                                           

toolchain/templates/hipergator.mako

  • New Mako template for HiPerGator batch job submission
  • Configures SLURM directives for multi-node GPU jobs with B200
    partition
  • Supports both GPU and CPU-only execution modes
  • Handles MPI job launching with proper process binding
+61/-0   
Configuration changes
modules
HiPerGator module configuration with NVHPC/25.9                   

toolchain/modules

  • Add h system identifier for HiPerGator
  • Configure NVHPC/25.9 compiler and CUDA 12.8.1 paths
  • Set OpenMPI 5.0.7 environment variables and MCA parameters
  • Define GPU compute capability (CC=100) for B200 architecture
  • Add MPI directory paths and binary locations
+9/-0     

@danieljvickers danieljvickers requested a review from a team as a code owner October 31, 2025 15:16
@qodo-merge-pro
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

The OpenMPI paths under the new 'h' system point to 'nvhpc/25.3' while the description mentions NVHPC/25.9; verify version compatibility and that binaries/libraries match the loaded compiler to avoid ABI/runtime mismatches.

h-all HPC_OMPI_DIR="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7"
h-all HPC_OMPI_BIN="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7/bin"
h-all OMPI_MCA_pml=ob1 OMPI_MCA_coll_hcoll_enable=0
h-gpu PATH="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7/bin:${PATH}"
h-gpu MFC_CUDA_CC=100 NVHPC_CUDA_HOME="/apps/compilers/cuda/12.8.1"
MPI Launch Options

Using '--bind-to none' with mpirun may cause suboptimal performance on multi-core nodes; consider appropriate binding/mapping (e.g., '--bind-to core --map-by ppr:...:node') or making it configurable per job.

mpirun -np ${nodes*tasks_per_node}            \
       --bind-to none                         \
       "${target.get_install_binpath(case)}")
Resource Mismatch

'#SBATCH --cpus-per-task=7' is hard-coded; ensure it aligns with tasks_per_node and GPU binding (one GPU per task). Mismatches can cause oversubscription or idle resources.

#SBATCH --cpus-per-task=7
% if gpu:
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=closest
% endif

Comment on lines +93 to +98
h-gpu nvhpc/25.9
h-gpu CUDA_HOME="/apps/compilers/cuda/12.8.1"
h-all HPC_OMPI_DIR="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7"
h-all HPC_OMPI_BIN="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7/bin"
h-all OMPI_MCA_pml=ob1 OMPI_MCA_coll_hcoll_enable=0
h-gpu PATH="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7/bin:${PATH}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Change the NVHPC compiler version from nvhpc/25.9 to nvhpc/25.3 to match the version used for the OpenMPI libraries, preventing potential ABI compatibility issues. [possible issue, importance: 9]

Suggested change
h-gpu nvhpc/25.9
h-gpu CUDA_HOME="/apps/compilers/cuda/12.8.1"
h-all HPC_OMPI_DIR="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7"
h-all HPC_OMPI_BIN="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7/bin"
h-all OMPI_MCA_pml=ob1 OMPI_MCA_coll_hcoll_enable=0
h-gpu PATH="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7/bin:${PATH}"
h-gpu nvhpc/25.3
h-gpu CUDA_HOME="/apps/compilers/cuda/12.8.1"
h-all HPC_OMPI_DIR="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7"
h-all HPC_OMPI_BIN="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7/bin"
h-all OMPI_MCA_pml=ob1 OMPI_MCA_coll_hcoll_enable=0
h-gpu PATH="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7/bin:${PATH}"

Comment on lines +50 to +53
(set -x; ${profiler} \
mpirun -np ${nodes*tasks_per_node} \
--bind-to none \
"${target.get_install_binpath(case)}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: To improve performance on multi-GPU nodes, change the mpirun binding from --bind-to none to --bind-to socket to ensure better process-to-GPU affinity. [general, importance: 7]

Suggested change
(set -x; ${profiler} \
mpirun -np ${nodes*tasks_per_node} \
--bind-to none \
"${target.get_install_binpath(case)}")
(set -x; ${profiler} \
mpirun -np ${nodes*tasks_per_node} \
--bind-to socket \
--report-bindings \
"${target.get_install_binpath(case)}")

@wilfonba wilfonba requested a review from sbryngelson as a code owner October 31, 2025 15:49
@sbryngelson sbryngelson merged commit 2da0daf into MFlowCode:master Oct 31, 2025
16 of 27 checks passed
@codecov
Copy link

codecov bot commented Oct 31, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.60%. Comparing base (4f8fb91) to head (3a258de).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1020   +/-   ##
=======================================
  Coverage   41.60%   41.60%           
=======================================
  Files          70       70           
  Lines       20783    20783           
  Branches     2616     2616           
=======================================
  Hits         8647     8647           
  Misses      10499    10499           
  Partials     1637     1637           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@danieljvickers danieljvickers deleted the resolve-nvhpc-25-3 branch October 31, 2025 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants