Oscar Mako and module update #1082

mrodrig6 · 2025-12-09T21:14:47Z

User description

Description

Brown Oscar supercomputer updated their OS and modules recently.
MFC also now runs on Python 3.13.
MFC's modules and Oscar mako file need an update to run interactively and on batch.

Fixes #(issue) [optional]

Type of change

[x ] Something else

Scope

[ x] This PR comprises a set of related changes with a common goal

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration

[x ] Ran simulations in interactive and batch mode on Oscar

Test Configuration:

What computers and compilers did you use to test this:

Checklist

[ x] I have added comments for the new code
They run to completion and demonstrate "interesting physics"
[ x] I ran ./mfc.sh format before committing my code
[x ] New and existing tests pass locally with my changes, including with GPU capability enabled (both NVIDIA hardware with NVHPC compilers and AMD hardware with CRAY compilers) and disabled
[x ] This PR does not introduce any repeated code (it follows the DRY principle)
[x ] I cannot think of a way to condense this code and reduce any introduced additional line count

PR Type

Enhancement

Description

Updated Oscar supercomputer batch job configuration for GPU support
Fixed environment variable naming and module loading syntax
Added Python 3.13.10s support to Oscar CPU modules
Corrected MPI command formatting and GPU resource allocation

Diagram Walkthrough

flowchart LR
  A["Oscar Mako Template"] -->|GPU binding updates| B["Batch Job Config"]
  A -->|Variable naming fix| C["Module Loading"]
  D["Module Configuration"] -->|Python 3.13.10s| E["Oscar CPU Modules"]
  D -->|GPU resource spec| F["Oscar GPU Modules"]

File Walkthrough

Relevant files

Configuration changes

oscar.mako `Oscar batch job GPU and module loading fixes` toolchain/templates/oscar.mako Replaced `--gpus-per-node` and `--mem` with `--gres=gpu:v100-16` for proper GPU resource allocation Updated GPU binding from `--gpu-bind=closest` to `--gpu-bind=verbose,closest` for better diagnostics Fixed environment variable from `MFC_ROOTDIR` to `MFC_ROOT_DIR` for consistency Reformatted MPI command arguments for improved readability and removed unnecessary argument passing	+5/-7

Dependencies

modules `Add Python 3.13.10s to Oscar modules` toolchain/modules Added Python 3.13.10s to Oscar CPU module configuration Updated module dependencies for Oscar supercomputer compatibility	+1/-1

CodeAnt-AI Description

Update Oscar job template and module list for correct GPU allocation, MPI execution, and Python 3.13

What Changed

Batch job script now requests GPUs via SLURM gres for V100 (gpu:v100-16) and enables verbose GPU binding so GPU resources are allocated as expected
Fixed environment variable name so jobs run from the correct MFC root directory, avoiding wrong working-directory failures
MPI batch runs now invoke the built program under mpirun correctly (previously arguments were mis-assembled), preventing silent no-op MPI launches
Oscar module set adds python/3.13.10s to the CPU profile so interactive and batch jobs use Python 3.13

Impact

✅ Correct GPU allocation on Oscar (V100 requests honored)
✅ Fewer batch job failures due to wrong working directory
✅ Jobs using MPI actually run the intended executable

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

Summary by CodeRabbit

New Features
- Python 3.13.10s module is now available in the Brown Oscar module group.
Improvements
- Batch job configuration updated: tasks-per-node option standardized and GPU resource specification refined (explicit GPU type and improved GPU binding).
Refactor
- Job template formatting and configuration references cleaned up for clearer, more maintainable submission scripts.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

codeant-ai · 2025-12-09T21:14:52Z

CodeAnt AI is reviewing your PR.

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

coderabbitai · 2025-12-09T21:15:10Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds python/3.13.10s to the Brown Oscar o-cpu module set, adjusts SBATCH GPU directives and task options in the Oscar template, renames MFC_ROOTDIR to MFC_ROOT_DIR, and reformats the MPI profiling invocation.

Changes

Cohort / File(s)	Summary
Module configuration `toolchain/modules`	Added `python/3.13.10s` to the Brown Oscar group o-cpu module list.
Oscar template `toolchain/templates/oscar.mako`	Replaced `--ntasks-per-node` with `--tasks-per-node`; removed static `mem=64G` and `gpus-per-node`; added `--gpu-bind=verbose,closest` and `--gres=gpu:v100-16:${tasks_per_node}`; renamed `${MFC_ROOTDIR}` → `${MFC_ROOT_DIR}`; collapsed multi-line MPI profiling invocation into a single quoted, aligned command line and adjusted target binary path expression.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Attention areas:
- Validate --gres=gpu:v100-16:${tasks_per_node} matches cluster GPU naming and capacity.
- Verify no remaining references to MFC_ROOTDIR elsewhere.
- Confirm MPI single-line reformat preserves argument quoting and environment interpolation.

Poem

🐰 I hopped through templates, tidy and spry,
Added Python, taught GPUs to reply,
A var renamed, a command aligned,
Oscar's scripts now tiptoe, neat and kind. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: updates to Oscar mako template and module configurations.
Description check	✅ Passed	The description covers key changes, testing performed, and relevant checklist items, though test configuration details are minimal.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

qodo-code-review · 2025-12-09T21:15:11Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Possible Issue Using --gres=gpu:v100-16 with ${tasks_per_node} assumes one GPU per task; if tasks_per_node differs from GPUs per node or if the node type/GPU model varies, this may misallocate GPUs. Validate against Oscar’s Slurm gres names and desired GPU count per node. #SBATCH --gpu-bind=verbose,closest #SBATCH --gres=gpu:v100-16:${tasks_per_node} % endif MPI Args Removed The mpirun invocation no longer forwards extra arguments (previously from ARG('--')); this may break workloads relying on custom MPI/runtime flags or application args. Confirm that no consumers depend on these passthrough args. (set -x; ${profiler} \ mpirun -np ${nodestasks_per_node} \ "${target.get_install_binpath(case)}") Module Compatibility* Adding python/3.13.10s under o-cpu may conflict with toolchain expectations or site module names; ensure this exact module exists on Oscar and is compatible with hpcx-mpi and other dependencies. o-cpu hpcx-mpi python/3.13.10s o-gpu nvhpc cuda/12.3.0 cmake/3.26.3

qodo-code-review · 2025-12-09T21:15:55Z

toolchain/templates/oscar.mako

+        (set -x; ${profiler}                              \
+            mpirun -np ${nodes*tasks_per_node}            \
                   "${target.get_install_binpath(case)}")


Suggestion: To correctly profile the parallel application instead of the MPI launcher, move the ${profiler} command to be executed by mpirun on each rank. [possible issue, importance: 9]

Suggested change

(set -x; ${profiler} \

mpirun -np ${nodes*tasks_per_node} \

"${target.get_install_binpath(case)}")

(set -x; mpirun -np ${nodes*tasks_per_node} \

${profiler} "${target.get_install_binpath(case)}")

cubic-dev-ai

No issues found across 2 files

codeant-ai · 2025-12-09T21:17:23Z

Nitpicks 🔍

🔒 No security issues identified
⚡ Recommended areas for review Possible Typo / Invalid Module Name The added module string `python/3.13.10s` looks unusual compared to other entries (which use `python/3.x.y`). This may be a typo or a non-existent module name on Oscar; if the module does not exist the loader will fail. Verify the exact module name available on the cluster and update accordingly. Compatibility risk Moving to Python 3.13 may introduce incompatibilities with existing MFC dependencies or other modules on Oscar. Confirm that required libraries (hdf5, anaconda, MPI wrappers) are available/compatible for Python 3.13 and that automated tests cover both interactive and batch launches with this version. GPU resource specification The SBATCH GPU lines are hard-coded and use a nonstandard-looking gres token ("gpu:v100-16:${tasks_per_node}") and a combined gpu-bind option. These may be unsupported or incompatible with different Oscar/Slurm versions; they should be parameterized or use the standard `--gpus-per-node` / `--gres=gpu:<type>:<count>` syntax and verified against the cluster configuration. Environment variable name change The script now cds to "${MFC_ROOT_DIR}" (underscore) instead of the previous "${MFC_ROOTDIR}". If the environment or other scripts still set the old variable name, this will fail. Confirm the runtime environment defines `MFC_ROOT_DIR`.

codeant-ai · 2025-12-09T21:17:43Z

toolchain/modules

 o     Brown Oscar
 o-all python
-o-cpu hpcx-mpi
+o-cpu hpcx-mpi python/3.13.10s


Suggestion: Typo in the Python module name: the added token python/3.13.10s has an extra trailing "s" which will make the module name invalid/unavailable to the module loader and cause this entry to fail at load time; remove the stray "s" so the version matches the rest of the file. [possible bug]

Severity Level: Critical 🚨

Suggested change

o-cpu hpcx-mpi python/3.13.10s

o-cpu hpcx-mpi python/3.13.10

Why it matters? ⭐

The added token "python/3.13.10s" very likely contains a typo — all other entries use numeric version tokens (e.g., python/3.12.5). The stray "s" will most likely make the module name not match the expected modulefile and could break module loading. Changing it to python/3.13.10 is a straightforward, low-risk fix that improves correctness.

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** toolchain/modules **Line:** 57:57 **Comment:** *Possible Bug: Typo in the Python module name: the added token `python/3.13.10s` has an extra trailing "s" which will make the module name invalid/unavailable to the module loader and cause this entry to fail at load time; remove the stray "s" so the version matches the rest of the file. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

codeant-ai · 2025-12-09T21:17:43Z

toolchain/templates/oscar.mako

-#SBATCH --mem=64G
-#SBATCH --gpu-bind=closest
+#SBATCH --gpu-bind=verbose,closest
+#SBATCH --gres=gpu:v100-16:${tasks_per_node}


Suggestion: Hardcoded GPU resource/type may not exist on all clusters and can cause sbatch to reject the job; request the GPU count generically (or parameterize the GPU type) instead of forcing "v100-16". [possible bug]

Severity Level: Critical 🚨

Suggested change

#SBATCH --gres=gpu:v100-16:${tasks_per_node}

#SBATCH --gres=gpu:${tasks_per_node}

Why it matters? ⭐

This is a valid and practical suggestion: hardcoding a specific GPU model (v100-16) can indeed cause sbatch to reject jobs on clusters that don't expose that exact resource name. Replacing the line with a generic gres request (or better, parameterizing the GPU model with a template variable) is safer and fixes a real portability issue visible in the diff.
The PR already gates these lines with % if gpu_enabled so removing the explicit model is a low-risk change.

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** toolchain/templates/oscar.mako **Line:** 19:19 **Comment:** *Possible Bug: Hardcoded GPU resource/type may not exist on all clusters and can cause sbatch to reject the job; request the GPU count generically (or parameterize the GPU type) instead of forcing "v100-16". Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

codeant-ai · 2025-12-09T21:17:44Z

toolchain/templates/oscar.mako


 ok ":) Loading modules:\n"
-cd "${MFC_ROOTDIR}"
+cd "${MFC_ROOT_DIR}"


Suggestion: The added unconditional "cd ${MFC_ROOT_DIR}" will fail silently if MFC_ROOT_DIR is unset or not a directory, causing the subsequent source of mfc.sh to run in the wrong directory and break the script; check existence and directory-ness before changing directory and exit with a clear error if invalid. [resource leak]

Severity Level: Minor ⚠️

Suggested change

cd "${MFC_ROOT_DIR}"

if [ -z "${MFC_ROOT_DIR}" ] || [ ! -d "${MFC_ROOT_DIR}" ]; then

echo "MFC_ROOT_DIR is not set or is not a directory" >&2

exit 1

fi

Why it matters? ⭐

Adding a guard to ensure MFC_ROOT_DIR is set and is a directory is sensible: as written the script will attempt to cd and then source ./mfc.sh relative to the current dir, which will silently (or noisily) fail if the variable is unset or points nowhere. The proposed check makes the failure explicit and fails fast, which is desirable for a job script that must set up environment.

Prompt for AI Agent 🤖

This is a comment left during a code review. **Path:** toolchain/templates/oscar.mako **Line:** 33:33 **Comment:** *Resource Leak: The added unconditional "cd ${MFC_ROOT_DIR}" will fail silently if `MFC_ROOT_DIR` is unset or not a directory, causing the subsequent source of `mfc.sh` to run in the wrong directory and break the script; check existence and directory-ness before changing directory and exit with a clear error if invalid. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

codeant-ai · 2025-12-09T21:24:31Z

CodeAnt AI finished reviewing your PR.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

toolchain/templates/oscar.mako (1)
33-33: Add validation for MFC_ROOT_DIR before using it.

The script unconditionally uses ${MFC_ROOT_DIR} without checking if it's set or valid. If the variable is unset or points to a non-existent directory, the subsequent source ./mfc.sh will fail in an unexpected directory, causing cryptic errors.

As flagged in past review comments, add a guard to fail fast with a clear error:
+if [ -z "${MFC_ROOT_DIR}" ] || [ ! -d "${MFC_ROOT_DIR}" ]; then
+    echo "ERROR: MFC_ROOT_DIR is not set or is not a directory" >&2
+    exit 1
+fi
 cd "${MFC_ROOT_DIR}"

🧹 Nitpick comments (1)

toolchain/templates/oscar.mako (1)
18-19: Consider parameterizing the GPU type for flexibility.

The hardcoded v100-16 GPU type is specific to Brown's Oscar cluster. While acceptable for an Oscar-specific template, parameterizing the GPU model (e.g., via a template variable) would improve maintainability if Oscar's GPU inventory changes or if this template is adapted for other clusters.

Past review comments flagged this concern. If the v100-16 designation is subject to change on Oscar, consider:
-#SBATCH --gres=gpu:v100-16:${tasks_per_node}
+#SBATCH --gres=gpu:${gpu_type if gpu_type else 'v100-16'}:${tasks_per_node}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8df72f8 and 0056072.

📒 Files selected for processing (1)

toolchain/templates/oscar.mako (4 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-24T21:50:46.909Z

Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .cursor/rules/mfc-agent-rules.mdc:0-0
Timestamp: 2025-11-24T21:50:46.909Z
Learning: Draft a step-by-step plan before making changes; build after each step using `./mfc.sh build -t pre_process simulation -j $(nproc)`

Applied to files:

toolchain/templates/oscar.mako

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build & Publish

🔇 Additional comments (1)

toolchain/templates/oscar.mako (1)

44-46: Verify profiler placement—launcher-level vs. rank-level profiling.

The current code applies ${profiler} to the mpirun launcher process. Whether this is correct depends on the profiler type:

Launcher-level profilers (system-wide or MPI-aware profilers): should wrap mpirun (current placement)

Rank-level profilers (per-process profilers): should be applied to each rank inside mpirun

Verify the intended profiler behavior in MFC documentation or by checking how the profiler variable is defined and used elsewhere in the codebase. If rank-level profiling is intended, the profiler should be moved inside the mpirun command.

coderabbitai · 2025-12-09T21:34:57Z

toolchain/templates/oscar.mako

 % if engine == 'batch':
 #SBATCH --nodes=${nodes}
-#SBATCH --ntasks-per-node=${tasks_per_node}
+#SBATCH --tasks-per-node=${tasks_per_node}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check all SBATCH-related options in oscar.mako and other templates echo "=== oscar.mako SBATCH directives ===" rg -n 'SBATCH' toolchain/templates/oscar.mako | head -20 echo "" echo "=== All templates SBATCH directives with tasks-per-node or ntasks-per-node ===" rg -n --type=mako '(tasks-per-node|ntasks-per-node)' toolchain/templates/ echo "" echo "=== Full oscar.mako file (if small enough) ===" wc -l toolchain/templates/oscar.mako

Repository: MFlowCode/MFC

Length of output: 170

🌐 Web query:

SLURM --tasks-per-node vs --ntasks-per-node option documentation

💡 Result:

Short answer: The canonical SLURM option is --ntasks-per-node (request N tasks on each node). --ntasks (or -n) sets the total number of tasks and takes precedence if both are given; --ntasks-per-node is treated as a maximum per-node count when used with --ntasks. Some local docs/sites may show --tasks-per-node as an informal/alternate wording, but official SLURM documentation and manpages use --ntasks-per-node. [1][2][3]

References

SLURM srun/sbatch docs (describes --ntasks, --ntasks-per-node and precedence with --ntasks). [1]

SLURM sbatch manpage (same text for --ntasks-per-node). [2]

Example/local doc showing --tasks-per-node wording (not the official option name). [3]

[1] https://slurm.schedmd.com/srun.html
[2] https://slurm.schedmd.com/sbatch.html
[3] https://www.nrel.gov/hpc/announcements/posts/node-use-efficiency

Revert the SBATCH option to --ntasks-per-node; --tasks-per-node is invalid.

According to official SLURM documentation, the canonical option is --ntasks-per-node, not --tasks-per-node. The latter is not a valid SLURM option and will cause batch script failures. Restore line 7 to use --ntasks-per-node=${tasks_per_node}.

🤖 Prompt for AI Agents

In toolchain/templates/oscar.mako around line 7, the SBATCH option used is --tasks-per-node which is invalid for SLURM; change it back to the canonical --ntasks-per-node=${tasks_per_node} so batch scripts use the correct option and won’t fail.

mrodrig6 added 2 commits December 9, 2025 11:51

added updated modules for Oscar supercomputer

e72f2cd

fix to oscar mako file

8df72f8

qodo-code-review bot added the Review effort 2/5 label Dec 9, 2025

codeant-ai bot added the size:XS This PR changes 0-9 lines, ignoring generated files label Dec 9, 2025

qodo-code-review bot reviewed Dec 9, 2025

View reviewed changes

cubic-dev-ai bot reviewed Dec 9, 2025

View reviewed changes

codeant-ai bot reviewed Dec 9, 2025

View reviewed changes

merger

0056072

coderabbitai bot reviewed Dec 9, 2025

View reviewed changes

sbryngelson merged commit c91edb9 into MFlowCode:master Dec 10, 2025
21 checks passed

	#SBATCH --gres=gpu:v100-16:${tasks_per_node}
	#SBATCH --gres=gpu:${tasks_per_node}

-cd "${MFC_ROOT_DIR}"
+if [ -z "${MFC_ROOT_DIR}" ] || [ ! -d "${MFC_ROOT_DIR}" ]; then
+    echo "MFC_ROOT_DIR is not set or is not a directory" >&2
+    exit 1
+fi

Oscar Mako and module update #1082

Oscar Mako and module update #1082

Uh oh!

Conversation

mrodrig6 commented Dec 9, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

User description

Description

Type of change

Scope

How Has This Been Tested?

Checklist

PR Type

Description

Diagram Walkthrough

File Walkthrough

CodeAnt-AI Description

What Changed

Impact

Checking Your Pull Request

Talking to CodeAnt AI

Example

Preserve Org Learnings with CodeAnt

Example

Retrigger review

Check Your Repository Health

Summary by CodeRabbit

Uh oh!

codeant-ai bot commented Dec 9, 2025

Thanks for using CodeAnt! 🎉

Uh oh!

coderabbitai bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

qodo-code-review bot commented Dec 9, 2025

PR Reviewer Guide 🔍

Uh oh!

qodo-code-review bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codeant-ai bot commented Dec 9, 2025

Nitpicks 🔍

Uh oh!

codeant-ai bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

codeant-ai bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

codeant-ai bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

codeant-ai bot commented Dec 9, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

mrodrig6 commented Dec 9, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 9, 2025 •

edited

Loading