Skip to content

Conversation

@mrodrig6
Copy link
Member

@mrodrig6 mrodrig6 commented Dec 9, 2025

User description

User description

Description

Brown Oscar supercomputer updated their OS and modules recently.
MFC also now runs on Python 3.13.
MFC's modules and Oscar mako file need an update to run interactively and on batch.

Fixes #(issue) [optional]

Type of change

  • [x ] Something else

Scope

  • [ x] This PR comprises a set of related changes with a common goal

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration

  • [x ] Ran simulations in interactive and batch mode on Oscar

Test Configuration:

  • What computers and compilers did you use to test this:

Checklist

  • [ x] I have added comments for the new code
    They run to completion and demonstrate "interesting physics"
  • [ x] I ran ./mfc.sh format before committing my code
  • [x ] New and existing tests pass locally with my changes, including with GPU capability enabled (both NVIDIA hardware with NVHPC compilers and AMD hardware with CRAY compilers) and disabled
  • [x ] This PR does not introduce any repeated code (it follows the DRY principle)
  • [x ] I cannot think of a way to condense this code and reduce any introduced additional line count

PR Type

Enhancement


Description

  • Updated Oscar supercomputer batch job configuration for GPU support

  • Fixed environment variable naming and module loading syntax

  • Added Python 3.13.10s support to Oscar CPU modules

  • Corrected MPI command formatting and GPU resource allocation


Diagram Walkthrough

flowchart LR
  A["Oscar Mako Template"] -->|GPU binding updates| B["Batch Job Config"]
  A -->|Variable naming fix| C["Module Loading"]
  D["Module Configuration"] -->|Python 3.13.10s| E["Oscar CPU Modules"]
  D -->|GPU resource spec| F["Oscar GPU Modules"]
Loading

File Walkthrough

Relevant files
Configuration changes
oscar.mako
Oscar batch job GPU and module loading fixes                         

toolchain/templates/oscar.mako

  • Replaced --gpus-per-node and --mem with --gres=gpu:v100-16 for proper
    GPU resource allocation
  • Updated GPU binding from --gpu-bind=closest to
    --gpu-bind=verbose,closest for better diagnostics
  • Fixed environment variable from MFC_ROOTDIR to MFC_ROOT_DIR for
    consistency
  • Reformatted MPI command arguments for improved readability and removed
    unnecessary argument passing
+5/-7     
Dependencies
modules
Add Python 3.13.10s to Oscar modules                                         

toolchain/modules

  • Added Python 3.13.10s to Oscar CPU module configuration
  • Updated module dependencies for Oscar supercomputer compatibility
+1/-1     


CodeAnt-AI Description

Update Oscar job template and module list for correct GPU allocation, MPI execution, and Python 3.13

What Changed

  • Batch job script now requests GPUs via SLURM gres for V100 (gpu:v100-16) and enables verbose GPU binding so GPU resources are allocated as expected
  • Fixed environment variable name so jobs run from the correct MFC root directory, avoiding wrong working-directory failures
  • MPI batch runs now invoke the built program under mpirun correctly (previously arguments were mis-assembled), preventing silent no-op MPI launches
  • Oscar module set adds python/3.13.10s to the CPU profile so interactive and batch jobs use Python 3.13

Impact

✅ Correct GPU allocation on Oscar (V100 requests honored)
✅ Fewer batch job failures due to wrong working directory
✅ Jobs using MPI actually run the intended executable

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

Summary by CodeRabbit

  • New Features

    • Python 3.13.10s module is now available in the Brown Oscar module group.
  • Improvements

    • Batch job configuration updated: tasks-per-node option standardized and GPU resource specification refined (explicit GPU type and improved GPU binding).
  • Refactor

    • Job template formatting and configuration references cleaned up for clearer, more maintainable submission scripts.

✏️ Tip: You can customize this high-level summary in your review settings.

@codeant-ai
Copy link

codeant-ai bot commented Dec 9, 2025

CodeAnt AI is reviewing your PR.


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 9, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds python/3.13.10s to the Brown Oscar o-cpu module set, adjusts SBATCH GPU directives and task options in the Oscar template, renames MFC_ROOTDIR to MFC_ROOT_DIR, and reformats the MPI profiling invocation.

Changes

Cohort / File(s) Summary
Module configuration
toolchain/modules
Added python/3.13.10s to the Brown Oscar group o-cpu module list.
Oscar template
toolchain/templates/oscar.mako
Replaced --ntasks-per-node with --tasks-per-node; removed static mem=64G and gpus-per-node; added --gpu-bind=verbose,closest and --gres=gpu:v100-16:${tasks_per_node}; renamed ${MFC_ROOTDIR}${MFC_ROOT_DIR}; collapsed multi-line MPI profiling invocation into a single quoted, aligned command line and adjusted target binary path expression.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Attention areas:
    • Validate --gres=gpu:v100-16:${tasks_per_node} matches cluster GPU naming and capacity.
    • Verify no remaining references to MFC_ROOTDIR elsewhere.
    • Confirm MPI single-line reformat preserves argument quoting and environment interpolation.

Poem

🐰 I hopped through templates, tidy and spry,
Added Python, taught GPUs to reply,
A var renamed, a command aligned,
Oscar's scripts now tiptoe, neat and kind. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: updates to Oscar mako template and module configurations.
Description check ✅ Passed The description covers key changes, testing performed, and relevant checklist items, though test configuration details are minimal.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

Using --gres=gpu:v100-16 with ${tasks_per_node} assumes one GPU per task; if tasks_per_node differs from GPUs per node or if the node type/GPU model varies, this may misallocate GPUs. Validate against Oscar’s Slurm gres names and desired GPU count per node.

#SBATCH --gpu-bind=verbose,closest
#SBATCH --gres=gpu:v100-16:${tasks_per_node}
% endif
MPI Args Removed

The mpirun invocation no longer forwards extra arguments (previously from ARG('--')); this may break workloads relying on custom MPI/runtime flags or application args. Confirm that no consumers depend on these passthrough args.

(set -x; ${profiler}                              \
    mpirun -np ${nodes*tasks_per_node}            \
           "${target.get_install_binpath(case)}")
Module Compatibility

Adding python/3.13.10s under o-cpu may conflict with toolchain expectations or site module names; ensure this exact module exists on Oscar and is compatible with hpcx-mpi and other dependencies.

o-cpu hpcx-mpi python/3.13.10s
o-gpu nvhpc cuda/12.3.0 cmake/3.26.3

@codeant-ai codeant-ai bot added the size:XS This PR changes 0-9 lines, ignoring generated files label Dec 9, 2025
Comment on lines +44 to 46
(set -x; ${profiler} \
mpirun -np ${nodes*tasks_per_node} \
"${target.get_install_binpath(case)}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: To correctly profile the parallel application instead of the MPI launcher, move the ${profiler} command to be executed by mpirun on each rank. [possible issue, importance: 9]

Suggested change
(set -x; ${profiler} \
mpirun -np ${nodes*tasks_per_node} \
"${target.get_install_binpath(case)}")
(set -x; mpirun -np ${nodes*tasks_per_node} \
${profiler} "${target.get_install_binpath(case)}")

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

@codeant-ai
Copy link

codeant-ai bot commented Dec 9, 2025

Nitpicks 🔍

🔒 No security issues identified
⚡ Recommended areas for review

  • Possible Typo / Invalid Module Name
    The added module string python/3.13.10s looks unusual compared to other entries (which use python/3.x.y). This may be a typo or a non-existent module name on Oscar; if the module does not exist the loader will fail. Verify the exact module name available on the cluster and update accordingly.

  • Compatibility risk
    Moving to Python 3.13 may introduce incompatibilities with existing MFC dependencies or other modules on Oscar. Confirm that required libraries (hdf5, anaconda, MPI wrappers) are available/compatible for Python 3.13 and that automated tests cover both interactive and batch launches with this version.

  • GPU resource specification
    The SBATCH GPU lines are hard-coded and use a nonstandard-looking gres token ("gpu:v100-16:${tasks_per_node}") and a combined gpu-bind option. These may be unsupported or incompatible with different Oscar/Slurm versions; they should be parameterized or use the standard --gpus-per-node / --gres=gpu:<type>:<count> syntax and verified against the cluster configuration.

  • Environment variable name change
    The script now cds to "${MFC_ROOT_DIR}" (underscore) instead of the previous "${MFC_ROOTDIR}". If the environment or other scripts still set the old variable name, this will fail. Confirm the runtime environment defines MFC_ROOT_DIR.

o Brown Oscar
o-all python
o-cpu hpcx-mpi
o-cpu hpcx-mpi python/3.13.10s
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Typo in the Python module name: the added token python/3.13.10s has an extra trailing "s" which will make the module name invalid/unavailable to the module loader and cause this entry to fail at load time; remove the stray "s" so the version matches the rest of the file. [possible bug]

Severity Level: Critical 🚨

Suggested change
o-cpu hpcx-mpi python/3.13.10s
o-cpu hpcx-mpi python/3.13.10
Why it matters? ⭐

The added token "python/3.13.10s" very likely contains a typo — all other entries use numeric version tokens (e.g., python/3.12.5). The stray "s" will most likely make the module name not match the expected modulefile and could break module loading. Changing it to python/3.13.10 is a straightforward, low-risk fix that improves correctness.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** toolchain/modules
**Line:** 57:57
**Comment:**
	*Possible Bug: Typo in the Python module name: the added token `python/3.13.10s` has an extra trailing "s" which will make the module name invalid/unavailable to the module loader and cause this entry to fail at load time; remove the stray "s" so the version matches the rest of the file.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

#SBATCH --mem=64G
#SBATCH --gpu-bind=closest
#SBATCH --gpu-bind=verbose,closest
#SBATCH --gres=gpu:v100-16:${tasks_per_node}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Hardcoded GPU resource/type may not exist on all clusters and can cause sbatch to reject the job; request the GPU count generically (or parameterize the GPU type) instead of forcing "v100-16". [possible bug]

Severity Level: Critical 🚨

Suggested change
#SBATCH --gres=gpu:v100-16:${tasks_per_node}
#SBATCH --gres=gpu:${tasks_per_node}
Why it matters? ⭐

This is a valid and practical suggestion: hardcoding a specific GPU model (v100-16) can indeed cause sbatch to reject jobs on clusters that don't expose that exact resource name. Replacing the line with a generic gres request (or better, parameterizing the GPU model with a template variable) is safer and fixes a real portability issue visible in the diff.
The PR already gates these lines with % if gpu_enabled so removing the explicit model is a low-risk change.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** toolchain/templates/oscar.mako
**Line:** 19:19
**Comment:**
	*Possible Bug: Hardcoded GPU resource/type may not exist on all clusters and can cause sbatch to reject the job; request the GPU count generically (or parameterize the GPU type) instead of forcing "v100-16".

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.


ok ":) Loading modules:\n"
cd "${MFC_ROOTDIR}"
cd "${MFC_ROOT_DIR}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The added unconditional "cd ${MFC_ROOT_DIR}" will fail silently if MFC_ROOT_DIR is unset or not a directory, causing the subsequent source of mfc.sh to run in the wrong directory and break the script; check existence and directory-ness before changing directory and exit with a clear error if invalid. [resource leak]

Severity Level: Minor ⚠️

Suggested change
cd "${MFC_ROOT_DIR}"
if [ -z "${MFC_ROOT_DIR}" ] || [ ! -d "${MFC_ROOT_DIR}" ]; then
echo "MFC_ROOT_DIR is not set or is not a directory" >&2
exit 1
fi
Why it matters? ⭐

Adding a guard to ensure MFC_ROOT_DIR is set and is a directory is sensible: as written the script will attempt to cd and then source ./mfc.sh relative to the current dir, which will silently (or noisily) fail if the variable is unset or points nowhere. The proposed check makes the failure explicit and fails fast, which is desirable for a job script that must set up environment.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** toolchain/templates/oscar.mako
**Line:** 33:33
**Comment:**
	*Resource Leak: The added unconditional "cd ${MFC_ROOT_DIR}" will fail silently if `MFC_ROOT_DIR` is unset or not a directory, causing the subsequent source of `mfc.sh` to run in the wrong directory and break the script; check existence and directory-ness before changing directory and exit with a clear error if invalid.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

@codeant-ai
Copy link

codeant-ai bot commented Dec 9, 2025

CodeAnt AI finished reviewing your PR.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
toolchain/templates/oscar.mako (1)

33-33: Add validation for MFC_ROOT_DIR before using it.

The script unconditionally uses ${MFC_ROOT_DIR} without checking if it's set or valid. If the variable is unset or points to a non-existent directory, the subsequent source ./mfc.sh will fail in an unexpected directory, causing cryptic errors.

As flagged in past review comments, add a guard to fail fast with a clear error:

+if [ -z "${MFC_ROOT_DIR}" ] || [ ! -d "${MFC_ROOT_DIR}" ]; then
+    echo "ERROR: MFC_ROOT_DIR is not set or is not a directory" >&2
+    exit 1
+fi
 cd "${MFC_ROOT_DIR}"
🧹 Nitpick comments (1)
toolchain/templates/oscar.mako (1)

18-19: Consider parameterizing the GPU type for flexibility.

The hardcoded v100-16 GPU type is specific to Brown's Oscar cluster. While acceptable for an Oscar-specific template, parameterizing the GPU model (e.g., via a template variable) would improve maintainability if Oscar's GPU inventory changes or if this template is adapted for other clusters.

Past review comments flagged this concern. If the v100-16 designation is subject to change on Oscar, consider:

-#SBATCH --gres=gpu:v100-16:${tasks_per_node}
+#SBATCH --gres=gpu:${gpu_type if gpu_type else 'v100-16'}:${tasks_per_node}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8df72f8 and 0056072.

📒 Files selected for processing (1)
  • toolchain/templates/oscar.mako (4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-24T21:50:46.909Z
Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .cursor/rules/mfc-agent-rules.mdc:0-0
Timestamp: 2025-11-24T21:50:46.909Z
Learning: Draft a step-by-step plan before making changes; build after each step using `./mfc.sh build -t pre_process simulation -j $(nproc)`

Applied to files:

  • toolchain/templates/oscar.mako
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build & Publish
🔇 Additional comments (1)
toolchain/templates/oscar.mako (1)

44-46: Verify profiler placement—launcher-level vs. rank-level profiling.

The current code applies ${profiler} to the mpirun launcher process. Whether this is correct depends on the profiler type:

  • Launcher-level profilers (system-wide or MPI-aware profilers): should wrap mpirun (current placement)
  • Rank-level profilers (per-process profilers): should be applied to each rank inside mpirun

Verify the intended profiler behavior in MFC documentation or by checking how the profiler variable is defined and used elsewhere in the codebase. If rank-level profiling is intended, the profiler should be moved inside the mpirun command.

% if engine == 'batch':
#SBATCH --nodes=${nodes}
#SBATCH --ntasks-per-node=${tasks_per_node}
#SBATCH --tasks-per-node=${tasks_per_node}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check all SBATCH-related options in oscar.mako and other templates

echo "=== oscar.mako SBATCH directives ==="
rg -n 'SBATCH' toolchain/templates/oscar.mako | head -20

echo ""
echo "=== All templates SBATCH directives with tasks-per-node or ntasks-per-node ==="
rg -n --type=mako '(tasks-per-node|ntasks-per-node)' toolchain/templates/

echo ""
echo "=== Full oscar.mako file (if small enough) ==="
wc -l toolchain/templates/oscar.mako

Repository: MFlowCode/MFC

Length of output: 170


🌐 Web query:

SLURM --tasks-per-node vs --ntasks-per-node option documentation

💡 Result:

Short answer: The canonical SLURM option is --ntasks-per-node (request N tasks on each node). --ntasks (or -n) sets the total number of tasks and takes precedence if both are given; --ntasks-per-node is treated as a maximum per-node count when used with --ntasks. Some local docs/sites may show --tasks-per-node as an informal/alternate wording, but official SLURM documentation and manpages use --ntasks-per-node. [1][2][3]

References

  • SLURM srun/sbatch docs (describes --ntasks, --ntasks-per-node and precedence with --ntasks). [1]
  • SLURM sbatch manpage (same text for --ntasks-per-node). [2]
  • Example/local doc showing --tasks-per-node wording (not the official option name). [3]

[1] https://slurm.schedmd.com/srun.html
[2] https://slurm.schedmd.com/sbatch.html
[3] https://www.nrel.gov/hpc/announcements/posts/node-use-efficiency


Revert the SBATCH option to --ntasks-per-node; --tasks-per-node is invalid.

According to official SLURM documentation, the canonical option is --ntasks-per-node, not --tasks-per-node. The latter is not a valid SLURM option and will cause batch script failures. Restore line 7 to use --ntasks-per-node=${tasks_per_node}.

🤖 Prompt for AI Agents
In toolchain/templates/oscar.mako around line 7, the SBATCH option used is
--tasks-per-node which is invalid for SLURM; change it back to the canonical
--ntasks-per-node=${tasks_per_node} so batch scripts use the correct option and
won’t fail.

@sbryngelson sbryngelson merged commit c91edb9 into MFlowCode:master Dec 10, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Review effort 2/5 size:XS This PR changes 0-9 lines, ignoring generated files

Development

Successfully merging this pull request may close these issues.

2 participants