-
Notifications
You must be signed in to change notification settings - Fork 125
Oscar Mako and module update #1082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
CodeAnt AI is reviewing your PR. Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
| (set -x; ${profiler} \ | ||
| mpirun -np ${nodes*tasks_per_node} \ | ||
| "${target.get_install_binpath(case)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: To correctly profile the parallel application instead of the MPI launcher, move the ${profiler} command to be executed by mpirun on each rank. [possible issue, importance: 9]
| (set -x; ${profiler} \ | |
| mpirun -np ${nodes*tasks_per_node} \ | |
| "${target.get_install_binpath(case)}") | |
| (set -x; mpirun -np ${nodes*tasks_per_node} \ | |
| ${profiler} "${target.get_install_binpath(case)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues found across 2 files
Nitpicks 🔍
|
| o Brown Oscar | ||
| o-all python | ||
| o-cpu hpcx-mpi | ||
| o-cpu hpcx-mpi python/3.13.10s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Typo in the Python module name: the added token python/3.13.10s has an extra trailing "s" which will make the module name invalid/unavailable to the module loader and cause this entry to fail at load time; remove the stray "s" so the version matches the rest of the file. [possible bug]
Severity Level: Critical 🚨
| o-cpu hpcx-mpi python/3.13.10s | |
| o-cpu hpcx-mpi python/3.13.10 |
Why it matters? ⭐
The added token "python/3.13.10s" very likely contains a typo — all other entries use numeric version tokens (e.g., python/3.12.5). The stray "s" will most likely make the module name not match the expected modulefile and could break module loading. Changing it to python/3.13.10 is a straightforward, low-risk fix that improves correctness.
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** toolchain/modules
**Line:** 57:57
**Comment:**
*Possible Bug: Typo in the Python module name: the added token `python/3.13.10s` has an extra trailing "s" which will make the module name invalid/unavailable to the module loader and cause this entry to fail at load time; remove the stray "s" so the version matches the rest of the file.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.| #SBATCH --mem=64G | ||
| #SBATCH --gpu-bind=closest | ||
| #SBATCH --gpu-bind=verbose,closest | ||
| #SBATCH --gres=gpu:v100-16:${tasks_per_node} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Hardcoded GPU resource/type may not exist on all clusters and can cause sbatch to reject the job; request the GPU count generically (or parameterize the GPU type) instead of forcing "v100-16". [possible bug]
Severity Level: Critical 🚨
| #SBATCH --gres=gpu:v100-16:${tasks_per_node} | |
| #SBATCH --gres=gpu:${tasks_per_node} |
Why it matters? ⭐
This is a valid and practical suggestion: hardcoding a specific GPU model (v100-16) can indeed cause sbatch to reject jobs on clusters that don't expose that exact resource name. Replacing the line with a generic gres request (or better, parameterizing the GPU model with a template variable) is safer and fixes a real portability issue visible in the diff.
The PR already gates these lines with % if gpu_enabled so removing the explicit model is a low-risk change.
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** toolchain/templates/oscar.mako
**Line:** 19:19
**Comment:**
*Possible Bug: Hardcoded GPU resource/type may not exist on all clusters and can cause sbatch to reject the job; request the GPU count generically (or parameterize the GPU type) instead of forcing "v100-16".
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.|
|
||
| ok ":) Loading modules:\n" | ||
| cd "${MFC_ROOTDIR}" | ||
| cd "${MFC_ROOT_DIR}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: The added unconditional "cd ${MFC_ROOT_DIR}" will fail silently if MFC_ROOT_DIR is unset or not a directory, causing the subsequent source of mfc.sh to run in the wrong directory and break the script; check existence and directory-ness before changing directory and exit with a clear error if invalid. [resource leak]
Severity Level: Minor
| cd "${MFC_ROOT_DIR}" | |
| if [ -z "${MFC_ROOT_DIR}" ] || [ ! -d "${MFC_ROOT_DIR}" ]; then | |
| echo "MFC_ROOT_DIR is not set or is not a directory" >&2 | |
| exit 1 | |
| fi |
Why it matters? ⭐
Adding a guard to ensure MFC_ROOT_DIR is set and is a directory is sensible: as written the script will attempt to cd and then source ./mfc.sh relative to the current dir, which will silently (or noisily) fail if the variable is unset or points nowhere. The proposed check makes the failure explicit and fails fast, which is desirable for a job script that must set up environment.
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** toolchain/templates/oscar.mako
**Line:** 33:33
**Comment:**
*Resource Leak: The added unconditional "cd ${MFC_ROOT_DIR}" will fail silently if `MFC_ROOT_DIR` is unset or not a directory, causing the subsequent source of `mfc.sh` to run in the wrong directory and break the script; check existence and directory-ness before changing directory and exit with a clear error if invalid.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.|
CodeAnt AI finished reviewing your PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
toolchain/templates/oscar.mako (1)
33-33: Add validation for MFC_ROOT_DIR before using it.The script unconditionally uses
${MFC_ROOT_DIR}without checking if it's set or valid. If the variable is unset or points to a non-existent directory, the subsequentsource ./mfc.shwill fail in an unexpected directory, causing cryptic errors.As flagged in past review comments, add a guard to fail fast with a clear error:
+if [ -z "${MFC_ROOT_DIR}" ] || [ ! -d "${MFC_ROOT_DIR}" ]; then + echo "ERROR: MFC_ROOT_DIR is not set or is not a directory" >&2 + exit 1 +fi cd "${MFC_ROOT_DIR}"
🧹 Nitpick comments (1)
toolchain/templates/oscar.mako (1)
18-19: Consider parameterizing the GPU type for flexibility.The hardcoded
v100-16GPU type is specific to Brown's Oscar cluster. While acceptable for an Oscar-specific template, parameterizing the GPU model (e.g., via a template variable) would improve maintainability if Oscar's GPU inventory changes or if this template is adapted for other clusters.Past review comments flagged this concern. If the v100-16 designation is subject to change on Oscar, consider:
-#SBATCH --gres=gpu:v100-16:${tasks_per_node} +#SBATCH --gres=gpu:${gpu_type if gpu_type else 'v100-16'}:${tasks_per_node}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
toolchain/templates/oscar.mako(4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-24T21:50:46.909Z
Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .cursor/rules/mfc-agent-rules.mdc:0-0
Timestamp: 2025-11-24T21:50:46.909Z
Learning: Draft a step-by-step plan before making changes; build after each step using `./mfc.sh build -t pre_process simulation -j $(nproc)`
Applied to files:
toolchain/templates/oscar.mako
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build & Publish
🔇 Additional comments (1)
toolchain/templates/oscar.mako (1)
44-46: Verify profiler placement—launcher-level vs. rank-level profiling.The current code applies
${profiler}to thempirunlauncher process. Whether this is correct depends on the profiler type:
- Launcher-level profilers (system-wide or MPI-aware profilers): should wrap
mpirun(current placement)- Rank-level profilers (per-process profilers): should be applied to each rank inside
mpirunVerify the intended profiler behavior in MFC documentation or by checking how the
profilervariable is defined and used elsewhere in the codebase. If rank-level profiling is intended, the profiler should be moved inside the mpirun command.
| % if engine == 'batch': | ||
| #SBATCH --nodes=${nodes} | ||
| #SBATCH --ntasks-per-node=${tasks_per_node} | ||
| #SBATCH --tasks-per-node=${tasks_per_node} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check all SBATCH-related options in oscar.mako and other templates
echo "=== oscar.mako SBATCH directives ==="
rg -n 'SBATCH' toolchain/templates/oscar.mako | head -20
echo ""
echo "=== All templates SBATCH directives with tasks-per-node or ntasks-per-node ==="
rg -n --type=mako '(tasks-per-node|ntasks-per-node)' toolchain/templates/
echo ""
echo "=== Full oscar.mako file (if small enough) ==="
wc -l toolchain/templates/oscar.makoRepository: MFlowCode/MFC
Length of output: 170
🌐 Web query:
SLURM --tasks-per-node vs --ntasks-per-node option documentation
💡 Result:
Short answer: The canonical SLURM option is --ntasks-per-node (request N tasks on each node). --ntasks (or -n) sets the total number of tasks and takes precedence if both are given; --ntasks-per-node is treated as a maximum per-node count when used with --ntasks. Some local docs/sites may show --tasks-per-node as an informal/alternate wording, but official SLURM documentation and manpages use --ntasks-per-node. [1][2][3]
References
- SLURM srun/sbatch docs (describes --ntasks, --ntasks-per-node and precedence with --ntasks). [1]
- SLURM sbatch manpage (same text for --ntasks-per-node). [2]
- Example/local doc showing --tasks-per-node wording (not the official option name). [3]
[1] https://slurm.schedmd.com/srun.html
[2] https://slurm.schedmd.com/sbatch.html
[3] https://www.nrel.gov/hpc/announcements/posts/node-use-efficiency
Revert the SBATCH option to --ntasks-per-node; --tasks-per-node is invalid.
According to official SLURM documentation, the canonical option is --ntasks-per-node, not --tasks-per-node. The latter is not a valid SLURM option and will cause batch script failures. Restore line 7 to use --ntasks-per-node=${tasks_per_node}.
🤖 Prompt for AI Agents
In toolchain/templates/oscar.mako around line 7, the SBATCH option used is
--tasks-per-node which is invalid for SLURM; change it back to the canonical
--ntasks-per-node=${tasks_per_node} so batch scripts use the correct option and
won’t fail.
User description
User description
Description
Brown Oscar supercomputer updated their OS and modules recently.
MFC also now runs on Python 3.13.
MFC's modules and Oscar mako file need an update to run interactively and on batch.
Fixes #(issue) [optional]
Type of change
Scope
How Has This Been Tested?
Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration
Test Configuration:
Checklist
They run to completion and demonstrate "interesting physics"
./mfc.sh formatbefore committing my codePR Type
Enhancement
Description
Updated Oscar supercomputer batch job configuration for GPU support
Fixed environment variable naming and module loading syntax
Added Python 3.13.10s support to Oscar CPU modules
Corrected MPI command formatting and GPU resource allocation
Diagram Walkthrough
File Walkthrough
oscar.mako
Oscar batch job GPU and module loading fixestoolchain/templates/oscar.mako
--gpus-per-nodeand--memwith--gres=gpu:v100-16for properGPU resource allocation
--gpu-bind=closestto--gpu-bind=verbose,closestfor better diagnosticsMFC_ROOTDIRtoMFC_ROOT_DIRforconsistency
unnecessary argument passing
modules
Add Python 3.13.10s to Oscar modulestoolchain/modules
CodeAnt-AI Description
Update Oscar job template and module list for correct GPU allocation, MPI execution, and Python 3.13
What Changed
Impact
✅ Correct GPU allocation on Oscar (V100 requests honored)✅ Fewer batch job failures due to wrong working directory✅ Jobs using MPI actually run the intended executable💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.
Summary by CodeRabbit
New Features
Improvements
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.