Skip to content
Draft
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions toolchain/bootstrap/modules.sh
Original file line number Diff line number Diff line change
Expand Up @@ -39,12 +39,13 @@ done
if [ -v $u_c ]; then
log "Select a system:"
log "$G""ORNL$W: Ascent (a) | Frontier (f) | Summit (s) | Wombat (w)"
log "$B""LLNL $W: Tuolumne (t)"
log "$C""ACCESS$W: Bridges2 (b) | Expanse (e) | Delta (d) | DeltaAI (dai)"
log "$Y""Gatech$W: Phoenix (p)"
log "$R""Caltech$W: Richardson (r)"
log "$BR""Brown$W: Oscar (o)"
log "$BR""Brown$W: Oscar (o)"
log "$B""DoD$W: Carpenter Cray (cc) | Carpenter GNU (c) | Nautilus (n)"
log_n "($G""a$W/$G""f$W/$G""s$W/$G""w$W/$C""b$W/$C""e$CR/$C""d/$C""dai$CR/$Y""p$CR/$R""r$CR/$B""cc$CR/$B""c$CR/$B""n$CR/$BR""o"$CR"): "
log_n "($G""a$W/$G""f$W/$G""s$W/$G""w$W/$B""t$W/$C""b$W/$C""e$CR/$C""d/$C""dai$CR/$Y""p$CR/$R""r$CR/$B""cc$CR/$B""c$CR/$B""n$CR/$BR""o"$CR"): "
read u_c
log
fi
Expand Down
6 changes: 6 additions & 0 deletions toolchain/modules
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,12 @@ f-all cpe/25.03 rocm/6.3.1
f-all cray-fftw cray-hdf5 python cmake
f-gpu python craype-accel-amd-gfx90a rocprofiler-compute/3.0.0

t OLCF Tuolumne
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Correct the inconsistent facility label for the 'Tuolumne' system. It is labeled OLCF in toolchain/modules but LLNL in toolchain/bootstrap/modules.sh; they should be consistent. [general, importance: 7]

Suggested change
t OLCF Tuolumne
t LLNL Tuolumne

t-all cpe/25.03 rocm/6.3.1
t-all cray-fftw cray-hdf5 cray-python cmake
t-gpu craype-accel-amd-gfx942
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Typo in module name: craype-accel-amd-gfx942 is almost certainly misspelled and will not match the real module name (typically craype-accel-amd-gfx90a), causing module load failures on Tuolumne GPU nodes. [possible bug]

Severity Level: Critical 🚨

Suggested change
t-gpu craype-accel-amd-gfx942
t-gpu craype-accel-amd-gfx90a
Why it matters? ⭐

This is a real, likely-typo bug. Other entries (e.g. f-gpu) use craype-accel-amd-gfx90a which matches known Cray module names; gfx942 does not follow that pattern and will likely fail to load on Tuolumne GPU nodes. Replacing with gfx90a aligns with the rest of the file and fixes a concrete runtime failure.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** toolchain/modules
**Line:** 51:51
**Comment:**
	*Possible Bug: Typo in module name: `craype-accel-amd-gfx942` is almost certainly misspelled and will not match the real module name (typically `craype-accel-amd-gfx90a`), causing module load failures on Tuolumne GPU nodes.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

t-gpu HSA_XNACK=1

d NCSA Delta
d-all python/3.11.6
d-cpu gcc/11.4.0 openmpi
Expand Down
63 changes: 63 additions & 0 deletions toolchain/templates/tuo.mako
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
#!/usr/bin/env bash

<%namespace name="helpers" file="helpers.mako"/>

% if engine == 'batch':
# flux: -N ${nodes}
# flux: -n ${tasks_per_node*nodes}
# flux: --job-name="${name}"
# flux: --output="${name}.out"
# flux: --error="${name}.err"
# flux: --time=${walltime}
# flux: --exclusive
# flux:--setattr=thp=always
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Flux directive formatting error: the directive line is missing a space after the colon ("# flux:--setattr=...") so the Flux scheduler may not recognize it as a directive; add a space after "flux:" so it reads "# flux: --setattr=...". [possible bug]

Severity Level: Critical 🚨

Suggested change
# flux:--setattr=thp=always
# flux: --setattr=thp=always
Why it matters? ⭐

Flux job-file directives are typically written as "# flux: --option=..." (a space after the colon).
Without the space the scheduler parser may not recognize the line as a directive, so adding the space fixes a probable real parsing bug rather than being purely cosmetic.
The PR already contains this exact line at L13; changing it to "# flux: --setattr=thp=always" is the correct single-line fix.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** toolchain/templates/tuo.mako
**Line:** 13:13
**Comment:**
	*Possible Bug: Flux directive formatting error: the directive line is missing a space after the colon ("# flux:--setattr=...") so the Flux scheduler may not recognize it as a directive; add a space after "flux:" so it reads "# flux: --setattr=...".

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after flux: in the batch directive. Should be # flux: --setattr=thp=always for consistency with other flux directives in this file.

Copilot uses AI. Check for mistakes.
# flux: --coral2-hugepages=512GB
Comment on lines +13 to +14
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate directives: Lines 13-14 are repeated unconditionally here and then again conditionally at lines 22-23 when unified is true. If these directives are always needed, they should only appear once (lines 13-14). If they should only apply when unified is true, move them inside the if unified: block and remove lines 13-14. The current code will result in these directives appearing twice when unified is enabled.

Suggested change
# flux:--setattr=thp=always
# flux: --coral2-hugepages=512GB

Copilot uses AI. Check for mistakes.
% if account:
# flux: --bank=${account}
% endif
% if partition:
# flux: --queue=${partition}
% endif
% if unified:
# flux:--setattr=thp=always
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Same formatting issue inside the unified block: the Flux attribute line is missing a space after the colon, which can cause the scheduler to ignore the directive; add the space to ensure the directive is parsed. [possible bug]

Severity Level: Critical 🚨

Suggested change
# flux:--setattr=thp=always
# flux: --setattr=thp=always
Why it matters? ⭐

Same issue as the previous suggestion but in the unified block (L21-L24). The missing space can prevent Flux from parsing the directive.
Fixing it to "# flux: --setattr=thp=always" is correct and necessary for reliable scheduler behavior when the unified block is rendered.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** toolchain/templates/tuo.mako
**Line:** 22:22
**Comment:**
	*Possible Bug: Same formatting issue inside the `unified` block: the Flux attribute line is missing a space after the colon, which can cause the scheduler to ignore the directive; add the space to ensure the directive is parsed.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after flux: in the batch directive. Should be # flux: --setattr=thp=always for consistency with other flux directives in this file.

Suggested change
# flux:--setattr=thp=always
# flux: --setattr=thp=always

Copilot uses AI. Check for mistakes.
# flux: --coral2-hugepages=512GB
% endif
Comment on lines +12 to +24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Fix a typo in the # flux:--setattr directive by adding a space. Also, remove the duplicated setattr and coral2-hugepages directives, keeping them only within the if unified: block. [possible issue, importance: 8]

Suggested change
# flux: --exclusive
# flux:--setattr=thp=always
# flux: --coral2-hugepages=512GB
% if account:
# flux: --bank=${account}
% endif
% if partition:
# flux: --queue=${partition}
% endif
% if unified:
# flux:--setattr=thp=always
# flux: --coral2-hugepages=512GB
% endif
# flux: --exclusive
% if account:
# flux: --bank=${account}
% endif
% if partition:
# flux: --queue=${partition}
% endif
% if unified:
# flux: --setattr=thp=always
# flux: --coral2-hugepages=512GB
% endif

% endif

${helpers.template_prologue()}

ok ":) Loading modules:\n"
cd "${MFC_ROOT_DIR}"
% if engine == 'batch':
. ./mfc.sh load -c t -m ${'g' if gpu else 'c'}
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect GPU check in module loading command. The gpu variable is a string ('no', 'acc', or 'mp'), not a boolean. Using if gpu will be True even when gpu == 'no' because 'no' is a truthy string. This should be ${'g' if gpu != 'no' else 'c'} to match the pattern used in other templates like frontier.mako (line 37).

Suggested change
. ./mfc.sh load -c t -m ${'g' if gpu else 'c'}
. ./mfc.sh load -c t -m ${'g' if gpu != 'no' else 'c'}

Copilot uses AI. Check for mistakes.
% endif
cd - > /dev/null
echo

% if gpu:
export MPICH_GPU_SUPPORT_ENABLED=1
% else:
export MPICH_GPU_SUPPORT_ENABLED=0
Comment on lines +37 to +40
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect GPU check. The gpu variable is a string ('no', 'acc', or 'mp'), not a boolean. Using if gpu: will be True even when gpu == 'no' because 'no' is a truthy string. This should be if gpu != 'no': to match the pattern used in other templates like frontier.mako (line 42).

Copilot uses AI. Check for mistakes.
% endif

% for target in targets:
${helpers.run_prologue(target)}

% if not mpi:
(set -x; ${profiler} "${target.get_install_binpath(case)}")
% else:
(set -x; flux run \
--nodes=${nodes} --ntasks=${tasks_per_node * nodes} \
--exclusive \
% if gpu:
--gpus-per-task 1 \
Comment on lines +52 to +53
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect GPU check. The gpu variable is a string ('no', 'acc', or 'mp'), not a boolean. Using if gpu: will be True even when gpu == 'no' because 'no' is a truthy string. This should be if gpu != 'no': to match the pattern used in other templates like frontier.mako (line 69).

Copilot uses AI. Check for mistakes.
% endif
${profiler} "${target.get_install_binpath(case)}")
% endif

${helpers.run_epilogue(target)}

echo
% endfor

${helpers.template_epilogue()}
Loading