Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions toolchain/bootstrap/modules.sh
Original file line number Diff line number Diff line change
Expand Up @@ -39,12 +39,13 @@ done
if [ -v $u_c ]; then
log "Select a system:"
log "$G""ORNL$W: Ascent (a) | Frontier (f) | Summit (s) | Wombat (w)"
log "$B""LLNL $W: Tuolumne (t)"
log "$C""ACCESS$W: Bridges2 (b) | Expanse (e) | Delta (d) | DeltaAI (dai)"
log "$Y""Gatech$W: Phoenix (p)"
log "$R""Caltech$W: Richardson (r)"
log "$BR""Brown$W: Oscar (o)"
log "$BR""Brown$W: Oscar (o)"
log "$B""DoD$W: Carpenter Cray (cc) | Carpenter GNU (c) | Nautilus (n)"
log_n "($G""a$W/$G""f$W/$G""s$W/$G""w$W/$C""b$W/$C""e$CR/$C""d/$C""dai$CR/$Y""p$CR/$R""r$CR/$B""cc$CR/$B""c$CR/$B""n$CR/$BR""o"$CR"): "
log_n "($G""a$W/$G""f$W/$G""s$W/$G""w$W/$B""t$W/$C""b$W/$C""e$CR/$C""d/$C""dai$CR/$Y""p$CR/$R""r$CR/$B""cc$CR/$B""c$CR/$B""n$CR/$BR""o"$CR"): "
read u_c
log
fi
Expand Down
6 changes: 6 additions & 0 deletions toolchain/modules
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,12 @@ f-all cpe/25.03 rocm/6.3.1
f-all cray-fftw cray-hdf5 python cmake
f-gpu python craype-accel-amd-gfx90a rocprofiler-compute/3.0.0

t OLCF Tuolumne
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Correct the inconsistent facility label for the 'Tuolumne' system. It is labeled OLCF in toolchain/modules but LLNL in toolchain/bootstrap/modules.sh; they should be consistent. [general, importance: 7]

Suggested change
t OLCF Tuolumne
t LLNL Tuolumne

t-all cpe/25.03 rocm/6.3.1
t-all cray-fftw cray-hdf5 cray-python cmake
t-gpu craype-accel-amd-gfx942
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Typo in module name: craype-accel-amd-gfx942 is almost certainly misspelled and will not match the real module name (typically craype-accel-amd-gfx90a), causing module load failures on Tuolumne GPU nodes. [possible bug]

Severity Level: Critical 🚨

Suggested change
t-gpu craype-accel-amd-gfx942
t-gpu craype-accel-amd-gfx90a
Why it matters? ⭐

This is a real, likely-typo bug. Other entries (e.g. f-gpu) use craype-accel-amd-gfx90a which matches known Cray module names; gfx942 does not follow that pattern and will likely fail to load on Tuolumne GPU nodes. Replacing with gfx90a aligns with the rest of the file and fixes a concrete runtime failure.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** toolchain/modules
**Line:** 51:51
**Comment:**
	*Possible Bug: Typo in module name: `craype-accel-amd-gfx942` is almost certainly misspelled and will not match the real module name (typically `craype-accel-amd-gfx90a`), causing module load failures on Tuolumne GPU nodes.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

t-gpu HSA_XNACK=1

d NCSA Delta
d-all python/3.11.6
d-cpu gcc/11.4.0 openmpi
Expand Down
63 changes: 63 additions & 0 deletions toolchain/templates/tuo.mako
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
#!/usr/bin/env bash

<%namespace name="helpers" file="helpers.mako"/>

% if engine == 'batch':
# flux: -N ${nodes}
# flux: -n ${tasks_per_node*nodes}
# flux: --job-name="${name}"
# flux: --output="${name}.out"
# flux: --error="${name}.err"
# flux: --time=${walltime}
# flux: --exclusive
# flux:--setattr=thp=always
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Flux directive formatting error: the directive line is missing a space after the colon ("# flux:--setattr=...") so the Flux scheduler may not recognize it as a directive; add a space after "flux:" so it reads "# flux: --setattr=...". [possible bug]

Severity Level: Critical 🚨

Suggested change
# flux:--setattr=thp=always
# flux: --setattr=thp=always
Why it matters? ⭐

Flux job-file directives are typically written as "# flux: --option=..." (a space after the colon).
Without the space the scheduler parser may not recognize the line as a directive, so adding the space fixes a probable real parsing bug rather than being purely cosmetic.
The PR already contains this exact line at L13; changing it to "# flux: --setattr=thp=always" is the correct single-line fix.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** toolchain/templates/tuo.mako
**Line:** 13:13
**Comment:**
	*Possible Bug: Flux directive formatting error: the directive line is missing a space after the colon ("# flux:--setattr=...") so the Flux scheduler may not recognize it as a directive; add a space after "flux:" so it reads "# flux: --setattr=...".

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after flux: in the batch directive. Should be # flux: --setattr=thp=always for consistency with other flux directives in this file.

Copilot uses AI. Check for mistakes.
# flux: --coral2-hugepages=512GB
Comment on lines +13 to +14
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate directives: Lines 13-14 are repeated unconditionally here and then again conditionally at lines 22-23 when unified is true. If these directives are always needed, they should only appear once (lines 13-14). If they should only apply when unified is true, move them inside the if unified: block and remove lines 13-14. The current code will result in these directives appearing twice when unified is enabled.

Suggested change
# flux:--setattr=thp=always
# flux: --coral2-hugepages=512GB

Copilot uses AI. Check for mistakes.
% if account:
# flux: --bank=${account}
% endif
% if partition:
# flux: --queue=${partition}
% endif
% if unified:
# flux:--setattr=thp=always
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Same formatting issue inside the unified block: the Flux attribute line is missing a space after the colon, which can cause the scheduler to ignore the directive; add the space to ensure the directive is parsed. [possible bug]

Severity Level: Critical 🚨

Suggested change
# flux:--setattr=thp=always
# flux: --setattr=thp=always
Why it matters? ⭐

Same issue as the previous suggestion but in the unified block (L21-L24). The missing space can prevent Flux from parsing the directive.
Fixing it to "# flux: --setattr=thp=always" is correct and necessary for reliable scheduler behavior when the unified block is rendered.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** toolchain/templates/tuo.mako
**Line:** 22:22
**Comment:**
	*Possible Bug: Same formatting issue inside the `unified` block: the Flux attribute line is missing a space after the colon, which can cause the scheduler to ignore the directive; add the space to ensure the directive is parsed.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after flux: in the batch directive. Should be # flux: --setattr=thp=always for consistency with other flux directives in this file.

Suggested change
# flux:--setattr=thp=always
# flux: --setattr=thp=always

Copilot uses AI. Check for mistakes.
# flux: --coral2-hugepages=512GB
% endif
Comment on lines +12 to +24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Fix a typo in the # flux:--setattr directive by adding a space. Also, remove the duplicated setattr and coral2-hugepages directives, keeping them only within the if unified: block. [possible issue, importance: 8]

Suggested change
# flux: --exclusive
# flux:--setattr=thp=always
# flux: --coral2-hugepages=512GB
% if account:
# flux: --bank=${account}
% endif
% if partition:
# flux: --queue=${partition}
% endif
% if unified:
# flux:--setattr=thp=always
# flux: --coral2-hugepages=512GB
% endif
# flux: --exclusive
% if account:
# flux: --bank=${account}
% endif
% if partition:
# flux: --queue=${partition}
% endif
% if unified:
# flux: --setattr=thp=always
# flux: --coral2-hugepages=512GB
% endif

% endif

${helpers.template_prologue()}

ok ":) Loading modules:\n"
cd "${MFC_ROOT_DIR}"
% if engine == 'batch':
. ./mfc.sh load -c t -m ${'g' if gpu else 'c'}
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect GPU check in module loading command. The gpu variable is a string ('no', 'acc', or 'mp'), not a boolean. Using if gpu will be True even when gpu == 'no' because 'no' is a truthy string. This should be ${'g' if gpu != 'no' else 'c'} to match the pattern used in other templates like frontier.mako (line 37).

Suggested change
. ./mfc.sh load -c t -m ${'g' if gpu else 'c'}
. ./mfc.sh load -c t -m ${'g' if gpu != 'no' else 'c'}

Copilot uses AI. Check for mistakes.
% endif
cd - > /dev/null
echo

% if gpu:
export MPICH_GPU_SUPPORT_ENABLED=1
% else:
export MPICH_GPU_SUPPORT_ENABLED=0
Comment on lines +37 to +40
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect GPU check. The gpu variable is a string ('no', 'acc', or 'mp'), not a boolean. Using if gpu: will be True even when gpu == 'no' because 'no' is a truthy string. This should be if gpu != 'no': to match the pattern used in other templates like frontier.mako (line 42).

Copilot uses AI. Check for mistakes.
% endif

% for target in targets:
${helpers.run_prologue(target)}

% if not mpi:
(set -x; ${profiler} "${target.get_install_binpath(case)}")
% else:
(set -x; flux run \
--nodes=${nodes} --ntasks=${tasks_per_node * nodes} \
--exclusive \
% if gpu:
--gpus-per-task 1 \
Comment on lines +52 to +53
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect GPU check. The gpu variable is a string ('no', 'acc', or 'mp'), not a boolean. Using if gpu: will be True even when gpu == 'no' because 'no' is a truthy string. This should be if gpu != 'no': to match the pattern used in other templates like frontier.mako (line 69).

Copilot uses AI. Check for mistakes.
% endif
${profiler} "${target.get_install_binpath(case)}")
% endif

${helpers.run_epilogue(target)}

echo
% endfor

${helpers.template_epilogue()}
Loading