-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Problem
Compute nodes may not be able to run sbatch/srun to submit jobs back to the slurmctld (head node). This is a known class of SLURM configuration problems that can affect workflows requiring job-from-job submission (e.g., Snakemake, Nextflow, or manual sbatch calls from within a running job).
Key SLURM Config Areas to Investigate
Authentication
- Is the
mungedaemon running and healthy on all compute nodes? - Consider adding
auth/jwtasAuthAltTypesfallback inslurm.conf— this can help when munge has transient issues
Network / Firewall
- Port 6817 (slurmctld default) must be reachable from compute nodes
SrunPortRangemay need explicit configuration so compute nodes can communicate back to the controller- Check for any firewall rules or security groups blocking traffic between compute nodes and the head node
Timeouts
MessageTimeoutdefaults to 10 seconds, which may be too low under load- Alpine's working config uses
MessageTimeout=90— worth matching or at least increasing
Config Consistency
slurm.confmust be identical across all nodes (head node + compute nodes)- A mismatch can cause silent failures or authentication errors
Diagnostic Commands (run from a compute node)
# Check munge status
systemctl status munge
# Test munge communication
munge -n | unmunge
# Test slurmctld connectivity
scontrol ping
# Try submitting a trivial job
sbatch --wrap="hostname"
# Check SLURM logs for errors
journalctl -u slurmd -n 50Reference
Alpine cluster's SLURM config is a working example of multi-node job submission — their slurm.conf settings for MessageTimeout, SrunPortRange, and AuthAltTypes can serve as a baseline.
Acceptance Criteria
- Confirm compute nodes can reach slurmctld on port 6817
- Confirm munge is running and keys match across nodes
- Validate
slurm.confconsistency - Test
sbatchfrom within a running job on a compute node - Document findings and any config changes needed
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels