Skip to content

Commit 9a11603

Browse files
committed
Document basic submission scripts and troubleshooting
1 parent cd23e21 commit 9a11603

File tree

1 file changed

+175
-0
lines changed

1 file changed

+175
-0
lines changed

docs/src/submit_scripts.md

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
# Job Submission Scripts for HPC Clusters
2+
3+
This page provides concrete examples and best practices for running calibrations on HPC clusters using ClimaCalibrate.jl. The examples assume basic familiarity with either Slurm or PBS job schedulers.
4+
5+
## Overview
6+
7+
ClimaCalibrate.jl supports two main approaches for running calibrations on HPC clusters:
8+
9+
1. **WorkerBackend**: Uses Julia's distributed computing capabilities with workers managed by the job scheduler
10+
2. **HPC Backends**: Directly submits individual model runs as separate jobs to the scheduler
11+
12+
The choice between these approaches depends on your cluster's resource allocation policies and your model's computational requirements.
13+
For more information, see the Backends page.
14+
15+
## WorkerBackend on a Slurm cluster
16+
17+
When using `WorkerBackend` on a Slurm cluster, allocate resources at the top level since Slurm allows nested resource allocations. Each worker will inherit one task from the Slurm allocation.
18+
19+
```bash
20+
#!/bin/bash
21+
#SBATCH --job-name=slurm_calibration
22+
#SBATCH --output=calibration_%j.out
23+
#SBATCH --time=12:00:00
24+
#SBATCH --ntasks=5
25+
#SBATCH --cpus-per-task=4
26+
#SBATCH --gpus-per-task=1
27+
#SBATCH --mem=8G
28+
29+
# Set environment variables for CliMA
30+
export CLIMACOMMS_DEVICE="CUDA"
31+
export CLIMACOMMS_CONTEXT="SINGLETON"
32+
33+
# Load required modules
34+
module load climacommon
35+
36+
# Build and run the Julia code
37+
julia --project=calibration -e 'using Pkg; Pkg.instantiate(;verbose=true)'
38+
julia --project=calibration calibration_script.jl
39+
```
40+
41+
**Key points:**
42+
- `--ntasks=5`: Requests 5 tasks, each worker gets one task
43+
- `--cpus-per-task=4`: Each worker gets 4 CPU cores
44+
- `--gpus-per-task=1`: Each worker gets 1 GPU
45+
- Uses `%j` in output/error file names to interpolate the job ID
46+
47+
## WorkerBackend on a PBS cluster
48+
49+
Since PBS does not support nested resource allocations, request minimal resources for the top-level script. Each worker will acquire its own resource allocation through the `PBSManager`.
50+
51+
```bash
52+
#!/bin/bash
53+
#PBS -N pbs_calibration
54+
#PBS -o calibration_${PBS_JOBID}.out
55+
#PBS -l walltime=12:00:00
56+
#PBS -l select=1:ncpus=1:mem=2GB
57+
58+
# Set environment variables for CliMA
59+
export CLIMACOMMS_DEVICE="CUDA"
60+
export CLIMACOMMS_CONTEXT="SINGLETON"
61+
62+
# Load required modules
63+
module load climacommon
64+
65+
# Build and run the Julia code
66+
julia --project=calibration -e 'using Pkg; Pkg.instantiate(;verbose=true)'
67+
julia --project=calibration calibration_script.jl
68+
```
69+
70+
**Key points:**
71+
- Requests only 1 CPU core for the main script
72+
- Workers will be launched as separate PBS jobs with their own resource allocations
73+
- Uses `${PBS_JOBID}` to include the job ID in output file names
74+
75+
## HPC Backend Approach
76+
77+
HPC backends directly submit individual forward model runs as separate jobs to the scheduler. This approach is ideal when:
78+
- Your forward model requires multiple CPU cores or GPUs
79+
- You need fine-grained control over resource allocation per model run
80+
- Your cluster doesn't support nested allocations
81+
82+
Since each model run consists of an independent resource allocation, minimal resources are needed to run the top-level calibration script.
83+
For a slurm cluster, here is a minimal submission script:
84+
```bash
85+
#!/bin/bash
86+
#SBATCH --job-name=slurm_calibration
87+
#SBATCH --output=calibration_%j.out
88+
#SBATCH --time=12:00:00
89+
#SBATCH --ntasks=1
90+
#SBATCH --cpus-per-task=1
91+
92+
# Load required modules
93+
module load climacommon
94+
95+
# Build and run the Julia code
96+
julia --project=calibration -e 'using Pkg; Pkg.instantiate(;verbose=true)'
97+
julia --project=calibration calibration_script.jl
98+
```
99+
For a PBS cluster, the script in the WorkerBackend section can be reused since it already specifies a minimal resource allocation.
100+
101+
## Resource Configuration
102+
103+
### CPU-Only Jobs
104+
105+
For CPU-only forward models:
106+
107+
```julia
108+
hpc_kwargs = Dict(
109+
:time => 30,
110+
:ntasks => 1,
111+
:cpus_per_task => 8,
112+
:mem => "16G"
113+
)
114+
```
115+
116+
### GPU Jobs
117+
118+
For GPU-accelerated forward models:
119+
120+
```julia
121+
hpc_kwargs = Dict(
122+
:time => 60,
123+
:ntasks => 1,
124+
:cpus_per_task => 4,
125+
:gpus_per_task => 1,
126+
:mem => "32G"
127+
)
128+
```
129+
130+
### Multi-Node Jobs
131+
132+
For models requiring multiple nodes:
133+
134+
```julia
135+
hpc_kwargs = Dict(
136+
:time => 120,
137+
:ntasks => 16,
138+
:cpus_per_task => 4,
139+
:nodes => 4,
140+
:mem => "64G"
141+
)
142+
```
143+
144+
## Environment Variables
145+
146+
Set these environment variables in your submission script:
147+
148+
- `CLIMACOMMS_DEVICE`: Set to `"CUDA"` for GPU runs or `"CPU"` for CPU-only runs
149+
- `CLIMACOMMS_CONTEXT`: Set to `"SINGLETON"` for WorkerBackend. The context is automatically set to `"MPI"` for HPC backends
150+
151+
## Troubleshooting
152+
153+
### Common Issues
154+
155+
1. **Worker Timeout**: Increase `ENV["JULIA_WORKER_TIMEOUT"]` in your Julia session if workers are timing out
156+
2. **Memory Issues**: Monitor memory usage and adjust `--mem` or `-l mem` accordingly.
157+
3. **GPU Allocation**: Ensure `--gpus-per-task` or `-l select` is set correctly
158+
4. **Module Conflicts**: Use `module purge` and ensure your MODULEPATH is set before loading required modules
159+
160+
### Debugging Commands
161+
162+
```bash
163+
# Check job status (Slurm)
164+
squeue -u $USER
165+
166+
# Check job status (PBS)
167+
qstat -u $USER
168+
169+
# View job logs
170+
tail -f calibration_<jobid>.out
171+
172+
# Check resource usage
173+
seff <jobid> # Slurm
174+
qstat -f <jobid> # PBS
175+
```

0 commit comments

Comments
 (0)