Skip to content

Commit 4dca5b7

Browse files
Add comprehensive Slurm support documentation and user guide
Co-authored-by: transientlunatic <4365778+transientlunatic@users.noreply.github.com>
1 parent 614e2af commit 4dca5b7

File tree

1 file changed

+243
-0
lines changed

1 file changed

+243
-0
lines changed

docs/SLURM_SUPPORT.md

Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
# Slurm Scheduler Support
2+
3+
This document describes the Slurm scheduler support added to asimov.
4+
5+
## Overview
6+
7+
Asimov now has first-class support for the Slurm scheduler, allowing you to use asimov on HPC clusters that use Slurm instead of HTCondor. This support includes:
8+
9+
- **Automatic scheduler detection**: Asimov automatically detects which scheduler is available during `asimov init`
10+
- **Scheduler abstraction**: All pipelines use a unified scheduler API that works with both HTCondor and Slurm
11+
- **DAG translation**: HTCondor DAG files are automatically converted to Slurm batch scripts
12+
- **Monitor daemon**: Periodic monitoring via system cron instead of HTCondor cron
13+
14+
## Installation
15+
16+
To use asimov with Slurm, install the optional Slurm dependencies:
17+
18+
```bash
19+
pip install asimov[slurm]
20+
```
21+
22+
Or if you're installing from source:
23+
24+
```bash
25+
pip install -e .[slurm]
26+
```
27+
28+
This installs `python-crontab` which is used for the monitor daemon.
29+
30+
## Getting Started
31+
32+
### Creating a New Project
33+
34+
When you run `asimov init`, asimov will automatically detect if Slurm is available:
35+
36+
```bash
37+
mkdir my-project
38+
cd my-project
39+
asimov init "My Project"
40+
```
41+
42+
Asimov checks for `sbatch` and `squeue` commands. If found, it configures the project to use Slurm.
43+
44+
### Manual Configuration
45+
46+
You can also manually configure the scheduler in `.asimov/asimov.conf`:
47+
48+
```ini
49+
[scheduler]
50+
type = slurm
51+
52+
[slurm]
53+
user = your_username
54+
partition = compute # Optional: specific partition
55+
cron_minute = */15 # Optional: monitor frequency (default: every 15 minutes)
56+
```
57+
58+
## Using Asimov with Slurm
59+
60+
Once configured, all asimov commands work the same way:
61+
62+
```bash
63+
# Start the monitor daemon (creates a cron job)
64+
asimov start
65+
66+
# Stop the monitor daemon (removes the cron job)
67+
asimov stop
68+
69+
# Build and submit jobs
70+
asimov manage build
71+
asimov manage submit
72+
73+
# Monitor jobs
74+
asimov monitor
75+
```
76+
77+
## How It Works
78+
79+
### Job Submission
80+
81+
When you submit jobs with Slurm, asimov:
82+
83+
1. Creates a Slurm batch script from your job description
84+
2. Submits the script using `sbatch`
85+
3. Returns the Slurm job ID for tracking
86+
87+
### DAG Translation
88+
89+
Pipelines like bilby, bayeswave, and lalinference generate HTCondor DAG files. When using Slurm, asimov automatically:
90+
91+
1. Parses the HTCondor DAG file
92+
2. Identifies job dependencies (PARENT-CHILD relationships)
93+
3. Converts to a Slurm batch script with `--dependency` flags
94+
4. Submits the workflow using `sbatch`
95+
96+
This allows existing pipelines to work seamlessly with Slurm without modification.
97+
98+
### Monitor Daemon
99+
100+
With HTCondor, `asimov start` submits a recurring job via HTCondor's cron functionality.
101+
102+
With Slurm, `asimov start`:
103+
- Creates a system cron job that runs `asimov monitor --chain` periodically
104+
- Uses `python-crontab` to manage the cron job automatically
105+
- Falls back to manual cron setup if `python-crontab` is not available
106+
107+
## Switching Between Schedulers
108+
109+
To switch from HTCondor to Slurm (or vice versa):
110+
111+
1. Update the `[scheduler]` section in `.asimov/asimov.conf`:
112+
113+
```ini
114+
# Switch from HTCondor to Slurm
115+
[scheduler]
116+
type = slurm
117+
```
118+
119+
2. Stop any running monitor daemon:
120+
121+
```bash
122+
asimov stop
123+
```
124+
125+
3. Start the monitor with the new scheduler:
126+
127+
```bash
128+
asimov start
129+
```
130+
131+
All existing job data remains compatible; only new jobs will use the new scheduler.
132+
133+
## Limitations
134+
135+
- **DAG complexity**: Very complex DAG files with advanced HTCondor features may not translate perfectly. Simple DAGs with job dependencies work well.
136+
- **Job status mapping**: Slurm job states are mapped to HTCondor-like status codes for compatibility, but some nuances may be lost.
137+
- **Resource specifications**: Some HTCondor-specific resource requirements may not have direct Slurm equivalents.
138+
139+
## Troubleshooting
140+
141+
### Scheduler not detected
142+
143+
If asimov doesn't detect Slurm automatically:
144+
145+
1. Verify Slurm is installed: `which sbatch squeue`
146+
2. Manually configure in `.asimov/asimov.conf`
147+
148+
### Cron job not created
149+
150+
If `asimov start` fails to create a cron job:
151+
152+
1. Install python-crontab: `pip install python-crontab`
153+
2. Or manually add to crontab: `crontab -e`
154+
155+
```cron
156+
*/15 * * * * cd /path/to/project && asimov monitor --chain >> .asimov/asimov_cron.out 2>> .asimov/asimov_cron.err
157+
```
158+
159+
### Job submission fails
160+
161+
If job submission fails:
162+
163+
1. Check Slurm is working: `sinfo`
164+
2. Verify partition exists: `sinfo -o "%P"`
165+
3. Check job logs in `.asimov/` directory
166+
167+
## Developer Information
168+
169+
### Scheduler API
170+
171+
The scheduler abstraction is defined in `asimov/scheduler.py`:
172+
173+
```python
174+
from asimov.scheduler import get_scheduler
175+
176+
# Get a scheduler instance
177+
scheduler = get_scheduler("slurm", partition="compute")
178+
179+
# Submit a job
180+
from asimov.scheduler import JobDescription
181+
job = JobDescription(
182+
executable="/bin/echo",
183+
output="out.log",
184+
error="err.log",
185+
log="job.log",
186+
cpus=4,
187+
memory="8GB"
188+
)
189+
cluster_id = scheduler.submit(job)
190+
191+
# Submit a DAG
192+
cluster_id = scheduler.submit_dag("workflow.dag", batch_name="my-analysis")
193+
194+
# Query jobs
195+
jobs = scheduler.query_all_jobs()
196+
197+
# Delete a job
198+
scheduler.delete(cluster_id)
199+
```
200+
201+
### Pipeline Integration
202+
203+
Pipelines access the scheduler via `self.scheduler`:
204+
205+
```python
206+
from asimov.pipeline import Pipeline
207+
208+
class MyPipeline(Pipeline):
209+
def submit_dag(self):
210+
# Scheduler is automatically configured
211+
cluster_id = self.scheduler.submit_dag(
212+
dag_file=self.dag_file,
213+
batch_name=f"{self.production.name}"
214+
)
215+
return cluster_id
216+
```
217+
218+
## Testing
219+
220+
Comprehensive tests are included:
221+
222+
```bash
223+
# Run unit tests
224+
python -m unittest tests.test_scheduler
225+
226+
# Run integration tests (requires Slurm)
227+
# See .github/workflows/slurm-tests.yml
228+
```
229+
230+
## Contributing
231+
232+
When adding new scheduler features:
233+
234+
1. Add the feature to the base `Scheduler` class
235+
2. Implement for both `HTCondor` and `Slurm`
236+
3. Add tests in `tests/test_scheduler.py`
237+
4. Update documentation
238+
239+
## References
240+
241+
- [Slurm Documentation](https://slurm.schedmd.com/)
242+
- [HTCondor Documentation](https://htcondor.readthedocs.io/)
243+
- [Asimov Documentation](https://asimov.docs.ligo.org/)

0 commit comments

Comments
 (0)