|
| 1 | +# Slurm Scheduler Support |
| 2 | + |
| 3 | +This document describes the Slurm scheduler support added to asimov. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +Asimov now has first-class support for the Slurm scheduler, allowing you to use asimov on HPC clusters that use Slurm instead of HTCondor. This support includes: |
| 8 | + |
| 9 | +- **Automatic scheduler detection**: Asimov automatically detects which scheduler is available during `asimov init` |
| 10 | +- **Scheduler abstraction**: All pipelines use a unified scheduler API that works with both HTCondor and Slurm |
| 11 | +- **DAG translation**: HTCondor DAG files are automatically converted to Slurm batch scripts |
| 12 | +- **Monitor daemon**: Periodic monitoring via system cron instead of HTCondor cron |
| 13 | + |
| 14 | +## Installation |
| 15 | + |
| 16 | +To use asimov with Slurm, install the optional Slurm dependencies: |
| 17 | + |
| 18 | +```bash |
| 19 | +pip install asimov[slurm] |
| 20 | +``` |
| 21 | + |
| 22 | +Or if you're installing from source: |
| 23 | + |
| 24 | +```bash |
| 25 | +pip install -e .[slurm] |
| 26 | +``` |
| 27 | + |
| 28 | +This installs `python-crontab` which is used for the monitor daemon. |
| 29 | + |
| 30 | +## Getting Started |
| 31 | + |
| 32 | +### Creating a New Project |
| 33 | + |
| 34 | +When you run `asimov init`, asimov will automatically detect if Slurm is available: |
| 35 | + |
| 36 | +```bash |
| 37 | +mkdir my-project |
| 38 | +cd my-project |
| 39 | +asimov init "My Project" |
| 40 | +``` |
| 41 | + |
| 42 | +Asimov checks for `sbatch` and `squeue` commands. If found, it configures the project to use Slurm. |
| 43 | + |
| 44 | +### Manual Configuration |
| 45 | + |
| 46 | +You can also manually configure the scheduler in `.asimov/asimov.conf`: |
| 47 | + |
| 48 | +```ini |
| 49 | +[scheduler] |
| 50 | +type = slurm |
| 51 | + |
| 52 | +[slurm] |
| 53 | +user = your_username |
| 54 | +partition = compute # Optional: specific partition |
| 55 | +cron_minute = */15 # Optional: monitor frequency (default: every 15 minutes) |
| 56 | +``` |
| 57 | + |
| 58 | +## Using Asimov with Slurm |
| 59 | + |
| 60 | +Once configured, all asimov commands work the same way: |
| 61 | + |
| 62 | +```bash |
| 63 | +# Start the monitor daemon (creates a cron job) |
| 64 | +asimov start |
| 65 | + |
| 66 | +# Stop the monitor daemon (removes the cron job) |
| 67 | +asimov stop |
| 68 | + |
| 69 | +# Build and submit jobs |
| 70 | +asimov manage build |
| 71 | +asimov manage submit |
| 72 | + |
| 73 | +# Monitor jobs |
| 74 | +asimov monitor |
| 75 | +``` |
| 76 | + |
| 77 | +## How It Works |
| 78 | + |
| 79 | +### Job Submission |
| 80 | + |
| 81 | +When you submit jobs with Slurm, asimov: |
| 82 | + |
| 83 | +1. Creates a Slurm batch script from your job description |
| 84 | +2. Submits the script using `sbatch` |
| 85 | +3. Returns the Slurm job ID for tracking |
| 86 | + |
| 87 | +### DAG Translation |
| 88 | + |
| 89 | +Pipelines like bilby, bayeswave, and lalinference generate HTCondor DAG files. When using Slurm, asimov automatically: |
| 90 | + |
| 91 | +1. Parses the HTCondor DAG file |
| 92 | +2. Identifies job dependencies (PARENT-CHILD relationships) |
| 93 | +3. Converts to a Slurm batch script with `--dependency` flags |
| 94 | +4. Submits the workflow using `sbatch` |
| 95 | + |
| 96 | +This allows existing pipelines to work seamlessly with Slurm without modification. |
| 97 | + |
| 98 | +### Monitor Daemon |
| 99 | + |
| 100 | +With HTCondor, `asimov start` submits a recurring job via HTCondor's cron functionality. |
| 101 | + |
| 102 | +With Slurm, `asimov start`: |
| 103 | +- Creates a system cron job that runs `asimov monitor --chain` periodically |
| 104 | +- Uses `python-crontab` to manage the cron job automatically |
| 105 | +- Falls back to manual cron setup if `python-crontab` is not available |
| 106 | + |
| 107 | +## Switching Between Schedulers |
| 108 | + |
| 109 | +To switch from HTCondor to Slurm (or vice versa): |
| 110 | + |
| 111 | +1. Update the `[scheduler]` section in `.asimov/asimov.conf`: |
| 112 | + |
| 113 | +```ini |
| 114 | +# Switch from HTCondor to Slurm |
| 115 | +[scheduler] |
| 116 | +type = slurm |
| 117 | +``` |
| 118 | + |
| 119 | +2. Stop any running monitor daemon: |
| 120 | + |
| 121 | +```bash |
| 122 | +asimov stop |
| 123 | +``` |
| 124 | + |
| 125 | +3. Start the monitor with the new scheduler: |
| 126 | + |
| 127 | +```bash |
| 128 | +asimov start |
| 129 | +``` |
| 130 | + |
| 131 | +All existing job data remains compatible; only new jobs will use the new scheduler. |
| 132 | + |
| 133 | +## Limitations |
| 134 | + |
| 135 | +- **DAG complexity**: Very complex DAG files with advanced HTCondor features may not translate perfectly. Simple DAGs with job dependencies work well. |
| 136 | +- **Job status mapping**: Slurm job states are mapped to HTCondor-like status codes for compatibility, but some nuances may be lost. |
| 137 | +- **Resource specifications**: Some HTCondor-specific resource requirements may not have direct Slurm equivalents. |
| 138 | + |
| 139 | +## Troubleshooting |
| 140 | + |
| 141 | +### Scheduler not detected |
| 142 | + |
| 143 | +If asimov doesn't detect Slurm automatically: |
| 144 | + |
| 145 | +1. Verify Slurm is installed: `which sbatch squeue` |
| 146 | +2. Manually configure in `.asimov/asimov.conf` |
| 147 | + |
| 148 | +### Cron job not created |
| 149 | + |
| 150 | +If `asimov start` fails to create a cron job: |
| 151 | + |
| 152 | +1. Install python-crontab: `pip install python-crontab` |
| 153 | +2. Or manually add to crontab: `crontab -e` |
| 154 | + |
| 155 | +```cron |
| 156 | +*/15 * * * * cd /path/to/project && asimov monitor --chain >> .asimov/asimov_cron.out 2>> .asimov/asimov_cron.err |
| 157 | +``` |
| 158 | + |
| 159 | +### Job submission fails |
| 160 | + |
| 161 | +If job submission fails: |
| 162 | + |
| 163 | +1. Check Slurm is working: `sinfo` |
| 164 | +2. Verify partition exists: `sinfo -o "%P"` |
| 165 | +3. Check job logs in `.asimov/` directory |
| 166 | + |
| 167 | +## Developer Information |
| 168 | + |
| 169 | +### Scheduler API |
| 170 | + |
| 171 | +The scheduler abstraction is defined in `asimov/scheduler.py`: |
| 172 | + |
| 173 | +```python |
| 174 | +from asimov.scheduler import get_scheduler |
| 175 | + |
| 176 | +# Get a scheduler instance |
| 177 | +scheduler = get_scheduler("slurm", partition="compute") |
| 178 | + |
| 179 | +# Submit a job |
| 180 | +from asimov.scheduler import JobDescription |
| 181 | +job = JobDescription( |
| 182 | + executable="/bin/echo", |
| 183 | + output="out.log", |
| 184 | + error="err.log", |
| 185 | + log="job.log", |
| 186 | + cpus=4, |
| 187 | + memory="8GB" |
| 188 | +) |
| 189 | +cluster_id = scheduler.submit(job) |
| 190 | + |
| 191 | +# Submit a DAG |
| 192 | +cluster_id = scheduler.submit_dag("workflow.dag", batch_name="my-analysis") |
| 193 | + |
| 194 | +# Query jobs |
| 195 | +jobs = scheduler.query_all_jobs() |
| 196 | + |
| 197 | +# Delete a job |
| 198 | +scheduler.delete(cluster_id) |
| 199 | +``` |
| 200 | + |
| 201 | +### Pipeline Integration |
| 202 | + |
| 203 | +Pipelines access the scheduler via `self.scheduler`: |
| 204 | + |
| 205 | +```python |
| 206 | +from asimov.pipeline import Pipeline |
| 207 | + |
| 208 | +class MyPipeline(Pipeline): |
| 209 | + def submit_dag(self): |
| 210 | + # Scheduler is automatically configured |
| 211 | + cluster_id = self.scheduler.submit_dag( |
| 212 | + dag_file=self.dag_file, |
| 213 | + batch_name=f"{self.production.name}" |
| 214 | + ) |
| 215 | + return cluster_id |
| 216 | +``` |
| 217 | + |
| 218 | +## Testing |
| 219 | + |
| 220 | +Comprehensive tests are included: |
| 221 | + |
| 222 | +```bash |
| 223 | +# Run unit tests |
| 224 | +python -m unittest tests.test_scheduler |
| 225 | + |
| 226 | +# Run integration tests (requires Slurm) |
| 227 | +# See .github/workflows/slurm-tests.yml |
| 228 | +``` |
| 229 | + |
| 230 | +## Contributing |
| 231 | + |
| 232 | +When adding new scheduler features: |
| 233 | + |
| 234 | +1. Add the feature to the base `Scheduler` class |
| 235 | +2. Implement for both `HTCondor` and `Slurm` |
| 236 | +3. Add tests in `tests/test_scheduler.py` |
| 237 | +4. Update documentation |
| 238 | + |
| 239 | +## References |
| 240 | + |
| 241 | +- [Slurm Documentation](https://slurm.schedmd.com/) |
| 242 | +- [HTCondor Documentation](https://htcondor.readthedocs.io/) |
| 243 | +- [Asimov Documentation](https://asimov.docs.ligo.org/) |
0 commit comments