Automated generation of optimized SLURM submission scripts for running bayesDREAM on HPC clusters, with specific support for Berzelius (NSC, Sweden).
- Overview
- Quick Start
- Installation
- Usage Guide
- Generated Scripts
- Submitting Jobs
- Monitoring Jobs
- Advanced Usage
- Troubleshooting
- API Reference
The SLURM Job Generator (bayesDREAM.slurm_jobgen) automates the creation of SLURM submission scripts for large-scale bayesDREAM analyses. It:
- Analyzes your dataset (features, cells, sparsity, technical groups)
- Estimates memory requirements (RAM and VRAM) using
memory_calculator.py - Selects optimal resources (GPU fat/thin nodes or CPU partition)
- Estimates wall-clock time based on dataset size and complexity
- Generates SLURM scripts with job dependencies and array parallelization
- Creates documentation (README.md with monitoring commands)
Without the generator:
- Manually calculate memory needs for each step
- Guess appropriate time limits
- Write SLURM scripts from scratch
- Set up job dependencies manually
- Risk over-allocating (waste) or under-allocating (failure)
With the generator:
- One function call generates all scripts
- Automatic resource optimization
- Built-in job dependencies
- Job arrays for parallelization
- Throttling to prevent cluster overload
import pandas as pd
from bayesDREAM.slurm_jobgen import SlurmJobGenerator
# Load data
meta = pd.read_csv('meta.csv')
counts = pd.read_csv('counts.csv', index_col=0) # genes × cells
# Save to cluster-accessible location
data_path = "/proj/berzelius-aiics-real/users/x_learo/data/run1"
meta.to_csv(f"{data_path}/meta.csv", index=False)
counts.to_csv(f"{data_path}/counts.csv")# Create generator
gen = SlurmJobGenerator(
meta=meta,
counts=counts,
cis_genes=['GFI1B', 'TET2', 'MYB', 'NFE2'],
output_dir='./slurm_jobs',
label='perturb_seq_batch1',
# Paths (adjust for your cluster)
python_env='/proj/.../mambaforge/envs/pyroenv/bin/python',
bayesdream_path='/proj/.../bayesDREAM',
data_path=data_path,
)
# Generate all scripts
gen.generate_all_scripts()# Transfer to Berzelius
scp -r slurm_jobs/ berzelius:/proj/.../
# On Berzelius
cd slurm_jobs
bash submit_all.sh
# Monitor
squeue -u $USER
tail -f logs/tech_*.outThe SLURM job generator is included in bayesDREAM. No additional installation required.
Requirements:
- bayesDREAM installed with dependencies
- Access to HPC cluster (Berzelius or similar SLURM-based system)
- Python environment on cluster with bayesDREAM
from bayesDREAM.slurm_jobgen import SlurmJobGenerator
generator = SlurmJobGenerator(
# Required
meta=meta_dataframe, # Cell metadata (cell, guide, target, cell_line)
counts=counts_matrix, # Gene expression (genes × cells)
cis_genes=['GFI1B', 'TET2'], # List of cis genes to fit
output_dir='./slurm_jobs', # Where to write scripts
label='my_experiment', # Unique identifier
# Cluster paths
python_env='/path/to/pyroenv/bin/python',
bayesdream_path='/path/to/bayesDREAM',
data_path='/path/to/data', # Where meta.csv and counts.csv are saved
# Optional (see sections below)
low_moi=True,
use_all_cells_technical=False,
partition_preference='auto',
max_concurrent_jobs=50,
time_multiplier=1.0,
)
# Generate scripts
generator.generate_all_scripts()The generator supports different experimental designs:
One guide per cell, clear NTC population
gen = SlurmJobGenerator(
...,
low_moi=True, # Default
use_all_cells_technical=False,
)Job structure:
fit_technical: 1 job (all NTC cells)fit_cis: N jobs (1 per cis gene)fit_trans: N jobs (1 per cis gene)
Total jobs: 1 + 2N
Multiple guides per cell, technical effects independent of perturbations
gen = SlurmJobGenerator(
...,
low_moi=False,
use_all_cells_technical=True, # Fit technical on ALL cells
)Job structure:
fit_technical: 1 job (all cells, once for all cis genes)fit_cis: N jobs (1 per cis gene)fit_trans: N jobs (1 per cis gene)
Total jobs: 1 + 2N
Benefit: Same total jobs as low MOI, but fit_technical uses all data.
When to use:
- Technical variation is batch/lane specific
- Technical effects don't correlate with perturbation effects
- Want to maximize statistical power for technical correction
When NOT to use:
- Technical groups (e.g., CRISPRi vs CRISPRa) correlate with cis gene expression
- See warning in
fit_technicaldocumentation
Multiple guides per cell, but want NTC-based technical correction per gene
gen = SlurmJobGenerator(
...,
low_moi=False,
use_all_cells_technical=False, # Fit technical on NTC per gene
)Job structure:
fit_technical: N jobs (NTC cells, 1 per cis gene)fit_cis: N jobs (1 per cis gene)fit_trans: N jobs (1 per cis gene)
Total jobs: 3N
When to use:
- High MOI but want conservative NTC-only technical correction
- Technical effects may vary by cis gene
The generator automatically selects optimal resources for Berzelius:
| Node Type | Constraint | GPUs per Node | VRAM per GPU | RAM per GPU | Best For |
|---|---|---|---|---|---|
| Fat | -C fat |
8 | 10 GB | 128 GB | Standard fitting |
| Thin | -C thin |
8 | 5 GB | 64 GB | Small datasets |
| CPU | --partition=berzelius-cpu |
0 | 0 | 7.76 GB/core | Extremely large datasets, fit_cis |
Note: Full node = 8 GPUs. The generator will use up to 8 GPUs before switching to CPU.
The generator analyzes memory requirements and selects:
fit_technical and fit_trans (same logic):
If VRAM ≤ 5 GB and RAM ≤ 64 GB:
→ 1 thin GPU ✓ (most efficient for small datasets)
Elif VRAM ≤ 10 GB and RAM ≤ 128 GB:
→ 1 fat GPU ✓ (most common case)
Elif VRAM ≤ 20 GB and RAM ≤ 256 GB:
→ 2 fat GPUs
Elif VRAM ≤ 40 GB and RAM ≤ 512 GB:
→ 4 fat GPUs
Elif VRAM ≤ 80 GB and RAM ≤ 1024 GB:
→ 8 fat GPUs (full node)
Else:
→ CPU partition (dataset too large for 8 GPUs)
fit_cis:
Default: CPU partition (doesn't need GPU for most datasets)
Goal: Use thin nodes for small datasets, scale up to full 8-GPU node before falling back to CPU.
Force specific resources:
# Force CPU for all steps
gen = SlurmJobGenerator(
...,
partition_preference='cpu',
)
# Force fat nodes
gen = SlurmJobGenerator(
...,
partition_preference='fat',
)
# Auto-select (default, recommended)
gen = SlurmJobGenerator(
...,
partition_preference='auto',
)The generator automatically estimates wall-clock time based on:
- Dataset size (T × N)
- Guide type (AutoIAFNormal vs AutoNormal)
- Step complexity
| Step | Base Time | Scales With |
|---|---|---|
fit_technical |
1.5 hours | T × N × (2× if AutoNormal) |
fit_cis |
0.5 hours/gene | N |
fit_trans |
3.0 hours/gene | T × N |
All estimates include 1.5× safety margin to reduce timeout risk.
Small dataset (5K genes, 10K cells):
- fit_technical: ~0.5 hours
- fit_cis: ~0.3 hours/gene
- fit_trans: ~0.8 hours/gene
Large dataset (50K genes, 100K cells):
- fit_technical: ~9 hours (if AutoNormal)
- fit_cis: ~1.5 hours/gene
- fit_trans: ~20 hours/gene
Scale time estimates:
# Conservative (2× longer)
gen = SlurmJobGenerator(
...,
time_multiplier=2.0,
)
# Aggressive (if you know your data converges fast)
gen = SlurmJobGenerator(
...,
time_multiplier=0.7,
)
# Default (recommended)
gen = SlurmJobGenerator(
...,
time_multiplier=1.0,
)Dataset size matters more than user intuition. A 50K gene dataset takes 10× longer than 5K genes. The generator's estimates are based on empirical scaling laws.
slurm_jobs/
├── 01_fit_technical.sh # Technical fitting
├── 02_fit_cis.sh # Cis effect fitting (job array)
├── 03_fit_trans.sh # Trans effect fitting (job array)
├── submit_all.sh # Master submission script
├── README.md # Complete documentation
└── logs/ # Created automatically
├── tech_*.out
├── cis_*_*.out
└── trans_*_*.out
Single job (low MOI or high MOI with use_all_cells=True):
#!/bin/bash
#SBATCH --job-name=perturb_seq_batch1_tech
#SBATCH --time=02:15:00
#SBATCH --partition=berzelius
#SBATCH -C fat
#SBATCH --gpus=1
#SBATCH --mem=11G
# Run fit_technical on NTC cells (or all cells if high MOI)Job array (high MOI with NTC per gene):
#SBATCH --array=0-3%50 # 4 cis genes, max 50 concurrentJob array (1 job per cis gene):
#!/bin/bash
#SBATCH --job-name=perturb_seq_batch1_cis
#SBATCH --array=0-3%50 # 4 cis genes
#SBATCH --time=00:30:00
#SBATCH --partition=berzelius-cpu
#SBATCH --cpus-per-task=1
#SBATCH --mem=9G
# Array: CIS_GENES=(GFI1B TET2 MYB NFE2)
# CIS_GENE=${CIS_GENES[$SLURM_ARRAY_TASK_ID]}Job array (1 job per cis gene):
#!/bin/bash
#SBATCH --job-name=perturb_seq_batch1_trans
#SBATCH --array=0-3%50 # 4 cis genes
#SBATCH --time=03:00:00
#SBATCH --partition=berzelius
#SBATCH -C fat
#SBATCH --gpus=1
#SBATCH --mem=18GMaster script with dependencies:
#!/bin/bash
# Submit fit_technical
TECH_JOB=$(sbatch --parsable 01_fit_technical.sh)
# Submit fit_cis (depends on technical)
CIS_JOB=$(sbatch --parsable --dependency=afterok:$TECH_JOB 02_fit_cis.sh)
# Submit fit_trans (depends on cis)
TRANS_JOB=$(sbatch --parsable --dependency=afterok:$CIS_JOB 03_fit_trans.sh)
echo "Jobs submitted: $TECH_JOB, $CIS_JOB, $TRANS_JOB"Auto-generated documentation includes:
- Dataset characteristics
- Memory and time estimates
- Resource allocation rationale
- Complete Python code that will be executed for each step
- Expected log output with examples
- Usage instructions
- Monitoring commands
- Troubleshooting guide
Key sections:
- "What Will Be Run": Shows exact Python code for fit_technical, fit_cis, and fit_trans with all parameters
- "Log Output": Shows what you'll see in logs/tech_.out, logs/cis_.out, and logs/trans_*.out
- Includes device selection (cuda/cpu), niters, nsamples, and all other parameters
cd slurm_jobs
bash submit_all.shWhat happens:
- Submits
fit_technical - Queues
fit_ciswith dependency on technical - Queues
fit_transwith dependency on cis - Prints job IDs
Output:
Submitting bayesDREAM pipeline jobs...
Label: perturb_seq_batch1
Output directory: ./slurm_jobs
Submitting fit_technical...
Job ID: 12345
Submitting fit_cis (depends on 12345)...
Job ID: 12346
Submitting fit_trans (depends on 12346)...
Job ID: 12347
All jobs submitted successfully!
Submit jobs one at a time:
# 1. Technical
sbatch 01_fit_technical.sh
# Note job ID (e.g., 12345)
# 2. Cis (after technical completes)
sbatch --dependency=afterok:12345 02_fit_cis.sh
# Note job ID (e.g., 12346)
# 3. Trans (after cis completes)
sbatch --dependency=afterok:12346 03_fit_trans.shWhen to use:
- Testing individual steps
- Re-running specific steps
- Custom dependency logic
# Your jobs
squeue -u $USER
# Specific job
squeue -j 12345
# All jobs for this experiment
squeue --name=perturb_seq_batch1_tech
squeue --name=perturb_seq_batch1_cis
squeue --name=perturb_seq_batch1_transOutput:
JOBID PARTITION NAME USER ST TIME NODES
12345 berzelius perturb_seq_batch1_tech learo R 0:45 1
12346 berzelius perturb_seq_batch1_cis learo PD 0:00 - (Dependency)
12347 berzelius perturb_seq_batch1_trans learo PD 0:00 - (Dependency)
Job states:
PD: Pending (waiting for resources or dependency)R: RunningCG: CompletingCD: CompletedF: Failed
# Detailed info
sacct -j 12345 --format=JobID,JobName,State,Elapsed,MaxRSS,MaxVMSize
# Efficiency report
seff 12345# Technical fit
tail -f logs/tech_12345.out
# Cis fit (array job)
tail -f logs/cis_12346_0.out # First array task
tail -f logs/cis_12346_*.out # All tasks (messy)
# Trans fit
tail -f logs/trans_12347_0.out# Find failed jobs
sacct -S 2025-01-17 -u $USER | grep FAILED
# Check error logs
grep -i error logs/*.err
grep -i "out of memory" logs/*.err# Cancel specific job
scancel 12345
# Cancel all jobs for experiment
scancel --name=perturb_seq_batch1_tech
scancel --name=perturb_seq_batch1_cis
scancel --name=perturb_seq_batch1_trans
# Cancel all your jobs
scancel -u $USERControl maximum concurrent jobs:
gen = SlurmJobGenerator(
...,
max_concurrent_jobs=20, # Limit to 20 at a time
)Generated SLURM directive:
#SBATCH --array=0-99%20 # 100 jobs, max 20 concurrentWhen to use:
- Cluster has queue limits
- Being considerate to other users
- Testing on subset before full run
Generate scripts for different configurations:
# Experiment 1: Full dataset
gen1 = SlurmJobGenerator(
meta=meta,
counts=counts,
cis_genes=['GFI1B', 'TET2', 'MYB', 'NFE2'],
output_dir='./slurm_jobs_full',
label='full_dataset',
)
gen1.generate_all_scripts()
# Experiment 2: Test on 2 genes
gen2 = SlurmJobGenerator(
meta=meta,
counts=counts,
cis_genes=['GFI1B', 'TET2'], # Subset
output_dir='./slurm_jobs_test',
label='test_run',
time_multiplier=0.5, # Faster for testing
)
gen2.generate_all_scripts()Specify custom environment:
gen = SlurmJobGenerator(
...,
python_env='/proj/.../custom_env/bin/python',
bayesdream_path='/proj/.../custom_bayesdream',
)For very sparse data (>90% zeros), provide hint:
gen = SlurmJobGenerator(
...,
sparsity=0.95, # 95% zeros
)Generator will use this for accurate memory estimation.
For non-negbinom distributions:
# Multinomial (splicing)
gen = SlurmJobGenerator(
...,
distribution='multinomial',
)
# Binomial (exon skipping)
gen = SlurmJobGenerator(
...,
distribution='binomial',
)Symptoms:
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=12345
Job killed due to memory usage
Solutions:
-
Check actual memory usage:
seff 12345 # Shows peak memory -
Increase memory request:
- Edit SLURM script:
#SBATCH --mem=20G(was 11G) - Or regenerate with
partition_preference='fat'for more RAM
- Edit SLURM script:
-
Use more GPUs:
# Each GPU on fat nodes provides 128GB RAM # 2 GPUs = 256GB RAM
-
Switch to CPU:
gen = SlurmJobGenerator( ..., partition_preference='cpu', )
-
Reduce dataset size:
- Filter low-count features
- Subset cells for testing
Symptoms:
Job reached time limit and was terminated
State: TIMEOUT
Solutions:
-
Check actual runtime:
sacct -j 12345 --format=JobID,Elapsed,State
-
Increase time limit:
- Edit SLURM script:
#SBATCH --time=06:00:00 - Or regenerate with
time_multiplier=2.0
- Edit SLURM script:
-
Check convergence:
- View logs to see if fitting was still making progress
- May need more iterations (edit Python code in script)
Check reason:
squeue -j 12345 -o "%.18i %.9P %.30j %.8u %.8T %.10M %.9l %.6D %.20R"Common reasons:
(Dependency): Waiting for parent job → Normal(Resources): Waiting for nodes → Be patient(Priority): Other jobs have higher priority → Be patient(QOSMaxGpuLimit): Too many GPUs requested → Reduce
Solutions:
- Wait (usually just need patience)
- Reduce resource requests
- Check cluster status:
sinfo
Symptoms:
FileNotFoundError: [Errno 2] No such file or directory: '/proj/.../data/meta.csv'
Solutions:
-
Verify data path:
ls /proj/.../data/meta.csv ls /proj/.../data/counts.csv
-
Check permissions:
ls -lh /proj/.../data/
-
Regenerate scripts with correct path:
gen = SlurmJobGenerator( ..., data_path='/correct/path/to/data', )
Symptoms:
- Cis jobs start before technical completes
- Trans jobs start before cis completes
Solutions:
-
Check SLURM version supports
--parsable:sbatch --help | grep parsable -
Manual submission with explicit IDs:
TECH_JOB=$(sbatch --parsable 01_fit_technical.sh) echo $TECH_JOB # Should print job ID sbatch --dependency=afterok:$TECH_JOB 02_fit_cis.sh
-
Check dependency status:
squeue -j 12346 -o "%.18i %.30E" # Shows dependency
Question: How do I know which guide is being used?
Answer: Check the generator output:
[INFO] AutoIAFNormal estimated at 2.5 GB VRAM
[INFO] Will use AutoIAFNormal with niters=50,000
or
[INFO] AutoIAFNormal would require 146.0 GB VRAM
[INFO] Will use AutoNormal (mean-field) with niters=100,000
Threshold: 20 GB VRAM
- If IAF < 20 GB → AutoIAFNormal (faster convergence)
- If IAF ≥ 20 GB → AutoNormal (memory-efficient, needs more iterations)
Symptoms:
Array job 12346 has 1/4 tasks failed
Find which task failed:
sacct -j 12346 --format=JobID,State | grep FAILEDOutput:
12346_2 FAILED
Check logs for that task:
cat logs/cis_12346_2.err
cat logs/cis_12346_2.outRerun specific task:
# Get cis gene for task 2
CIS_GENES=(GFI1B TET2 MYB NFE2)
echo ${CIS_GENES[2]} # MYB
# Edit script to run just that gene, or resubmit:
sbatch --array=2 02_fit_cis.sh # Just task 2class SlurmJobGenerator:
"""
Generate SLURM job submission scripts for bayesDREAM pipeline on Berzelius.
"""
def __init__(
self,
meta: pd.DataFrame,
counts, # pd.DataFrame or sparse matrix
gene_meta: Optional[pd.DataFrame] = None,
cis_genes: Optional[List[str]] = None,
output_dir: str = './slurm_jobs',
label: str = 'bayesdream_run',
low_moi: bool = True,
use_all_cells_technical: bool = False,
distribution: str = 'negbinom',
sparsity: Optional[float] = None,
n_groups: Optional[int] = None,
max_concurrent_jobs: int = 50,
time_multiplier: float = 1.0,
partition_preference: str = 'auto',
python_env: str = '/proj/.../pyroenv/bin/python',
bayesdream_path: str = '/proj/.../bayesDREAM',
data_path: Optional[str] = None,
nsamples: int = 1000,
):Required:
meta(pd.DataFrame): Cell metadata with columns:cell,guide,target,cell_linecounts(pd.DataFrame or sparse): Gene expression counts (features × cells)cis_genes(list of str): List of cis genes to fit
Output:
output_dir(str): Directory to write SLURM scripts (default:'./slurm_jobs')label(str): Unique identifier for this run (default:'bayesdream_run')
Experiment Design:
low_moi(bool): Low MOI mode vs high MOI (default:True)use_all_cells_technical(bool): Use all cells for technical fit (default:False)
Data Characteristics:
distribution(str):'negbinom','multinomial','binomial','normal','studentt'(default:'negbinom')sparsity(float): Fraction of zeros (default: auto-detect)n_groups(int): Number of technical groups (default: auto-detect)gene_meta(pd.DataFrame): Gene metadata (optional)
Resource Allocation:
partition_preference(str):'auto','fat','thin', or'cpu'(default:'auto')max_concurrent_jobs(int): Max concurrent array jobs (default:50)time_multiplier(float): Scale time estimates (default:1.0)
Cluster Configuration:
python_env(str): Path to Python executable with bayesDREAMbayesdream_path(str): Path to bayesDREAM repositorydata_path(str): Path to saved data (meta.csv, counts.csv)
Fitting Parameters:
nsamples(int): Number of posterior samples (default:1000)
generate_all_scripts(cis_genes: Optional[List[str]] = None)
Generate all SLURM scripts.
gen.generate_all_scripts()estimate_memory_requirements() -> Dict[str, float]
Estimate RAM and VRAM for each step.
memory = gen.estimate_memory_requirements()
print(f"Technical: {memory['fit_technical_ram_gb']:.1f} GB RAM")Returns:
{
'fit_technical_ram_gb': 5.7,
'fit_technical_vram_gb': 7.1,
'fit_cis_ram_gb': 4.3,
'fit_cis_vram_gb': 4.2,
'fit_trans_ram_gb': 12.6,
'fit_trans_vram_gb': 9.1,
'min_ram_gb': 12.6,
'min_vram_gb': 9.1,
'recommended_ram_gb': 19.0,
'recommended_vram_gb': 16.0,
'resources': {...}
}estimate_time_requirements(memory: Dict) -> Dict[str, str]
Estimate wall-clock time.
times = gen.estimate_time_requirements(memory)
print(f"Technical: {times['fit_technical']}") # "02:15:00"from bayesDREAM.slurm_jobgen import SlurmJobGenerator
import pandas as pd
# Load data
meta = pd.read_csv('meta.csv')
counts = pd.read_csv('counts.csv', index_col=0)
# Save to cluster
data_path = "/proj/.../data/run1"
meta.to_csv(f"{data_path}/meta.csv", index=False)
counts.to_csv(f"{data_path}/counts.csv")
# Generate scripts
gen = SlurmJobGenerator(
meta=meta,
counts=counts,
cis_genes=['GFI1B', 'TET2', 'MYB', 'NFE2'],
output_dir='./slurm_jobs',
label='perturb_seq_lowmoi',
low_moi=True,
python_env='/proj/.../pyroenv/bin/python',
bayesdream_path='/proj/.../bayesDREAM',
data_path=data_path,
)
gen.generate_all_scripts()gen = SlurmJobGenerator(
meta=meta,
counts=counts,
cis_genes=['GFI1B', 'TET2'],
output_dir='./slurm_jobs_highmoi',
label='perturb_seq_highmoi',
low_moi=False,
use_all_cells_technical=True, # Key difference
python_env='/proj/.../pyroenv/bin/python',
bayesdream_path='/proj/.../bayesDREAM',
data_path=data_path,
)
gen.generate_all_scripts()gen = SlurmJobGenerator(
meta=meta,
counts=counts_large, # 100K genes × 200K cells
cis_genes=['GFI1B'],
output_dir='./slurm_jobs_cpu',
label='perturb_seq_large',
partition_preference='cpu', # Force CPU
time_multiplier=3.0, # CPU is slower
python_env='/proj/.../pyroenv/bin/python',
bayesdream_path='/proj/.../bayesDREAM',
data_path=data_path,
)
gen.generate_all_scripts()# Test on 2 genes with conservative settings
gen = SlurmJobGenerator(
meta=meta,
counts=counts,
cis_genes=['GFI1B', 'TET2'], # Just 2 genes
output_dir='./slurm_jobs_test',
label='test_run',
time_multiplier=2.0, # Extra time for safety
max_concurrent_jobs=2, # Don't overwhelm cluster
python_env='/proj/.../pyroenv/bin/python',
bayesdream_path='/proj/.../bayesDREAM',
data_path=data_path,
)
gen.generate_all_scripts()While this generator is optimized for Berzelius, it can be adapted for other SLURM clusters:
-
Update resource specs in
_recommend_resources():# Example for different cluster # - GPU nodes: 40GB VRAM, 256GB RAM # - CPU nodes: 8GB RAM per core
-
Update module loading in generated scripts:
# Replace: module load Anaconda/2021.05-nsc1 # With your cluster's modules: module load python/3.9 module load cuda/11.8
-
Update partition names:
# Change partition names to match your cluster 'partition': 'gpu', # Instead of 'berzelius' 'constraint': 'v100', # Instead of 'fat'
For questions about:
- bayesDREAM code: See repository documentation
- Berzelius cluster: Contact NSC support
- SLURM issues: Consult your cluster's documentation