SLURM Job Generator for HPC Clusters

Automated generation of optimized SLURM submission scripts for running bayesDREAM on HPC clusters, with specific support for Berzelius (NSC, Sweden).

Overview
Quick Start
Installation
Usage Guide
Generated Scripts
Submitting Jobs
Monitoring Jobs
Advanced Usage
Troubleshooting
API Reference

Overview

The SLURM Job Generator (bayesDREAM.slurm_jobgen) automates the creation of SLURM submission scripts for large-scale bayesDREAM analyses. It:

Analyzes your dataset (features, cells, sparsity, technical groups)
Estimates memory requirements (RAM and VRAM) using memory_calculator.py
Selects optimal resources (GPU fat/thin nodes or CPU partition)
Estimates wall-clock time based on dataset size and complexity
Generates SLURM scripts with job dependencies and array parallelization
Creates documentation (README.md with monitoring commands)

Why Use This?

Without the generator:

Manually calculate memory needs for each step
Guess appropriate time limits
Write SLURM scripts from scratch
Set up job dependencies manually
Risk over-allocating (waste) or under-allocating (failure)

With the generator:

One function call generates all scripts
Automatic resource optimization
Built-in job dependencies
Job arrays for parallelization
Throttling to prevent cluster overload

Quick Start

1. Prepare Your Data

import pandas as pd
from bayesDREAM.slurm_jobgen import SlurmJobGenerator

# Load data
meta = pd.read_csv('meta.csv')
counts = pd.read_csv('counts.csv', index_col=0)  # genes × cells

# Save to cluster-accessible location
data_path = "/proj/berzelius-aiics-real/users/x_learo/data/run1"
meta.to_csv(f"{data_path}/meta.csv", index=False)
counts.to_csv(f"{data_path}/counts.csv")

2. Generate SLURM Scripts

# Create generator
gen = SlurmJobGenerator(
    meta=meta,
    counts=counts,
    cis_genes=['GFI1B', 'TET2', 'MYB', 'NFE2'],
    output_dir='./slurm_jobs',
    label='perturb_seq_batch1',

    # Paths (adjust for your cluster)
    python_env='/proj/.../mambaforge/envs/pyroenv/bin/python',
    bayesdream_path='/proj/.../bayesDREAM',
    data_path=data_path,
)

# Generate all scripts
gen.generate_all_scripts()

3. Submit on Cluster

# Transfer to Berzelius
scp -r slurm_jobs/ berzelius:/proj/.../

# On Berzelius
cd slurm_jobs
bash submit_all.sh

# Monitor
squeue -u $USER
tail -f logs/tech_*.out

Installation

The SLURM job generator is included in bayesDREAM. No additional installation required.

Requirements:

bayesDREAM installed with dependencies
Access to HPC cluster (Berzelius or similar SLURM-based system)
Python environment on cluster with bayesDREAM

Usage Guide

Basic Usage

from bayesDREAM.slurm_jobgen import SlurmJobGenerator

generator = SlurmJobGenerator(
    # Required
    meta=meta_dataframe,           # Cell metadata (cell, guide, target, cell_line)
    counts=counts_matrix,          # Gene expression (genes × cells)
    cis_genes=['GFI1B', 'TET2'],   # List of cis genes to fit
    output_dir='./slurm_jobs',     # Where to write scripts
    label='my_experiment',         # Unique identifier

    # Cluster paths
    python_env='/path/to/pyroenv/bin/python',
    bayesdream_path='/path/to/bayesDREAM',
    data_path='/path/to/data',     # Where meta.csv and counts.csv are saved

    # Optional (see sections below)
    low_moi=True,
    use_all_cells_technical=False,
    partition_preference='auto',
    max_concurrent_jobs=50,
    time_multiplier=1.0,
)

# Generate scripts
generator.generate_all_scripts()

Low MOI vs High MOI

The generator supports different experimental designs:

Low MOI (default)

One guide per cell, clear NTC population

gen = SlurmJobGenerator(
    ...,
    low_moi=True,  # Default
    use_all_cells_technical=False,
)

Job structure:

fit_technical: 1 job (all NTC cells)
fit_cis: N jobs (1 per cis gene)
fit_trans: N jobs (1 per cis gene)

Total jobs: 1 + 2N

High MOI - All Cells Mode

Multiple guides per cell, technical effects independent of perturbations

gen = SlurmJobGenerator(
    ...,
    low_moi=False,
    use_all_cells_technical=True,  # Fit technical on ALL cells
)

Job structure:

fit_technical: 1 job (all cells, once for all cis genes)
fit_cis: N jobs (1 per cis gene)
fit_trans: N jobs (1 per cis gene)

Total jobs: 1 + 2N

Benefit: Same total jobs as low MOI, but fit_technical uses all data.

When to use:

Technical variation is batch/lane specific
Technical effects don't correlate with perturbation effects
Want to maximize statistical power for technical correction

When NOT to use:

Technical groups (e.g., CRISPRi vs CRISPRa) correlate with cis gene expression
See warning in fit_technical documentation

High MOI - NTC Per Gene Mode

Multiple guides per cell, but want NTC-based technical correction per gene

gen = SlurmJobGenerator(
    ...,
    low_moi=False,
    use_all_cells_technical=False,  # Fit technical on NTC per gene
)

Job structure:

fit_technical: N jobs (NTC cells, 1 per cis gene)
fit_cis: N jobs (1 per cis gene)
fit_trans: N jobs (1 per cis gene)

Total jobs: 3N

When to use:

High MOI but want conservative NTC-only technical correction
Technical effects may vary by cis gene

Resource Allocation

The generator automatically selects optimal resources for Berzelius:

Berzelius Node Types

Node Type	Constraint	GPUs per Node	VRAM per GPU	RAM per GPU	Best For
Fat	`-C fat`	8	10 GB	128 GB	Standard fitting
Thin	`-C thin`	8	5 GB	64 GB	Small datasets
CPU	`--partition=berzelius-cpu`	0	0	7.76 GB/core	Extremely large datasets, fit_cis

Note: Full node = 8 GPUs. The generator will use up to 8 GPUs before switching to CPU.

Automatic Selection Logic

The generator analyzes memory requirements and selects:

fit_technical and fit_trans (same logic):

If VRAM ≤ 5 GB and RAM ≤ 64 GB:
    → 1 thin GPU ✓ (most efficient for small datasets)
Elif VRAM ≤ 10 GB and RAM ≤ 128 GB:
    → 1 fat GPU ✓ (most common case)
Elif VRAM ≤ 20 GB and RAM ≤ 256 GB:
    → 2 fat GPUs
Elif VRAM ≤ 40 GB and RAM ≤ 512 GB:
    → 4 fat GPUs
Elif VRAM ≤ 80 GB and RAM ≤ 1024 GB:
    → 8 fat GPUs (full node)
Else:
    → CPU partition (dataset too large for 8 GPUs)

fit_cis:

Default: CPU partition (doesn't need GPU for most datasets)

Goal: Use thin nodes for small datasets, scale up to full 8-GPU node before falling back to CPU.

Manual Override

Force specific resources:

# Force CPU for all steps
gen = SlurmJobGenerator(
    ...,
    partition_preference='cpu',
)

# Force fat nodes
gen = SlurmJobGenerator(
    ...,
    partition_preference='fat',
)

# Auto-select (default, recommended)
gen = SlurmJobGenerator(
    ...,
    partition_preference='auto',
)

Time Estimation

The generator automatically estimates wall-clock time based on:

Dataset size (T × N)
Guide type (AutoIAFNormal vs AutoNormal)
Step complexity

Base Estimates (20K genes, 30K cells)

Step	Base Time	Scales With
`fit_technical`	1.5 hours	T × N × (2× if AutoNormal)
`fit_cis`	0.5 hours/gene	N
`fit_trans`	3.0 hours/gene	T × N

Safety Margin

All estimates include 1.5× safety margin to reduce timeout risk.

Scaling Examples

Small dataset (5K genes, 10K cells):

fit_technical: ~0.5 hours
fit_cis: ~0.3 hours/gene
fit_trans: ~0.8 hours/gene

Large dataset (50K genes, 100K cells):

fit_technical: ~9 hours (if AutoNormal)
fit_cis: ~1.5 hours/gene
fit_trans: ~20 hours/gene

User Override

Scale time estimates:

# Conservative (2× longer)
gen = SlurmJobGenerator(
    ...,
    time_multiplier=2.0,
)

# Aggressive (if you know your data converges fast)
gen = SlurmJobGenerator(
    ...,
    time_multiplier=0.7,
)

# Default (recommended)
gen = SlurmJobGenerator(
    ...,
    time_multiplier=1.0,
)

Why Automatic?

Dataset size matters more than user intuition. A 50K gene dataset takes 10× longer than 5K genes. The generator's estimates are based on empirical scaling laws.

Generated Scripts

Script Overview

slurm_jobs/
├── 01_fit_technical.sh      # Technical fitting
├── 02_fit_cis.sh             # Cis effect fitting (job array)
├── 03_fit_trans.sh           # Trans effect fitting (job array)
├── submit_all.sh             # Master submission script
├── README.md                 # Complete documentation
└── logs/                     # Created automatically
    ├── tech_*.out
    ├── cis_*_*.out
    └── trans_*_*.out

01_fit_technical.sh

Single job (low MOI or high MOI with use_all_cells=True):

#!/bin/bash
#SBATCH --job-name=perturb_seq_batch1_tech
#SBATCH --time=02:15:00
#SBATCH --partition=berzelius
#SBATCH -C fat
#SBATCH --gpus=1
#SBATCH --mem=11G

# Run fit_technical on NTC cells (or all cells if high MOI)

Job array (high MOI with NTC per gene):

#SBATCH --array=0-3%50  # 4 cis genes, max 50 concurrent

02_fit_cis.sh

Job array (1 job per cis gene):

#!/bin/bash
#SBATCH --job-name=perturb_seq_batch1_cis
#SBATCH --array=0-3%50  # 4 cis genes
#SBATCH --time=00:30:00
#SBATCH --partition=berzelius-cpu
#SBATCH --cpus-per-task=1
#SBATCH --mem=9G

# Array: CIS_GENES=(GFI1B TET2 MYB NFE2)
# CIS_GENE=${CIS_GENES[$SLURM_ARRAY_TASK_ID]}

03_fit_trans.sh

Job array (1 job per cis gene):

#!/bin/bash
#SBATCH --job-name=perturb_seq_batch1_trans
#SBATCH --array=0-3%50  # 4 cis genes
#SBATCH --time=03:00:00
#SBATCH --partition=berzelius
#SBATCH -C fat
#SBATCH --gpus=1
#SBATCH --mem=18G

submit_all.sh

Master script with dependencies:

#!/bin/bash
# Submit fit_technical
TECH_JOB=$(sbatch --parsable 01_fit_technical.sh)

# Submit fit_cis (depends on technical)
CIS_JOB=$(sbatch --parsable --dependency=afterok:$TECH_JOB 02_fit_cis.sh)

# Submit fit_trans (depends on cis)
TRANS_JOB=$(sbatch --parsable --dependency=afterok:$CIS_JOB 03_fit_trans.sh)

echo "Jobs submitted: $TECH_JOB, $CIS_JOB, $TRANS_JOB"

README.md

Auto-generated documentation includes:

Dataset characteristics
Memory and time estimates
Resource allocation rationale
Complete Python code that will be executed for each step
Expected log output with examples
Usage instructions
Monitoring commands
Troubleshooting guide

Key sections:

"What Will Be Run": Shows exact Python code for fit_technical, fit_cis, and fit_trans with all parameters
"Log Output": Shows what you'll see in logs/tech_.out, logs/cis_.out, and logs/trans_*.out
Includes device selection (cuda/cpu), niters, nsamples, and all other parameters

Submitting Jobs

Method 1: Automatic (Recommended)

cd slurm_jobs
bash submit_all.sh

What happens:

Submits fit_technical
Queues fit_cis with dependency on technical
Queues fit_trans with dependency on cis
Prints job IDs

Output:

Submitting bayesDREAM pipeline jobs...
Label: perturb_seq_batch1
Output directory: ./slurm_jobs

Submitting fit_technical...
  Job ID: 12345
Submitting fit_cis (depends on 12345)...
  Job ID: 12346
Submitting fit_trans (depends on 12346)...
  Job ID: 12347

All jobs submitted successfully!

Method 2: Manual

Submit jobs one at a time:

# 1. Technical
sbatch 01_fit_technical.sh
# Note job ID (e.g., 12345)

# 2. Cis (after technical completes)
sbatch --dependency=afterok:12345 02_fit_cis.sh
# Note job ID (e.g., 12346)

# 3. Trans (after cis completes)
sbatch --dependency=afterok:12346 03_fit_trans.sh

When to use:

Testing individual steps
Re-running specific steps
Custom dependency logic

Monitoring Jobs

Check Job Status

# Your jobs
squeue -u $USER

# Specific job
squeue -j 12345

# All jobs for this experiment
squeue --name=perturb_seq_batch1_tech
squeue --name=perturb_seq_batch1_cis
squeue --name=perturb_seq_batch1_trans

Output:

JOBID   PARTITION  NAME                    USER   ST  TIME  NODES
12345   berzelius  perturb_seq_batch1_tech learo  R   0:45  1
12346   berzelius  perturb_seq_batch1_cis  learo  PD  0:00  -  (Dependency)
12347   berzelius  perturb_seq_batch1_trans learo PD 0:00  -  (Dependency)

Job states:

PD: Pending (waiting for resources or dependency)
R: Running
CG: Completing
CD: Completed
F: Failed

View Job Details

# Detailed info
sacct -j 12345 --format=JobID,JobName,State,Elapsed,MaxRSS,MaxVMSize

# Efficiency report
seff 12345

View Logs in Real-Time

# Technical fit
tail -f logs/tech_12345.out

# Cis fit (array job)
tail -f logs/cis_12346_0.out  # First array task
tail -f logs/cis_12346_*.out  # All tasks (messy)

# Trans fit
tail -f logs/trans_12347_0.out

Check for Errors

# Find failed jobs
sacct -S 2025-01-17 -u $USER | grep FAILED

# Check error logs
grep -i error logs/*.err
grep -i "out of memory" logs/*.err

Cancel Jobs

# Cancel specific job
scancel 12345

# Cancel all jobs for experiment
scancel --name=perturb_seq_batch1_tech
scancel --name=perturb_seq_batch1_cis
scancel --name=perturb_seq_batch1_trans

# Cancel all your jobs
scancel -u $USER

Advanced Usage

Job Array Throttling

Control maximum concurrent jobs:

gen = SlurmJobGenerator(
    ...,
    max_concurrent_jobs=20,  # Limit to 20 at a time
)

Generated SLURM directive:

#SBATCH --array=0-99%20  # 100 jobs, max 20 concurrent

When to use:

Cluster has queue limits
Being considerate to other users
Testing on subset before full run

Multiple Experiments

Generate scripts for different configurations:

# Experiment 1: Full dataset
gen1 = SlurmJobGenerator(
    meta=meta,
    counts=counts,
    cis_genes=['GFI1B', 'TET2', 'MYB', 'NFE2'],
    output_dir='./slurm_jobs_full',
    label='full_dataset',
)
gen1.generate_all_scripts()

# Experiment 2: Test on 2 genes
gen2 = SlurmJobGenerator(
    meta=meta,
    counts=counts,
    cis_genes=['GFI1B', 'TET2'],  # Subset
    output_dir='./slurm_jobs_test',
    label='test_run',
    time_multiplier=0.5,  # Faster for testing
)
gen2.generate_all_scripts()

Custom Python Environment

Specify custom environment:

gen = SlurmJobGenerator(
    ...,
    python_env='/proj/.../custom_env/bin/python',
    bayesdream_path='/proj/.../custom_bayesdream',
)

Sparse Data Optimization

For very sparse data (>90% zeros), provide hint:

gen = SlurmJobGenerator(
    ...,
    sparsity=0.95,  # 95% zeros
)

Generator will use this for accurate memory estimation.

Different Distributions

For non-negbinom distributions:

# Multinomial (splicing)
gen = SlurmJobGenerator(
    ...,
    distribution='multinomial',
)

# Binomial (exon skipping)
gen = SlurmJobGenerator(
    ...,
    distribution='binomial',
)

Troubleshooting

Job Fails: "Out of Memory"

Symptoms:

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=12345
Job killed due to memory usage

Solutions:

Check actual memory usage:
```
seff 12345  # Shows peak memory
```
Increase memory request:
- Edit SLURM script: #SBATCH --mem=20G (was 11G)
- Or regenerate with partition_preference='fat' for more RAM

Use more GPUs:

# Each GPU on fat nodes provides 128GB RAM
# 2 GPUs = 256GB RAM

Switch to CPU:

gen = SlurmJobGenerator(
    ...,
    partition_preference='cpu',
)

Reduce dataset size:
- Filter low-count features
- Subset cells for testing

Job Times Out

Symptoms:

Job reached time limit and was terminated
State: TIMEOUT

Solutions:

Check actual runtime:

sacct -j 12345 --format=JobID,Elapsed,State

Increase time limit:
- Edit SLURM script: #SBATCH --time=06:00:00
- Or regenerate with time_multiplier=2.0
Check convergence:
- View logs to see if fitting was still making progress
- May need more iterations (edit Python code in script)

Job Stuck in Queue (PD)

Check reason:

squeue -j 12345 -o "%.18i %.9P %.30j %.8u %.8T %.10M %.9l %.6D %.20R"

Common reasons:

(Dependency): Waiting for parent job → Normal
(Resources): Waiting for nodes → Be patient
(Priority): Other jobs have higher priority → Be patient
(QOSMaxGpuLimit): Too many GPUs requested → Reduce

Solutions:

Wait (usually just need patience)
Reduce resource requests
Check cluster status: sinfo

Job Fails: "File Not Found"

Symptoms:

FileNotFoundError: [Errno 2] No such file or directory: '/proj/.../data/meta.csv'

Solutions:

Verify data path:

ls /proj/.../data/meta.csv
ls /proj/.../data/counts.csv

Check permissions:
```
ls -lh /proj/.../data/
```

Regenerate scripts with correct path:

gen = SlurmJobGenerator(
    ...,
    data_path='/correct/path/to/data',
)

Dependencies Not Working

Symptoms:

Cis jobs start before technical completes
Trans jobs start before cis completes

Solutions:

Check SLURM version supports --parsable:
```
sbatch --help | grep parsable
```

Manual submission with explicit IDs:

TECH_JOB=$(sbatch --parsable 01_fit_technical.sh)
echo $TECH_JOB  # Should print job ID
sbatch --dependency=afterok:$TECH_JOB 02_fit_cis.sh

Check dependency status:

squeue -j 12346 -o "%.18i %.30E"  # Shows dependency

AutoNormal vs AutoIAFNormal

Question: How do I know which guide is being used?

Answer: Check the generator output:

[INFO] AutoIAFNormal estimated at 2.5 GB VRAM
[INFO] Will use AutoIAFNormal with niters=50,000

[INFO] AutoIAFNormal would require 146.0 GB VRAM
[INFO] Will use AutoNormal (mean-field) with niters=100,000

Threshold: 20 GB VRAM

If IAF < 20 GB → AutoIAFNormal (faster convergence)
If IAF ≥ 20 GB → AutoNormal (memory-efficient, needs more iterations)

Job Array Task Failures

Symptoms:

Array job 12346 has 1/4 tasks failed

Find which task failed:

sacct -j 12346 --format=JobID,State | grep FAILED

Output:

12346_2    FAILED

Check logs for that task:

cat logs/cis_12346_2.err
cat logs/cis_12346_2.out

Rerun specific task:

# Get cis gene for task 2
CIS_GENES=(GFI1B TET2 MYB NFE2)
echo ${CIS_GENES[2]}  # MYB

# Edit script to run just that gene, or resubmit:
sbatch --array=2 02_fit_cis.sh  # Just task 2

API Reference

SlurmJobGenerator Class

class SlurmJobGenerator:
    """
    Generate SLURM job submission scripts for bayesDREAM pipeline on Berzelius.
    """

    def __init__(
        self,
        meta: pd.DataFrame,
        counts,  # pd.DataFrame or sparse matrix
        gene_meta: Optional[pd.DataFrame] = None,
        cis_genes: Optional[List[str]] = None,
        output_dir: str = './slurm_jobs',
        label: str = 'bayesdream_run',
        low_moi: bool = True,
        use_all_cells_technical: bool = False,
        distribution: str = 'negbinom',
        sparsity: Optional[float] = None,
        n_groups: Optional[int] = None,
        max_concurrent_jobs: int = 50,
        time_multiplier: float = 1.0,
        partition_preference: str = 'auto',
        python_env: str = '/proj/.../pyroenv/bin/python',
        bayesdream_path: str = '/proj/.../bayesDREAM',
        data_path: Optional[str] = None,
        nsamples: int = 1000,
    ):

Parameters

Required:

meta (pd.DataFrame): Cell metadata with columns: cell, guide, target, cell_line
counts (pd.DataFrame or sparse): Gene expression counts (features × cells)
cis_genes (list of str): List of cis genes to fit

Output:

output_dir (str): Directory to write SLURM scripts (default: './slurm_jobs')
label (str): Unique identifier for this run (default: 'bayesdream_run')

Experiment Design:

low_moi (bool): Low MOI mode vs high MOI (default: True)
use_all_cells_technical (bool): Use all cells for technical fit (default: False)

Data Characteristics:

distribution (str): 'negbinom', 'multinomial', 'binomial', 'normal', 'studentt' (default: 'negbinom')
sparsity (float): Fraction of zeros (default: auto-detect)
n_groups (int): Number of technical groups (default: auto-detect)
gene_meta (pd.DataFrame): Gene metadata (optional)

Resource Allocation:

partition_preference (str): 'auto', 'fat', 'thin', or 'cpu' (default: 'auto')
max_concurrent_jobs (int): Max concurrent array jobs (default: 50)
time_multiplier (float): Scale time estimates (default: 1.0)

Cluster Configuration:

python_env (str): Path to Python executable with bayesDREAM
bayesdream_path (str): Path to bayesDREAM repository
data_path (str): Path to saved data (meta.csv, counts.csv)

Fitting Parameters:

nsamples (int): Number of posterior samples (default: 1000)

Methods

generate_all_scripts(cis_genes: Optional[List[str]] = None)

Generate all SLURM scripts.

gen.generate_all_scripts()

estimate_memory_requirements() -> Dict[str, float]

Estimate RAM and VRAM for each step.

memory = gen.estimate_memory_requirements()
print(f"Technical: {memory['fit_technical_ram_gb']:.1f} GB RAM")

Returns:

{
    'fit_technical_ram_gb': 5.7,
    'fit_technical_vram_gb': 7.1,
    'fit_cis_ram_gb': 4.3,
    'fit_cis_vram_gb': 4.2,
    'fit_trans_ram_gb': 12.6,
    'fit_trans_vram_gb': 9.1,
    'min_ram_gb': 12.6,
    'min_vram_gb': 9.1,
    'recommended_ram_gb': 19.0,
    'recommended_vram_gb': 16.0,
    'resources': {...}
}

estimate_time_requirements(memory: Dict) -> Dict[str, str]

Estimate wall-clock time.

times = gen.estimate_time_requirements(memory)
print(f"Technical: {times['fit_technical']}")  # "02:15:00"

Examples

Example 1: Standard Low MOI

from bayesDREAM.slurm_jobgen import SlurmJobGenerator
import pandas as pd

# Load data
meta = pd.read_csv('meta.csv')
counts = pd.read_csv('counts.csv', index_col=0)

# Save to cluster
data_path = "/proj/.../data/run1"
meta.to_csv(f"{data_path}/meta.csv", index=False)
counts.to_csv(f"{data_path}/counts.csv")

# Generate scripts
gen = SlurmJobGenerator(
    meta=meta,
    counts=counts,
    cis_genes=['GFI1B', 'TET2', 'MYB', 'NFE2'],
    output_dir='./slurm_jobs',
    label='perturb_seq_lowmoi',
    low_moi=True,
    python_env='/proj/.../pyroenv/bin/python',
    bayesdream_path='/proj/.../bayesDREAM',
    data_path=data_path,
)

gen.generate_all_scripts()

Example 2: High MOI with All Cells

gen = SlurmJobGenerator(
    meta=meta,
    counts=counts,
    cis_genes=['GFI1B', 'TET2'],
    output_dir='./slurm_jobs_highmoi',
    label='perturb_seq_highmoi',
    low_moi=False,
    use_all_cells_technical=True,  # Key difference
    python_env='/proj/.../pyroenv/bin/python',
    bayesdream_path='/proj/.../bayesDREAM',
    data_path=data_path,
)

gen.generate_all_scripts()

Example 3: CPU-Only (Very Large Dataset)

gen = SlurmJobGenerator(
    meta=meta,
    counts=counts_large,  # 100K genes × 200K cells
    cis_genes=['GFI1B'],
    output_dir='./slurm_jobs_cpu',
    label='perturb_seq_large',
    partition_preference='cpu',  # Force CPU
    time_multiplier=3.0,  # CPU is slower
    python_env='/proj/.../pyroenv/bin/python',
    bayesdream_path='/proj/.../bayesDREAM',
    data_path=data_path,
)

gen.generate_all_scripts()

Example 4: Testing with Subset

# Test on 2 genes with conservative settings
gen = SlurmJobGenerator(
    meta=meta,
    counts=counts,
    cis_genes=['GFI1B', 'TET2'],  # Just 2 genes
    output_dir='./slurm_jobs_test',
    label='test_run',
    time_multiplier=2.0,  # Extra time for safety
    max_concurrent_jobs=2,  # Don't overwhelm cluster
    python_env='/proj/.../pyroenv/bin/python',
    bayesdream_path='/proj/.../bayesDREAM',
    data_path=data_path,
)

gen.generate_all_scripts()

Notes for Other Clusters

While this generator is optimized for Berzelius, it can be adapted for other SLURM clusters:

Required Changes

Update resource specs in _recommend_resources():

# Example for different cluster
# - GPU nodes: 40GB VRAM, 256GB RAM
# - CPU nodes: 8GB RAM per core

Update module loading in generated scripts:

# Replace:
module load Anaconda/2021.05-nsc1

# With your cluster's modules:
module load python/3.9
module load cuda/11.8

Update partition names:

# Change partition names to match your cluster
'partition': 'gpu',  # Instead of 'berzelius'
'constraint': 'v100',  # Instead of 'fat'

Contact

For questions about:

bayesDREAM code: See repository documentation
Berzelius cluster: Contact NSC support
SLURM issues: Consult your cluster's documentation

FilesExpand file tree

SLURM_JOB_GENERATOR.md

Latest commit

History

SLURM_JOB_GENERATOR.md

File metadata and controls

SLURM Job Generator for HPC Clusters

Table of Contents

Overview

Why Use This?

Quick Start

1. Prepare Your Data

2. Generate SLURM Scripts

3. Submit on Cluster

Installation

Usage Guide

Basic Usage

Low MOI vs High MOI

Low MOI (default)

High MOI - All Cells Mode

High MOI - NTC Per Gene Mode

Resource Allocation

Berzelius Node Types

Automatic Selection Logic

Manual Override

Time Estimation

Base Estimates (20K genes, 30K cells)

Safety Margin

Scaling Examples

User Override

Why Automatic?

Generated Scripts

Script Overview

01_fit_technical.sh

02_fit_cis.sh

03_fit_trans.sh

submit_all.sh

README.md

Submitting Jobs

Method 1: Automatic (Recommended)

Method 2: Manual

Monitoring Jobs

Check Job Status

View Job Details

View Logs in Real-Time

Check for Errors

Cancel Jobs

Advanced Usage

Job Array Throttling

Multiple Experiments

Custom Python Environment

Sparse Data Optimization

Different Distributions

Troubleshooting

Job Fails: "Out of Memory"

Job Times Out

Job Stuck in Queue (PD)

Job Fails: "File Not Found"

Dependencies Not Working

AutoNormal vs AutoIAFNormal

Job Array Task Failures

API Reference

SlurmJobGenerator Class

Parameters

Methods

Examples

Example 1: Standard Low MOI

Example 2: High MOI with All Cells

Example 3: CPU-Only (Very Large Dataset)

Example 4: Testing with Subset

Notes for Other Clusters

Required Changes

Contact

See Also