Skip to content

stephenkocsis14/bioinformatics-pipelines

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bioinformatics Pipelines

A suite of production-ready bioinformatics pipelines for HPC SLURM environments. Each pipeline auto-detects samples from directory structure, supports SE/PE sequencing, uses SLURM job arrays for parallelism, and saves all intermediate outputs and figures.

Pipelines

Pipeline Directory Description
Bulk RNA-seq pipelines/bulk_rnaseq/ Differential expression analysis with DESeq2, edgeR, limma-voom
scRNA-seq pipelines/scrna_seq/ Single-cell analysis with Seurat, Scanpy, Cell Ranger
ChIP-seq pipelines/chipseq/ Peak calling and differential binding with MACS2, DiffBind
ATAC-seq pipelines/atacseq/ Chromatin accessibility with MACS2, footprinting, DiffBind
WGS pipelines/wgs/ Whole genome variant calling with GATK, DeepVariant, Mutect2
WES pipelines/wes/ Whole exome variant calling with interval restriction + exome QC

Project Structure

bioinformatics-pipelines/
├── lib/                            # Shared utilities (sourced by all pipelines)
│   ├── utils.sh                    # Logging, sample detection, FASTQ detection
│   ├── genome_refs.sh              # Human GRCh38 + Mouse GRCm39 reference paths
│   ├── slurm_utils.sh              # Job array submission, dependency chaining
│   ├── parse_config.R              # R: reads config.sh into named list
│   └── parse_config.py             # Python: reads config.sh into dict
├── templates/
│   ├── slurm_header.sh             # SBATCH directive template
│   ├── config_template.sh          # Annotated config template
│   └── sample_sheet_template.tsv   # Sample sheet template
└── pipelines/
    ├── bulk_rnaseq/
    ├── scrna_seq/
    ├── chipseq/
    ├── atacseq/
    ├── wgs/
    └── wes/

Each pipeline follows the structure:

pipelines/{name}/
├── config.sh               # Pipeline configuration
├── sample_sheet.tsv         # Sample metadata
├── scripts/                 # Numbered analysis scripts + run_all.sh
└── docs/README.md           # Pipeline documentation

Quick Start

  1. Configure: Edit pipelines/{name}/config.sh — set paths, genome, modules, SLURM account
  2. Prepare inputs: Place FASTQs in fastq/{sample_id}/ directories; fill out sample_sheet.tsv
  3. Run: bash pipelines/{name}/scripts/run_all.sh

SLURM Job Array Strategy

  • Per-sample steps (QC, trim, align, dedup) run as job arrays (--array=1-N)
  • Aggregate steps (MultiQC, DE, visualization) run as single jobs with --dependency=afterok:{JOB_ID}
  • run_all.sh orchestrates the full dependency chain automatically
  • Individual steps: sbatch scripts/02_trim.sh sample_list.txt
  • Single sample debug: SLURM_ARRAY_TASK_ID=3 bash scripts/02_trim.sh sample_list.txt

Input Format

FASTQs must be organized by sample:

fastq/
├── sample_A/
│   ├── sample_A_R1.fastq.gz
│   └── sample_A_R2.fastq.gz    # (PE only)
├── sample_B/
│   ├── sample_B_R1.fastq.gz
│   └── sample_B_R2.fastq.gz

Naming conventions supported: _R1/_R2 or _1/_2 suffixes.

Supported Genomes

  • Human: GRCh38 (hg38) — all indexes, annotations, known sites
  • Mouse: GRCm39 (mm39) — all indexes, annotations, known sites

Requirements

  • SLURM cluster with module system
  • Standard bioinformatics HPC modules (FastQC, Trimmomatic, STAR, BWA, GATK, etc.)
  • R 4.x with Bioconductor packages
  • Python 3.x with scanpy, scvelo, etc.

See individual pipeline docs/README.md for specific requirements.

About

HPC SLURM bioinformatics pipelines: bulk RNA-seq, scRNA-seq, ChIP-seq, ATAC-seq, WGS, WES

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors