CRISPR Pipeline

A comprehensive pipeline for single-cell Perturb-Seq analysis that enables robust processing and analysis of CRISPR screening data at single-cell resolution.

Prerequisites

Nextflow and Singularity must be installed before running the pipeline:

Nextflow (version > 24)

Workflow manager for executing the pipeline:

conda install bioconda::nextflow

Singularity

Container platform that must be available in your execution environment.

Nextflow Tower Integration

This is a seamless pipeline execution monitoring system that offers a web-based interface for workflow management.

To enable Nextflow Tower, we require a TOWER_ACCESS_TOKEN.

To obtain your token:

Create/login to your account at cloud.tower.nf
Navigate to Settings > Your tokens
Click "Add token" and generate a new token
Set as environment variable: export TOWER_ACCESS_TOKEN=your_token_here

Pipeline Installation

To install the pipeline:

git clone https://github.com/pinellolab/CRISPR_Pipeline.git

Input Requirements

File Descriptions

FASTQ Files

{sample}_R1.fastq.gz: Contains cell barcode and UMI sequences
{sample}_R2.fastq.gz: Contains transcript sequences

YAML Configuration Files (see example_data/)

rna_seqspec.yml: Defines RNA sequencing structure and parameters
guide_seqspec.yml: Specifies guide RNA detection parameters
hash_seqspec.yml: Defines cell hashing structure (required if using cell hashing)
barcode_onlist.txt: List of valid cell barcodes

Metadata Files (see example_data/)

guide_metadata.tsv: Contains guide RNA information and annotations
hash_metadata.tsv: Cell hashing sample information (required if using cell hashing)
pairs_to_test.csv: Defines perturbation pairs for comparison analysis (required if testing predefined pairs)

For detailed specifications, see our documentation.

Running the Pipeline

Pipeline Configuration

Before running the pipeline, customize the configuration files for your environment:

1. Data and Analysis Parameters (`nextflow.config`)

Update the pipeline-specific parameters in the params section, for example:

// Input data paths
    input = null

    // TO-DO, pipeline parameters
    ENABLE_DATA_HASHING = false //use for datasets contaning hash
    ENABLE_SCRUBLET = false //Using scrublet can be chalenge in datasets with hundred of thousands cells (ex: 300k +)
    use_igvf_reference = true // Download the reference from IGVF to use as gene file annotation, this method when true will overwrite the reference_transcriptome and gtf_download_path
    is_10x3v3 = true // In case using 10x3v3 use this option to execute the barcode translation operation. Otherwise guides and transcriptomes from the same cell will point for different barcodes and the overlap between modalities will be very small
    DUAL_GUIDE = false  // Case using Dual Guide system such as Replogle 2022 paper
    REFERENCE_transcriptome = 'human' // will be used to download the kallisto human index
    REFERENCE_gtf_download_path = 'https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/gencode.v46.annotation.gtf.gz' // Case creating a custom reference from the internet
    REFERENCE_gtf_local_path = '/path/to/gencode_gtf.gtf.gz' // Case provind your own gtf data.

    QC_min_genes_per_cell = 500 //This parameter will be used to filter cel with low quality (aka: less than X transcript with more than one read)
    QC_min_cells_per_gene = 0.05 //Fraction of cells a gene should be present to be considered in the inference steps. (0.05 is an number )
    QC_pct_mito = 20 //Percentage of Mitochondrial reads from a cell total to discard the cell

    Multiplicity_of_infection = 'low' // 'high' or 'low'

    GUIDE_ASSIGNMENT_method = 'sceptre'
    GUIDE_ASSIGNMENT_capture_method = 'CROP-seq'
    GUIDE_ASSIGNMENT_cleanser_probability_threshold = 1
    GUIDE_ASSIGNMENT_SCEPTRE_probability_threshold = 'default'
    GUIDE_ASSIGNMENT_SCEPTRE_n_em_rep = 'default'

    INFERENCE_method = 'default' // sceptre or perturbo. Default will run sceptre and perturbo in cis and perturbo in trans (for elements and per guide)
    INFERENCE_target_guide_pairing_strategy = 'default'

    INFERENCE_predefined_pairs_to_test = "path/to/file.csv"
    INFERENCE_max_target_distance_bp = 1000000

    INFERENCE_SCEPTRE_side = 'both'
    INFERENCE_SCEPTRE_grna_integration_strategy = 'union'
    INFERENCE_SCEPTRE_resampling_approximation = 'skew_normal'
    INFERENCE_SCEPTRE_control_group = 'default'
    INFERENCE_SCEPTRE_resampling_mechanism = 'default'
    INFERENCE_SCEPTRE_formula_object = 'default'

    NETWORK_custom_central_nodes = 'undefined'
    NETWORK_central_nodes_num = 1

2. Compute Environment Configuration

Choose and configure your compute profile by updating the relevant sections:

🖥️ Local

// Resource limits (adjust based on your machine)
max_cpus = 8           // Number of CPU cores available
max_memory = '32.GB'   // RAM available for the pipeline

// Run with: nextflow run main.nf -profile local

🏢 SLURM Cluster

// Resource limits (adjust based on cluster specs)
max_cpus = 128
max_memory = '512.GB'

// Update SLURM partitions in profiles section:
slurm {
    process {
        queue = 'short,normal,long'  // Replace with your partition names
    }
}

// Run with: nextflow run main.nf -profile slurm

☁️ Google Cloud Platform

// Update GCP settings
google_bucket = 'gs://your-bucket-name'
google_project = 'your-gcp-project-id'
google_region = 'us-central1'  // Choose your preferred region

// Resource limits
max_cpus = 128
max_memory = '512.GB'

// Run with (see more in GCP_user_notebook.ipynb): 
// export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/pipeline-service-key.json"
// nextflow run main.nf -profile google

3. Container Configuration

The pipeline uses pre-built containers. Update if you have custom versions:

containers {
    base     = 'sjiang9/conda-docker:0.2'      // Main analysis tools
    cleanser = 'sjiang9/cleanser:0.3'          // Guide assignment
    sceptre  = 'sjiang9/sceptre-igvf:0.1'     // Guide assignment/Perturbation inference 
    perturbo = 'loganblaine/perturbo:latest'   // Perturbation inference 
    aria2    = 'biasofpriene/aria2c' // download references in a fast way

}

🎯 Resource Sizing Guidelines

Recommended Starting Values:

Environment	max_cpus	max_memory	Notes
Local (development)	4-8	16-32GB	For testing small datasets
Local (full analysis)	8-16	64-128GB	For complete runs
SLURM cluster	64-128	256-512GB	Adjust based on node specs
Google Cloud	128+	512GB+	Can scale dynamically

🔧 Testing Your Configuration

Validate syntax:

nextflow config -profile local  # Test local profile
nextflow config -profile slurm  # Test SLURM profile

Test with small dataset:

# Start with a subset of your data
# Make all scripts executable (required for pipeline execution)
chmod +x bin/*
# RUN THE PIPELINE
nextflow run main.nf -profile local --input small_test.tsv -outdir ./Outputs

💡 Pro Tips

Start conservative: Begin with lower resource limits and increase as needed
Profile-specific limits: The pipeline automatically scales resources based on retry attempts
Development workflow: Use local profile for code testing, cluster/cloud for production runs

🚨 Common Issues

Memory errors: Increase max_memory if you see out-of-memory failures
Queue timeouts: Adjust SLURM partition names to match your cluster
Permission errors: Ensure your Google Cloud service account has proper permissions
Container issues: Verify Singularity is available on your system
Missing files: Double-check paths in nextflow.config and actual files in example_data

Output Description

The output files will be generated in the pipeline_outputs and pipeline_dashboard directory.

Generated Files

Within the pipeline_outputs directory, you will find:

inference_mudata.h5mu - MuData format output
per_element_output.tsv - Per-element analysis
per_guide_output.tsv - Per-guide analysis

Structure:

📁 `pipeline_outputs/` — README

This folder contains the tabular results produced by the CRISPR inference pipeline.
All result files are tsv.gz (tab-separated, gzip-compressed) unless noted otherwise.

🧬 Cis analysis

Files

File	Description
`cis_per_guide_results.tsv.gz`	Inference results for cis guide--gene pairs with guides tested independently.
`cis_per_element_results.tsv.gz`	Inference results for cis element--gene pairs with guides grouped by intended target element.

Schema

`cis_per_guide_results.tsv.gz`

Column	Description
`gene_id`	ENSEMBL gene ID
`guide_id`	Guide identifier (guide name)
`sceptre_log2_fc`	SCEPTRE effect size estimate (log2 fold-change)
`sceptre_p_value`	SCEPTRE (uncorrected) p_value of differential expression
`perturbo_log2_fc`	PerTurbo effect size estimate (log2 fold-change)
`perturbo_p_value`	PerTurbo (uncorrected) posterior probability of differential expression

`cis_per_element_results.tsv.gz`

Column	Description
`gene_id`	ENSEMBL gene ID
`intended_target_name`	Intended target element name
`sceptre_log2_fc`	SCEPTRE effect size estimate (log2 fold-change)
`sceptre_p_value`	SCEPTRE (uncorrected) p_value
`perturbo_log2_fc`	PerTurbo effect size estimate (log2 fold-change)
`perturbo_p_value`	PerTurbo (uncorrected) posterior probability of differential expression

🌐 Trans analysis

Files

File	Description
`trans_per_guide_results.tsv.gz`	PerTurbo inference results for all guide--gene pairs, guides tested independently.
`trans_per_element_results.tsv.gz`	PerTurbo inference results for all element--gene pairs, guides grouped by intended target element.

Schema

`trans_per_guide_results.tsv.gz`

Column	Description
`gene_id`	ENSEMBL gene ID
`guide_id`	Guide identifier (guide name).
`log2_fc`	PerTurbo effect size (log2 fold-change)
`p_value`	PerTurbo (uncorrected) posterior probability of differential expression

`trans_per_element_results.tsv.gz`

Column	Description
`gene_id`	ENSEMBL gene ID
`intended_target_name`	Intended target element name.
`log2_fc`	PerTurbo effect size (log2 fold-change)
`p_value`	PerTurbo (uncorrected) posterior probability of differential expression

Other files

   ├── per_element_output.tsv.gz : 
   ├── per_guide_output.tsv.gz :


   ├── perturbo_per_element_output.tsv :
   ├── perturbo_per_guide_output.tsv :

   ├── inference_mudata.h5mu :

For details, see our documentation.

Generated Figures

The pipeline produces several figures:

Within the pipeline_dashboard directory, you will find:

Evaluation Output:
- network_plot.png: Gene interaction networks visualization.
- volcano_plot.png: gRNA-gene pairs analysis.
- IGV files (.bedgraph and bedpe): Genome browser visualization files.
Analysis Figures:
- knee_plot_scRNA.png: Knee plot of UMI counts vs. barcode index.
- scatterplot_scrna.png: Scatterplot of total counts vs. genes detected, colored by mitochondrial content.
- violin_plot.png: Distribution of gene counts, total counts, and mitochondrial content.
- scRNA_barcodes_UMI_thresholds.png: Number of scRNA barcodes using different Total UMI thresholds.
- guides_per_cell_histogram.png: Histogram of guides per cell.
- cells_per_guide_histogram.png: Histogram of cells per guide.
- guides_UMI_thresholds.png: Simulating the final number of cells with assigned guides using different minimal number thresholds (at least one guide > threshold value). (Use it to inspect how many cells would have assigned guides. This can be used to check if the final number of cells with guides fit with your expected number of cells)
- guides_UMI_thresholds.png: Histogram of the number of sgRNA represented per cell
- cells_per_htp_barplot.png: Number of Cells across Different HTOs
- umap_hto.png: UMAP Clustering of Cells Based on HTOs (The dimensions represent the distribution of HTOs in each cell)
- umap_hto_singlets.png: UMAP Clustering of Cells Based on HTOs (multiplets removed)
seqSpec Plots:
- seqSpec_check_plots.png: The frequency of each nucleotides along the Read 1 (Use to inspect the expected read parts with their expected signature) and Read 2 (Use to inspect the expected read parts with their expected signature).

Structure:

📁 pipeline_dashboard/
  ├── 📄 dashboard.html                         
  │
  ├── 📁 evaluation_output/                      
  │   ├── 🖼️ network_plot.png                   
  │   ├── 🖼️ volcano_plot.png                  
  │   ├── 📄 igv.bedgraph                     
  │   └── 📄 igv.bedpe                         
  │
  ├── 📁 figures/
  │   ├── 🖼️ knee_plot_scRNA.png                
  │   ├── 🖼️ scatterplot_scrna.png              
  │   ├── 🖼️ violin_plot.png                    
  │   ├── 🖼️ scRNA_barcodes_UMI_thresholds.png  
  │   ├── 🖼️ guides_per_cell_histogram.png      
  │   ├── 🖼️ cells_per_guide_histogram.png      
  │   ├── 🖼️ guides_UMI_thresholds.png          
  │   ├── 🖼️ cells_per_htp_barplot.png          
  │   ├── 🖼️ umap_hto.png                       
  │   └── 🖼️ umap_hto_singlets.png              
  │
  ├── 📁 guide_seqSpec_plots/
  │   └── 🖼️ seqSpec_check_plots.png            
  │
  └── 📁 hashing_seqSpec_plots/
      └── 🖼️ seqSpec_check_plots.png

Pipeline Testing Guide

To ensure proper pipeline functionality, we provide two extensively validated datasets for testing purposes.

Available Test Datasets

1. TF_Perturb_Seq_Pilot Dataset (Gary-Hon Lab)

The TF_Perturb_Seq_Pilot dataset was generated by the Gary-Hon Lab and is available through the IGVF Data Portal under Analysis Set ID: IGVFDS4389OUWU. To access the fastq files, you need to:

First, register for an account on the IGVF Data Portal to obtain your access credentials.
Once you have your credentials, you can use our provided Python script to download all necessary FASTQ files:
```
cd example_data
python download_fastq.py \
    --sample per-sample_file.tsv \
    --access-key YOUR_ACCESS_KEY \
    --secret-key YOUR_SECRET_KEY
```
💡 Note: You'll need to replace YOUR_ACCESS_KEY and YOUR_SECRET_KEY with the credentials from your IGVF portal account. These credentials can be found in your IGVF portal profile settings.

All other required input files for running the pipeline with this dataset are already included in the repository under the example_data directory.

2. Gasperini et al. Dataset

This dataset comes from a large-scale CRISPR screen study published in Cell (Gasperini et al., 2019: "A Genome-wide Framework for Mapping Gene Regulation via Cellular Genetic Screens") and provides an excellent resource for testing the pipeline. The full dataset, including raw sequencing data and processed files, is publicly available through GEO under accession number GSE120861.

Step-by-Step Testing Instructions

Environment Setup

# Clone and enter the repository
git clone https://github.com/pinellolab/CRISPR_Pipeline.git
cd CRISPR_Pipeline

Choose Your Dataset and Follow the Corresponding Instructions:

Option A: TF_Perturb_Seq_Pilot Dataset

# Run with LOCAL
nextflow run main.nf \
   -profile local \
   --input samplesheet.tsv \
   --outdir ./outputs/

# Run with SLURM
nextflow run main.nf \
   -profile slurm \
   --input samplesheet.tsv \
   --outdir ./outputs/

# Run with GCP
nextflow run main.nf \
   -profile google \
   --input samplesheet.tsv \
   --outdir gs://igvf-pertub-seq-pipeline-data/scratch/ # Path to your GCP bucket

Option B: Gasperini Dataset

Set up the configuration files:

# Copy configuration files and example data
cp example_gasperini/nextflow.config nextflow.config
cp -r example_gasperini/example_data/* example_data/

Obtain sequencing data:

Download a subset of the dataset gasperini in your own server.
Place files in example_data/fastq_files directory

NTHREADS=16
wget https://github.com/10XGenomics/bamtofastq/releases/download/v1.4.1/bamtofastq_linux; chmod +x bamtofastq_linux
wget https://sra-pub-src-1.s3.amazonaws.com/SRR7967488/pilot_highmoi_screen.1_CGTTACCG.grna.bam.1;mv pilot_highmoi_screen.1_CGTTACCG.grna.bam.1 pilot_highmoi_screen.1_CGTTACCG.grna.bam
./bamtofastq_linux --nthreads="$NTHREADS" pilot_highmoi_screen.1_CGTTACCG.grna.bam bam_pilot_guide_1

wget https://sra-pub-src-1.s3.amazonaws.com/SRR7967482/pilot_highmoi_screen.1_SI_GA_G1.bam.1;mv pilot_highmoi_screen.1_SI_GA_G1.bam.1 pilot_highmoi_screen.1_SI_GA_G1.bam
./bamtofastq_linux --nthreads="$NTHREADS" pilot_highmoi_screen.1_SI_GA_G1.bam bam_pilot_scrna_1

Now you should see the bam_pilot_guide_1 and bam_pilot_scrna_1 directories inside the example_data/fastq_files directory. Inside bam_pilot_guide_1 and bam_pilot_scrna_1, there are multiple sets of FASTQ files.

Prepare the whitelist:
```
# Extract the compressed whitelist file
unzip example_data/yaml_files/3M-february-2018.txt.zip
```
Now you should see 3M-february-2018.txt inside example_data/yaml_files/ directory.

Launch the pipeline:

# Run with LOCAL
nextflow run main.nf \
   -profile local \
   --input samplesheet.tsv \
   --outdir ./outputs/

# Run with SLURM
nextflow run main.nf \
   -profile slurm \
   --input samplesheet.tsv \
   --outdir ./outputs/

# Run with GCP
nextflow run main.nf \
   -profile google \
   --input samplesheet.tsv \
   --outdir gs://igvf-pertub-seq-pipeline-data/scratch/ # Path to your GCP bucket

Expected Outputs

The pipeline generates two directories upon completion:

pipeline_outputs: Contains all analysis results
pipeline_dashboard: Houses interactive visualization reports

Troubleshooting

If you encounter any issues during testing:

Review log files and intermediate results in the work/ directory
Verify that all input files meet the required format specifications

For additional support or questions, please open an issue on our GitHub repository.

Credits

We thank the following people for their extensive assistance in the development of this pipeline:

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the [Slack #fg-crispr channel]

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Name		Name	Last commit message	Last commit date
Latest commit History 299 Commits
RunOnSlurm		RunOnSlurm
assets		assets
bin		bin
conf		conf
css		css
docker-images/conda-docker		docker-images/conda-docker
docs		docs
download_development		download_development
download_pipeline		download_pipeline
example_data		example_data
example_gasperini		example_gasperini
example_replogle		example_replogle
js		js
modules		modules
subworkflows		subworkflows
workflows/crispr_pipeline		workflows/crispr_pipeline
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
GCP_user_notebook.ipynb		GCP_user_notebook.ipynb
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
modules.json		modules.json
nextflow		nextflow
nextflow.config		nextflow.config
tower.yml		tower.yml

License

pinellolab/CRISPR_Pipeline

Folders and files

Latest commit

History

Repository files navigation

CRISPR Pipeline

Prerequisites

Nextflow (version > 24)

Singularity

Nextflow Tower Integration

Pipeline Installation

Input Requirements

File Descriptions

FASTQ Files

YAML Configuration Files (see example_data/)

Metadata Files (see example_data/)

Running the Pipeline

Pipeline Configuration

1. Data and Analysis Parameters (nextflow.config)

2. Compute Environment Configuration

🖥️ Local

🏢 SLURM Cluster

☁️ Google Cloud Platform

3. Container Configuration

🎯 Resource Sizing Guidelines

Recommended Starting Values:

🔧 Testing Your Configuration

💡 Pro Tips

🚨 Common Issues

Output Description

Generated Files

📁 pipeline_outputs/ — README

🧬 Cis analysis

Files

Schema

cis_per_guide_results.tsv.gz

cis_per_element_results.tsv.gz

🌐 Trans analysis

Files

Schema

trans_per_guide_results.tsv.gz

trans_per_element_results.tsv.gz

Other files

Generated Figures

Pipeline Testing Guide

Available Test Datasets

1. TF_Perturb_Seq_Pilot Dataset (Gary-Hon Lab)

2. Gasperini et al. Dataset

Step-by-Step Testing Instructions

Option A: TF_Perturb_Seq_Pilot Dataset

Option B: Gasperini Dataset

Expected Outputs

Troubleshooting

Credits

Contributions and Support

Citations

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Languages

1. Data and Analysis Parameters (`nextflow.config`)

📁 `pipeline_outputs/` — README

`cis_per_guide_results.tsv.gz`

`cis_per_element_results.tsv.gz`

`trans_per_guide_results.tsv.gz`

`trans_per_element_results.tsv.gz`

Packages