|
| 1 | +(rnaseq-nf-page)= |
| 2 | + |
| 3 | +# Getting started with rnaseq-nf |
| 4 | + |
| 5 | +[`rnaseq-nf`](https://github.com/nextflow-io/rnaseq-nf) is a basic Nextflow pipeline for RNA-Seq analysis that performs quality control, transcript quantification, and result aggregation. The pipeline processes paired-end FASTQ files, generates quality control reports with [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), quantifies transcripts with [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html), and produces a unified report with [MultiQC](https://seqera.io/multiqc/). |
| 6 | + |
| 7 | +This tutorial describes the architecture of the [`rnaseq-nf`](https://github.com/nextflow-io/rnaseq-nf) pipeline and provides instructions on how to run it. |
| 8 | + |
| 9 | +## Pipeline architecture |
| 10 | + |
| 11 | +The pipeline is organized into modular workflows and processes that coordinate data flow from input files through analysis steps to final outputs. |
| 12 | + |
| 13 | +### Entry workflow |
| 14 | + |
| 15 | +The [entry workflow](https://github.com/nextflow-io/rnaseq-nf/blob/master/main.nf) orchestrates the entire pipeline by coordinating input parameters and data flow: |
| 16 | + |
| 17 | +```{mermaid} |
| 18 | +flowchart TB |
| 19 | + subgraph " " |
| 20 | + subgraph params |
| 21 | + v0["transcriptome"] |
| 22 | + v1["reads"] |
| 23 | + v5["multiqc"] |
| 24 | + v2["outdir"] |
| 25 | + end |
| 26 | + v4([RNASEQ]) |
| 27 | + v6([MULTIQC]) |
| 28 | + v0 --> v4 |
| 29 | + v1 --> v4 |
| 30 | + v4 --> v6 |
| 31 | + v5 --> v6 |
| 32 | + end |
| 33 | +``` |
| 34 | + |
| 35 | +Data flow: |
| 36 | + |
| 37 | +- The `transcriptome` and `reads` parameters are passed to the `RNASEQ` subworkflow, which performs indexing, quality control, and quantification. |
| 38 | + |
| 39 | +- The outputs from `RNASEQ`, along with the MultiQC configuration (`multiqc`), are passed to the `MULTIQC` module, which aggregates results into a unified HTML report. |
| 40 | + |
| 41 | +- The `outdir` parameter defines where all results are published. |
| 42 | + |
| 43 | +### `RNASEQ` |
| 44 | + |
| 45 | +The [`RNASEQ`](https://github.com/nextflow-io/rnaseq-nf/blob/master/modules/rnaseq.nf) subworkflow coordinates three processes that run in parallel and sequence: |
| 46 | + |
| 47 | +```{mermaid} |
| 48 | +flowchart TB |
| 49 | + subgraph RNASEQ |
| 50 | + subgraph take |
| 51 | + v0["read_pairs_ch"] |
| 52 | + v1["transcriptome"] |
| 53 | + end |
| 54 | + v2([INDEX]) |
| 55 | + v4([FASTQC]) |
| 56 | + v6([QUANT]) |
| 57 | + subgraph emit |
| 58 | + v8["fastqc"] |
| 59 | + v9["quant"] |
| 60 | + end |
| 61 | + v1 --> v2 |
| 62 | + v0 --> v4 |
| 63 | + v0 --> v6 |
| 64 | + v2 --> v6 |
| 65 | + v4 --> v8 |
| 66 | + v6 --> v9 |
| 67 | + end |
| 68 | +``` |
| 69 | + |
| 70 | +Inputs (`take:`): |
| 71 | + |
| 72 | +- `read_pairs_ch`: A channel of paired-end read files |
| 73 | +- `transcriptome`: A reference transcriptome file |
| 74 | + |
| 75 | +Data flow (`main:`): |
| 76 | + |
| 77 | +- [`INDEX`](https://github.com/nextflow-io/rnaseq-nf/blob/master/modules/index/main.nf) creates a Salmon index from the `transcriptome` input (runs once). |
| 78 | + |
| 79 | +- [`FASTQC`](https://github.com/nextflow-io/rnaseq-nf/blob/master/modules/fastqc/main.nf) analyzes the samples in the `read_pairs_ch` channel in parallel (runs independently for each sample). |
| 80 | + |
| 81 | +- [`QUANT`](https://github.com/nextflow-io/rnaseq-nf/blob/master/modules/quant/main.nf) quantifies transcripts using the index from `INDEX` and the samples in the `read_pairs_ch` channel (runs for each sample after `INDEX` completes). |
| 82 | + |
| 83 | +Outputs (`emit:`): |
| 84 | + |
| 85 | +- `fastqc`: The results from `FASTQC` |
| 86 | + |
| 87 | +- `quant`: The results from `QUANT` |
| 88 | + |
| 89 | +### `MULTIQC` |
| 90 | + |
| 91 | +The [`MULTIQC`](https://github.com/nextflow-io/rnaseq-nf/blob/master/modules/multiqc/main.nf) process aggregates all quality control and quantification outputs into a comprehensive HTML report. |
| 92 | + |
| 93 | +Inputs: |
| 94 | + |
| 95 | +- Input files: All collected outputs from the `RNASEQ` subworkflow (FastQC reports and Salmon quantification files). |
| 96 | +- `config`: MultiQC configuration files and branding (logo, styling). |
| 97 | + |
| 98 | +Process execution: |
| 99 | + |
| 100 | +- `MULTIQC` scans all input files, extracts metrics and statistics, and generates a unified report. |
| 101 | + |
| 102 | +Outputs: |
| 103 | + |
| 104 | +- `multiqc_report.html`: A single consolidated HTML report providing an overview of: |
| 105 | + - General stats |
| 106 | + - Salmon fragment length distribution |
| 107 | + - FastQC quality control |
| 108 | + - Software versions |
| 109 | + |
| 110 | +## Pipeline parameters |
| 111 | + |
| 112 | +The pipeline behavior can be customized using command-line parameters to specify input data, output locations, and configuration files. |
| 113 | + |
| 114 | +The pipeline accepts the following command-line parameters: |
| 115 | + |
| 116 | +- `--reads`: Path to paired-end FASTQ files (default: `data/ggal/ggal_gut_{1,2}.fq`). |
| 117 | + |
| 118 | +- `--transcriptome`: Path to reference transcriptome FASTA (default: `data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa`). |
| 119 | + |
| 120 | +- `--outdir`: Output directory for results (default: `results`). |
| 121 | + |
| 122 | +- `--multiqc`: Path to MultiQC configuration directory (default: `multiqc`). |
| 123 | + |
| 124 | +## Configuration profiles |
| 125 | + |
| 126 | +Configuration profiles allow you to customize how and where the pipeline runs by specifying the `-profile` flag. Multiple profiles can be specified as a comma-separated list. Profiles are defined in the [`nextflow.config`](https://github.com/nextflow-io/rnaseq-nf/blob/master/nextflow.config) file in the base directory. |
| 127 | + |
| 128 | +<h3>Software profiles</h3> |
| 129 | + |
| 130 | +Software profiles specify how software dependencies for processes should be provisioned: |
| 131 | + |
| 132 | +- `conda`: Provision a Conda environment for each process based on its required Conda packages |
| 133 | +- `docker`: Use a Docker container which contains all required dependencies |
| 134 | +- `singularity`: Use a Singularity container which contains all required dependencies |
| 135 | +- `wave`: Provision a Wave container for each process based on its required Conda packages |
| 136 | + |
| 137 | +:::{note} |
| 138 | +The respective container runtime or package manager must be installed to use these profiles. |
| 139 | +::: |
| 140 | + |
| 141 | +<h3>Execution profiles</h3> |
| 142 | + |
| 143 | +Execution profiles specify the compute and storage environment used by the pipeline: |
| 144 | + |
| 145 | +- `slurm`: Run on a SLURM HPC cluster |
| 146 | +- `batch`: Run on AWS Batch |
| 147 | +- `google-batch`: Run on Google Cloud Batch |
| 148 | +- `azure-batch`: Run on Azure Batch |
| 149 | + |
| 150 | +:::{note} |
| 151 | +Depending on your environment, you may need to configure underlying infrastructure such as resource pools, storage, and credentials. |
| 152 | +::: |
| 153 | + |
| 154 | +## Test data |
| 155 | + |
| 156 | +The pipeline includes test data in the [`data/ggal/`](https://github.com/nextflow-io/rnaseq-nf/tree/master/data/ggal) directory for demonstration and validation purposes: |
| 157 | + |
| 158 | +- Paired-end FASTQ files from four tissue samples (gut, liver, lung, spleen): |
| 159 | + - `ggal_gut_{1,2}.fq` |
| 160 | + - `ggal_liver_{1,2}.fq` |
| 161 | + - `ggal_lung_{1,2}.fq` |
| 162 | + - `ggal_spleen_{1,2}.fq` |
| 163 | + |
| 164 | +- Reference transcriptome: |
| 165 | + - `ggal_1_48850000_49020000.Ggal71.500bpflank.fa` |
| 166 | + |
| 167 | +By default, only the `gut` sample is processed. You can use the `all-reads` profile to process all four tissue samples. |
| 168 | + |
| 169 | +## Quick start |
| 170 | + |
| 171 | +The [`rnaseq-nf`](https://github.com/nextflow-io/rnaseq-nf) pipeline is executable out-of-the-box. This section provides examples for running the pipeline with different configurations. |
| 172 | + |
| 173 | +### Basic execution |
| 174 | + |
| 175 | +Run the pipeline with default parameters using Docker: |
| 176 | + |
| 177 | +```bash |
| 178 | +nextflow run nextflow-io/rnaseq-nf -profile docker |
| 179 | +``` |
| 180 | + |
| 181 | +### Configuring individual parameters |
| 182 | + |
| 183 | +Override default parameters to use custom input files and output locations: |
| 184 | + |
| 185 | +```bash |
| 186 | +nextflow run nextflow-io/rnaseq-nf \ |
| 187 | + --reads '/path/to/reads/*_{1,2}.fastq.gz' \ |
| 188 | + --transcriptome '/path/to/transcriptome.fa' \ |
| 189 | + --outdir 'my_results' \ |
| 190 | + -profile docker |
| 191 | +``` |
| 192 | + |
| 193 | +### Using profiles |
| 194 | + |
| 195 | +Specify configuration profiles to customize runtime environments and data sources: |
| 196 | + |
| 197 | +```bash |
| 198 | +# Use Conda to provision software dependencies |
| 199 | +nextflow run nextflow-io/rnaseq-nf -profile conda |
| 200 | + |
| 201 | +# Run on a SLURM cluster |
| 202 | +nextflow run nextflow-io/rnaseq-nf -profile slurm |
| 203 | + |
| 204 | +# Combine multiple profiles: process all reads using Docker |
| 205 | +nextflow run nextflow-io/rnaseq-nf -profile all-reads,docker |
| 206 | +``` |
| 207 | + |
| 208 | +:::{tip} |
| 209 | +See [Configuration profiles](#configuration-profiles) for more information about profiles. |
| 210 | +::: |
| 211 | + |
| 212 | +## Expected outputs |
| 213 | + |
| 214 | +The [`rnaseq-nf`](https://github.com/nextflow-io/rnaseq-nf) pipeline produces the following outputs in the `results` directory: |
| 215 | + |
| 216 | +``` |
| 217 | +results/ |
| 218 | +├── fastqc_<SAMPLE_ID>_logs/ # FastQC quality reports per sample |
| 219 | +│ ├── <SAMPLE_ID>_1_fastqc.html |
| 220 | +│ ├── <SAMPLE_ID>_1_fastqc.zip |
| 221 | +│ ├── <SAMPLE_ID>_2_fastqc.html |
| 222 | +│ └── <SAMPLE_ID>_2_fastqc.zip |
| 223 | +└── multiqc_report.html # Aggregated QC and Salmon report |
| 224 | +``` |
| 225 | + |
| 226 | +The MultiQC report (`multiqc_report.html`) can be viewed in a web browser. |
0 commit comments