nf-rnaseq is a bioinformatics pipeline that can be used to process RNA sequencing data.
The pipeline is built using the Nextflow Workflow Management System (WfMS), and uses Docker/Singularity containers for enhanced reproducibility. Being written in the Nextflow DSL2 language, this pipeline supports the definition of one container per process, thus making it easy to maintain and update software dependencies.
- Read QC (FastQC)
- Pseudo-alignment and transcript-level quantification (Salmon)
- Gene-level summarization (tximport)
- Summarize QC Reports (MultiQC)
-
Install
Nextflow(>=22.10.6). If theCondapackage manager is already installed in your system, you can use the following commands:#Install nextflow conda install nextflow #Update to the latest version nextflow self-update
-
Install any of
Docker,Singularity,Conda -
Download the pipeline
git clone https://github.com/alebarberis/nf-rnaseq
-
Start running your own analysis
nextflow run nf-rnaseq --input <DIR> --outdir <DIR> --salmon_index <DIR> --transcriptome <FILE> -profile <conda/docker/singularity>
nextflow run main.nf --help=========================================
nf-rnaseq: a nextflow rna-seq pipeline
=========================================
version 0.0.9000
Usage:
The typical command for running the pipeline is as follows:
nextflow run main.nf --input /path/to/samples --transcriptome /path/to/transcriptome --salmon_index /path/to/index [OPTIONS]
Mandatory arguments:
--input DIRPATH Folder containing FASTQ files, or file with
4 columns: id, read_1, read_2, library_type
--transcriptome FILEPATH FASTA file containing the transcriptome (can be a gzip file)
--salmon_index DIRPATH Folder containing the index on the transcriptome. If empty
a new index will be automatically generated
--modules STRING The pipeline modules to run (default: 'fastqc,quant,multiqc').
Available modules are: fastqc, quant, multiqc
Optional arguments:
--filext STRING Extension of input files (default: fq.gz)
--suffix1 STRING Suffix of first file in paired reads (default: _1)
--suffix2 STRING Suffix of second file in paired reads (default: _2)
--concatenate BOOLEAN Whether to concatenate input files when multiple files
per sample id are found (e.g., files from different
lanes)
--prefix STRING Regular expression used to identify groups of multiple
files to concatenate (e.g., --prefix LANE(\d+)_)
--species STRING Species of the samples (e.g., --species hsapiens).
This parameter is used to create the output sub-folders
and to download genome/transcriptome data (if required)
--refdir DIRPATH Folder with reference transcriptome and (optional) genome
--decoys [FILEPATH] File containing a set of decoy sequences. If the parameter is
provided without value (i.e., --decoys), a set of decoys
is attempted to be computed from the transcriptome and genome
files
--genome FILEPATH FASTA file containing the genome (can be a gzip file)
--gtf FILEPATH Gene Transfer Format file (can be used to generate a genemap)
--genemap [FILEPATH] File containing a mapping of transcripts to genes. If the
parameter is provided without a value (i.e., --genemap),
and a GTF file is provided in input, a mapping is attempted
--salmon_libtype STRING Library type, used for salmon quantification (default: 'A')
--multiqc_config FILEPATH Config yaml file for MultiQC
--outdir DIRPATH Output directory (default: ./results)
--cachedir DIRPATH Provide a centralised cache directory for containers (default: ./work)
--verbose Whether to report extra information on progress
--help Print this usage statement
--max_cpus STRING Maximum amount of allowed cpus (default: 7)
--max_memory STRING Maximum amount of allowed memory (default: '30.GB')
--max_time STRING Maximum amount of execution time (default: '48.h')Nextflow has different configuration sources:
- Parameters specified on the command line or using a file
- Configuration files (Nextflow looks for configuration files in multiple locations)
- Values defined within the pipeline script itself
The sources are ranked in order to decide which settings to apply and avoid conflicts (see Nextflow's documentation for further information).
nf-rnaseq has the main configuration file (named nextflow.config) in the workflow project directory. Other configuration files are located in the /config directory: for example, /config/modules.config contains DSL2 per module options and publishing paths, while /config/resources.config contains resources settings (file modified from the base.config of the nf-core rnaseq pipeline.
A Container can be seen as a minimal virtual environment or, in simpler words, as a software package containing all the tools needed for a specific task, such as the processing of RNA-seq raw data. The main advantage of containerized software is that it allows the execution of the same analysis on different machines being sure we have the same versions of the computational tools, thus boosting reproducibility. Containers can run on any platform that supports a container runtime.
Being written in the Nextflow DSL2 language, this pipeline support the definition of one container per process, thus making it easy to maintain and update software dependencies. The current pipeline default configuration includes profiles (i.e., sets of configuration attributes) for Conda, Docker, and Singularity containers. A profile can be activated when launching the pipeline execution by using the -profile command line option.
nextflow run nf-rnaseq --input <DIR> --outdir <DIR> --salmon_index <DIR> --transcriptome <FILE> -profile <conda/docker/singularity>The default parameters of the pipeline and the defined profiles can be changed by modifying the /nextflow.config configuration file.
To modify the default configuration of a specific module, you need to identify the module in the /config/modules.config configuration file (if missing, create a new entry in the process configuration scope) and then simply add/change the container definition. For example, you can change the container for the FASTQC module as reported below.
Firstly, check on Conda the available versions of FastQC. For example, let's select version 0.11.9. Then, you can modify the configuration accordingly:
process {
withName: FASTQC {
conda = 'bioconda::fastqc=0.11.9'
}
}Firstly, check on Quay.io the available versions of FastQC. For example, let's select version 0.11.9. Then, you can modify the configuration accordingly:
process {
withName: FASTQC {
container = 'quay.io/biocontainers/fastqc:fastqc=0.11.9--hdfd78af_1'
}
}Firstly, check on the Galaxy project website the available versions of FastQC. The Galaxy project provides all Bioinformatics software from the BioContainers initiative as Singularity prebuilt images. For example, let's select version 0.11.9. Then, you can modify the configuration accordingly:
process {
withName: FASTQC {
container = 'https://depot.galaxyproject.org/singularity/fastqc:0.11.9--hdfd78af_1'
}
}The maximum amount of available memory, CPUs, or allowed execution time, can be modified by changing the related parameters (max_memory, max_cpus, and max_time, respectively) in the /nextflow.config configuration file. Process-specific resource requirements can be changed by modifying the /resources.config configuration file.
The design and implementation of this pipeline was based on the nf-core rnaseq pipeline developed by the nf-core project.
The pipeline was written and is maintained by Alessandro Barberis (@alebarberis).
Firstly, I would like to express my sincere gratitude to Prostate Cancer UK for their generous funding, which made it possible for me to develop the first version of this pipeline. Secondly, I would like to acknowledge and thank Professor Valentine Macaulay (head of the IGF Group) and Professor Francesca Buffa (head of the Computational Biology and Integrative Genomics group) for the invaluable discussions and unwavering support provided throughout the project. And finally, I would like to thank the Nuffield Department of Surgical Sciences for their consistent assistance.