scbirlab/nf-ont-call-variants is a Nextflow pipeline to call variants from Nanopore FASTQ files from bacterial clones relative to a wildtype control.
The pipeline broadly recapitualtes, where possible, the GATK best practices for germline short variant calling, with some changes for bacterial genomes and long-read sequencing.
Table of contents
For each sample:
- Quality Trim reads using
cutadapt. - Map to genome FASTA using
minimap2. - Call variants with
Clair3.
Then merge resulting GVCFs using GATK CombineGVCFs. With the combined variant calls:
- Annotate variant effects using
snpEff. - Filter out variants where all samples have identical variants (important to have a wild-type control here).
- Write to output TSV.
- Get FASTQ quality metrics with
fastqc. - Generate alignment statistics and plots with
samtools statsandmosdepth. - Map to genome FASTA using
bowtie2becauseminimap2logs are not compatible withmultiqc. This way, some kind of alignment metrics are possible. - Compile the logs of processing steps into an HTML report with
multiqc.
You need to have Nextflow and either Conda, Singularity, or Docker installed on your system.
If you're at the Crick or your shared cluster has it already installed, try:
module load Nextflow SingularityOtherwise, if it's your first time using Nextflow on your system and you have Conda installed, you can install it using conda:
conda install -c bioconda nextflow You may need to set the NXF_HOME environment variable. For example,
mkdir -p ~/.nextflow
export NXF_HOME=~/.nextflowTo make this a permanent change, you can do something like the following:
mkdir -p ~/.nextflow
echo "export NXF_HOME=~/.nextflow" >> ~/.bash_profile
source ~/.bash_profileMake a sample sheet (see below) and, optionally,
a nextflow.config file in the directory where you want the
pipeline to run. Then run Nextflow.
nextflow run scbirlab/nf-ont-call-variantsEach time you run the pipeline after the first time, Nextflow will use a
locally-cached version which will not be automatically updated. If you want
to ensure that you're using the very latest version of the pipeline, use
the -latest flag.
nextflow run scbirlab/nf-ont-call-variants -latestIf you want to run a particular tagged version of the pipeline, such as v0.0.2, you can do so using
nextflow run scbirlab/nf-ont-call-variants -r v0.0.2For help, use nextflow run scbirlab/nf-ont-call-variants --help.
The first time you run the pipeline for a project, the software dependencies
in environment.yml will be installed. This may take several minutes.
The following parameters are required:
sample_sheet: path to a CSV with information about the samples and FASTQ files to be processed
The following parameters have default values which can be overridden if necessary.
inputs = "inputs": The folder containing your inputs (i.e. sequencing reads). It's likely you'll want to change this one.trim_qual = 10: Forcutadapt, the minimum Phred score for trimming 3' callsmin_length = 10: Forcutadapt, the minimum trimmed length of a read. Shorter reads will be discarded
The following options do not need to be changed, but can be overridden if you decide you need to:
gatk_image = "docker://broadinstitute/gatk:latest": Which GATK4 image to usesnpeff_url = "https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip": Where to download snpEff fromclair3_image = "docker://hkubal/clair3:latest": Which Clair3 image to usererio_url = "https://github.com/nanoporetech/rerio.git": Where to find the Rerio repositoryclair3_model = "r1041_e82_400bps_sup_v500": Which basecalling model to use with Clair3
The parameters can be provided either in the nextflow.config file or on the nextflow run command.
Here is an example of the nextflow.config file:
params {
sample_sheet = "/path/to/sample-sheet.csv"
inputs = "/path/to/inputs"
}Alternatively, you can provide the parameters on the command line:
nextflow run scbirlab/nf-ont-call-variants \
--sample_sheet /path/to/sample-sheet.csv \
--inputs /path/to/inputsThe sample sheet is a CSV file providing information about which FASTQ files belong to which sample.
The file must have a header with the column names below, and one line per sample to be processed.
sample_id: the unique name of the sample. The wildtype must be named so that it is alphabetically lastreads: path (relative toinputsoption above) to compressed FASTQ files derived from Nanopore sequencinggenome_accession: NCBI genome accession number of the reference, starting with "GCF_" or "GCA_". You can look it up here.
You can also add additional columns for annotation, e.g. strain_name, if you like for later ease of reference.
Here is an example of the sample sheet:
| sample_id | reads | genome_accession |
|---|---|---|
| wt | raw_reads_wt_*.fastq.gz | GCF_000015005.1 |
| mut1 | raw_reads_mut_*.fastq.gz | GCF_000015005.1 |
Outputs are saved in the same directory as sample_sheet. They are organised under three directories:
processed: FASTQ files and logs resulting from alignmentstables: tables, plots, and VCF files corresponding to variant callsmultiqc: HTML report on processing steps
If you run into problems not covered here, add to the issue tracker.
Here are the help pages of the software used by this pipeline.