Important
You may use FFPErase, the underlying content, and any output therefrom for personal use, academic research and noncommercial purposes only. See LICENSE for more details.
Tool for pre-processing and classifying FFPE artifact mutations, using nextflow.
Models for SNV and Indels artifacts are available at: https://huggingface.co/papaemmelab/ffperase
if --model is not provided, the pipeline will download the corresponding model for the mutation type from huggingface 🤗.
You need Nextflow installed.
nextflow run papaemmelab/nf-ffperase --helpGive it a try with a test run if you have docker available:
nextflow run papaemmelab/nf-ffperase -main -profile test,cloudDefault value: --step full. It runs both Preprocessing and Classify steps.
See this example:
nextflow run papaemmelab/nf-ffperase \
-r main \
--step full \
--vcf {snvs.vcf} \
--bam {tumor.bam} \
--reference {grch37.fasta} \
--bed {grch37.genome.bed} \
--outdir {results} \
--coverage {100} \
--medianInsert {250} \
--model {trained_models/snvs.pkl} \
--modelName {name} \
--mutationType {snvs or indels}nf-ffperase has 2 steps to classify variants, preprocess and classify:
-
✏️
preprocesstakes an input of a VCF, BAM, median coverage and reference fasta and annotates mutations for classification. This step uses hileup and GATK's Picard to calculate necessary metrics. -
🔮
classifytakes an input of preprocessed mutations and a model and generates a boolean classification as artifact or real for each mutation. [True: Artifact, False: Real]
--step preprocess runs the following processes:
- Pileup: mutations to calculate Variant Allele Frequency (VAF).
- Picard: its
CollectSequencingArtifactMetricscommand to calculate estimated error rates at the base change and trinucleotide levels. More details on this calculation can be found here. The user has the option to run this duringpreprocessor to optionally pass in a directory with the following output files:- *.bait_bias_detail_metrics
- *.pre_adapter_detail_metricsto compute the necessary features:
- Annotation: using pileup and picard's output estimates the following features:
Variant Allele Frequency (VAF)Average Base Quality (AVG_BQ)Average Mapping Quality (AVG_MQ)Number of Variant ReadsNumber of distinct Variant Alleles (>= 2% VAF)Strand Bias Fisher Score
nextflow run papaemmelab/nf-ffperase \
-r main \
--step preprocess \
--vcf {snvs.vcf} \
--bam {tumor.bam} \
--reference {grch37.fasta} \
--bed {grch37.genome.bed} \
--outdir {results} \
--coverage {100} \
--medianInsert {250} \Output is the features, located at: {outdir}/preprocess/features.tsv.
Option --splitPileup corresponds to number of mutations to include in each pileup split and is set as default to 1000. --splitReads corresponds to number of reads to include within each picard split with a default of 7,500,000. If desired and resources are available, decreasing these will increase the number of split jobs optimizing the pileup and picard processes. Changes to these will impact how much memory is required per job so may require updates in nextflow config.
--step classify takes an input of a model type, corresponding model and classifies preprocessed mutations based on their likelihood of being artifactual. Output should be directly from preprocess step, located in the output directory: {outdir}/preprocess/features.tsv.
See this example:
nextflow run papaemmelab/nf-ffperase \
-r main \
--step full \
--features {results/preprocess/features.tsv} \
--outdir {results} \
--model {trained_models/snvs.pkl} \
--modelName {name}--step train takes an input of preprocessed mutations and a boolean label column (0: real, 1: artifact), a model name, mutation type, and an optional pretrained model to train a new classifier.
See this example:
nextflow run papaemmelab/nf-ffperase \
-r main \
--step train \
--features {results/preprocess/features.tsv} \
--labelCol {column name} \
--modelName {name} \
--outdir {results} \
--mutationType {snvs or indels} \
--modelPath {trained_models/snvs.pkl} (optional)Contributions are welcome, and they are greatly appreciated, check our contributing guidelines!