Skip to content

papaemmelab/nf-ffperase

Repository files navigation

🧽 nf-ffperase

code formatting nf-test nf-ffperase CI

Important

You may use FFPErase, the underlying content, and any output therefrom for personal use, academic research and noncommercial purposes only. See LICENSE for more details.

Tool for pre-processing and classifying FFPE artifact mutations, using nextflow.

Contents

🤖 Trained Models

Models for SNV and Indels artifacts are available at: https://huggingface.co/papaemmelab/ffperase

if --model is not provided, the pipeline will download the corresponding model for the mutation type from huggingface 🤗.

🚀 Run Pipeline

You need Nextflow installed.

nextflow run papaemmelab/nf-ffperase --help

Give it a try with a test run if you have docker available:

nextflow run papaemmelab/nf-ffperase -main -profile test,cloud

1. ⚡️ Full pipeline

Default value: --step full. It runs both Preprocessing and Classify steps.

See this example:

nextflow run papaemmelab/nf-ffperase \
    -r main \
    --step full \
    --vcf {snvs.vcf} \
    --bam {tumor.bam} \
    --reference {grch37.fasta} \
    --bed {grch37.genome.bed} \
    --outdir {results} \
    --coverage {100} \
    --medianInsert {250} \
    --model {trained_models/snvs.pkl} \
    --modelName {name} \
    --mutationType {snvs or indels}

nf-ffperase has 2 steps to classify variants, preprocess and classify:

  1. ✏️ preprocess takes an input of a VCF, BAM, median coverage and reference fasta and annotates mutations for classification. This step uses hileup and GATK's Picard to calculate necessary metrics.

  2. 🔮 classify takes an input of preprocessed mutations and a model and generates a boolean classification as artifact or real for each mutation. [True: Artifact, False: Real]

2. ✏️ Preprocessing Variants

--step preprocess runs the following processes:

  • Pileup: mutations to calculate Variant Allele Frequency (VAF).
  • Picard: its CollectSequencingArtifactMetrics command to calculate estimated error rates at the base change and trinucleotide levels. More details on this calculation can be found here. The user has the option to run this during preprocess or to optionally pass in a directory with the following output files:
    • *.bait_bias_detail_metrics
    • *.pre_adapter_detail_metricsto compute the necessary features:
  • Annotation: using pileup and picard's output estimates the following features:
    • Variant Allele Frequency (VAF)
    • Average Base Quality (AVG_BQ)
    • Average Mapping Quality (AVG_MQ)
    • Number of Variant Reads
    • Number of distinct Variant Alleles (>= 2% VAF)
    • Strand Bias Fisher Score

Example

nextflow run papaemmelab/nf-ffperase \
    -r main \
    --step preprocess \
    --vcf {snvs.vcf} \
    --bam {tumor.bam} \
    --reference {grch37.fasta} \
    --bed {grch37.genome.bed} \
    --outdir {results} \
    --coverage {100} \
    --medianInsert {250} \

Output is the features, located at: {outdir}/preprocess/features.tsv.

⚡️ Optional Speed Improvements

Option --splitPileup corresponds to number of mutations to include in each pileup split and is set as default to 1000. --splitReads corresponds to number of reads to include within each picard split with a default of 7,500,000. If desired and resources are available, decreasing these will increase the number of split jobs optimizing the pileup and picard processes. Changes to these will impact how much memory is required per job so may require updates in nextflow config.

3. 🔮 Classifying Artifacts

--step classify takes an input of a model type, corresponding model and classifies preprocessed mutations based on their likelihood of being artifactual. Output should be directly from preprocess step, located in the output directory: {outdir}/preprocess/features.tsv.

See this example:

nextflow run papaemmelab/nf-ffperase \
    -r main \
    --step full \
    --features {results/preprocess/features.tsv} \
    --outdir {results} \
    --model {trained_models/snvs.pkl} \
    --modelName {name}

4. 🧠 Training/Retraining

--step train takes an input of preprocessed mutations and a boolean label column (0: real, 1: artifact), a model name, mutation type, and an optional pretrained model to train a new classifier.

See this example:

nextflow run papaemmelab/nf-ffperase \
    -r main \
    --step train \
    --features {results/preprocess/features.tsv} \
    --labelCol {column name} \
    --modelName {name} \
    --outdir {results} \
    --mutationType {snvs or indels} \
    --modelPath {trained_models/snvs.pkl} (optional)

Contributing

Contributions are welcome, and they are greatly appreciated, check our contributing guidelines!

About

🔬Nextflow pipeline for FFPE artifact removal.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors