🧽 nf-ffperase

Important

You may use FFPErase, the underlying content, and any output therefrom for personal use, academic research and noncommercial purposes only. See LICENSE for more details.

Tool for pre-processing and classifying FFPE artifact mutations, using nextflow.

🤖 Trained Models

Models for SNV and Indels artifacts are available at: https://huggingface.co/papaemmelab/ffperase

if --model is not provided, the pipeline will download the corresponding model for the mutation type from huggingface 🤗.

🚀 Run Pipeline

You need Nextflow installed.

nextflow run papaemmelab/nf-ffperase --help

Give it a try with a test run if you have docker available:

nextflow run papaemmelab/nf-ffperase -main -profile test,cloud

1. ⚡️ Full pipeline

Default value: --step full. It runs both Preprocessing and Classify steps.

See this example:

nextflow run papaemmelab/nf-ffperase \
    -r main \
    --step full \
    --vcf {snvs.vcf} \
    --bam {tumor.bam} \
    --reference {grch37.fasta} \
    --bed {grch37.genome.bed} \
    --outdir {results} \
    --coverage {100} \
    --medianInsert {250} \
    --model {trained_models/snvs.pkl} \
    --modelName {name} \
    --mutationType {snvs or indels}

nf-ffperase has 2 steps to classify variants, preprocess and classify:

✏️ preprocess takes an input of a VCF, BAM, median coverage and reference fasta and annotates mutations for classification. This step uses hileup and GATK's Picard to calculate necessary metrics.
🔮 classify takes an input of preprocessed mutations and a model and generates a boolean classification as artifact or real for each mutation. [True: Artifact, False: Real]

2. ✏️ Preprocessing Variants

--step preprocess runs the following processes:

Pileup: mutations to calculate Variant Allele Frequency (VAF).
Picard: its CollectSequencingArtifactMetrics command to calculate estimated error rates at the base change and trinucleotide levels. More details on this calculation can be found here. The user has the option to run this during preprocess or to optionally pass in a directory with the following output files:
- *.bait_bias_detail_metrics
- *.pre_adapter_detail_metricsto compute the necessary features:
Annotation: using pileup and picard's output estimates the following features:
- Variant Allele Frequency (VAF)
- Average Base Quality (AVG_BQ)
- Average Mapping Quality (AVG_MQ)
- Number of Variant Reads
- Number of distinct Variant Alleles (>= 2% VAF)
- Strand Bias Fisher Score

Example

nextflow run papaemmelab/nf-ffperase \
    -r main \
    --step preprocess \
    --vcf {snvs.vcf} \
    --bam {tumor.bam} \
    --reference {grch37.fasta} \
    --bed {grch37.genome.bed} \
    --outdir {results} \
    --coverage {100} \
    --medianInsert {250} \

Output is the features, located at: {outdir}/preprocess/features.tsv.

⚡️ Optional Speed Improvements

Option --splitPileup corresponds to number of mutations to include in each pileup split and is set as default to 1000. --splitReads corresponds to number of reads to include within each picard split with a default of 7,500,000. If desired and resources are available, decreasing these will increase the number of split jobs optimizing the pileup and picard processes. Changes to these will impact how much memory is required per job so may require updates in nextflow config.

3. 🔮 Classifying Artifacts

--step classify takes an input of a model type, corresponding model and classifies preprocessed mutations based on their likelihood of being artifactual. Output should be directly from preprocess step, located in the output directory: {outdir}/preprocess/features.tsv.

See this example:

nextflow run papaemmelab/nf-ffperase \
    -r main \
    --step full \
    --features {results/preprocess/features.tsv} \
    --outdir {results} \
    --model {trained_models/snvs.pkl} \
    --modelName {name}

4. 🧠 Training/Retraining

--step train takes an input of preprocessed mutations and a boolean label column (0: real, 1: artifact), a model name, mutation type, and an optional pretrained model to train a new classifier.

See this example:

nextflow run papaemmelab/nf-ffperase \
    -r main \
    --step train \
    --features {results/preprocess/features.tsv} \
    --labelCol {column name} \
    --modelName {name} \
    --outdir {results} \
    --mutationType {snvs or indels} \
    --modelPath {trained_models/snvs.pkl} (optional)

Contributing

Contributions are welcome, and they are greatly appreciated, check our contributing guidelines!

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.github		.github
assets		assets
bin		bin
containers		containers
modules		modules
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_dockerfile.py		create_dockerfile.py
main.nf		main.nf
nextflow.config		nextflow.config
nf-test.config		nf-test.config
run_test.sh		run_test.sh
utils.nf		utils.nf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧽 nf-ffperase

Contents

🤖 Trained Models

🚀 Run Pipeline

1. ⚡️ Full pipeline

2. ✏️ Preprocessing Variants

Example

⚡️ Optional Speed Improvements

3. 🔮 Classifying Artifacts

4. 🧠 Training/Retraining

Contributing

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧽 nf-ffperase

Contents

🤖 Trained Models

🚀 Run Pipeline

1. ⚡️ Full pipeline

2. ✏️ Preprocessing Variants

Example

⚡️ Optional Speed Improvements

3. 🔮 Classifying Artifacts

4. 🧠 Training/Retraining

Contributing

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages