|
| 1 | +# Paired-End Variant and Ploidy-Aware Genotype Calling |
| 2 | + |
| 3 | +This workflow performs paired-end reads quality control, mapping and germline |
| 4 | +variant and genotype calling for organisms of any given ploidy. |
| 5 | + |
| 6 | +It takes a collection of Illumina paired-end FASTQ files, a reference genome |
| 7 | +in FASTA format, a gene annotation in GTF format, and a ploidy parameter, and |
| 8 | +produces annotated variants both as VCF and as a tab-separated table. |
| 9 | + |
| 10 | +Reads are first quality- and adapter-trimmed with fastp. Trimmed reads |
| 11 | +are then mapped to the reference genome using BWA-MEM. The resulting |
| 12 | +alignments are filtered with Samtools view to retain only properly paired |
| 13 | +reads, and PCR duplicates are removed using Picard MarkDuplicates. QC metrics |
| 14 | +from fastp, Samtools stats, and MarkDuplicates are aggregated into a single |
| 15 | +MultiQC report. |
| 16 | + |
| 17 | +Variant and genotype calling is performed with FreeBayes, which operates in |
| 18 | +haplotype-based mode on the duplicate-free BAM. |
| 19 | +The ploidy assumed for calling is configurable and defaults to 2 (diploid). |
| 20 | + |
| 21 | +The intial VCF output is normalised and left-aligned with bcftools norm, |
| 22 | +splitting multi-allelic sites into individual biallelic records. |
| 23 | +Variants are then functionally annotated using SnpEff, with a custom SnpEff |
| 24 | +database built on-the-fly from the provided reference FASTA and GTF annotation. |
| 25 | +Annotation is restricted to coding and splicing effects (downstream, |
| 26 | +intergenic, intronic, UTR, and upstream effects are excluded). The annotated |
| 27 | +VCF is subsequently parsed with SnpSift Extract Fields into a flat tabular |
| 28 | +format, and per-sample tables are merged into a single file. |
| 29 | + |
| 30 | +## Inputs |
| 31 | + |
| 32 | +Paired Collection: a list:paired dataset collection of Illumina paired-end |
| 33 | +reads in fastqsanger or fastqsanger.gz format. |
| 34 | + |
| 35 | +Reference Genome FASTA: the reference genome sequence to use for mapping |
| 36 | +and variant calling. |
| 37 | + |
| 38 | +Annotation GTF: a GTF gene annotation file corresponding to the reference |
| 39 | +genome, used to build the SnpEff database. |
| 40 | + |
| 41 | +Set Ploidy for FreeBayes Variant Calling: an integer specifying the ploidy |
| 42 | +of the organism (default: 2). |
| 43 | + |
| 44 | + |
| 45 | +## Outputs |
| 46 | + |
| 47 | +Fastp HTML report: per-sample HTML quality control report from fastp. |
| 48 | + |
| 49 | +Preprocessing and mapping MultiQC report: aggregated HTML QC report |
| 50 | +combining fastp, Samtools stats, and Picard MarkDuplicates metrics across |
| 51 | +all samples. |
| 52 | + |
| 53 | +SnpEff annotated variants (VCF): annotated variants in VCF format, tagged VariantsasVCF. |
| 54 | + |
| 55 | +SnpEff HTML summary report: HTML summary statistics from SnpEff describing the |
| 56 | +distribution of variant effects across functional categories. |
| 57 | + |
| 58 | +Annotated variants table: a merged, tab-separated table of annotated variants |
| 59 | +across all samples, tagged VariantsAsTSV. Columns include CHROM, POS, |
| 60 | +FILTER, REF, ALT, DP, AF, DP4, SB, and per-effect fields for |
| 61 | +impact, functional class, effect type, gene name, codon change, amino acid |
| 62 | +change, and transcript ID. |
0 commit comments