Skip to content

Commit 5fe4a6d

Browse files
authored
Merge pull request #1109 from SaimMomin12/wf/variant-calling-diploid
Add Variant Calling Workflow for diploid systems (BRC)
2 parents 2343162 + 9b17e0a commit 5fe4a6d

File tree

9 files changed

+1055
-0
lines changed

9 files changed

+1055
-0
lines changed
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
version: 1.2
2+
workflows:
3+
- name: main
4+
subclass: Galaxy
5+
publish: true
6+
primaryDescriptorPath: /generic-genotype+variant-calling-wgs-pe.ga
7+
testParameterFiles:
8+
- /generic-genotype+variant-calling-wgs-pe-test.yml
9+
authors:
10+
- name: Saim Momin
11+
orcid: 0009-0003-9935-828X
12+
- name: Wolfgang Maier
13+
orcid: 0000-0002-9464-6640
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Changelog
2+
3+
## [0.1] 2026-02-18
4+
5+
First release.
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Paired-End Variant and Ploidy-Aware Genotype Calling
2+
3+
This workflow performs paired-end reads quality control, mapping and germline
4+
variant and genotype calling for organisms of any given ploidy.
5+
6+
It takes a collection of Illumina paired-end FASTQ files, a reference genome
7+
in FASTA format, a gene annotation in GTF format, and a ploidy parameter, and
8+
produces annotated variants both as VCF and as a tab-separated table.
9+
10+
Reads are first quality- and adapter-trimmed with fastp. Trimmed reads
11+
are then mapped to the reference genome using BWA-MEM. The resulting
12+
alignments are filtered with Samtools view to retain only properly paired
13+
reads, and PCR duplicates are removed using Picard MarkDuplicates. QC metrics
14+
from fastp, Samtools stats, and MarkDuplicates are aggregated into a single
15+
MultiQC report.
16+
17+
Variant and genotype calling is performed with FreeBayes, which operates in
18+
haplotype-based mode on the duplicate-free BAM.
19+
The ploidy assumed for calling is configurable and defaults to 2 (diploid).
20+
21+
The intial VCF output is normalised and left-aligned with bcftools norm,
22+
splitting multi-allelic sites into individual biallelic records.
23+
Variants are then functionally annotated using SnpEff, with a custom SnpEff
24+
database built on-the-fly from the provided reference FASTA and GTF annotation.
25+
Annotation is restricted to coding and splicing effects (downstream,
26+
intergenic, intronic, UTR, and upstream effects are excluded). The annotated
27+
VCF is subsequently parsed with SnpSift Extract Fields into a flat tabular
28+
format, and per-sample tables are merged into a single file.
29+
30+
## Inputs
31+
32+
Paired Collection: a list:paired dataset collection of Illumina paired-end
33+
reads in fastqsanger or fastqsanger.gz format.
34+
35+
Reference Genome FASTA: the reference genome sequence to use for mapping
36+
and variant calling.
37+
38+
Annotation GTF: a GTF gene annotation file corresponding to the reference
39+
genome, used to build the SnpEff database.
40+
41+
Set Ploidy for FreeBayes Variant Calling: an integer specifying the ploidy
42+
of the organism (default: 2).
43+
44+
45+
## Outputs
46+
47+
Fastp HTML report: per-sample HTML quality control report from fastp.
48+
49+
Preprocessing and mapping MultiQC report: aggregated HTML QC report
50+
combining fastp, Samtools stats, and Picard MarkDuplicates metrics across
51+
all samples.
52+
53+
SnpEff annotated variants (VCF): annotated variants in VCF format, tagged VariantsasVCF.
54+
55+
SnpEff HTML summary report: HTML summary statistics from SnpEff describing the
56+
distribution of variant effects across functional categories.
57+
58+
Annotated variants table: a merged, tab-separated table of annotated variants
59+
across all samples, tagged VariantsAsTSV. Columns include CHROM, POS,
60+
FILTER, REF, ALT, DP, AF, DP4, SB, and per-effect fields for
61+
impact, functional class, effect type, gene name, codon change, amino acid
62+
change, and transcript ID.
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
- doc: Test outline for generic-genotype+variant-calling-wgs-pe.ga
2+
job:
3+
Paired Collection:
4+
class: Collection
5+
collection_type: list:paired
6+
elements:
7+
- class: Collection
8+
type: paired
9+
identifier: sample1
10+
elements:
11+
- class: File
12+
identifier: forward
13+
path: test-data/sample1_R1.fastq.gz
14+
filetype: fastqsanger.gz
15+
- class: File
16+
identifier: reverse
17+
path: test-data/sample1_R2.fastq.gz
18+
filetype: fastqsanger.gz
19+
Reference Genome fasta:
20+
class: File
21+
path: test-data/reference.fasta
22+
filetype: fasta
23+
Annotation GTF:
24+
class: File
25+
path: test-data/annotation.gtf
26+
filetype: gtf
27+
Set Ploidy for FreeBayes Variant Calling: 2
28+
outputs:
29+
fastp HTML report:
30+
element_tests:
31+
sample1:
32+
asserts:
33+
has_text:
34+
text: "<tr><td class='col1'>total reads:</td><td class='col2'>600</td></tr>"
35+
36+
Preprocessing and mapping reports:
37+
asserts:
38+
has_text:
39+
text: "MultiQC Report"
40+
41+
SnpEff variants:
42+
asserts:
43+
has_text:
44+
text: "##fileformat=VCFv4.2"
45+
has_text:
46+
text: "##contig=<ID=chr1,length=2000>"
47+
has_n_lines:
48+
n: 71
49+
50+
SnpEff eff reports:
51+
asserts:
52+
has_text:
53+
text: "<td valign=top> <b> Number of variants processed <br> (i.e. after filter and non-variants) </b> </td>"
54+
has_text:
55+
text: "<td> 2 </td>"
56+
57+
Annotated Variants:
58+
asserts:
59+
has_n_lines:
60+
n: 3
61+
has_n_columns:
62+
n: 19
63+
has_text:
64+
text: "chr1\t751\t.\tC\tG\t60\t1.0\t0\t60\t0.0\t4.31318\tLOW\tSILENT\tSYNONYMOUS_CODING\tSynGene1\tccC/ccG\tP184\tTRANS1"

0 commit comments

Comments
 (0)