Skip to content

Commit 7cc4da7

Browse files
authored
Merge pull request #10 from madscort/dev
Dev
2 parents b8978bd + 06e168c commit 7cc4da7

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

77 files changed

+3085
-290
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -176,3 +176,5 @@ DoBSeqWF.Rproj
176176
.rpack/
177177
nxf-tmp.*
178178
cramtable.tsv
179+
pooltable.tsv
180+
vcftable.tsv

README.md

Lines changed: 141 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,32 @@
11
# DoBSeq Nextflow Pipeline
22

3-
Usage:
3+
## General usage outside NGC-HPC:
44

5-
1. Install nextflow [manually](https://www.nextflow.io/docs/latest/getstarted.html) or using conda:
5+
### 1. Install nextflow [manually](https://www.nextflow.io/docs/latest/getstarted.html) or using conda:
66

77
```Bash
88
conda install nextflow
99
```
1010

11-
2. In a clean folder - clone this repository:
12-
13-
- With internet access:
11+
### 2. In a clean folder - clone this repository:
1412

1513
```Bash
1614
git clone https://github.com/madscort/DoBSeqWF.git .
1715
```
1816

19-
- Without internet access on NGC-HPC:
20-
21-
```Bash
22-
git clone /ngc/projects/icope_staging_r/git/predisposed/.git .
23-
```
24-
25-
3. Run pipeline without any data (dry-run):
17+
### 3. Run pipeline without any data (dry-run):
2618

2719
```Bash
2820
nextflow run main.nf -profile (standard/esrum/ngc),test -stub
2921
```
3022

31-
4. Run pipeline with tiny test data:
23+
### 4. Run pipeline with tiny test data:
3224

3325
```Bash
34-
nextflow run main.nf -profile (standard/esrum/ngc),test
26+
nextflow run main.nf -profile (standard/esrum),test
3527
```
3628

37-
5. Run pipeline with input data (see test data for file contents):
29+
### 5. Run pipeline with input data (see test data for file contents):
3830

3931
```Bash
4032
nextflow run main.nf \
@@ -46,18 +38,148 @@ nextflow run main.nf \
4638
--ploidy <integer>
4739
```
4840

49-
Workflow repository contents:
41+
## Usage on NGC-HPC
42+
43+
### 0. Create qsub helper script
44+
45+
Create a wrapper script for qsub, so you don't have to keep track of working directory, group etc. again.
46+
First do `mkdir ~/bin`. Then save the following script as a file named `~/bin/myqsub` and make it executable by `chmod +x ~/bin/myqsub`.
47+
48+
```
49+
#!/bin/bash
50+
51+
qsub -W group_list=icope_staging_r -A icope_staging_r -d $(pwd) "$@"
52+
```
53+
54+
Add ~/bin to your path. You can have this done on log-in by appending the following line to your `~/.bashrc`:
55+
56+
```
57+
export PATH="$PATH:$HOME/bin"
58+
```
59+
60+
### 1. In a clean folder - clone this repository from the local NGC-HPC folder:
61+
62+
```Bash
63+
git clone /ngc/projects/icope_staging_r/git/predisposed/.git .
64+
```
65+
66+
### 2. Pipeline preview with test data:
67+
68+
```Bash
69+
bash next.pbs -params-file test_config.json -stub
70+
```
71+
72+
### 3. Run pipeline with test data locally (all heavy jobs are still submitted with qsub):
73+
74+
```Bash
75+
bash next.pbs -params-file test_config.json
76+
```
77+
78+
### 4. Submit pipeline with test data:
79+
80+
```Bash
81+
myqsub next.pbs -F -params-file test_config.json
82+
```
83+
84+
## Running the pipeline with _real_ data
85+
86+
### 1. Organization
87+
While the pipeline is still under development, it make sense to create new clones for each pipeline run, to keep track of possible changes done while running it. I propose this folder structure:
88+
89+
```Bash
90+
predisposed
91+
├── git/DoBSeqWF # Temporary local workflow repository
92+
├── resources # Reference genome and target files.
93+
├── data/ # Raw data for each batch
94+
│ ├── <batch_id_I>/
95+
│ │ └── *.fq.gz
96+
│ ├── <batch_id_II>/
97+
│ │ └── *.fq.gz
98+
│ └── <batch_id_III>/
99+
│ └── *.fq.gz
100+
│ └── ...
101+
└── processed_data/ # Processed data for each batch
102+
├── <batch_id_I>/
103+
│ ├── DoBSeqWF/ # Clone repository here
104+
│ │ ├── config.json # Configuration file
105+
│ │ ├── pooltable.tsv # Pool table (create with helper script)
106+
│ │ └── decodetable.tsv # Decode table (we need a convention for this)
107+
│ └── results
108+
│ ├── cram/ # CRAM files for each pool
109+
│ ├── logs/ # Log files for each process
110+
│ ├── variants/ # VCF files for each pool
111+
│ ├── variant_tables/ # TSV files converted from pool VCFs
112+
│ └── pinpoint_variants/
113+
│ ├── all_pins/ # All pinpointables for each sample in individual vcfs (*note)
114+
│ ├── unique_pins/ # All unique pinpointables for each sample in individual vcfs (*note)
115+
│ ├── *_merged.vcf.gz # All pinpointables for all samples in a single vcf without sample information
116+
│ ├── summary.tsv # Variant counts for each sample
117+
│ └── lookup.tsv # Variant to sample lookup table
118+
├── <batch_id_II>/
119+
│ ├── DoBSeqWF/
120+
│ └── results/
121+
├── <batch_id_III>/
122+
│ ├── DoBSeqWF/
123+
└── results/
124+
└── ...
125+
```
126+
(*note) Each pinpointable variant can be represented by the horizontal or the vertical pools. In order not to loose any information, there are, _for now_, 6 vcf files for each sample. Four with representations from either dimension named {sample}\_{pool}\_{unique/all}\_pins.vcf.gz and 2 with all pins merged named {sample}\_{unique/all}.vcf.gz.
127+
128+
### 2. Prepare folders
129+
130+
```Bash
131+
cd /ngc/projects2/dp_00005/data/predisposed/
132+
mkdir -p data/<batch_id> processed_data/<batch_id>
133+
mv /ssi/fastq/data /ngc/projects2/dp_00005/data/predisposed/data/<batch_id>/
134+
```
135+
136+
### 2. Clone pipeline repository
137+
138+
```Bash
139+
cd processed_data/<batch_id>
140+
git clone /ngc/projects2/dp_00005/data/predisposed/git/DoBSeqWF
141+
```
142+
143+
### 3. Create pooltable.tsv:
144+
145+
```Bash
146+
cd DoBSeqWF
147+
bash assets/helper_scripts/create_pooltable.sh ../data/predisposed/data/<batch_id>/
148+
```
149+
150+
### 4. Adjust configuration file:
151+
152+
Fill out config.json with the correct paths and parameters. Decodetable is not needed for mapping only. Look into nextflow.config for possible parameters to set in the conifg.json.
153+
154+
### 5. Run pipeline
155+
156+
```Bash
157+
myqsub next.pbs -F -params-file config.json
158+
```
159+
160+
### 6. Monitor progress
161+
162+
```Bash
163+
tail nextflow.log
164+
```
165+
166+
If the pipeline fails - it is likely due to resource constraints. Adjust as needed in the conf/profiles.config file under NGC, and rerun the PBS script. Be aware that any direct edits of the workflow scripts, ie. modules and subworkflows, can lead to complete re-run of the pipeline.
167+
168+
169+
# Workflow repository contents:
50170

51171
```Bash
52172
DoBSeqWF
53173
├── LICENSE
174+
├── VERSION
54175
├── README.md
55176
├── assets
56177
│ ├── data
57178
│ │ ├── reference_genomes
58179
│ │ │ └── small
59180
│ │ │ └── small_reference.*
60181
│ │ └── test_data
182+
│ │ ├── coordtable.tsv
61183
│ │ ├── decodetable.tsv
62184
│ │ ├── pools
63185
│ │ │ └── *.fq.gz
@@ -76,6 +198,9 @@ DoBSeqWF
76198
├── main.nf # Main workflow
77199
├── modules/
78200
│ └── <module>.nf # Module scripts
201+
├── subworkflows/
202+
│ └── <subworkflow>.nf # Module scripts
203+
├── next.pbs # Helper script for running on NGC-HPC
79204
└── nextflow.config # Workflow parameters
80205
```
81206

VERSION

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
0.1.0
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
##fileformat=VCFv4.2
2+
##FILTER=<ID=LowQual,Description="Low quality">
3+
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
4+
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
5+
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
6+
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
7+
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
8+
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --max-alternate-alleles 3 --sample-ploidy 2 --max-num-haplotypes-in-population 1000 --output B0_H1.GATK.vcf.gz --max-reads-per-alignment-start 0 --intervals target_calling.bed --input B0_H1.iq.bam --reference small_reference.fna --disable-read-filter NotDuplicateReadFilter --use-posteriors-to-calculate-qual false --dont-use-dragstr-priors false --use-new-qual-calculator true --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --heterozygosity-stdev 0.01 --standard-min-confidence-threshold-for-calling 30.0 --max-genotype-count 1024 --num-reference-samples-if-no-call 0 --genotype-assignment-method USE_PLS_TO_ASSIGN --contamination-fraction-to-filter 0.0 --output-mode EMIT_VARIANTS_ONLY --all-site-pls false --flow-likelihood-parallel-threads 0 --flow-likelihood-optimized-comp false --flow-use-t0-tag false --flow-probability-threshold 0.003 --flow-remove-non-single-base-pair-indels false --flow-remove-one-zero-probs false --flow-quantization-bins 121 --flow-fill-empty-bins-value 0.001 --flow-symmetric-indel-probs false --flow-report-insertion-or-deletion false --flow-disallow-probs-larger-than-call false --flow-lump-probs false --flow-retain-max-n-probs-base-format false --flow-probability-scaling-factor 10 --flow-order-cycle-length 4 --flow-number-of-uncertain-flows-to-clip 0 --flow-nucleotide-of-first-uncertain-flow T --keep-boundary-flows false --gvcf-gq-bands 1 --gvcf-gq-bands 2 --gvcf-gq-bands 3 --gvcf-gq-bands 4 --gvcf-gq-bands 5 --gvcf-gq-bands 6 --gvcf-gq-bands 7 --gvcf-gq-bands 8 --gvcf-gq-bands 9 --gvcf-gq-bands 10 --gvcf-gq-bands 11 --gvcf-gq-bands 12 --gvcf-gq-bands 13 --gvcf-gq-bands 14 --gvcf-gq-bands 15 --gvcf-gq-bands 16 --gvcf-gq-bands 17 --gvcf-gq-bands 18 --gvcf-gq-bands 19 --gvcf-gq-bands 20 --gvcf-gq-bands 21 --gvcf-gq-bands 22 --gvcf-gq-bands 23 --gvcf-gq-bands 24 --gvcf-gq-bands 25 --gvcf-gq-bands 26 --gvcf-gq-bands 27 --gvcf-gq-bands 28 --gvcf-gq-bands 29 --gvcf-gq-bands 30 --gvcf-gq-bands 31 --gvcf-gq-bands 32 --gvcf-gq-bands 33 --gvcf-gq-bands 34 --gvcf-gq-bands 35 --gvcf-gq-bands 36 --gvcf-gq-bands 37 --gvcf-gq-bands 38 --gvcf-gq-bands 39 --gvcf-gq-bands 40 --gvcf-gq-bands 41 --gvcf-gq-bands 42 --gvcf-gq-bands 43 --gvcf-gq-bands 44 --gvcf-gq-bands 45 --gvcf-gq-bands 46 --gvcf-gq-bands 47 --gvcf-gq-bands 48 --gvcf-gq-bands 49 --gvcf-gq-bands 50 --gvcf-gq-bands 51 --gvcf-gq-bands 52 --gvcf-gq-bands 53 --gvcf-gq-bands 54 --gvcf-gq-bands 55 --gvcf-gq-bands 56 --gvcf-gq-bands 57 --gvcf-gq-bands 58 --gvcf-gq-bands 59 --gvcf-gq-bands 60 --gvcf-gq-bands 70 --gvcf-gq-bands 80 --gvcf-gq-bands 90 --gvcf-gq-bands 99 --floor-blocks false --indel-size-to-eliminate-in-ref-model 10 --disable-optimizations false --dragen-mode false --flow-mode NONE --apply-bqd false --apply-frd false --disable-spanning-event-genotyping false --transform-dragen-mapping-quality false --mapping-quality-threshold-for-genotyping 20 --max-effective-depth-adjustment-for-frd 0 --just-determine-active-regions false --dont-genotype false --do-not-run-physical-phasing false --do-not-correct-overlapping-quality false --use-filtered-reads-for-annotations false --use-flow-aligner-for-stepwise-hc-filtering false --adaptive-pruning false --do-not-recover-dangling-branches false --recover-dangling-heads false --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --min-dangling-branch-length 4 --recover-all-dangling-branches false --min-pruning 2 --adaptive-pruning-initial-error-rate 0.001 --pruning-lod-threshold 2.302585092994046 --pruning-seeding-lod-threshold 9.210340371976184 --max-unpruned-variants 100 --linked-de-bruijn-graph false --disable-artificial-haplotype-recovery false --enable-legacy-graph-cycle-detection false --debug-assembly false --debug-graph-transformations false --capture-assembly-failure-bam false --num-matching-bases-in-dangling-end-to-recover -1 --error-correction-log-odds -Infinity --error-correct-reads false --kmer-length-for-read-error-correction 25 --min-observations-for-kmer-to-be-solid 20 --likelihood-calculation-engine PairHMM --base-quality-score-threshold 18 --dragstr-het-hom-ratio 2 --dont-use-dragstr-pair-hmm-scores false --pair-hmm-gap-continuation-penalty 10 --expected-mismatch-rate-for-read-disqualification 0.02 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-rate 45 --disable-symmetric-hmm-normalizing false --disable-cap-base-qualities-to-map-quality false --enable-dynamic-read-disqualification-for-genotyping false --dynamic-read-disqualification-threshold 1.0 --native-pair-hmm-threads 4 --native-pair-hmm-use-double-precision false --flow-hmm-engine-min-indel-adjust 6 --flow-hmm-engine-flat-insertion-penatly 45 --flow-hmm-engine-flat-deletion-penatly 45 --pileup-detection false --pileup-detection-enable-indel-pileup-calling false --num-artificial-haplotypes-to-add-per-allele 5 --artifical-haplotype-filtering-kmer-size 10 --pileup-detection-snp-alt-threshold 0.1 --pileup-detection-indel-alt-threshold 0.5 --pileup-detection-absolute-alt-depth 0.0 --pileup-detection-snp-adjacent-to-assembled-indel-range 5 --pileup-detection-bad-read-tolerance 0.0 --pileup-detection-proper-pair-read-badness true --pileup-detection-edit-distance-read-badness-threshold 0.08 --pileup-detection-chimeric-read-badness true --pileup-detection-template-mean-badness-threshold 0.0 --pileup-detection-template-std-badness-threshold 0.0 --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --override-fragment-softclip-check false --min-base-quality-score 10 --smith-waterman JAVA --emit-ref-confidence NONE --max-mnp-distance 0 --force-call-filtered-alleles false --reference-model-deletion-quality 30 --soft-clip-low-quality-ends false --allele-informative-reads-overlap-margin 2 --smith-waterman-dangling-end-match-value 25 --smith-waterman-dangling-end-mismatch-penalty -50 --smith-waterman-dangling-end-gap-open-penalty -110 --smith-waterman-dangling-end-gap-extend-penalty -6 --smith-waterman-haplotype-to-reference-match-value 200 --smith-waterman-haplotype-to-reference-mismatch-penalty -150 --smith-waterman-haplotype-to-reference-gap-open-penalty -260 --smith-waterman-haplotype-to-reference-gap-extend-penalty -11 --smith-waterman-read-to-haplotype-match-value 10 --smith-waterman-read-to-haplotype-mismatch-penalty -15 --smith-waterman-read-to-haplotype-gap-open-penalty -30 --smith-waterman-read-to-haplotype-gap-extend-penalty -5 --flow-assembly-collapse-hmer-size 0 --flow-assembly-collapse-partial-mode false --flow-filter-alleles false --flow-filter-alleles-qual-threshold 30.0 --flow-filter-alleles-sor-threshold 3.0 --flow-filter-lone-alleles false --flow-filter-alleles-debug-graphs false --min-assembly-region-size 50 --max-assembly-region-size 300 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --force-active false --assembly-region-padding 100 --padding-around-indels 75 --padding-around-snps 20 --padding-around-strs 75 --max-extension-into-assembly-region-padding-legacy 25 --enable-legacy-assembly-region-trimming false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --max-variants-per-shard 0 --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false --minimum-mapping-quality 20 --disable-tool-default-annotations false --enable-all-annotations false --allow-old-rms-mapping-quality-annotation-data false",Version="4.3.0.0",Date="15 April 2024 at 13.29.09 CEST">
9+
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
10+
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
11+
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
12+
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
13+
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
14+
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
15+
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
16+
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
17+
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
18+
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
19+
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
20+
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
21+
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
22+
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
23+
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
24+
##contig=<ID=small_ref,length=1980>
25+
##source=HaplotypeCaller
26+
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT B0_H1
27+
small_ref 280 . T C 2001.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.000;DP=102;ExcessHet=0.0000;FS=3.235;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=20.22;ReadPosRankSum=-1.398;SOR=1.806 GT:AD:DP:GQ:PL 0/1:47,52:99:99:2009,0,1806
28+
small_ref 655 . C T 3301.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.000;DP=181;ExcessHet=0.0000;FS=1.851;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=18.65;ReadPosRankSum=1.852;SOR=0.643 GT:AD:DP:GQ:PL 0/1:91,86:177:99:3309,0,3533
276 Bytes
Binary file not shown.

0 commit comments

Comments
 (0)