RasmussenLab
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 0 deletions b/‎.gitignore‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 141 additions & 16 deletions b/‎README.md‎
Lines changed: 141 additions & 16 deletions
diff --git a/‎VERSION‎
Lines changed: 1 addition & 0 deletions b/‎VERSION‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎assets/data/test_data/known_sites/g1000.vcf‎
Lines changed: 28 additions & 0 deletions b/‎assets/data/test_data/known_sites/g1000.vcf‎
Lines changed: 28 additions & 0 deletions
diff --git a/‎assets/data/test_data/known_sites/g1000.vcf.idx‎
276 Bytes b/‎assets/data/test_data/known_sites/g1000.vcf.idx‎
276 Bytes
@@ -176,3 +176,5 @@ DoBSeqWF.Rproj
 .rpack/
 nxf-tmp.*
 cramtable.tsv
+pooltable.tsv
+vcftable.tsv
@@ -1,40 +1,32 @@
 # DoBSeq Nextflow Pipeline
 
-Usage:
+## General usage outside NGC-HPC:
 
-1. Install nextflow [manually](https://www.nextflow.io/docs/latest/getstarted.html) or using conda:
+### 1. Install nextflow [manually](https://www.nextflow.io/docs/latest/getstarted.html) or using conda:
 
 ```Bash
 conda install nextflow
 ```
 
-2. In a clean folder - clone this repository:
-
-- With internet access:
+### 2. In a clean folder - clone this repository:
 
 ```Bash
 git clone https://github.com/madscort/DoBSeqWF.git .
 ```
 
-- Without internet access on NGC-HPC:
-
-```Bash
-git clone /ngc/projects/icope_staging_r/git/predisposed/.git .
-```
-
-3. Run pipeline without any data (dry-run):
+### 3. Run pipeline without any data (dry-run):
 
 ```Bash
 nextflow run main.nf -profile (standard/esrum/ngc),test -stub
 ```
 
-4. Run pipeline with tiny test data:
+### 4. Run pipeline with tiny test data:
 
 ```Bash
-nextflow run main.nf -profile (standard/esrum/ngc),test
+nextflow run main.nf -profile (standard/esrum),test
 ```
 
-5. Run pipeline with input data (see test data for file contents):
+### 5. Run pipeline with input data (see test data for file contents):
 
 ```Bash
 nextflow run main.nf                                       \
@@ -46,18 +38,148 @@ nextflow run main.nf                                       \
   --ploidy <integer>
 ```
 
-Workflow repository contents:
+## Usage on NGC-HPC
+
+### 0. Create qsub helper script
+
+Create a wrapper script for qsub, so you don't have to keep track of working directory, group etc. again.
+First do `mkdir ~/bin`. Then save the following script as a file named `~/bin/myqsub` and make it executable by `chmod +x ~/bin/myqsub`.
+
+```
+#!/bin/bash
+
+qsub -W group_list=icope_staging_r -A icope_staging_r -d $(pwd) "$@"
+```
+
+Add ~/bin to your path. You can have this done on log-in by appending the following line to your `~/.bashrc`:
+
+```
+export PATH="$PATH:$HOME/bin"
+```
+
+### 1. In a clean folder - clone this repository from the local NGC-HPC folder:
+
+```Bash
+git clone /ngc/projects/icope_staging_r/git/predisposed/.git .
+```
+
+### 2. Pipeline preview with test data:
+
+```Bash
+bash next.pbs -params-file test_config.json -stub
+```
+
+### 3. Run pipeline with test data locally (all heavy jobs are still submitted with qsub):
+
+```Bash
+bash next.pbs -params-file test_config.json
+```
+
+### 4. Submit pipeline with test data:
+
+```Bash
+myqsub next.pbs -F -params-file test_config.json
+```
+
+## Running the pipeline with _real_ data
+
+### 1. Organization
+While the pipeline is still under development, it make sense to create new clones for each pipeline run, to keep track of possible changes done while running it. I propose this folder structure:
+
+```Bash
+predisposed
+├── git/DoBSeqWF                            # Temporary local workflow repository
+├── resources                               # Reference genome and target files.
+├── data/                                   # Raw data for each batch
+│   ├── <batch_id_I>/
+│   │   └── *.fq.gz
+│   ├── <batch_id_II>/
+│   │   └── *.fq.gz
+│   └── <batch_id_III>/
+│       └── *.fq.gz
+│   └── ...
+└── processed_data/                         # Processed data for each batch
+    ├── <batch_id_I>/
+    │   ├── DoBSeqWF/                       # Clone repository here
+    │   │   ├── config.json                 # Configuration file
+    │   │   ├── pooltable.tsv               # Pool table (create with helper script)
+    │   │   └── decodetable.tsv             # Decode table (we need a convention for this)
+    │   └── results
+    │       ├── cram/                       # CRAM files for each pool
+    │       ├── logs/                       # Log files for each process
+    │       ├── variants/                   # VCF files for each pool
+    │       ├── variant_tables/             # TSV files converted from pool VCFs
+    │       └── pinpoint_variants/
+    │           ├── all_pins/               # All pinpointables for each sample in individual vcfs (*note)
+    │           ├── unique_pins/            # All unique pinpointables for each sample in individual vcfs (*note)
+    │           ├── *_merged.vcf.gz         # All pinpointables for all samples in a single vcf without sample information
+    │           ├── summary.tsv             # Variant counts for each sample
+    │           └── lookup.tsv              # Variant to sample lookup table
+    ├── <batch_id_II>/
+    │   ├── DoBSeqWF/
+    │   └── results/
+    ├── <batch_id_III>/
+    │   ├── DoBSeqWF/
+        └── results/
+    └── ...
+```
+(*note) Each pinpointable variant can be represented by the horizontal or the vertical pools. In order not to loose any information, there are, _for now_, 6 vcf files for each sample. Four with representations from either dimension named {sample}\_{pool}\_{unique/all}\_pins.vcf.gz and 2 with all pins merged named {sample}\_{unique/all}.vcf.gz.
+
+### 2. Prepare folders
+
+```Bash
+cd /ngc/projects2/dp_00005/data/predisposed/
+mkdir -p data/<batch_id> processed_data/<batch_id>
+mv /ssi/fastq/data /ngc/projects2/dp_00005/data/predisposed/data/<batch_id>/
+```
+
+### 2. Clone pipeline repository
+
+```Bash
+cd processed_data/<batch_id>
+git clone /ngc/projects2/dp_00005/data/predisposed/git/DoBSeqWF
+```
+
+### 3. Create pooltable.tsv:
+
+```Bash
+cd DoBSeqWF
+bash assets/helper_scripts/create_pooltable.sh ../data/predisposed/data/<batch_id>/
+```
+
+### 4. Adjust configuration file:
+
+Fill out config.json with the correct paths and parameters. Decodetable is not needed for mapping only. Look into nextflow.config for possible parameters to set in the conifg.json.
+
+### 5. Run pipeline
+
+```Bash
+myqsub next.pbs -F -params-file config.json
+```
+
+### 6. Monitor progress
+
+```Bash
+tail nextflow.log
+```
+
+If the pipeline fails - it is likely due to resource constraints. Adjust as needed in the conf/profiles.config file under NGC, and rerun the PBS script. Be aware that any direct edits of the workflow scripts, ie. modules and subworkflows, can lead to complete re-run of the pipeline.
+
+
+# Workflow repository contents:
 
 ```Bash
 DoBSeqWF                                    
 ├── LICENSE
+├── VERSION
 ├── README.md
 ├── assets
 │   ├── data
 │   │   ├── reference_genomes
 │   │   │   └── small
 │   │   │       └── small_reference.*
 │   │   └── test_data
+│   │       ├── coordtable.tsv
 │   │       ├── decodetable.tsv
 │   │       ├── pools
 │   │       │   └── *.fq.gz
@@ -76,6 +198,9 @@ DoBSeqWF
 ├── main.nf                               # Main workflow
 ├── modules/
 │   └── <module>.nf                       # Module scripts
+├── subworkflows/
+│   └── <subworkflow>.nf                  # Module scripts
+├── next.pbs                              # Helper script for running on NGC-HPC
 └── nextflow.config                       # Workflow parameters
 ```
 
@@ -0,0 +1 @@
+0.1.0
@@ -0,0 +1,28 @@
+##fileformat=VCFv4.2
+##FILTER=<ID=LowQual,Description="Low quality">
+##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
+##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
+##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
+##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
+##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
+##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --max-alternate-alleles 3 --sample-ploidy 2 --max-num-haplotypes-in-population 1000 --output B0_H1.GATK.vcf.gz --max-reads-per-alignment-start 0 --intervals target_calling.bed --input B0_H1.iq.bam --reference small_reference.fna --disable-read-filter NotDuplicateReadFilter --use-posteriors-to-calculate-qual false --dont-use-dragstr-priors false --use-new-qual-calculator true --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --heterozygosity-stdev 0.01 --standard-min-confidence-threshold-for-calling 30.0 --max-genotype-count 1024 --num-reference-samples-if-no-call 0 --genotype-assignment-method USE_PLS_TO_ASSIGN --contamination-fraction-to-filter 0.0 --output-mode EMIT_VARIANTS_ONLY --all-site-pls false --flow-likelihood-parallel-threads 0 --flow-likelihood-optimized-comp false --flow-use-t0-tag false --flow-probability-threshold 0.003 --flow-remove-non-single-base-pair-indels false --flow-remove-one-zero-probs false --flow-quantization-bins 121 --flow-fill-empty-bins-value 0.001 --flow-symmetric-indel-probs false --flow-report-insertion-or-deletion false --flow-disallow-probs-larger-than-call false --flow-lump-probs false --flow-retain-max-n-probs-base-format false --flow-probability-scaling-factor 10 --flow-order-cycle-length 4 --flow-number-of-uncertain-flows-to-clip 0 --flow-nucleotide-of-first-uncertain-flow T --keep-boundary-flows false --gvcf-gq-bands 1 --gvcf-gq-bands 2 --gvcf-gq-bands 3 --gvcf-gq-bands 4 --gvcf-gq-bands 5 --gvcf-gq-bands 6 --gvcf-gq-bands 7 --gvcf-gq-bands 8 --gvcf-gq-bands 9 --gvcf-gq-bands 10 --gvcf-gq-bands 11 --gvcf-gq-bands 12 --gvcf-gq-bands 13 --gvcf-gq-bands 14 --gvcf-gq-bands 15 --gvcf-gq-bands 16 --gvcf-gq-bands 17 --gvcf-gq-bands 18 --gvcf-gq-bands 19 --gvcf-gq-bands 20 --gvcf-gq-bands 21 --gvcf-gq-bands 22 --gvcf-gq-bands 23 --gvcf-gq-bands 24 --gvcf-gq-bands 25 --gvcf-gq-bands 26 --gvcf-gq-bands 27 --gvcf-gq-bands 28 --gvcf-gq-bands 29 --gvcf-gq-bands 30 --gvcf-gq-bands 31 --gvcf-gq-bands 32 --gvcf-gq-bands 33 --gvcf-gq-bands 34 --gvcf-gq-bands 35 --gvcf-gq-bands 36 --gvcf-gq-bands 37 --gvcf-gq-bands 38 --gvcf-gq-bands 39 --gvcf-gq-bands 40 --gvcf-gq-bands 41 --gvcf-gq-bands 42 --gvcf-gq-bands 43 --gvcf-gq-bands 44 --gvcf-gq-bands 45 --gvcf-gq-bands 46 --gvcf-gq-bands 47 --gvcf-gq-bands 48 --gvcf-gq-bands 49 --gvcf-gq-bands 50 --gvcf-gq-bands 51 --gvcf-gq-bands 52 --gvcf-gq-bands 53 --gvcf-gq-bands 54 --gvcf-gq-bands 55 --gvcf-gq-bands 56 --gvcf-gq-bands 57 --gvcf-gq-bands 58 --gvcf-gq-bands 59 --gvcf-gq-bands 60 --gvcf-gq-bands 70 --gvcf-gq-bands 80 --gvcf-gq-bands 90 --gvcf-gq-bands 99 --floor-blocks false --indel-size-to-eliminate-in-ref-model 10 --disable-optimizations false --dragen-mode false --flow-mode NONE --apply-bqd false --apply-frd false --disable-spanning-event-genotyping false --transform-dragen-mapping-quality false --mapping-quality-threshold-for-genotyping 20 --max-effective-depth-adjustment-for-frd 0 --just-determine-active-regions false --dont-genotype false --do-not-run-physical-phasing false --do-not-correct-overlapping-quality false --use-filtered-reads-for-annotations false --use-flow-aligner-for-stepwise-hc-filtering false --adaptive-pruning false --do-not-recover-dangling-branches false --recover-dangling-heads false --kmer-size 10 --kmer-size 25 --dont-increase-kmer-sizes-for-cycles false --allow-non-unique-kmers-in-ref false --num-pruning-samples 1 --min-dangling-branch-length 4 --recover-all-dangling-branches false --min-pruning 2 --adaptive-pruning-initial-error-rate 0.001 --pruning-lod-threshold 2.302585092994046 --pruning-seeding-lod-threshold 9.210340371976184 --max-unpruned-variants 100 --linked-de-bruijn-graph false --disable-artificial-haplotype-recovery false --enable-legacy-graph-cycle-detection false --debug-assembly false --debug-graph-transformations false --capture-assembly-failure-bam false --num-matching-bases-in-dangling-end-to-recover -1 --error-correction-log-odds -Infinity --error-correct-reads false --kmer-length-for-read-error-correction 25 --min-observations-for-kmer-to-be-solid 20 --likelihood-calculation-engine PairHMM --base-quality-score-threshold 18 --dragstr-het-hom-ratio 2 --dont-use-dragstr-pair-hmm-scores false --pair-hmm-gap-continuation-penalty 10 --expected-mismatch-rate-for-read-disqualification 0.02 --pair-hmm-implementation FASTEST_AVAILABLE --pcr-indel-model CONSERVATIVE --phred-scaled-global-read-mismapping-rate 45 --disable-symmetric-hmm-normalizing false --disable-cap-base-qualities-to-map-quality false --enable-dynamic-read-disqualification-for-genotyping false --dynamic-read-disqualification-threshold 1.0 --native-pair-hmm-threads 4 --native-pair-hmm-use-double-precision false --flow-hmm-engine-min-indel-adjust 6 --flow-hmm-engine-flat-insertion-penatly 45 --flow-hmm-engine-flat-deletion-penatly 45 --pileup-detection false --pileup-detection-enable-indel-pileup-calling false --num-artificial-haplotypes-to-add-per-allele 5 --artifical-haplotype-filtering-kmer-size 10 --pileup-detection-snp-alt-threshold 0.1 --pileup-detection-indel-alt-threshold 0.5 --pileup-detection-absolute-alt-depth 0.0 --pileup-detection-snp-adjacent-to-assembled-indel-range 5 --pileup-detection-bad-read-tolerance 0.0 --pileup-detection-proper-pair-read-badness true --pileup-detection-edit-distance-read-badness-threshold 0.08 --pileup-detection-chimeric-read-badness true --pileup-detection-template-mean-badness-threshold 0.0 --pileup-detection-template-std-badness-threshold 0.0 --bam-writer-type CALLED_HAPLOTYPES --dont-use-soft-clipped-bases false --override-fragment-softclip-check false --min-base-quality-score 10 --smith-waterman JAVA --emit-ref-confidence NONE --max-mnp-distance 0 --force-call-filtered-alleles false --reference-model-deletion-quality 30 --soft-clip-low-quality-ends false --allele-informative-reads-overlap-margin 2 --smith-waterman-dangling-end-match-value 25 --smith-waterman-dangling-end-mismatch-penalty -50 --smith-waterman-dangling-end-gap-open-penalty -110 --smith-waterman-dangling-end-gap-extend-penalty -6 --smith-waterman-haplotype-to-reference-match-value 200 --smith-waterman-haplotype-to-reference-mismatch-penalty -150 --smith-waterman-haplotype-to-reference-gap-open-penalty -260 --smith-waterman-haplotype-to-reference-gap-extend-penalty -11 --smith-waterman-read-to-haplotype-match-value 10 --smith-waterman-read-to-haplotype-mismatch-penalty -15 --smith-waterman-read-to-haplotype-gap-open-penalty -30 --smith-waterman-read-to-haplotype-gap-extend-penalty -5 --flow-assembly-collapse-hmer-size 0 --flow-assembly-collapse-partial-mode false --flow-filter-alleles false --flow-filter-alleles-qual-threshold 30.0 --flow-filter-alleles-sor-threshold 3.0 --flow-filter-lone-alleles false --flow-filter-alleles-debug-graphs false --min-assembly-region-size 50 --max-assembly-region-size 300 --active-probability-threshold 0.002 --max-prob-propagation-distance 50 --force-active false --assembly-region-padding 100 --padding-around-indels 75 --padding-around-snps 20 --padding-around-strs 75 --max-extension-into-assembly-region-padding-legacy 25 --enable-legacy-assembly-region-trimming false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --max-variants-per-shard 0 --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays  --disable-tool-default-read-filters false --minimum-mapping-quality 20 --disable-tool-default-annotations false --enable-all-annotations false --allow-old-rms-mapping-quality-annotation-data false",Version="4.3.0.0",Date="15 April 2024 at 13.29.09 CEST">
+##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
+##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
+##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
+##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
+##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
+##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
+##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
+##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
+##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
+##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
+##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
+##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
+##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
+##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
+##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
+##contig=<ID=small_ref,length=1980>
+##source=HaplotypeCaller
+#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	B0_H1
+small_ref	280	.	T	C	2001.64	.	AC=1;AF=0.500;AN=2;BaseQRankSum=0.000;DP=102;ExcessHet=0.0000;FS=3.235;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=20.22;ReadPosRankSum=-1.398;SOR=1.806	GT:AD:DP:GQ:PL	0/1:47,52:99:99:2009,0,1806
+small_ref	655	.	C	T	3301.64	.	AC=1;AF=0.500;AN=2;BaseQRankSum=0.000;DP=181;ExcessHet=0.0000;FS=1.851;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=18.65;ReadPosRankSum=1.852;SOR=0.643	GT:AD:DP:GQ:PL	0/1:91,86:177:99:3309,0,3533