Merge pull request #1136 from Delphine-L/precur

mvdbeek · web-flow · commit 1e712cc45570 · 2026-03-12T22:55:07.000+01:00
Update VGP precuration workflow
diff --git a/workflows/VGP-assembly-v2/hi-c-contact-map-for-assembly-manual-curation/CHANGELOG.md b/workflows/VGP-assembly-v2/hi-c-contact-map-for-assembly-manual-curation/CHANGELOG.md
@@ -1,7 +1,18 @@
 # Changelog
 
+
+
+## [2.2] - 2026-03-10
+
+### Changed
+- The gene tracks generation can now be skipped (Useful for large genomes running out of memory)
+- Added option to generate high resolution Hi-C Pretext maps
+- Replace MarkDuplicates with Samtools markdup
+
+
 ## [2.1] - 2026-03-02
 
+
 ### Automatic update
 - `toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.5+galaxy2` was updated to `toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.5+galaxy3`
 - `toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cat/9.5+galaxy2` was updated to `toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cat/9.5+galaxy3`
diff --git a/workflows/VGP-assembly-v2/hi-c-contact-map-for-assembly-manual-curation/README.md b/workflows/VGP-assembly-v2/hi-c-contact-map-for-assembly-manual-curation/README.md
@@ -20,35 +20,40 @@ The Pretext files can be opened in PretextView for manual curation of genome ass
 
 ### Assembly Annotation Parameters
 
-6. **Species Name** [text] - Species identifier for gene annotation
-7. **Assembly Name** [text] - Assembly identifier (e.g., toLid)
-8. **Lineage for Compleasm** [text] - BUSCO lineage dataset (e.g., vertebrata_odb10, primates_odb10)
-9. **Database for Compleasm** [text] - Compleasm database version (default: v5)
+6. **Generate gene annotations** [boolean] - Enable/disable Compleasm gene annotation tracks (disable for large genomes to avoid memory issues)
+7. **Species Name** [text] - Species identifier for gene annotation
+8. **Assembly Name** [text] - Assembly identifier (e.g., toLid)
+9. **Lineage for Compleasm** [text] - BUSCO lineage dataset (e.g., vertebrata_odb10, primates_odb10)
+10. **Database for Compleasm** [text] - Compleasm database version (default: v5)
 
 ### Scaffold Naming Options
 
-10. **Do you want to add suffixes to the scaffold names?** [boolean] - Select yes if scaffold names do not contain haplotype information
-11. **First Haplotype suffix** [text] - Suffix for haplotype 1 scaffolds (default: H1)
-12. **Second Haplotype suffix** [text] - Suffix for haplotype 2 scaffolds (default: H2)
+11. **Do you want to add suffixes to the scaffold names?** [boolean] - Select yes if scaffold names do not contain haplotype information
+12. **First Haplotype suffix** [text] - Suffix for haplotype 1 scaffolds (default: H1)
+13. **Second Haplotype suffix** [text] - Suffix for haplotype 2 scaffolds (default: H2)
 
 ### Hi-C Processing Options
 
-13. **Do you want to trim the Hi-C data?** [boolean] - If yes, removes 5bp at the end of Hi-C reads (recommended for Arima Hi-C data if the map looks "noisy")
-14. **Remove duplicated Hi-C reads?** [boolean] - Remove PCR duplicates from Hi-C alignments using Picard MarkDuplicates
-15. **Minimum Mapping Quality** [integer] - Minimum MAPQ score for filtered PretextMap (default: 10; set to 0 to keep all mapped reads)
+14. **Do you want to trim the Hi-C data?** [boolean] - If yes, removes 5bp at the end of Hi-C reads (recommended for Arima Hi-C data if the map looks "noisy")
+15. **Remove duplicated Hi-C reads?** [boolean] - Remove PCR duplicates from Hi-C alignments using Samtools markdup
+16. **Minimum Mapping Quality** [integer] - Minimum MAPQ score for filtered PretextMap (default: 10; set to 0 to keep all mapped reads)
 
 ### PacBio Processing Options
 
-16. **Remove adapters from HiFi reads?** [boolean] - Trim adapters from PacBio HiFi reads using Cutadapt
+17. **Remove adapters from HiFi reads?** [boolean] - Trim adapters from PacBio HiFi reads using Cutadapt
 
 ### Telomere Detection
 
-17. **Canonical telomeric pattern** [text] - Expected telomere repeat sequence (default: TTAGGG for vertebrates; use CCCTAA for reverse complement)
-18. **Telomeric Patterns to explore (comma-separated), IUPAC allowed** [text] - Additional telomeric patterns to search for (e.g., TTAGGG,CCCTAA)
+18. **Canonical telomeric pattern** [text] - Expected telomere repeat sequence (default: TTAGGG for vertebrates; use CCCTAA for reverse complement)
+19. **Telomeric Patterns to explore (comma-separated), IUPAC allowed** [text] - Additional telomeric patterns to search for (e.g., TTAGGG,CCCTAA)
+
+### Hi-C Map Resolution
+
+20. **Generate high resolution Hi-C maps** [boolean] - Generate high resolution Pretext maps (slower, requires more resources)
 
 ### Visualization Options
 
-19. **Bin Size for Bigwig files** [integer] - Resolution for coverage tracks (default: 100; larger values = smaller files but lower resolution)
+21. **Bin Size for Bigwig files** [integer] - Resolution for coverage tracks (default: 100; larger values = smaller files but lower resolution)
 
 ## Outputs
 
@@ -60,48 +65,57 @@ The Pretext files can be opened in PretextView for manual curation of genome ass
 
 ### Gene Annotation Outputs
 
-4. **Compleasm hap1 - Busco** [GFF] - BUSCO-based gene predictions for haplotype 1
-5. **Compleasm hap1 - Miniprot** [GFF] - Miniprot-based gene predictions for haplotype 1
-6. **Compleasm hap2 - Busco** [GFF] - BUSCO-based gene predictions for haplotype 2
-7. **Compleasm hap2 - Miniprot** [GFF] - Miniprot-based gene predictions for haplotype 2
-8. **Merged Compleasm Gffs** [GFF] - Combined gene annotations from all Compleasm analyses
-
+4. **Compleasm Genes track** [GFF] - Combined gene annotations for Pretext track (if gene annotations enabled)
 
 ### Hi-C Alignment Outputs
 
-10. **Merged Hi-C Alignments** [BAM] - Combined Hi-C read alignments
-11. **Hi-C duplication stats: MultiQC** [HTML] - MultiQC report of duplicate statistics (if duplicate removal enabled)
-12. **Hi-C duplication stats: Raw** [tabular] - Raw Picard MarkDuplicates metrics (if duplicate removal enabled)
-13. **Markduplicates Summary** [text] - Summary of duplicate removal statistics (if enabled)
+5. **Merged Hi-C Alignments on Scaffolds** [BAM] - Combined Hi-C read alignments
+6. **Precuration Hi-C alignments** [BAM] - Hi-C alignments before filtering
+7. **Trimmed Hi-C data** [fastq] - Hi-C reads after adapter trimming (if trimming enabled)
+8. **Hi-C duplication stats on Scaffolds** [tabular] - Samtools markdup statistics (if duplicate removal enabled)
+9. **Hi-C duplication stats on Scaffolds: MultiQc** [HTML] - MultiQC report of duplicate statistics (if duplicate removal enabled)
+10. **Hi-C duplication stats on Scaffolds: Raw** [tabular] - Raw Samtools markdup metrics (if duplicate removal enabled)
+11. **Pairtools Multiqc Stats on Scaffolds** [tabular] - Pairtools statistics
+12. **Pairtools MultiQc on Scaffolds: Plots** [HTML] - Pairtools MultiQC plots
+
+### PacBio Processing Outputs
+
+13. **HiFi reads without adapters** [fastq] - Adapter-trimmed PacBio reads (if adapter removal enabled)
+14. **HiFi reads adapters trimming report** [tabular] - Cutadapt trimming statistics (if adapter removal enabled)
 
 ### PacBio Coverage Outputs
 
-14. **BigWig Coverage** [bigwig] - PacBio read coverage track
-15. **Coverage Gaps Track** [bedgraph] - Regions with low or no PacBio coverage
-16. **Merged HiFi Alignments** [BAM] - Combined PacBio alignments
+15. **BigWig Coverage** [bigwig] - PacBio read coverage track
+16. **Coverage Gaps Track** [bedgraph] - Regions with low or no PacBio coverage
+17. **Merged HiFi Alignments** [BAM] - Combined PacBio alignments
 
 ### Telomere Outputs
 
-17. **Telomere Report** [tabular] - Comprehensive telomere analysis from Teloscope
-18. **terminal telomeres** [bedgraph] - All detected telomeric regions
-19. **P telomeres bed** [BED] - P-arm (5') telomeres only
-20. **Q telomeres Bed** [BED] - Q-arm (3') telomeres only
+18. **Telomere Report** [tabular] - Comprehensive telomere analysis from Teloscope
+19. **terminal telomeres** [bedgraph] - All detected telomeric regions
+20. **P telomeres bed** [BED] - P-arm (5') telomeres only
+21. **Q telomeres Bed** [BED] - Q-arm (3') telomeres only
 
 ### Gap Outputs
 
-21. **Gaps Bed** [BED] - Assembly gap coordinates
-22. **Gaps Bedgraph** [bedgraph] - Assembly gap track for Pretext
+22. **Gaps Bed** [BED] - Assembly gap coordinates
+23. **Gaps Bedgraph** [bedgraph] - Assembly gap track for Pretext
+
+### Assembly Haplotype Outputs
+
+24. **Decontaminated Hap1 with Suffix** [fasta] - Haplotype 1 with suffix applied
+25. **Decontaminated Hap2 with Suffix** [fasta] - Haplotype 2 with suffix applied
 
 ### Pretext Map Outputs
 
 All Pretext outputs are generated in two versions:
 - **With MAPQ filtering** (default MAPQ ≥ 10): Cleaner maps with high-confidence contacts
 - **Without filtering (Multimapping)**: Shows all mapped contacts including low-quality alignments
 
-23. **Pretext All tracks** [pretext] - Contact map with all annotation tracks (MAPQ filtered)
-24. **Pretext All tracks - Multimapping** [pretext] - Contact map with all tracks (unfiltered)
-25. **Pretext Snapshot With tracks** [PNG] - Image of contact map with tracks (MAPQ filtered)
-26. **Pretext Snapshot With tracks - Multimapping** [PNG] - Image of contact map with tracks (unfiltered)
+26. **Pretext All tracks** [pretext] - Contact map with all annotation tracks (MAPQ filtered)
+27. **Pretext All tracks - Multimapping** [pretext] - Contact map with all tracks (unfiltered)
+28. **Pretext Snapshot With tracks** [PNG] - Image of contact map with tracks (MAPQ filtered)
+29. **Pretext Snapshot With tracks - Multimapping** [PNG] - Image of contact map with tracks (unfiltered)
 
 ## Usage Notes
 
diff --git a/workflows/VGP-assembly-v2/hi-c-contact-map-for-assembly-manual-curation/hi-c-map-for-assembly-manual-curation-tests.yml b/workflows/VGP-assembly-v2/hi-c-contact-map-for-assembly-manual-curation/hi-c-map-for-assembly-manual-curation-tests.yml
@@ -63,6 +63,8 @@
     Assembly Name: toLid
     Lineage for Compleasm: vertebrata_odb10
     Database for Compleasm: v5 
+    Generate gene annotations: false
+    Generate high resolution Hi-C maps: false
   outputs:
     Assembly for curation:
       asserts:
@@ -85,17 +87,12 @@
         has_size:
           value: 10000
           delta: 5000
-    Terminal Telomeres:
+    P telomeres bed:
       asserts:
         has_text:
-          text: "scaffold_10.H1"
+          text: "scaffold_10.H2"
         has_text:
           text: "scaffold_1.H1"
-    Merged Hi-C Alignments:
-      asserts:
-        has_size:
-          value: 7500000
-          delta: 3000000
     Pretext All tracks:
       asserts:
         has_size:
@@ -106,22 +103,18 @@
         has_size:
           value: 800000
           delta: 400000
-    Compleasm Genes track:
-      asserts:
-        has_text:
-          text: "scaffold_1.H2\t106669\t106765"
-    Markduplicates Summary:
-      asserts:
-        has_text:
-          text: "1042\t217\t3942"
-    "Hi-C duplication stats: Raw":
+    "Hi-C duplication stats on Scaffolds: Raw":
       asserts:
         has_text:
           text: "total_dups\t3941"
     P telomeres bed:
       asserts:
         has_text:
           text: "scaffold_1.H1\t488600\t500600\t12000"
+    Hi-C duplication stats on Scaffolds:
+      asserts:
+        has_text:
+          text: "EXCLUDED: 1042"
 - doc: Test 2 - Multiple read sets with collections
   job:
     Haplotype 1:
@@ -205,6 +198,8 @@
     Assembly Name: toLid
     Lineage for Compleasm: vertebrata_odb10
     Database for Compleasm: v5
+    Generate gene annotations: true
+    Generate high resolution Hi-C maps: false
   outputs:
     Assembly for curation:
       asserts:
@@ -227,13 +222,13 @@
         has_size:
           value: 20000
           delta: 10000
-    Terminal Telomeres:
+    P telomeres bed:
       asserts:
         has_text:
-          text: "scaffold_10.H1"
+          text: "scaffold_10.H2"
         has_text:
           text: "scaffold_1.H1"
-    Merged Hi-C Alignments:
+    Merged Hi-C Alignments on Scaffolds:
       asserts:
         has_size:
           value: 15000000
@@ -252,11 +247,15 @@
       asserts:
         has_text:
           text: "scaffold_1.H2\t106669\t106765"
-    "Hi-C duplication stats: Raw":
+    "Hi-C duplication stats on Scaffolds: Raw":
       asserts:
         has_text:
-          text: "total_dups\t1305"
+          text: "total_dups\t3941\t2623"
     P telomeres bed:
       asserts:
         has_text:
           text: "scaffold_1.H1\t488600\t500600\t12000"
+    Hi-C duplication stats on Scaffolds:
+      asserts:
+        has_text:
+          text: "EXCLUDED: 1725"
diff --git a/workflows/VGP-assembly-v2/hi-c-contact-map-for-assembly-manual-curation/hi-c-map-for-assembly-manual-curation.ga b/workflows/VGP-assembly-v2/hi-c-contact-map-for-assembly-manual-curation/hi-c-map-for-assembly-manual-curation.ga