Skip to content

Commit 1e712cc

Browse files
authored
Merge pull request #1136 from Delphine-L/precur
Update VGP precuration workflow
2 parents 68b5b51 + 4119391 commit 1e712cc

File tree

4 files changed

+2635
-2549
lines changed

4 files changed

+2635
-2549
lines changed

workflows/VGP-assembly-v2/hi-c-contact-map-for-assembly-manual-curation/CHANGELOG.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,18 @@
11
# Changelog
22

3+
4+
5+
## [2.2] - 2026-03-10
6+
7+
### Changed
8+
- The gene tracks generation can now be skipped (Useful for large genomes running out of memory)
9+
- Added option to generate high resolution Hi-C Pretext maps
10+
- Replace MarkDuplicates with Samtools markdup
11+
12+
313
## [2.1] - 2026-03-02
414

15+
516
### Automatic update
617
- `toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.5+galaxy2` was updated to `toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_replace_in_line/9.5+galaxy3`
718
- `toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cat/9.5+galaxy2` was updated to `toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cat/9.5+galaxy3`

workflows/VGP-assembly-v2/hi-c-contact-map-for-assembly-manual-curation/README.md

Lines changed: 51 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -20,35 +20,40 @@ The Pretext files can be opened in PretextView for manual curation of genome ass
2020

2121
### Assembly Annotation Parameters
2222

23-
6. **Species Name** [text] - Species identifier for gene annotation
24-
7. **Assembly Name** [text] - Assembly identifier (e.g., toLid)
25-
8. **Lineage for Compleasm** [text] - BUSCO lineage dataset (e.g., vertebrata_odb10, primates_odb10)
26-
9. **Database for Compleasm** [text] - Compleasm database version (default: v5)
23+
6. **Generate gene annotations** [boolean] - Enable/disable Compleasm gene annotation tracks (disable for large genomes to avoid memory issues)
24+
7. **Species Name** [text] - Species identifier for gene annotation
25+
8. **Assembly Name** [text] - Assembly identifier (e.g., toLid)
26+
9. **Lineage for Compleasm** [text] - BUSCO lineage dataset (e.g., vertebrata_odb10, primates_odb10)
27+
10. **Database for Compleasm** [text] - Compleasm database version (default: v5)
2728

2829
### Scaffold Naming Options
2930

30-
10. **Do you want to add suffixes to the scaffold names?** [boolean] - Select yes if scaffold names do not contain haplotype information
31-
11. **First Haplotype suffix** [text] - Suffix for haplotype 1 scaffolds (default: H1)
32-
12. **Second Haplotype suffix** [text] - Suffix for haplotype 2 scaffolds (default: H2)
31+
11. **Do you want to add suffixes to the scaffold names?** [boolean] - Select yes if scaffold names do not contain haplotype information
32+
12. **First Haplotype suffix** [text] - Suffix for haplotype 1 scaffolds (default: H1)
33+
13. **Second Haplotype suffix** [text] - Suffix for haplotype 2 scaffolds (default: H2)
3334

3435
### Hi-C Processing Options
3536

36-
13. **Do you want to trim the Hi-C data?** [boolean] - If yes, removes 5bp at the end of Hi-C reads (recommended for Arima Hi-C data if the map looks "noisy")
37-
14. **Remove duplicated Hi-C reads?** [boolean] - Remove PCR duplicates from Hi-C alignments using Picard MarkDuplicates
38-
15. **Minimum Mapping Quality** [integer] - Minimum MAPQ score for filtered PretextMap (default: 10; set to 0 to keep all mapped reads)
37+
14. **Do you want to trim the Hi-C data?** [boolean] - If yes, removes 5bp at the end of Hi-C reads (recommended for Arima Hi-C data if the map looks "noisy")
38+
15. **Remove duplicated Hi-C reads?** [boolean] - Remove PCR duplicates from Hi-C alignments using Samtools markdup
39+
16. **Minimum Mapping Quality** [integer] - Minimum MAPQ score for filtered PretextMap (default: 10; set to 0 to keep all mapped reads)
3940

4041
### PacBio Processing Options
4142

42-
16. **Remove adapters from HiFi reads?** [boolean] - Trim adapters from PacBio HiFi reads using Cutadapt
43+
17. **Remove adapters from HiFi reads?** [boolean] - Trim adapters from PacBio HiFi reads using Cutadapt
4344

4445
### Telomere Detection
4546

46-
17. **Canonical telomeric pattern** [text] - Expected telomere repeat sequence (default: TTAGGG for vertebrates; use CCCTAA for reverse complement)
47-
18. **Telomeric Patterns to explore (comma-separated), IUPAC allowed** [text] - Additional telomeric patterns to search for (e.g., TTAGGG,CCCTAA)
47+
18. **Canonical telomeric pattern** [text] - Expected telomere repeat sequence (default: TTAGGG for vertebrates; use CCCTAA for reverse complement)
48+
19. **Telomeric Patterns to explore (comma-separated), IUPAC allowed** [text] - Additional telomeric patterns to search for (e.g., TTAGGG,CCCTAA)
49+
50+
### Hi-C Map Resolution
51+
52+
20. **Generate high resolution Hi-C maps** [boolean] - Generate high resolution Pretext maps (slower, requires more resources)
4853

4954
### Visualization Options
5055

51-
19. **Bin Size for Bigwig files** [integer] - Resolution for coverage tracks (default: 100; larger values = smaller files but lower resolution)
56+
21. **Bin Size for Bigwig files** [integer] - Resolution for coverage tracks (default: 100; larger values = smaller files but lower resolution)
5257

5358
## Outputs
5459

@@ -60,48 +65,57 @@ The Pretext files can be opened in PretextView for manual curation of genome ass
6065

6166
### Gene Annotation Outputs
6267

63-
4. **Compleasm hap1 - Busco** [GFF] - BUSCO-based gene predictions for haplotype 1
64-
5. **Compleasm hap1 - Miniprot** [GFF] - Miniprot-based gene predictions for haplotype 1
65-
6. **Compleasm hap2 - Busco** [GFF] - BUSCO-based gene predictions for haplotype 2
66-
7. **Compleasm hap2 - Miniprot** [GFF] - Miniprot-based gene predictions for haplotype 2
67-
8. **Merged Compleasm Gffs** [GFF] - Combined gene annotations from all Compleasm analyses
68-
68+
4. **Compleasm Genes track** [GFF] - Combined gene annotations for Pretext track (if gene annotations enabled)
6969

7070
### Hi-C Alignment Outputs
7171

72-
10. **Merged Hi-C Alignments** [BAM] - Combined Hi-C read alignments
73-
11. **Hi-C duplication stats: MultiQC** [HTML] - MultiQC report of duplicate statistics (if duplicate removal enabled)
74-
12. **Hi-C duplication stats: Raw** [tabular] - Raw Picard MarkDuplicates metrics (if duplicate removal enabled)
75-
13. **Markduplicates Summary** [text] - Summary of duplicate removal statistics (if enabled)
72+
5. **Merged Hi-C Alignments on Scaffolds** [BAM] - Combined Hi-C read alignments
73+
6. **Precuration Hi-C alignments** [BAM] - Hi-C alignments before filtering
74+
7. **Trimmed Hi-C data** [fastq] - Hi-C reads after adapter trimming (if trimming enabled)
75+
8. **Hi-C duplication stats on Scaffolds** [tabular] - Samtools markdup statistics (if duplicate removal enabled)
76+
9. **Hi-C duplication stats on Scaffolds: MultiQc** [HTML] - MultiQC report of duplicate statistics (if duplicate removal enabled)
77+
10. **Hi-C duplication stats on Scaffolds: Raw** [tabular] - Raw Samtools markdup metrics (if duplicate removal enabled)
78+
11. **Pairtools Multiqc Stats on Scaffolds** [tabular] - Pairtools statistics
79+
12. **Pairtools MultiQc on Scaffolds: Plots** [HTML] - Pairtools MultiQC plots
80+
81+
### PacBio Processing Outputs
82+
83+
13. **HiFi reads without adapters** [fastq] - Adapter-trimmed PacBio reads (if adapter removal enabled)
84+
14. **HiFi reads adapters trimming report** [tabular] - Cutadapt trimming statistics (if adapter removal enabled)
7685

7786
### PacBio Coverage Outputs
7887

79-
14. **BigWig Coverage** [bigwig] - PacBio read coverage track
80-
15. **Coverage Gaps Track** [bedgraph] - Regions with low or no PacBio coverage
81-
16. **Merged HiFi Alignments** [BAM] - Combined PacBio alignments
88+
15. **BigWig Coverage** [bigwig] - PacBio read coverage track
89+
16. **Coverage Gaps Track** [bedgraph] - Regions with low or no PacBio coverage
90+
17. **Merged HiFi Alignments** [BAM] - Combined PacBio alignments
8291

8392
### Telomere Outputs
8493

85-
17. **Telomere Report** [tabular] - Comprehensive telomere analysis from Teloscope
86-
18. **terminal telomeres** [bedgraph] - All detected telomeric regions
87-
19. **P telomeres bed** [BED] - P-arm (5') telomeres only
88-
20. **Q telomeres Bed** [BED] - Q-arm (3') telomeres only
94+
18. **Telomere Report** [tabular] - Comprehensive telomere analysis from Teloscope
95+
19. **terminal telomeres** [bedgraph] - All detected telomeric regions
96+
20. **P telomeres bed** [BED] - P-arm (5') telomeres only
97+
21. **Q telomeres Bed** [BED] - Q-arm (3') telomeres only
8998

9099
### Gap Outputs
91100

92-
21. **Gaps Bed** [BED] - Assembly gap coordinates
93-
22. **Gaps Bedgraph** [bedgraph] - Assembly gap track for Pretext
101+
22. **Gaps Bed** [BED] - Assembly gap coordinates
102+
23. **Gaps Bedgraph** [bedgraph] - Assembly gap track for Pretext
103+
104+
### Assembly Haplotype Outputs
105+
106+
24. **Decontaminated Hap1 with Suffix** [fasta] - Haplotype 1 with suffix applied
107+
25. **Decontaminated Hap2 with Suffix** [fasta] - Haplotype 2 with suffix applied
94108

95109
### Pretext Map Outputs
96110

97111
All Pretext outputs are generated in two versions:
98112
- **With MAPQ filtering** (default MAPQ ≥ 10): Cleaner maps with high-confidence contacts
99113
- **Without filtering (Multimapping)**: Shows all mapped contacts including low-quality alignments
100114

101-
23. **Pretext All tracks** [pretext] - Contact map with all annotation tracks (MAPQ filtered)
102-
24. **Pretext All tracks - Multimapping** [pretext] - Contact map with all tracks (unfiltered)
103-
25. **Pretext Snapshot With tracks** [PNG] - Image of contact map with tracks (MAPQ filtered)
104-
26. **Pretext Snapshot With tracks - Multimapping** [PNG] - Image of contact map with tracks (unfiltered)
115+
26. **Pretext All tracks** [pretext] - Contact map with all annotation tracks (MAPQ filtered)
116+
27. **Pretext All tracks - Multimapping** [pretext] - Contact map with all tracks (unfiltered)
117+
28. **Pretext Snapshot With tracks** [PNG] - Image of contact map with tracks (MAPQ filtered)
118+
29. **Pretext Snapshot With tracks - Multimapping** [PNG] - Image of contact map with tracks (unfiltered)
105119

106120
## Usage Notes
107121

workflows/VGP-assembly-v2/hi-c-contact-map-for-assembly-manual-curation/hi-c-map-for-assembly-manual-curation-tests.yml

Lines changed: 20 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,8 @@
6363
Assembly Name: toLid
6464
Lineage for Compleasm: vertebrata_odb10
6565
Database for Compleasm: v5
66+
Generate gene annotations: false
67+
Generate high resolution Hi-C maps: false
6668
outputs:
6769
Assembly for curation:
6870
asserts:
@@ -85,17 +87,12 @@
8587
has_size:
8688
value: 10000
8789
delta: 5000
88-
Terminal Telomeres:
90+
P telomeres bed:
8991
asserts:
9092
has_text:
91-
text: "scaffold_10.H1"
93+
text: "scaffold_10.H2"
9294
has_text:
9395
text: "scaffold_1.H1"
94-
Merged Hi-C Alignments:
95-
asserts:
96-
has_size:
97-
value: 7500000
98-
delta: 3000000
9996
Pretext All tracks:
10097
asserts:
10198
has_size:
@@ -106,22 +103,18 @@
106103
has_size:
107104
value: 800000
108105
delta: 400000
109-
Compleasm Genes track:
110-
asserts:
111-
has_text:
112-
text: "scaffold_1.H2\t106669\t106765"
113-
Markduplicates Summary:
114-
asserts:
115-
has_text:
116-
text: "1042\t217\t3942"
117-
"Hi-C duplication stats: Raw":
106+
"Hi-C duplication stats on Scaffolds: Raw":
118107
asserts:
119108
has_text:
120109
text: "total_dups\t3941"
121110
P telomeres bed:
122111
asserts:
123112
has_text:
124113
text: "scaffold_1.H1\t488600\t500600\t12000"
114+
Hi-C duplication stats on Scaffolds:
115+
asserts:
116+
has_text:
117+
text: "EXCLUDED: 1042"
125118
- doc: Test 2 - Multiple read sets with collections
126119
job:
127120
Haplotype 1:
@@ -205,6 +198,8 @@
205198
Assembly Name: toLid
206199
Lineage for Compleasm: vertebrata_odb10
207200
Database for Compleasm: v5
201+
Generate gene annotations: true
202+
Generate high resolution Hi-C maps: false
208203
outputs:
209204
Assembly for curation:
210205
asserts:
@@ -227,13 +222,13 @@
227222
has_size:
228223
value: 20000
229224
delta: 10000
230-
Terminal Telomeres:
225+
P telomeres bed:
231226
asserts:
232227
has_text:
233-
text: "scaffold_10.H1"
228+
text: "scaffold_10.H2"
234229
has_text:
235230
text: "scaffold_1.H1"
236-
Merged Hi-C Alignments:
231+
Merged Hi-C Alignments on Scaffolds:
237232
asserts:
238233
has_size:
239234
value: 15000000
@@ -252,11 +247,15 @@
252247
asserts:
253248
has_text:
254249
text: "scaffold_1.H2\t106669\t106765"
255-
"Hi-C duplication stats: Raw":
250+
"Hi-C duplication stats on Scaffolds: Raw":
256251
asserts:
257252
has_text:
258-
text: "total_dups\t1305"
253+
text: "total_dups\t3941\t2623"
259254
P telomeres bed:
260255
asserts:
261256
has_text:
262257
text: "scaffold_1.H1\t488600\t500600\t12000"
258+
Hi-C duplication stats on Scaffolds:
259+
asserts:
260+
has_text:
261+
text: "EXCLUDED: 1725"

0 commit comments

Comments
 (0)