Skip to content

Commit 3fdc3f0

Browse files
authored
Merge pull request #1107 from Delphine-L/precur
VGP - New version of precuration workflow
2 parents 534f147 + 0dcbed4 commit 3fdc3f0

File tree

4 files changed

+5537
-1412
lines changed

4 files changed

+5537
-1412
lines changed

workflows/VGP-assembly-v2/hi-c-contact-map-for-assembly-manual-curation/CHANGELOG.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,33 @@
11
# Changelog
22

3+
## [2.0] - 2026-02-13
4+
5+
### Changed
6+
7+
- Replaced seqtk_telo with Teloscope for telomere detection
8+
- Telomere output renamed from "Seqtk-telo Output" to "Telomere Report"
9+
10+
### Added
11+
12+
- Gene annotation tracks with Compleasm
13+
- Optional removal of PCR duplicates from Hi-C reads (Picard MarkDuplicates)
14+
- Optional adapter trimming for PacBio HiFi reads (Cutadapt)
15+
- Separate P-arm and Q-arm telomere BED outputs
16+
- Custom telomeric pattern exploration with IUPAC support
17+
- Hi-C duplication statistics (MultiQC and raw formats)
18+
19+
20+
### Automatic update
21+
22+
- `toolshed.g2.bx.psu.edu/repos/bgruening/gfastats/gfastats/1.3.11+galaxy0` was updated to `toolshed.g2.bx.psu.edu/repos/bgruening/gfastats/gfastats/1.3.11+galaxy1`
23+
- `toolshed.g2.bx.psu.edu/repos/iuc/bwa_mem2/bwa_mem2/2.2.1+galaxy4` was updated to `toolshed.g2.bx.psu.edu/repos/iuc/bwa_mem2/bwa_mem2/2.3+galaxy0`
24+
- `toolshed.g2.bx.psu.edu/repos/iuc/minimap2/minimap2/2.28+galaxy1` was updated to `toolshed.g2.bx.psu.edu/repos/iuc/minimap2/minimap2/2.28+galaxy2`
25+
- `toolshed.g2.bx.psu.edu/repos/iuc/pretext_graph/pretext_graph/0.0.7+galaxy0` was updated to `toolshed.g2.bx.psu.edu/repos/iuc/pretext_graph/pretext_graph/0.0.9+galaxy0`
26+
- `toolshed.g2.bx.psu.edu/repos/iuc/pretext_snapshot/pretext_snapshot/0.0.4+galaxy0` was updated to `toolshed.g2.bx.psu.edu/repos/iuc/pretext_snapshot/pretext_snapshot/0.0.5+galaxy1`
27+
- `toolshed.g2.bx.psu.edu/repos/iuc/samtools_merge/samtools_merge/1.20+galaxy2` was updated to `toolshed.g2.bx.psu.edu/repos/iuc/samtools_merge/samtools_merge/1.22+galaxy1`
28+
- `toolshed.g2.bx.psu.edu/repos/lparsons/cutadapt/cutadapt/5.1+galaxy0` was updated to `toolshed.g2.bx.psu.edu/repos/lparsons/cutadapt/cutadapt/5.2+galaxy0`
29+
30+
331
## [1.0beta6] - 2025-06-09
432

533
### Changes
Lines changed: 124 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,133 @@
1-
# Hi-C Contact map generation for manual curation of genome assemblies
1+
# Hi-C Contact Map Generation for Manual Curation of Genome Assemblies
22

3-
This workflow generates Hi-C contact maps for diploid genome assemblies in the Pretext format. It includes tracks for PacBio read coverage, Gaps, and telomeres. The Pretext files can be open in PretextView for the manual curation of genome assemblies.
3+
This workflow generates Hi-C contact maps for diploid genome assemblies in the Pretext format. It includes tracks for:
4+
- **Gene annotations** (Compleasm)
5+
- **PacBio read coverage** and coverage gaps
6+
- **Telomeres** (with flexible pattern detection)
7+
- **Assembly gaps**
48

9+
The Pretext files can be opened in PretextView for manual curation of genome assemblies.
510

611
## Inputs
712

8-
1. **Haplotype 1** [fasta]
9-
2. **Will you use a second haplotype?**
10-
3. **Haplotype 2** [fasta]
11-
4. **Do you want to add suffixes to the scaffold names?** Select yes if the scaffold names in your assembly do not contain haplotype information.
12-
5. **Haplotype 1 suffix** This suffix will be added to haplotype 1 scaffold names if you selected to add suffixes to the scaffold names.
13-
6. **Haplotype 2 suffix** This suffix will be added to haplotype 2 scaffold names if you selected to add suffixes to the scaffold names.
14-
7. **Hi-C reads** [fastq] Paired Collection containing the Hi-D data
15-
8. **Do you want to trim the Hi-C data?** If *yes*, remove 5bp at the end of Hi-C reads. Use with Arima Hi-C data if the Hi-C map looks "noisy".
16-
9. **Minimum Mapping Score** Minimum mapping score to keep for Hi-C alignments in the filtered PretextMap. Set to 0 to keep all mapped reads. Default: 20 .
17-
10. **Telomere repeat to suit species** Expected value of the repeated sequences in the telomeres. Default value [CCCTAA] is suited to vertebrates.
18-
11. **PacBio reads** [fastq] Collection of PacBio reads.
13+
### Required Inputs
1914

15+
1. **Haplotype 1** [fasta] - Primary haplotype assembly
16+
2. **Will you use a second haplotype?** [boolean] - Set to true for diploid assemblies
17+
3. **Haplotype 2** [fasta] - Secondary haplotype assembly (required if using two haplotypes)
18+
4. **Hi-C reads** [fastq] - Paired collection containing Hi-C data
19+
5. **PacBio reads** [fastq] - Collection of PacBio HiFi reads
20+
21+
### Assembly Annotation Parameters
22+
23+
6. **Species Name** [text] - Species identifier for gene annotation
24+
7. **Assembly Name** [text] - Assembly identifier (e.g., toLid)
25+
8. **Lineage for Compleasm** [text] - BUSCO lineage dataset (e.g., vertebrata_odb10, primates_odb10)
26+
9. **Database for Compleasm** [text] - Compleasm database version (default: v5)
27+
28+
### Scaffold Naming Options
29+
30+
10. **Do you want to add suffixes to the scaffold names?** [boolean] - Select yes if scaffold names do not contain haplotype information
31+
11. **First Haplotype suffix** [text] - Suffix for haplotype 1 scaffolds (default: H1)
32+
12. **Second Haplotype suffix** [text] - Suffix for haplotype 2 scaffolds (default: H2)
33+
34+
### Hi-C Processing Options
35+
36+
13. **Do you want to trim the Hi-C data?** [boolean] - If yes, removes 5bp at the end of Hi-C reads (recommended for Arima Hi-C data if the map looks "noisy")
37+
14. **Remove duplicated Hi-C reads?** [boolean] - Remove PCR duplicates from Hi-C alignments using Picard MarkDuplicates
38+
15. **Minimum Mapping Quality** [integer] - Minimum MAPQ score for filtered PretextMap (default: 10; set to 0 to keep all mapped reads)
39+
40+
### PacBio Processing Options
41+
42+
16. **Remove adapters from HiFi reads?** [boolean] - Trim adapters from PacBio HiFi reads using Cutadapt
43+
44+
### Telomere Detection
45+
46+
17. **Canonical telomeric pattern** [text] - Expected telomere repeat sequence (default: TTAGGG for vertebrates; use CCCTAA for reverse complement)
47+
18. **Telomeric Patterns to explore (comma-separated), IUPAC allowed** [text] - Additional telomeric patterns to search for (e.g., TTAGGG,CCCTAA)
48+
49+
### Visualization Options
50+
51+
19. **Bin Size for Bigwig files** [integer] - Resolution for coverage tracks (default: 100; larger values = smaller files but lower resolution)
2052

2153
## Outputs
2254

23-
1. Concatenated Assembly [fasta] If two haplotypes are used.
24-
2. Trimmed Hi-C data (If trimming option is selected) [fastq]
25-
3. Mapped Hi-C reads [bam]
26-
4. Telomeres track [bedgraph]
27-
5. Gap track [bedgraph]
28-
6. Coverage track [bigwig]
29-
7. Gaps in coverage track [bedgraph]
30-
7. Pretext Map without tracks [pretext], filtered and unfiltered.
31-
8. Pretext Map with tracks [pretext], filtered and unfiltered.
32-
9. Pretext Snapshot image of the Hi-C contact map [png], filtered and unfiltered.
55+
### Assembly Outputs
56+
57+
1. **Assembly for curation** [fasta] - Merged assembly (if two haplotypes used) or single haplotype with optional suffix
58+
2. **Assembly Info** [tabular] - Summary statistics from gfastats
59+
3. **Both Haplotypes merged** [fasta] - Concatenated assembly file
60+
61+
### Gene Annotation Outputs
62+
63+
4. **Compleasm hap1 - Busco** [GFF] - BUSCO-based gene predictions for haplotype 1
64+
5. **Compleasm hap1 - Miniprot** [GFF] - Miniprot-based gene predictions for haplotype 1
65+
6. **Compleasm hap2 - Busco** [GFF] - BUSCO-based gene predictions for haplotype 2
66+
7. **Compleasm hap2 - Miniprot** [GFF] - Miniprot-based gene predictions for haplotype 2
67+
8. **Merged Compleasm Gffs** [GFF] - Combined gene annotations from all Compleasm analyses
68+
69+
70+
### Hi-C Alignment Outputs
71+
72+
10. **Merged Hi-C Alignments** [BAM] - Combined Hi-C read alignments
73+
11. **Hi-C duplication stats: MultiQC** [HTML] - MultiQC report of duplicate statistics (if duplicate removal enabled)
74+
12. **Hi-C duplication stats: Raw** [tabular] - Raw Picard MarkDuplicates metrics (if duplicate removal enabled)
75+
13. **Markduplicates Summary** [text] - Summary of duplicate removal statistics (if enabled)
76+
77+
### PacBio Coverage Outputs
78+
79+
14. **BigWig Coverage** [bigwig] - PacBio read coverage track
80+
15. **Coverage Gaps Track** [bedgraph] - Regions with low or no PacBio coverage
81+
16. **Merged HiFi Alignments** [BAM] - Combined PacBio alignments
82+
83+
### Telomere Outputs
84+
85+
17. **Telomere Report** [tabular] - Comprehensive telomere analysis from Teloscope
86+
18. **terminal telomeres** [bedgraph] - All detected telomeric regions
87+
19. **P telomeres bed** [BED] - P-arm (5') telomeres only
88+
20. **Q telomeres Bed** [BED] - Q-arm (3') telomeres only
89+
90+
### Gap Outputs
91+
92+
21. **Gaps Bed** [BED] - Assembly gap coordinates
93+
22. **Gaps Bedgraph** [bedgraph] - Assembly gap track for Pretext
94+
95+
### Pretext Map Outputs
96+
97+
All Pretext outputs are generated in two versions:
98+
- **With MAPQ filtering** (default MAPQ ≥ 10): Cleaner maps with high-confidence contacts
99+
- **Without filtering (Multimapping)**: Shows all mapped contacts including low-quality alignments
100+
101+
23. **Pretext All tracks** [pretext] - Contact map with all annotation tracks (MAPQ filtered)
102+
24. **Pretext All tracks - Multimapping** [pretext] - Contact map with all tracks (unfiltered)
103+
25. **Pretext Snapshot With tracks** [PNG] - Image of contact map with tracks (MAPQ filtered)
104+
26. **Pretext Snapshot With tracks - Multimapping** [PNG] - Image of contact map with tracks (unfiltered)
105+
106+
## Usage Notes
107+
108+
### When to trim Hi-C data
109+
Enable trimming if you're using Arima Hi-C kits and notice a "noisy" contact map pattern.
110+
111+
### MAPQ filtering
112+
The default MAPQ threshold of 10 removes ambiguously mapped Hi-C contacts, resulting in cleaner contact maps. Compare both filtered and unfiltered outputs to assess mapping quality.
113+
114+
### Choosing Compleasm lineage
115+
Select the most specific lineage available for your species:
116+
- Vertebrates: `vertebrata_odb10`
117+
- Mammals: `mammalia_odb10`
118+
- Primates: `primates_odb10`
119+
- Birds: `aves_odb10`
120+
- See [BUSCO datasets](https://busco-data.ezlab.org/v5/data/lineages/) for complete list
121+
122+
### Telomere patterns
123+
- Vertebrates typically use TTAGGG
124+
- Some species have variant patterns - check the literature
125+
- IUPAC codes are supported (e.g., TTAGGK for TTAGGG/TTAGGC)
126+
127+
## Citation
128+
129+
If you use this workflow, please cite:
130+
- PretextMap and PretextView tools
131+
- Compleasm for gene completeness assessment
132+
- Teloscope for telomere detection
133+
- Other tools as listed in the workflow

0 commit comments

Comments
 (0)