|
1 | | -# Hi-C Contact map generation for manual curation of genome assemblies |
| 1 | +# Hi-C Contact Map Generation for Manual Curation of Genome Assemblies |
2 | 2 |
|
3 | | -This workflow generates Hi-C contact maps for diploid genome assemblies in the Pretext format. It includes tracks for PacBio read coverage, Gaps, and telomeres. The Pretext files can be open in PretextView for the manual curation of genome assemblies. |
| 3 | +This workflow generates Hi-C contact maps for diploid genome assemblies in the Pretext format. It includes tracks for: |
| 4 | +- **Gene annotations** (Compleasm) |
| 5 | +- **PacBio read coverage** and coverage gaps |
| 6 | +- **Telomeres** (with flexible pattern detection) |
| 7 | +- **Assembly gaps** |
4 | 8 |
|
| 9 | +The Pretext files can be opened in PretextView for manual curation of genome assemblies. |
5 | 10 |
|
6 | 11 | ## Inputs |
7 | 12 |
|
8 | | -1. **Haplotype 1** [fasta] |
9 | | -2. **Will you use a second haplotype?** |
10 | | -3. **Haplotype 2** [fasta] |
11 | | -4. **Do you want to add suffixes to the scaffold names?** Select yes if the scaffold names in your assembly do not contain haplotype information. |
12 | | -5. **Haplotype 1 suffix** This suffix will be added to haplotype 1 scaffold names if you selected to add suffixes to the scaffold names. |
13 | | -6. **Haplotype 2 suffix** This suffix will be added to haplotype 2 scaffold names if you selected to add suffixes to the scaffold names. |
14 | | -7. **Hi-C reads** [fastq] Paired Collection containing the Hi-D data |
15 | | -8. **Do you want to trim the Hi-C data?** If *yes*, remove 5bp at the end of Hi-C reads. Use with Arima Hi-C data if the Hi-C map looks "noisy". |
16 | | -9. **Minimum Mapping Score** Minimum mapping score to keep for Hi-C alignments in the filtered PretextMap. Set to 0 to keep all mapped reads. Default: 20 . |
17 | | -10. **Telomere repeat to suit species** Expected value of the repeated sequences in the telomeres. Default value [CCCTAA] is suited to vertebrates. |
18 | | -11. **PacBio reads** [fastq] Collection of PacBio reads. |
| 13 | +### Required Inputs |
19 | 14 |
|
| 15 | +1. **Haplotype 1** [fasta] - Primary haplotype assembly |
| 16 | +2. **Will you use a second haplotype?** [boolean] - Set to true for diploid assemblies |
| 17 | +3. **Haplotype 2** [fasta] - Secondary haplotype assembly (required if using two haplotypes) |
| 18 | +4. **Hi-C reads** [fastq] - Paired collection containing Hi-C data |
| 19 | +5. **PacBio reads** [fastq] - Collection of PacBio HiFi reads |
| 20 | + |
| 21 | +### Assembly Annotation Parameters |
| 22 | + |
| 23 | +6. **Species Name** [text] - Species identifier for gene annotation |
| 24 | +7. **Assembly Name** [text] - Assembly identifier (e.g., toLid) |
| 25 | +8. **Lineage for Compleasm** [text] - BUSCO lineage dataset (e.g., vertebrata_odb10, primates_odb10) |
| 26 | +9. **Database for Compleasm** [text] - Compleasm database version (default: v5) |
| 27 | + |
| 28 | +### Scaffold Naming Options |
| 29 | + |
| 30 | +10. **Do you want to add suffixes to the scaffold names?** [boolean] - Select yes if scaffold names do not contain haplotype information |
| 31 | +11. **First Haplotype suffix** [text] - Suffix for haplotype 1 scaffolds (default: H1) |
| 32 | +12. **Second Haplotype suffix** [text] - Suffix for haplotype 2 scaffolds (default: H2) |
| 33 | + |
| 34 | +### Hi-C Processing Options |
| 35 | + |
| 36 | +13. **Do you want to trim the Hi-C data?** [boolean] - If yes, removes 5bp at the end of Hi-C reads (recommended for Arima Hi-C data if the map looks "noisy") |
| 37 | +14. **Remove duplicated Hi-C reads?** [boolean] - Remove PCR duplicates from Hi-C alignments using Picard MarkDuplicates |
| 38 | +15. **Minimum Mapping Quality** [integer] - Minimum MAPQ score for filtered PretextMap (default: 10; set to 0 to keep all mapped reads) |
| 39 | + |
| 40 | +### PacBio Processing Options |
| 41 | + |
| 42 | +16. **Remove adapters from HiFi reads?** [boolean] - Trim adapters from PacBio HiFi reads using Cutadapt |
| 43 | + |
| 44 | +### Telomere Detection |
| 45 | + |
| 46 | +17. **Canonical telomeric pattern** [text] - Expected telomere repeat sequence (default: TTAGGG for vertebrates; use CCCTAA for reverse complement) |
| 47 | +18. **Telomeric Patterns to explore (comma-separated), IUPAC allowed** [text] - Additional telomeric patterns to search for (e.g., TTAGGG,CCCTAA) |
| 48 | + |
| 49 | +### Visualization Options |
| 50 | + |
| 51 | +19. **Bin Size for Bigwig files** [integer] - Resolution for coverage tracks (default: 100; larger values = smaller files but lower resolution) |
20 | 52 |
|
21 | 53 | ## Outputs |
22 | 54 |
|
23 | | -1. Concatenated Assembly [fasta] If two haplotypes are used. |
24 | | -2. Trimmed Hi-C data (If trimming option is selected) [fastq] |
25 | | -3. Mapped Hi-C reads [bam] |
26 | | -4. Telomeres track [bedgraph] |
27 | | -5. Gap track [bedgraph] |
28 | | -6. Coverage track [bigwig] |
29 | | -7. Gaps in coverage track [bedgraph] |
30 | | -7. Pretext Map without tracks [pretext], filtered and unfiltered. |
31 | | -8. Pretext Map with tracks [pretext], filtered and unfiltered. |
32 | | -9. Pretext Snapshot image of the Hi-C contact map [png], filtered and unfiltered. |
| 55 | +### Assembly Outputs |
| 56 | + |
| 57 | +1. **Assembly for curation** [fasta] - Merged assembly (if two haplotypes used) or single haplotype with optional suffix |
| 58 | +2. **Assembly Info** [tabular] - Summary statistics from gfastats |
| 59 | +3. **Both Haplotypes merged** [fasta] - Concatenated assembly file |
| 60 | + |
| 61 | +### Gene Annotation Outputs |
| 62 | + |
| 63 | +4. **Compleasm hap1 - Busco** [GFF] - BUSCO-based gene predictions for haplotype 1 |
| 64 | +5. **Compleasm hap1 - Miniprot** [GFF] - Miniprot-based gene predictions for haplotype 1 |
| 65 | +6. **Compleasm hap2 - Busco** [GFF] - BUSCO-based gene predictions for haplotype 2 |
| 66 | +7. **Compleasm hap2 - Miniprot** [GFF] - Miniprot-based gene predictions for haplotype 2 |
| 67 | +8. **Merged Compleasm Gffs** [GFF] - Combined gene annotations from all Compleasm analyses |
| 68 | + |
| 69 | + |
| 70 | +### Hi-C Alignment Outputs |
| 71 | + |
| 72 | +10. **Merged Hi-C Alignments** [BAM] - Combined Hi-C read alignments |
| 73 | +11. **Hi-C duplication stats: MultiQC** [HTML] - MultiQC report of duplicate statistics (if duplicate removal enabled) |
| 74 | +12. **Hi-C duplication stats: Raw** [tabular] - Raw Picard MarkDuplicates metrics (if duplicate removal enabled) |
| 75 | +13. **Markduplicates Summary** [text] - Summary of duplicate removal statistics (if enabled) |
| 76 | + |
| 77 | +### PacBio Coverage Outputs |
| 78 | + |
| 79 | +14. **BigWig Coverage** [bigwig] - PacBio read coverage track |
| 80 | +15. **Coverage Gaps Track** [bedgraph] - Regions with low or no PacBio coverage |
| 81 | +16. **Merged HiFi Alignments** [BAM] - Combined PacBio alignments |
| 82 | + |
| 83 | +### Telomere Outputs |
| 84 | + |
| 85 | +17. **Telomere Report** [tabular] - Comprehensive telomere analysis from Teloscope |
| 86 | +18. **terminal telomeres** [bedgraph] - All detected telomeric regions |
| 87 | +19. **P telomeres bed** [BED] - P-arm (5') telomeres only |
| 88 | +20. **Q telomeres Bed** [BED] - Q-arm (3') telomeres only |
| 89 | + |
| 90 | +### Gap Outputs |
| 91 | + |
| 92 | +21. **Gaps Bed** [BED] - Assembly gap coordinates |
| 93 | +22. **Gaps Bedgraph** [bedgraph] - Assembly gap track for Pretext |
| 94 | + |
| 95 | +### Pretext Map Outputs |
| 96 | + |
| 97 | +All Pretext outputs are generated in two versions: |
| 98 | +- **With MAPQ filtering** (default MAPQ ≥ 10): Cleaner maps with high-confidence contacts |
| 99 | +- **Without filtering (Multimapping)**: Shows all mapped contacts including low-quality alignments |
| 100 | + |
| 101 | +23. **Pretext All tracks** [pretext] - Contact map with all annotation tracks (MAPQ filtered) |
| 102 | +24. **Pretext All tracks - Multimapping** [pretext] - Contact map with all tracks (unfiltered) |
| 103 | +25. **Pretext Snapshot With tracks** [PNG] - Image of contact map with tracks (MAPQ filtered) |
| 104 | +26. **Pretext Snapshot With tracks - Multimapping** [PNG] - Image of contact map with tracks (unfiltered) |
| 105 | + |
| 106 | +## Usage Notes |
| 107 | + |
| 108 | +### When to trim Hi-C data |
| 109 | +Enable trimming if you're using Arima Hi-C kits and notice a "noisy" contact map pattern. |
| 110 | + |
| 111 | +### MAPQ filtering |
| 112 | +The default MAPQ threshold of 10 removes ambiguously mapped Hi-C contacts, resulting in cleaner contact maps. Compare both filtered and unfiltered outputs to assess mapping quality. |
| 113 | + |
| 114 | +### Choosing Compleasm lineage |
| 115 | +Select the most specific lineage available for your species: |
| 116 | +- Vertebrates: `vertebrata_odb10` |
| 117 | +- Mammals: `mammalia_odb10` |
| 118 | +- Primates: `primates_odb10` |
| 119 | +- Birds: `aves_odb10` |
| 120 | +- See [BUSCO datasets](https://busco-data.ezlab.org/v5/data/lineages/) for complete list |
| 121 | + |
| 122 | +### Telomere patterns |
| 123 | +- Vertebrates typically use TTAGGG |
| 124 | +- Some species have variant patterns - check the literature |
| 125 | +- IUPAC codes are supported (e.g., TTAGGK for TTAGGG/TTAGGC) |
| 126 | + |
| 127 | +## Citation |
| 128 | + |
| 129 | +If you use this workflow, please cite: |
| 130 | +- PretextMap and PretextView tools |
| 131 | +- Compleasm for gene completeness assessment |
| 132 | +- Teloscope for telomere detection |
| 133 | +- Other tools as listed in the workflow |
0 commit comments