Skip to content

Commit d96ce24

Browse files
authored
Merge pull request #226 from golobor/update_docs
update docs
2 parents 8b9b04e + 91bfa85 commit d96ce24

File tree

4 files changed

+137
-38
lines changed

4 files changed

+137
-38
lines changed
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
Technical notes
2-
===============
1+
Design notes
2+
=============
33

44
Designing scientific software and formats requires making a multitude of
55
tantalizing technical decisions and compromises. Often, the reasons behind a

doc/examples/pairtools_phase_walkthrough.ipynb

Lines changed: 15 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,6 @@
77
"source": [
88
"# Pairtools phase walkthrough\n",
99
"\n",
10-
"## Pairtools Phase Walkthrough\n",
11-
"\n",
1210
"Welcome to the pairtools phase walkthrough! This notebook will guide you through the process of resolving contacts between homologous chromosomes using haplotype-resolved Hi-C analysis.\n",
1311
"\n",
1412
"## What is haplotype-resolved Hi-C?\n",
@@ -36,50 +34,31 @@
3634
"source": [
3735
"Several approaches have been developed to process Hi-C data from haplotype-resolved experiments. In `pairtools`, we implement the approach that was used in Erceg et al. Here is its brief outline:\n",
3836
"\n",
39-
"1. [Create the reference genome](#Create-the-reference-genome): create a \"concatenated\" reference genome that contains sequences of both homologs of each chromosome. \n",
37+
"1. Create the haplotype-resolved genome. First, we will create a \"concatenated\" reference genome that contains sequences of both homologs of each chromosome. \n",
4038
"\n",
4139
" - Incorporate known SNVs (usually in .vcf format) into the reference genome using [bcftools](https://samtools.github.io/bcftools/bcftools.html) to create FASTA files with the sequences of both homologs.\n",
4240
" - Add suffixes to the name of each homolog that identify the type (`_hap1` or `_hap2`).\n",
4341
"\n",
4442
"2. Map the Hi-C data to the concatenated reference and parse resulting alignment into Hi-C pairs. Compared to the standard Hi-C pipeline, this step would contain a couple of modifications:\n",
45-
" - parse allowing multimappers (mapq 0). \n",
46-
" - make the aligner report two suboptimal alignments (aka the second and the third hit).\n",
43+
" - Make the aligner report two suboptimal alignments (aka the second and the third hit).\n",
44+
" - Parse allowing multimappers (mapq 0). \n",
4745
" \n",
4846
" Note that, upon mapping to the homolog-resolved genome, Hi-C reads will report the identity of their homologue as the suffix of the chromosome name.\n",
4947
" \n",
50-
" See sections:\n",
51-
" \n",
52-
" (i) [Download data](#Download-data)\n",
53-
" \n",
54-
" (ii) [Map data with bwa mem to diploid genome](#Map-data-with-bwa-mem-to-diploid-genome)\n",
55-
" \n",
56-
" (iii) [pairtools parse](#pairtools-parse)\n",
57-
" \n",
48+
"3. Phase the resulting pairs based on the reported suboptimal alignments. \n",
5849
"\n",
59-
"3. [pairtools phase](#pairtools-phase): phase the pairs based on the reported suboptimal alignments. \n",
50+
" By checking the scores of two suboptimal alignments, we will distinguish the true multi-mappers from unresolved pairs (i.e. cases when the read aligns to the location with no distinguishing SNV). Phasing will remove the haplotype suffixes from chromosome names and add extra fields to the .pairs file with:\n",
6051
"\n",
61-
" By checking the scores of two suboptimal alignments, we will distinguish the true multi-mappers from unresolved pairs (i.e. cases when the read aligns to the location with no distinguishing SNV).\n",
62-
" Phasing procedure will remove the haplotype suffixes from chromosome names and add extra fields to the .pairs file with:\n",
63-
" \n",
64-
" '.' (non-resolved)\n",
65-
" \n",
66-
" '0' (first haplotype) or \n",
67-
" \n",
68-
" '1' (second haplotype). \n",
69-
" \n",
52+
" - '.' (non-resolved)\n",
53+
" - '0' (first haplotype) \n",
54+
" - '1' (second haplotype)\n",
7055
" \n",
7156
" Phasing schema: \n",
7257
" \n",
73-
"![image.png](attachment:62e74fba-c1c1-44b5-a3e2-3699c3cac7ce.png)\n",
74-
"\n",
58+
" ![image.png](attachment:62e74fba-c1c1-44b5-a3e2-3699c3cac7ce.png)\n",
7559
"\n",
76-
"4. Post-procesing. Sort and dedup Hi-C pairs and calculate stats, similarly to the standard Hi-C pipeline. \n",
7760
"\n",
78-
" See sections:\n",
79-
" \n",
80-
" (i) [pairtools dedup](#pairtools-dedup)\n",
81-
" \n",
82-
" (ii) [Stats](#Stats)"
61+
"4. Post-procesing. Sort and [dedup](#pairtools-dedup) Hi-C pairs and calculate [stats](#Stats), similarly to the standard Hi-C pipeline. "
8362
]
8463
},
8564
{
@@ -111,7 +90,7 @@
11190
"id": "5ab026af-fe25-4a70-82ef-52af6fb25371",
11291
"metadata": {},
11392
"source": [
114-
"## Create the reference genome\n",
93+
"## Create the homolog-resolved genome\n",
11594
"\n",
11695
"To phase input reads, we need to map the data to the concatenated genome with two haplotypes. \n",
11796
"Below, we will generate such genome in several steps. You will need a reference genome, and one or two lists of mutations to instroduce to the reference.\n",
@@ -363,15 +342,15 @@
363342
"id": "dfd7c4cb-31dd-43df-8510-95fd0ff9f78f",
364343
"metadata": {},
365344
"source": [
366-
"#### Create the index of concatenated haplotypes"
345+
"#### Create the bwa index of homolog-resolved genome"
367346
]
368347
},
369348
{
370349
"cell_type": "markdown",
371350
"id": "99d28f6f-b754-4a95-95d5-9e5e51d14571",
372351
"metadata": {},
373352
"source": [
374-
"Concatenate the genomes and index them together. "
353+
"Concatenate the genomes of two homologs and index them together. "
375354
]
376355
},
377356
{
@@ -588,7 +567,8 @@
588567
"id": "a8a9fd19",
589568
"metadata": {},
590569
"source": [
591-
"# Map data with bwa mem to diploid genome\n",
570+
"## Map data with bwa mem to diploid genome\n",
571+
"\n",
592572
"In homolog-resolved Hi-C experiments, reads are first aligned against the reference genome and then parsed with pairtools, similar to the standard pairtools-based Hi-C pipeline. However, an additional challenge arises in distinguishing between un-phaseable reads (reads that map equally well to two homologous locations on two homologous chromosomes) and multimappers (reads that map to repeats in the genome).\n",
593573
"\n",
594574
"To differentiate between these cases, we can examine the top three candidate alignments for each read:\n",

doc/index.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,8 @@ Contents:
7272
sorting
7373
formats
7474
stats
75-
technotes
75+
protocols_pipelines
76+
designnotes
7677
cli_tools
7778

7879
.. toctree::

doc/protocols_pipelines.rst

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
Workflows and Parameters
2+
========================
3+
4+
This page provides guidance on using pairtools for the most common Hi-C protocols and
5+
helps users fine-tune the pipeline for different variations of the Hi-C protocol.
6+
It covers recommended parameters and best practices for processing Hi-C data using pairtools.
7+
8+
Typical Hi-C Workflow
9+
----------------------
10+
11+
A typical pairtools workflow for processing standard Hi-C data is outlined below.
12+
Please, note that this is a shorter version; you can find a more detailed and reproducible example in chapter :ref:`examples/pairtools_walkthrough`.
13+
14+
1. Align sequences to the reference genome with ``bwa mem``:
15+
16+
.. code-block:: console
17+
18+
bwa mem -SP index_file input.R1.fastq input.R2.fastq > input.sam
19+
20+
2. Parse alignments into Hi-C pairs using ``pairtools parse``:
21+
22+
.. code-block:: console
23+
24+
pairtools parse -c /path/to/chrom_sizes -o output.pairs.gz input.sam
25+
26+
3. Sort pairs using ``pairtools sort``:
27+
28+
29+
.. code-block:: console
30+
31+
32+
pairtools sort --nproc 8 -o output.sorted.pairs.gz output.pairs.gz
33+
34+
4. Detect and remove duplicates using ``pairtools dedup`` and generate statistics:
35+
36+
.. code-block:: console
37+
38+
pairtools dedup \
39+
--output output.nodups.pairs.gz \
40+
--output-dups output.dups.pairs.gz \
41+
--output-unmapped output.unmapped.pairs.gz
42+
--output-stats output.dedup.stats \
43+
output.sorted.pairs.gz
44+
45+
5. Aggregate into a cooler file:
46+
47+
.. code-block:: console
48+
49+
cooler cload pairs -c1 2 -p1 3 -c2 4 -p2 5 /path/to/chrom_sizes:1000 output.nodups.pairs.gz output.1000.cool
50+
51+
52+
53+
54+
Recommended pairtools parameters for standard Hi-C protocols
55+
------------------------------------------------------------
56+
57+
To adapt the standard workflow for common variations of the Hi-C protocol, consider adjusting the following parameters:
58+
59+
1. ``pairtools parse --walks-policy``:
60+
61+
This parameter determines how pairtools parse handles reads with multiple alignments (walks). We recommend specifying the value explicitly, as the default has changed between versions of ``pairtools parse``.
62+
63+
Our current recommendation is to use ``--walks-policy 5unique``, which is the default setting in the latest version of pairtools. With this option, pairtools parse reports the two 5'-most unique alignments on each side of a paired read as a pair.
64+
65+
This option increases the number of reported pairs compared to the most conservative ``--walks-policy mask``. However, it's important to note that ``5unique`` can potentially report pairs of non-directly ligated fragments (i.e., two fragments separated by one or more other DNA fragments). Such non-direct (also known as "higher-order" or "nonadjacent") ligations have slightly different statistical properties than direct ligations, as illustrated in several Pore-C papers [`1 <https://www.biorxiv.org/content/10.1101/833590v1.full>`_ , `2 <https://www.nature.com/articles/s41467-023-36899-x>`_].
66+
67+
An alternative is the ``--walks-policy 3unique`` policy, which reports the two 3'-most unique alignments on each side of
68+
a paired read as a pair, thus decreasing the chance of reporting non-direct ligations.
69+
However, ``3unique`` may not work well in situations where the combined length of a read pair is longer than the length of a DNA fragment (e.g. long read experiments).
70+
In this case, the 3' sides of the two reads will cover the same locations in the DNA molecule, and the 3' alignments may end up identical.
71+
72+
Finally, the experimental ``--walks-policy all`` option reports all alignments of a read pair as separate pairs. This option maximizes the number of reported pairs. The downside is that it breaks the assumption that there is only one pair per read, which is not compatible with retrieval of .sam records from .pairsam output and may also complicate the interpretation of pair statistics.
73+
74+
2. ``pairtools select "(mapq1>=30) and (mapq2>=30)"``:
75+
76+
This filtering command selects only pairs with high-quality alignments,
77+
where both reads in a pair have a mapping quality (MAPQ) score of 30 or higher.
78+
Applying this filter helps remove false alignments between partially homologous sequences, which often cause artificial high-frequency interactions in Hi-C maps.
79+
This step is essential for generating maps for high-quality dot calls.
80+
81+
Note that we recommend storing the most comprehensive, unfiltered list of pairs and applying the filter on the fly prior to contact aggregation:
82+
83+
.. code-block:: console
84+
85+
pairtools select "(mapq1>=30) and (mapq2>=30)" output.nodups.pairs.gz | \
86+
cooler cload pairs -c1 2 -p1 3 -c2 4 -p2 5 chromsizes.txt:1000 - output.mapq_30.1000.cool
87+
88+
89+
Technical tips
90+
--------------
91+
92+
- **Pipe between commands to save space and I/O throughput**
93+
94+
Use Unix pipes to connect the output of one command directly to the input of the next command in the pipeline.
95+
This eliminates the need to store intermediate files on disk, saving storage space and reducing I/O overhead.
96+
Specifically, mapping, parsing, sorting and deduplication can all be connected into a single pipeline:
97+
98+
.. code-block:: console
99+
100+
bwa mem -SP index input.R1.fastq input.R2.fastq | \
101+
pairtools parse -c chromsizes.txt | \
102+
pairtools sort | \
103+
--output output.nodups.pairs.gz \
104+
--output-dups output.dups.pairs.gz \
105+
--output-unmapped output.unmapped.pairs.gz
106+
--output-stats output.dedup.stats
107+
108+
- **Use recommended compression for efficient storage and processing.** .sam, .pairs and .pairsam files are text-based format that are rather inefficient and slow to process.
109+
Pairtools recognize .bam, .gz and .lz4 file extensions and automatically compress and decompress files on the fly.
110+
Compression saves space, and reduces I/O overhead at a relatively minor CPU cost.
111+
112+
- **Parallelize tasks and manage resources effectively for faster execution.**
113+
Each pairtool has the CLI flags --nproc-in and --nproc-out to control the number of cores dedicated
114+
to input decompression and output compression. Additionally, `pairtools sort` parallelizes sorting with `--nproc`.ß
115+
116+
Example Workflows
117+
-----------------
118+

0 commit comments

Comments
 (0)