Skip to content

Commit 0c1a0c7

Browse files
authored
Merge pull request #222 from golobor/master
stats: calculate the divergence point and corresponding read fractions
2 parents cff8226 + 00ef8e6 commit 0c1a0c7

File tree

5 files changed

+488
-179
lines changed

5 files changed

+488
-179
lines changed

doc/protocols_pipelines.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Typical Hi-C Workflow
99
----------------------
1010

1111
A typical pairtools workflow for processing standard Hi-C data is outlined below.
12-
Please, note that this is a shorter version; you can find a more detailed and reproducible example in chapter :ref:`examples/pairtools_walkthrough`.
12+
Please, note that this is a shorter version. For a detailed reproducible example, please, check the Jupyter notebook "Pairtools Walkthrough".
1313

1414
1. Align sequences to the reference genome with ``bwa mem``:
1515

@@ -103,6 +103,7 @@ Technical tips
103103
bwa mem -SP index input.R1.fastq input.R2.fastq | \
104104
pairtools parse -c chromsizes.txt | \
105105
pairtools sort | \
106+
pairtools dedup | \
106107
--output output.nodups.pairs.gz \
107108
--output-dups output.dups.pairs.gz \
108109
--output-unmapped output.unmapped.pairs.gz
@@ -116,8 +117,9 @@ Technical tips
116117
Each pairtool has the CLI flags --nproc-in and --nproc-out to control the number of cores dedicated
117118
to input decompression and output compression. Additionally, `pairtools sort` parallelizes sorting with `--nproc`.ß
118119

119-
Example Workflows
120+
Advanced Workflows
120121
------------------
122+
121123
For more advanced workflows, please check the following projects:
122124

123125
- `Distiller-nf <https://github.com/open2c/distiller-nf>`_ is a feature-rich Open2C Hi-C processing pipeline for the Nextflow workflow manager.

doc/stats.rst

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ output file.
1414

1515
- **Global statistics** include:
1616
- number of pairs (total, unmapped, single-side mapped, etc.),
17-
- total number of different pair types (UU, NN, NU, and others, see ` Pair types in pairtools docs <https://pairtools.readthedocs.io/en/latest/formats.html#pair-types>`_),
17+
- total number of different pair types (UU, NN, NU, and others, see `Pair types in pairtools docs <https://pairtools.readthedocs.io/en/latest/formats.html#pair-types>`_),
1818
- number of contacts between all chromosome pairs
1919

2020
- **Summary statistics** include:
@@ -59,17 +59,23 @@ replacement from a finite pool of fragments in DNA library [1]_ [2]_.
5959
With each new sequenced molecule, the expected number of observed unique molecules
6060
increases according to a simple equation:
6161

62-
$$ U(N+1) = U(N) + (1 - {U(N) \\over C}), $$
62+
.. math::
6363
64-
where $N$ is the number of sequenced molecules, $U(N)$ is the expected number
65-
of observed unique molecules after sequencing $N$ molecules, and C is the library complexity.
64+
U(N+1) = U(N) + \left(1 - \frac{U(N)}{C} \right),
65+
66+
where :math:`N` is the number of sequenced molecules, :math:`U(N)` is the expected number
67+
of observed unique molecules after sequencing :math:`N` molecules, and :math:`C` is the library complexity.
6668
This differential equation yields [1, 2]:
6769

68-
$$ {U(N) \\over C} = 1 - exp( - {N \\over C}), $$
70+
.. math::
71+
72+
{U(N) \over C} = 1 - exp\left( - \frac{N}{C} \right),
6973
7074
which can be solved as
7175

72-
$$ C = \Re(lambert W( - { \exp( - {1 \\over u} ) \\over u} ) ) + {1 \\over u} $$
76+
.. math::
77+
78+
C = \Re \left( W_{Lambert} \left( - \frac{ \exp\left( - \frac{1}{U} \right) } {U} \right) \right) + \frac{1}{U}
7379
7480
Library complexity can guide in the choice of sequencing depth of the library
7581
and provide an estimate of library quality.

pairtools/cli/stats.py

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,14 @@
2828
" all overlapping statistics. Non-overlapping statistics are appended to"
2929
" the end of the file. Supported for tsv stats with single filter.",
3030
)
31+
@click.option(
32+
"--n-dist-bins-decade",
33+
type=int,
34+
default=PairCounter.N_DIST_BINS_DECADE_DEFAULT,
35+
show_default=True,
36+
required=False,
37+
help="Number of bins to split the distance range in log10-space, specified per a factor of 10 difference.",
38+
)
3139
@click.option(
3240
"--with-chromsizes/--no-chromsizes",
3341
is_flag=True,
@@ -107,7 +115,7 @@
107115
)
108116
@common_io_options
109117
def stats(
110-
input_path, output, merge, bytile_dups, output_bytile_stats, filter, **kwargs
118+
input_path, output, merge, n_dist_bins_decade, bytile_dups, output_bytile_stats, filter, **kwargs
111119
):
112120
"""Calculate pairs statistics.
113121
@@ -123,6 +131,7 @@ def stats(
123131
input_path,
124132
output,
125133
merge,
134+
n_dist_bins_decade,
126135
bytile_dups,
127136
output_bytile_stats,
128137
filter,
@@ -131,10 +140,10 @@ def stats(
131140

132141

133142
def stats_py(
134-
input_path, output, merge, bytile_dups, output_bytile_stats, filter, **kwargs
143+
input_path, output, merge, n_dist_bins_decade, bytile_dups, output_bytile_stats, filter, **kwargs
135144
):
136145
if merge:
137-
do_merge(output, input_path, **kwargs)
146+
do_merge(output, input_path, n_dist_bins_decade=n_dist_bins_decade, **kwargs)
138147
return
139148

140149
if len(input_path) == 0:
@@ -181,6 +190,7 @@ def stats_py(
181190
filter = None
182191

183192
stats = PairCounter(
193+
n_dist_bins_decade=n_dist_bins_decade,
184194
bytile_dups=bytile_dups,
185195
filters=filter,
186196
startup_code=kwargs.get("startup_code", ""), # for evaluation of filters

0 commit comments

Comments
 (0)