Skip to content

Commit 4b2529b

Browse files
Adding updated documentation for mlml
1 parent 1c5873f commit 4b2529b

File tree

1 file changed

+116
-57
lines changed

1 file changed

+116
-57
lines changed

docs/content/mlml.md

Lines changed: 116 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -1,109 +1,168 @@
11
# mlml - estimation of hydroxymethylation and methylation levels
22

33
## Synopsis
4-
```shell
5-
$ dnmtools mlml [OPTIONS] -u <input-bs.meth> -m <input-oxbs.meth> -h <input-tab.meth>
4+
```console
5+
$ dnmtools mlml [OPTIONS] -u <input-bs-seq> -m <input-oxbs-seq> -h <input-tab-seq>
66
```
77

88
## Description
99

1010
If you are interested in estimating hydroxymethylation level and have
1111
any two of Tet-Assisted Bisulfite sequencing (TAB-seq), oxidative
12-
bisulfite sequencing (oxBS-seq) and BS-seq data available, you can use
13-
`mlml` to perform consistent and simultaneous estimation.
14-
15-
The input file format could be the default
16-
[counts](../counts) output format,
17-
or BED format file with 6 columns as the example below:
12+
bisulfite sequencing (oxBS-seq) or ordinary bisulfite sequencing
13+
(BS-seq) data available, `mlml` can perform simultaneous estimation of
14+
5mC and 5hmC levels, separately for each site.
1815

16+
The input file format is the "counts" format, which is generated by
17+
the dnmtools [counts](../counts) command after mapping reads. This
18+
is the general appearance of each line in the input files:
1919
```txt
20-
chr1 3001345 3001346 CpG:9 0.777777777778 +
20+
chr1 3001345 + CpG 0.777777777778 9
2121
```
22-
23-
Here the fourth column indicates that this site is a CpG site, and the
24-
number of reads covering this site is 9. The fifth column is the
25-
methylation level of the CpG site, ranging from 0 to 1. Note that all
26-
input files must be sorted. Assume you have three input files ready:
27-
`input-bs-seq.meth`, `input-oxbs-seq.meth` and `input-tab-seq.meth`.
28-
The following command will take all the inputs:
29-
30-
```shell
31-
$ dnmtools mlml -v -u input-bs-seq.meth -m input-oxbs-seq.meth -h input-tab-seq.meth -o result.txt
22+
In general `mlml` works on CpG sites. There is no reason you cannot
23+
use it on other sites, but we've only ever tested it on CpG sites.
24+
25+
All input files for `mlml` must be sorted the same way, and have
26+
corresponding sites -- each must have the same number of sites, at the
27+
same positions within the genome, and in the same order.
28+
29+
Assume you have three input files: `input-bs-seq.counts`,
30+
`input-oxbs-seq.counts` and `input-tab-seq.counts`. Each of these
31+
files would be produced by ordinary BS-seq, oxBS-seq and TAB-seq,
32+
respectively. The following command will use these inputs:
33+
```console
34+
$ dnmtools mlml -u input-bs-seq.counts -m input-oxbs-seq.counts -h input-tab-seq.counts
3235
```
36+
The results for the above command will be printed to the terminal with
37+
one line per site (line) in the input files -- that might be a
38+
lot. One line will look something like this:
39+
```txt
40+
chr1 300 + CpG 0.635788 0.259378 0.104835 0
41+
```
42+
Each line has 8 columns, the first 4 of which correspond to those in
43+
the [counts](../counts) format indicated above. These are
44+
*tab-delimited*. Here is the meaning of the rest of the columns:
45+
46+
1. chromosome
47+
2. position
48+
3. strand
49+
4. label
50+
5. 5mC: estimated fraction of molecules with the 5mC mark
51+
6. 5hmC: estimated fraction of molecules with the 5mC mark
52+
7. neither: estimated fraction with neither 5mC nor 5hmC
53+
8. number of conflicting sites: this integer value is explained below
54+
55+
If only two types of input are available, e.g. `input-bs-seq.counts`
56+
and `input-oxbs-seq.counts`, then the following command would be
57+
appropriate:
58+
```console
59+
$ dnmtools mlml -u input-bs-seq.counts -m input-oxbs-seq-seq.counts -o result.txt
60+
```
61+
The output would be written to the file `results.txt` because of the
62+
`-o` argument. Obviously more data is better, and estimates from `mlml`
63+
are more accurate with the 3 types of data. But in the vast majority
64+
of cases only one of oxBS-seq or TAB-seq will be available. The
65+
arguments `-u`, `-m` and `-b` are used to indicate the following:
3366

34-
If only two types of input are available, e.g. `input-bs-seq.meth and
35-
`input-oxbs-seq.meth`, then use the following command:
67+
* -u: BS-seq input, think of it as directly informing about "unmethylated"
68+
* -m: oxBS-seq input, "m" because it gives exactly 5mC information.
69+
* -h: TAB-seq input, "h" because it gives exactly 5hmC information.
3670

37-
```shell
38-
$ dnmtools mlml -u input-bs-seq.meth -m input-oxbs-seq-seq.meth -o result.txt
39-
```
71+
You must provide at least 2 of these, and be sure to use the
72+
appropriate argument for each data type. `mlml` works by taking
73+
advantage of the properties of each data type, so swapping those will
74+
result in useless output.
4075

4176
In some cases, you might want to specify the convergence tolerance for
4277
EM algorithm. This can be done through `-t` option. For example:
43-
44-
```shell
45-
$ dnmtools mlml -u input-bs-seq.meth -m input-oxbs-seq.meth -o result.txt -t 1e-2
78+
```console
79+
$ dnmtools mlml -u bs-seq.counts -m oxbs-seq.counts -o result.txt -t 1e-2
4680
```
47-
4881
This command will make the iteration process stop when the difference
49-
of estimation between two iterations is less than 10^−2 . Thea value
50-
format can be scientific notation, e.g. 1e-5, or float number, e.g.
51-
0.00001.
52-
53-
The output of `mlml` is tab-delimited format. Here is an example:
54-
55-
```txt
56-
chr11 15 16 0.166667 0.19697 0.636364 0
57-
chr12 11 12 0.222222 0 0.777778 2
82+
of estimation between two iterations is less than 1e-02. The value
83+
format can be scientific notation, e.g. 1e-5, or as a fixed point
84+
fraction, e.g. 0.00001.
85+
86+
Conflicts: To calculate the last column in the output, a binomial test
87+
is performed for each input methylation level (either 2 or 3 in total
88+
depending on the number of input files). The tests are done separately
89+
for estimated maximum likelihood level (for each of 5mC, 5hmC and
90+
"neither"). In each case, check if the estimate falls outside the
91+
confidence interval calculated using coverage and observed fraction
92+
from the input file. If more than one estimate at a given site falls
93+
outside the confidence interval, we flag it as conflicting. This will
94+
happen by random chance, but if the same sites show conflicts in
95+
multiple independent analyses, it might indicate that the sites
96+
themselves are unreliable in one of the types of experiments.
97+
98+
## Examples
99+
100+
Assume the data files are as follows:
101+
```console
102+
$ cat bs.counts
103+
chr1 100 + CpG 0.95 20
104+
chr1 200 + CpG 0.8 20
105+
chr1 300 + CpG 0.9 30
106+
chr1 400 + CpG 0.5 2
107+
$ cat oxbs.counts
108+
chr1 100 + CpG 0.8 20
109+
chr1 200 + CpG 0.6 20
110+
chr1 300 + CpG 0.6 10
111+
chr1 400 + CpG 0.6 10
112+
$ cat tab.counts
113+
chr1 100 + CpG 0.15 20
114+
chr1 200 + CpG 0.2 30
115+
chr1 300 + CpG 0.2 5
116+
chr1 400 + CpG 0.1 20
58117
```
59-
60-
The columns are chromosome name, start position, end position, 5-mC
61-
level, 5-hmC level, unmethylated level and number of conflicts. To
62-
calculate the last column, a binomial test is performed for each input
63-
methylation level (can be 2 or 3 in total depending on parameters). If
64-
the estimated methylation level falls out of the confidence interval
65-
calculated from input coverage and methylation level, then such event
66-
is counted as one conflict. It is recommended to filter estimation
67-
results based on the number of conflicts; if more conflicts happens on
68-
one site then it is possible that information from such site is not
69-
reliable.
118+
Then running `mlml` would give the following results:
119+
```console
120+
$ dnmtools mlml -bsseq bs.counts -tabseq tab.counts -oxbsseq oxbs.counts
121+
chr1 100 + CpG 0.8 0.15 0.05 0
122+
chr1 200 + CpG 0.6 0.2 0.2 0
123+
chr1 300 + CpG 0.635788 0.259378 0.104835 0
124+
chr1 400 + CpG 0.565183 0.0939689 0.340848 0
125+
```
126+
You should be able to calculate the estimates for the 1st and 2nd
127+
output lines easily in your head. The third and fourth lines are more
128+
difficult.
70129

71130
## Options
72131

73132
```txt
74133
-o, -output
75134
```
76-
output file (default: STDOUT)
135+
output file (default: standard output)
77136

78137
```txt
79138
-u, -bsseq
80139
```
81-
input BS-seq methcounts file
140+
input file for BS-seq counts
82141
```txt
83142
-h, -tabseq
84143
```
85-
input TAB-seq methcounts file
144+
input file for TAB-seq counts
86145
```txt
87146
-m, -oxbsseq
88147
```
89-
input oxBS-seq methcounts file
148+
input file for oxBS-seq counts
90149
```txt
91150
-t, -tolerance
92151
```
93-
EM convergence threshold (default: 0.000000)
152+
convergence tolerance (default: 1e-10)
94153
```txt
95154
-a, -alpha
96155
```
97-
significance level of binomial test for each site (default: 0.050000)
156+
significance level of binomial test at each site (default: 0.05)
98157
```txt
99158
-H, -outh
100159
```
101-
hmC pseudo methcount output file (default: null)
160+
5hmC estimated levels at each site (default: none)
102161
```txt
103162
-M, -outm
104163
```
105-
mC pseudo methcount output file (default: null)
164+
5mC estimated levels at each site (default: none)
106165
```txt
107166
-v, -verbose
108167
```
109-
print more run info to STDERR while the program is running.
168+
print more information while running

0 commit comments

Comments
 (0)