|
1 | 1 | # mlml - estimation of hydroxymethylation and methylation levels |
2 | 2 |
|
3 | 3 | ## Synopsis |
4 | | -```shell |
5 | | -$ dnmtools mlml [OPTIONS] -u <input-bs.meth> -m <input-oxbs.meth> -h <input-tab.meth> |
| 4 | +```console |
| 5 | +$ dnmtools mlml [OPTIONS] -u <input-bs-seq> -m <input-oxbs-seq> -h <input-tab-seq> |
6 | 6 | ``` |
7 | 7 |
|
8 | 8 | ## Description |
9 | 9 |
|
10 | 10 | If you are interested in estimating hydroxymethylation level and have |
11 | 11 | any two of Tet-Assisted Bisulfite sequencing (TAB-seq), oxidative |
12 | | -bisulfite sequencing (oxBS-seq) and BS-seq data available, you can use |
13 | | -`mlml` to perform consistent and simultaneous estimation. |
14 | | - |
15 | | -The input file format could be the default |
16 | | -[counts](../counts) output format, |
17 | | -or BED format file with 6 columns as the example below: |
| 12 | +bisulfite sequencing (oxBS-seq) or ordinary bisulfite sequencing |
| 13 | +(BS-seq) data available, `mlml` can perform simultaneous estimation of |
| 14 | +5mC and 5hmC levels, separately for each site. |
18 | 15 |
|
| 16 | +The input file format is the "counts" format, which is generated by |
| 17 | +the dnmtools [counts](../counts) command after mapping reads. This |
| 18 | +is the general appearance of each line in the input files: |
19 | 19 | ```txt |
20 | | -chr1 3001345 3001346 CpG:9 0.777777777778 + |
| 20 | +chr1 3001345 + CpG 0.777777777778 9 |
21 | 21 | ``` |
22 | | - |
23 | | -Here the fourth column indicates that this site is a CpG site, and the |
24 | | -number of reads covering this site is 9. The fifth column is the |
25 | | -methylation level of the CpG site, ranging from 0 to 1. Note that all |
26 | | -input files must be sorted. Assume you have three input files ready: |
27 | | -`input-bs-seq.meth`, `input-oxbs-seq.meth` and `input-tab-seq.meth`. |
28 | | -The following command will take all the inputs: |
29 | | - |
30 | | -```shell |
31 | | -$ dnmtools mlml -v -u input-bs-seq.meth -m input-oxbs-seq.meth -h input-tab-seq.meth -o result.txt |
| 22 | +In general `mlml` works on CpG sites. There is no reason you cannot |
| 23 | +use it on other sites, but we've only ever tested it on CpG sites. |
| 24 | + |
| 25 | +All input files for `mlml` must be sorted the same way, and have |
| 26 | +corresponding sites -- each must have the same number of sites, at the |
| 27 | +same positions within the genome, and in the same order. |
| 28 | + |
| 29 | +Assume you have three input files: `input-bs-seq.counts`, |
| 30 | +`input-oxbs-seq.counts` and `input-tab-seq.counts`. Each of these |
| 31 | +files would be produced by ordinary BS-seq, oxBS-seq and TAB-seq, |
| 32 | +respectively. The following command will use these inputs: |
| 33 | +```console |
| 34 | +$ dnmtools mlml -u input-bs-seq.counts -m input-oxbs-seq.counts -h input-tab-seq.counts |
32 | 35 | ``` |
| 36 | +The results for the above command will be printed to the terminal with |
| 37 | +one line per site (line) in the input files -- that might be a |
| 38 | +lot. One line will look something like this: |
| 39 | +```txt |
| 40 | +chr1 300 + CpG 0.635788 0.259378 0.104835 0 |
| 41 | +``` |
| 42 | +Each line has 8 columns, the first 4 of which correspond to those in |
| 43 | +the [counts](../counts) format indicated above. These are |
| 44 | +*tab-delimited*. Here is the meaning of the rest of the columns: |
| 45 | + |
| 46 | +1. chromosome |
| 47 | +2. position |
| 48 | +3. strand |
| 49 | +4. label |
| 50 | +5. 5mC: estimated fraction of molecules with the 5mC mark |
| 51 | +6. 5hmC: estimated fraction of molecules with the 5mC mark |
| 52 | +7. neither: estimated fraction with neither 5mC nor 5hmC |
| 53 | +8. number of conflicting sites: this integer value is explained below |
| 54 | + |
| 55 | +If only two types of input are available, e.g. `input-bs-seq.counts` |
| 56 | +and `input-oxbs-seq.counts`, then the following command would be |
| 57 | +appropriate: |
| 58 | +```console |
| 59 | +$ dnmtools mlml -u input-bs-seq.counts -m input-oxbs-seq-seq.counts -o result.txt |
| 60 | +``` |
| 61 | +The output would be written to the file `results.txt` because of the |
| 62 | +`-o` argument. Obviously more data is better, and estimates from `mlml` |
| 63 | +are more accurate with the 3 types of data. But in the vast majority |
| 64 | +of cases only one of oxBS-seq or TAB-seq will be available. The |
| 65 | +arguments `-u`, `-m` and `-b` are used to indicate the following: |
33 | 66 |
|
34 | | -If only two types of input are available, e.g. `input-bs-seq.meth and |
35 | | -`input-oxbs-seq.meth`, then use the following command: |
| 67 | +* -u: BS-seq input, think of it as directly informing about "unmethylated" |
| 68 | +* -m: oxBS-seq input, "m" because it gives exactly 5mC information. |
| 69 | +* -h: TAB-seq input, "h" because it gives exactly 5hmC information. |
36 | 70 |
|
37 | | -```shell |
38 | | -$ dnmtools mlml -u input-bs-seq.meth -m input-oxbs-seq-seq.meth -o result.txt |
39 | | -``` |
| 71 | +You must provide at least 2 of these, and be sure to use the |
| 72 | +appropriate argument for each data type. `mlml` works by taking |
| 73 | +advantage of the properties of each data type, so swapping those will |
| 74 | +result in useless output. |
40 | 75 |
|
41 | 76 | In some cases, you might want to specify the convergence tolerance for |
42 | 77 | EM algorithm. This can be done through `-t` option. For example: |
43 | | - |
44 | | -```shell |
45 | | -$ dnmtools mlml -u input-bs-seq.meth -m input-oxbs-seq.meth -o result.txt -t 1e-2 |
| 78 | +```console |
| 79 | +$ dnmtools mlml -u bs-seq.counts -m oxbs-seq.counts -o result.txt -t 1e-2 |
46 | 80 | ``` |
47 | | - |
48 | 81 | This command will make the iteration process stop when the difference |
49 | | -of estimation between two iterations is less than 10^−2 . Thea value |
50 | | -format can be scientific notation, e.g. 1e-5, or float number, e.g. |
51 | | -0.00001. |
52 | | - |
53 | | -The output of `mlml` is tab-delimited format. Here is an example: |
54 | | - |
55 | | -```txt |
56 | | -chr11 15 16 0.166667 0.19697 0.636364 0 |
57 | | -chr12 11 12 0.222222 0 0.777778 2 |
| 82 | +of estimation between two iterations is less than 1e-02. The value |
| 83 | +format can be scientific notation, e.g. 1e-5, or as a fixed point |
| 84 | +fraction, e.g. 0.00001. |
| 85 | + |
| 86 | +Conflicts: To calculate the last column in the output, a binomial test |
| 87 | +is performed for each input methylation level (either 2 or 3 in total |
| 88 | +depending on the number of input files). The tests are done separately |
| 89 | +for estimated maximum likelihood level (for each of 5mC, 5hmC and |
| 90 | +"neither"). In each case, check if the estimate falls outside the |
| 91 | +confidence interval calculated using coverage and observed fraction |
| 92 | +from the input file. If more than one estimate at a given site falls |
| 93 | +outside the confidence interval, we flag it as conflicting. This will |
| 94 | +happen by random chance, but if the same sites show conflicts in |
| 95 | +multiple independent analyses, it might indicate that the sites |
| 96 | +themselves are unreliable in one of the types of experiments. |
| 97 | + |
| 98 | +## Examples |
| 99 | + |
| 100 | +Assume the data files are as follows: |
| 101 | +```console |
| 102 | +$ cat bs.counts |
| 103 | +chr1 100 + CpG 0.95 20 |
| 104 | +chr1 200 + CpG 0.8 20 |
| 105 | +chr1 300 + CpG 0.9 30 |
| 106 | +chr1 400 + CpG 0.5 2 |
| 107 | +$ cat oxbs.counts |
| 108 | +chr1 100 + CpG 0.8 20 |
| 109 | +chr1 200 + CpG 0.6 20 |
| 110 | +chr1 300 + CpG 0.6 10 |
| 111 | +chr1 400 + CpG 0.6 10 |
| 112 | +$ cat tab.counts |
| 113 | +chr1 100 + CpG 0.15 20 |
| 114 | +chr1 200 + CpG 0.2 30 |
| 115 | +chr1 300 + CpG 0.2 5 |
| 116 | +chr1 400 + CpG 0.1 20 |
58 | 117 | ``` |
59 | | - |
60 | | -The columns are chromosome name, start position, end position, 5-mC |
61 | | -level, 5-hmC level, unmethylated level and number of conflicts. To |
62 | | -calculate the last column, a binomial test is performed for each input |
63 | | -methylation level (can be 2 or 3 in total depending on parameters). If |
64 | | -the estimated methylation level falls out of the confidence interval |
65 | | -calculated from input coverage and methylation level, then such event |
66 | | -is counted as one conflict. It is recommended to filter estimation |
67 | | -results based on the number of conflicts; if more conflicts happens on |
68 | | -one site then it is possible that information from such site is not |
69 | | -reliable. |
| 118 | +Then running `mlml` would give the following results: |
| 119 | +```console |
| 120 | +$ dnmtools mlml -bsseq bs.counts -tabseq tab.counts -oxbsseq oxbs.counts |
| 121 | +chr1 100 + CpG 0.8 0.15 0.05 0 |
| 122 | +chr1 200 + CpG 0.6 0.2 0.2 0 |
| 123 | +chr1 300 + CpG 0.635788 0.259378 0.104835 0 |
| 124 | +chr1 400 + CpG 0.565183 0.0939689 0.340848 0 |
| 125 | +``` |
| 126 | +You should be able to calculate the estimates for the 1st and 2nd |
| 127 | +output lines easily in your head. The third and fourth lines are more |
| 128 | +difficult. |
70 | 129 |
|
71 | 130 | ## Options |
72 | 131 |
|
73 | 132 | ```txt |
74 | 133 | -o, -output |
75 | 134 | ``` |
76 | | -output file (default: STDOUT) |
| 135 | +output file (default: standard output) |
77 | 136 |
|
78 | 137 | ```txt |
79 | 138 | -u, -bsseq |
80 | 139 | ``` |
81 | | -input BS-seq methcounts file |
| 140 | +input file for BS-seq counts |
82 | 141 | ```txt |
83 | 142 | -h, -tabseq |
84 | 143 | ``` |
85 | | -input TAB-seq methcounts file |
| 144 | +input file for TAB-seq counts |
86 | 145 | ```txt |
87 | 146 | -m, -oxbsseq |
88 | 147 | ``` |
89 | | -input oxBS-seq methcounts file |
| 148 | +input file for oxBS-seq counts |
90 | 149 | ```txt |
91 | 150 | -t, -tolerance |
92 | 151 | ``` |
93 | | -EM convergence threshold (default: 0.000000) |
| 152 | +convergence tolerance (default: 1e-10) |
94 | 153 | ```txt |
95 | 154 | -a, -alpha |
96 | 155 | ``` |
97 | | -significance level of binomial test for each site (default: 0.050000) |
| 156 | +significance level of binomial test at each site (default: 0.05) |
98 | 157 | ```txt |
99 | 158 | -H, -outh |
100 | 159 | ``` |
101 | | -hmC pseudo methcount output file (default: null) |
| 160 | +5hmC estimated levels at each site (default: none) |
102 | 161 | ```txt |
103 | 162 | -M, -outm |
104 | 163 | ``` |
105 | | -mC pseudo methcount output file (default: null) |
| 164 | +5mC estimated levels at each site (default: none) |
106 | 165 | ```txt |
107 | 166 | -v, -verbose |
108 | 167 | ``` |
109 | | -print more run info to STDERR while the program is running. |
| 168 | +print more information while running |
0 commit comments