Skip to content

Commit b397756

Browse files
Merge pull request #160 from smithlabcode/guessprotocol-docs-update
guessprotocol: updating the docs for the new model and output format
2 parents 8681af7 + d9c346d commit b397756

File tree

1 file changed

+79
-27
lines changed

1 file changed

+79
-27
lines changed

docs/content/guessprotocol.md

Lines changed: 79 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -9,46 +9,98 @@ $ dnmtools guessprotocol [OPTIONS] <file-1.fastq> [<file-2.fastq>]
99

1010
Mapping a WGBS dataset requires knowledge of the sequencing protocol
1111
generated to process the data. This may not be properly documented
12-
where the data was obtained, so we created a tool that guesses it
13-
based on the nucleotide content on one or two FASTQ files.
14-
15-
The `guessprotocol` tool counts the number of As, Cs, Gs and Ts in
16-
each end of the dataset and reports the protocol that is closest to
17-
the nucleotide frequency expectations. outputs a single line
18-
determining if the dataset is
19-
* (1) T-rich, where, in end 1, half of the bases are Ts and there are
20-
very few Cs. In end 2, half of the bases are As and there are very few
21-
Gs
22-
* (2) A-rich, where, in end 1, half of the bases are As and there
23-
are very few Gs and, in end 2, half of the bases are Ts and there
24-
are very few Cs
25-
* (3) Random PBAT, where complementary reads have complementary
26-
bisulfite bases (e.g. if end 1 is T-rich, end 2 is A-rich), the
27-
bisulfite base in each end is random.
28-
* (4) Unknwon, if, based on the nucleotide frequencies, the protocol
29-
cannot be determined (or the read is not WGBS).
12+
where the data was obtained, so we created this command to guess the
13+
protocol based on the nucleotide content in the input FASTQ file (or
14+
files, for paired-end).
15+
16+
The `guessprotocol` tool uses two models of nucleotide content
17+
following bisulfite conversion and applies this model to each
18+
read. One model is for WGBS, and the other is for PBAT. For each read,
19+
both models are applied, and the result is a probability for whether
20+
the read (or read pair) was generated using WGBS or PBAT. Once the
21+
requested number of reads is processed, the aggregate results for all
22+
reads are used to guess whether the protocol used to generate the data
23+
was WGBS, PBAT or rPBAT. The criteria are roughly as follows: if most
24+
of the reads look like they are from WGBS, then we conclude WGBS. If
25+
most of the reads look like they are from PBAT, then we conclude
26+
PBAT. If the result is more towards the middle, then we conclude
27+
rPBAT.
28+
29+
More details: the number of As, Cs, Gs and Ts differs depending on
30+
WGBS (traditional WGBS or MethylC-seq), PBAT -- post bisulfite adaptor
31+
tagging, or rPBAT (random PBAT).
32+
33+
* For WGBS, a single-end sequenced read should be T-rich, and if the
34+
data is paired-end, read1 is T-rich and read2 is A-rich.
35+
* For PBAT, a single-end sequenced read should be A-rich, and if the
36+
data is paired-end, read1 is A-rich and read2 is T-rich.
37+
* For rPBAT, we have a random mix of the above situations. However, in
38+
practice it seems almost never to be 50% each.
39+
40+
In most cases, when the data is WGBS or PBAT, it is very obvious which
41+
is the protocol used.
42+
43+
As of dnmtools v1.4.1, `guessprotocol` will always make a conclusion,
44+
but includes a confidence level.
3045

3146
The output of `guessprotocol` is useful prior to mapping. For example,
3247
it can be used to decide whether or not to map with the `-R` flag (for
3348
"random PBAT") when using
3449
[abismal](https://github.com/smithlabcode/abismal).
3550

36-
For paired-end data, `guessprotocol` finds read mates by finding
37-
identical read names. Some datasets finish the read name with
38-
identifiers like .1 on end 1 and .2 on end 2, thus making the read
51+
For paired-end data, `guessprotocol` finds ensures reads are mates by
52+
finding identical read names. Some datasets finish the read name with
53+
identifiers like ".1" on end 1 and ".2" on end 2, thus making the read
3954
names technically different at the last two characters. You can tell
4055
the program to ignore a certain suffix size (like size 2 in this
4156
example) when matching read names using the `-i` flag.
4257

58+
The output includes the following values in a YAML format:
59+
* `protocol`: this is the guessed protocol (wgbs, pbat or rpbat) based
60+
on the content of the reads.
61+
* `confidence`: indicates the level of confidence in the guess for the
62+
protocol (values: low or high).
63+
* `layout`: indicates whether the supplied reads were paired or
64+
single-ended.
65+
* `n_reads_wgbs`: the average number of reads (for single-ended reads)
66+
or read pairs (for paired reads) where read1 is determined by the
67+
model to be T-rich.
68+
* `n_reads`: the number of evaluated reads or read pairs.
69+
* `wgbs_fraction`: the probability that a read (for single-ended
70+
reads) or the read1 of a read pair (for paired reads) is T-rich.
71+
4372
## Options
44-
```txt
45-
-n -nreads
4673
```
47-
number of reads in initial check. The program stops after collecting
48-
statistics for the first `n` reads (default: 1,000,000)
74+
-n, -nreads
75+
```
76+
Number of reads to check. The program stops after collecting
77+
statistics for the first `n` reads (default: 1,000,000). Fewer than
78+
the default are usually sufficient, but increase this value if you
79+
suspect reads at the start of the file might be problematic.
4980

5081
```txt
5182
-i -ignore
5283
```
53-
length of the read name suffix to ignore when matching
54-
## Options
84+
Length of the read name suffix to ignore when matching read names to
85+
ensure mates are correctly synchronized when the data is paired-end.
86+
87+
```
88+
-b, -bisulfite
89+
```
90+
Assumed bisulfite conversion rate for the models (default: 0.98).
91+
92+
```
93+
-H, -human
94+
```
95+
Use human genome nucleotide frequencies. A good assumption for samples
96+
from a mammal.
97+
98+
```
99+
-o, -output
100+
```
101+
The output file name.
102+
103+
```
104+
-v, -verbose
105+
```
106+
Report available information during the run.

0 commit comments

Comments
 (0)