@@ -9,46 +9,98 @@ $ dnmtools guessprotocol [OPTIONS] <file-1.fastq> [<file-2.fastq>]
99
1010Mapping a WGBS dataset requires knowledge of the sequencing protocol
1111generated to process the data. This may not be properly documented
12- where the data was obtained, so we created a tool that guesses it
13- based on the nucleotide content on one or two FASTQ files.
14-
15- The ` guessprotocol ` tool counts the number of As, Cs, Gs and Ts in
16- each end of the dataset and reports the protocol that is closest to
17- the nucleotide frequency expectations. outputs a single line
18- determining if the dataset is
19- * (1) T-rich, where, in end 1, half of the bases are Ts and there are
20- very few Cs. In end 2, half of the bases are As and there are very few
21- Gs
22- * (2) A-rich, where, in end 1, half of the bases are As and there
23- are very few Gs and, in end 2, half of the bases are Ts and there
24- are very few Cs
25- * (3) Random PBAT, where complementary reads have complementary
26- bisulfite bases (e.g. if end 1 is T-rich, end 2 is A-rich), the
27- bisulfite base in each end is random.
28- * (4) Unknwon, if, based on the nucleotide frequencies, the protocol
29- cannot be determined (or the read is not WGBS).
12+ where the data was obtained, so we created this command to guess the
13+ protocol based on the nucleotide content in the input FASTQ file (or
14+ files, for paired-end).
15+
16+ The ` guessprotocol ` tool uses two models of nucleotide content
17+ following bisulfite conversion and applies this model to each
18+ read. One model is for WGBS, and the other is for PBAT. For each read,
19+ both models are applied, and the result is a probability for whether
20+ the read (or read pair) was generated using WGBS or PBAT. Once the
21+ requested number of reads is processed, the aggregate results for all
22+ reads are used to guess whether the protocol used to generate the data
23+ was WGBS, PBAT or rPBAT. The criteria are roughly as follows: if most
24+ of the reads look like they are from WGBS, then we conclude WGBS. If
25+ most of the reads look like they are from PBAT, then we conclude
26+ PBAT. If the result is more towards the middle, then we conclude
27+ rPBAT.
28+
29+ More details: the number of As, Cs, Gs and Ts differs depending on
30+ WGBS (traditional WGBS or MethylC-seq), PBAT -- post bisulfite adaptor
31+ tagging, or rPBAT (random PBAT).
32+
33+ * For WGBS, a single-end sequenced read should be T-rich, and if the
34+ data is paired-end, read1 is T-rich and read2 is A-rich.
35+ * For PBAT, a single-end sequenced read should be A-rich, and if the
36+ data is paired-end, read1 is A-rich and read2 is T-rich.
37+ * For rPBAT, we have a random mix of the above situations. However, in
38+ practice it seems almost never to be 50% each.
39+
40+ In most cases, when the data is WGBS or PBAT, it is very obvious which
41+ is the protocol used.
42+
43+ As of dnmtools v1.4.1, ` guessprotocol ` will always make a conclusion,
44+ but includes a confidence level.
3045
3146The output of ` guessprotocol ` is useful prior to mapping. For example,
3247it can be used to decide whether or not to map with the ` -R ` flag (for
3348"random PBAT") when using
3449[ abismal] ( https://github.com/smithlabcode/abismal ) .
3550
36- For paired-end data, ` guessprotocol ` finds read mates by finding
37- identical read names. Some datasets finish the read name with
38- identifiers like .1 on end 1 and .2 on end 2, thus making the read
51+ For paired-end data, ` guessprotocol ` finds ensures reads are mates by
52+ finding identical read names. Some datasets finish the read name with
53+ identifiers like ".1" on end 1 and ".2" on end 2, thus making the read
3954names technically different at the last two characters. You can tell
4055the program to ignore a certain suffix size (like size 2 in this
4156example) when matching read names using the ` -i ` flag.
4257
58+ The output includes the following values in a YAML format:
59+ * ` protocol ` : this is the guessed protocol (wgbs, pbat or rpbat) based
60+ on the content of the reads.
61+ * ` confidence ` : indicates the level of confidence in the guess for the
62+ protocol (values: low or high).
63+ * ` layout ` : indicates whether the supplied reads were paired or
64+ single-ended.
65+ * ` n_reads_wgbs ` : the average number of reads (for single-ended reads)
66+ or read pairs (for paired reads) where read1 is determined by the
67+ model to be T-rich.
68+ * ` n_reads ` : the number of evaluated reads or read pairs.
69+ * ` wgbs_fraction ` : the probability that a read (for single-ended
70+ reads) or the read1 of a read pair (for paired reads) is T-rich.
71+
4372## Options
44- ``` txt
45- -n -nreads
4673```
47- number of reads in initial check. The program stops after collecting
48- statistics for the first ` n ` reads (default: 1,000,000)
74+ -n, -nreads
75+ ```
76+ Number of reads to check. The program stops after collecting
77+ statistics for the first ` n ` reads (default: 1,000,000). Fewer than
78+ the default are usually sufficient, but increase this value if you
79+ suspect reads at the start of the file might be problematic.
4980
5081``` txt
5182 -i -ignore
5283```
53- length of the read name suffix to ignore when matching
54- ## Options
84+ Length of the read name suffix to ignore when matching read names to
85+ ensure mates are correctly synchronized when the data is paired-end.
86+
87+ ```
88+ -b, -bisulfite
89+ ```
90+ Assumed bisulfite conversion rate for the models (default: 0.98).
91+
92+ ```
93+ -H, -human
94+ ```
95+ Use human genome nucleotide frequencies. A good assumption for samples
96+ from a mammal.
97+
98+ ```
99+ -o, -output
100+ ```
101+ The output file name.
102+
103+ ```
104+ -v, -verbose
105+ ```
106+ Report available information during the run.
0 commit comments