Skip to content

Commit cff2153

Browse files
Updating the docs for counts and for format ahead of v1.2.5
1 parent 1f6d3c2 commit cff2153

File tree

2 files changed

+91
-26
lines changed

2 files changed

+91
-26
lines changed

docs/content/counts.md

Lines changed: 26 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Synopsis
44
```console
5-
$ dnmtools counts [OPTIONS] -c <chroms> <input.sam>
5+
$ dnmtools counts [OPTIONS] -c <chroms> <input.bam>
66
```
77

88
## Description
@@ -18,7 +18,7 @@ pluripotent mammalian cells such as embryonic stem cells. And possibly
1818
whatever cells you are studying. The output of `counts` serves as the
1919
input for many downstream analyses.
2020

21-
The input mapped reads file (`input.sam`) is in SAM/BAM format. The
21+
The input mapped reads file (`input.bam`) is in SAM/BAM format. The
2222
reads should be sorted so those mapping to the same chromosome are
2323
consecutive in the file. Duplicate reads should be probably be
2424
[removed](../uniq) first, but that depends on your data.
@@ -40,12 +40,12 @@ $ dnmtools counts -c /path/to/genome.fa -o output.meth input.sam
4040
```
4141

4242
The argument `-c` gives the name of a FASTA file containing all
43-
chromosome sequences or a directory that contains one FASTA format
44-
file for each chromosome. By default `counts` identifies these
45-
chromosome files by the extension `.fa`. Importantly, the "name" line
46-
in each chromosome FASTA file must begin with the character `>`
47-
followed immediately by the same name that identifies that chromosome
48-
in the SAM output (the `.sam` files). An example of the output and
43+
chromosome sequences (as of v1.2.5, a directory of separate files is
44+
no longer supported). Importantly, the "name" line in each chromosome
45+
FASTA file must begin with the character `>` followed immediately by
46+
the same name that identifies that chromosome in the SAM output (the
47+
`.bam` files). If you use the same FASTA format file you used to map
48+
the reads, everything should be fine. An example of the output and
4949
explanation of each column follows:
5050
```txt
5151
chr1 1869 + CCG 0 1
@@ -142,6 +142,24 @@ which is not useful unless commands are piped.
142142
Reference genome file, which must be in FASTA format. This is
143143
required.
144144

145+
```txt
146+
-t, -threads
147+
```
148+
The number of threads to use. This is only really helpful if the input
149+
is BAM (not very helpful for SAM), and the output is to be zipped (see
150+
`-z` below). These threads will help decompress the BAM input and will
151+
help compress the gzip format output. If only one of these conditions
152+
holds, using more threads can still help. Because `counts` spends most
153+
of its computing time processing reads sequentially, there are
154+
diminishing returns for specifying too many threads.
155+
156+
```txt
157+
-z, -zip
158+
```
159+
The output should be zipped (in gzip format). This is not deduced by
160+
the filename, but specifying this argument should be accompanied by
161+
using a `.gz` filename suffix for the output.
162+
145163
```txt
146164
-n, -cpg-only
147165
```

docs/content/format.md

Lines changed: 65 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
## Synopsis
44

55
```shell
6-
$ dnmtools format [OPTIONS] -f <mapper> <input.sam>
6+
$ dnmtools format [OPTIONS] -f <mapper> <input.bam> [output.bam]
77
```
88

99
## Description
@@ -16,12 +16,29 @@ important to quantify methylation, as fragments that overlap must
1616
count the overlapping bases only once and must be treated as
1717
originating from the same allele. These can be ensured by merging them
1818
into a single entry. SAM/BAM files generated by abismal, Bismark and
19-
BSMAP can be formatted using the `format` command. An example use of
20-
this command to format a mapped reads file is:
19+
BSMAP can be formatted using the `format` command.
20+
21+
An example use of this command to format a mapped reads file is:
2122
```shell
22-
$ dnmtools format -f abismal -o input-formatted.sam input.sam
23+
$ dnmtools format -f abismal input.bam output.sam
2324
```
2425
Above, the file `input.sam` would have been generated by `abismal`.
26+
The file `output.bam` is the output, and an output file is required
27+
here unless the `-stdout` argument is specified (see below). Another
28+
example:
29+
```shell
30+
$ dnmtools format -f abismal -t 8 -B input.bam output.bam
31+
```
32+
This will use 8 threads because of the `-t 8` and will produce output
33+
in BAM format because of the `-B` flag (not the filename of the
34+
output).
35+
36+
*Note* As of dnmtools v1.2.5, there is no longer a "buffer size"
37+
argument. This introduced arbitrary behavior. Now `format` assumes
38+
reads are sorted by read name, which should ensure mates in paired-end
39+
sequencing are consecutive in the file. No "buffer" is needed, and
40+
data that does not conform is more easily detected, making this tool
41+
more easily detect improperly formatted input.
2542

2643
## Options
2744

@@ -32,33 +49,63 @@ This option indicates the format of the input SAM file, corresponding
3249
to the mapper that generated it (options: abismal, bsmap, bismark).
3350

3451
```txt
35-
-o, -output
52+
-t, -threads
53+
```
54+
The number of threads to use. These threads are used for I/O, and are
55+
most helpful when the input and output are both BAM, where the threads
56+
can really speed things up.
57+
58+
```txt
59+
-B, -bam
3660
```
37-
The name of the output file. The output will be in SAM format. By
38-
default this is standard output.
61+
The output is in BAM format. This is an option to help prevent
62+
accidentally writing BAM format to the terminal or through a pipe that
63+
expects plain text, e.g., SAM.
3964

4065
```txt
41-
-s, -suffix
66+
-stdout
67+
```
68+
Write the output to standard out. This is not done by default even
69+
without an output file given, because of the danger of writing BAM to
70+
the terminal or through a pipe unexpectedly. It is possible to write
71+
BAM redirected or through a pipe, but the `-stdout` argument is
72+
required.
73+
74+
```txt
75+
-s, -suff
4276
```
4377
The length of the suffix for read names, which indicates whether the
44-
read is from end 1 or end 2 (default: 1).
78+
read is from end 1 or end 2 for paired-end reads. If this is not
79+
specified, but the data is paired end (i.e., the flag `-single-end` is
80+
not used; see below), then the length of this suffix is inferred.
81+
82+
```txt
83+
-single-end
84+
```
85+
Using this argument tells `format` not to look for mates to merge as a
86+
single fragment. The default assumption is that data is paired-ended
87+
and that mates are consecutive in the input.
4588

4689
```txt
4790
-L, -max-frag
4891
```
4992
The maximum allowed insert size in base-pairs (default:
50-
10000). Normally this parameter is set at the mapping step, but
51-
`format` can also reject reads that are in opposing strands in the
52-
same chromosome but map more than "max-frag" bases apart.
93+
unlimited). Normally this parameter is determined during read mapping,
94+
but `format` can also reject reads that are in opposing strands in the
95+
same chromosome but map more than this many bases apart.
5396

5497
```txt
55-
-B, -buf-size
98+
-F, -force
5699
```
57-
Maximum buffer size (default: 10000). This is the maximum
58-
number of reads retained before mates are no longer seeked for
59-
reads. If more than "buf-size" reads are not a proper mate of a given
60-
read, the read is printed as-is and reported as single-end. This value
61-
has no effect if the input is single-end.
100+
This option "forces" the `format` command to process paired-end reads
101+
even if it is unable to detect mates. Without this argument, failure
102+
to detect mates will cause `format` to terminate. This option is
103+
useful, for example, if the reads were paired-ended, but the second
104+
end is of such low quality that only reads from the first end were
105+
mapped. In a data analysis pipeline, it might not be apparent that one
106+
of two ends failed entirely, so providing this option can help. If you
107+
are only analyzing a small number of data sets, you probably want to
108+
be made aware of this problem rather than force it to be ignored.
62109

63110
```txt
64111
-v, -verbose

0 commit comments

Comments
 (0)