Skip to content

Commit 327ff84

Browse files
authored
Merge pull request #72 from hasindu2008/dev
Dev
2 parents dd284ee + 2a87172 commit 327ff84

File tree

9 files changed

+271
-29
lines changed

9 files changed

+271
-29
lines changed

docs/archive.md

Lines changed: 225 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,225 @@
1-
# Paranoid's guide on using slow5 for archiving
1+
# A pedant's guide on using slow5 for archiving
2+
3+
BLOW5 format is not only useful for [fast signal-level analysis](https://www.nature.com/articles/s41587-021-01147-4), but it is also a great alternative for archiving raw nanopore signal data. You can save heaps of space, for instance, our in-house data generated at the Garvan Institute during 2021 comprised 245 TB FAST5 files, which were [reduced to just 109 TB after conversion to BLOW5](https://twitter.com/GenTechGp/status/1505756167586611200). A concern one might be having is whether the integrity of original data is retained with BLOW5. We assure users that FAST5 to BLOW5 conversion is fully lossless, and can be converted back to FAST5 at any time. But still many users (including us) can be a bit paranoid and it's good to be pedantic when it comes to potential data loss.
4+
5+
This vignette walks you through a number of sanity checks that remedy those concerns. Note that some of these tests are redundant and very time-consuming. Depending on how pedantic you are, you can pick the tests you want.
6+
7+
# Suggested conversion structure
8+
9+
We suggest converting all FAST5 files from a single sequencing run into a single merged BLOW5 file. This way, it is very convenient to manage. The examples in this article are centred around this conversion structure. However, this is our personnel preference, you can either have any number of BLOW5 files per sequencing run or even merge multiple sequencing runs into one BLOW5. But we suggest not to merge drastically different runs (e.g., a PromethION run with a MinION run, a run done in the year 2020 with a run in 2022) into one BLOW5 file for archiving purposes. This is because the FAST5 file structure can vary significantly and mixing them up makes things unnecessarily complex. However, for analysis purposes, you can merge as you wish, as you have original files if something goes wrong.
10+
11+
For a given sequencing run, call `slow5tools f2s` to convert each FAST5 file to BLOW5. Then, merge all SLOW5 files into a single file using `slow5tools merge`. Given below is a bash code snippet:
12+
13+
```bash
14+
FAST5_DIR=/path/to/input/fast5/
15+
DIR_TMP_SLOW5=/path/to/temporary/slow5/
16+
SLOW5_FILE=/path/to/merged.blow5
17+
NUM_THREADS=8
18+
19+
slow5tools f2s ${FAST5_DIR} -d ${DIR_TMP_SLOW5} -p ${NUM_THREADS}
20+
slow5tools merge ${DIR_TMP_SLOW5}/ -o ${SLOW5_FILE} -t${NUM_THREADS}
21+
22+
rm -r ${DIR_TMP_SLOW5}
23+
```
24+
25+
## Sanity check by counting number of reads
26+
27+
One of the simplest sanity checks is to check if the read count in the merged BLOW5 file is the same as those in input FAST5 files. Getting the number of read records in a merged BLOW5 file is quite simple. A bash command that gets the number of read records using `slow5tools stats` and saves the count into a variable:
28+
29+
```bash
30+
NUM_SLOW5_READS=$(slow5tools stats $SLOW5_FILE | grep "number of records" | awk '{print $NF}')
31+
```
32+
33+
Now this `NUM_SLOW5_READS` variable can be used to compare with the count from FAST5 files. Unfortunately, we are not aware of a straightforward way to get the number of records from a set of FAST5 files quickly (please let us know if there is). So far, the most accurate (but time-consuming) method we found is to use `strings` command to print out all the strings in a FAST5 file and then to `grep` for the strings of the format `read_.*-.*-.*` that represents a UUID-based read ID. If you have GNU parallel (`apt-get install parallel`) this can be done in parallel for multiple FAST5 files in a directory and summed up using `awk`:
34+
```bash
35+
NUM_READS=$(find $FAST5_DIR -name '*.fast5' | parallel -I% --max-args 1 strings % | grep "read_.*-.*-.*" | wc -l | awk 'BEGIN {count=0;} {count=count+$0} END {print count;}')
36+
```
37+
38+
If GNU parallel is not available, this can be done serially (will be slow obviously) using `xargs` as:
39+
```bash
40+
NUM_READS=$(find $FAST5_DIR -name '*.fast5' | xargs --max-args 1 strings | grep "read_.*-.*-.*" | wc -l | awk 'BEGIN {count=0;} {count=count+$0} END {print count;}')
41+
```
42+
43+
Now, we can simply compare `NUM_READS` to `NUM_SLOW5_READS` and see if they are equal. A bash code snippet for this comparison is given below:
44+
45+
```bash
46+
if [ ${NUM_READS} -ne ${NUM_SLOW5_READS} ]
47+
then
48+
echo "ERROR: fall back sanity check also failed. $NUM_READS in FAST5, but $NUM_SLOW5_READS reads in SLOW5"
49+
exit 1
50+
else
51+
echo "$NUM_READS in FAST5, $NUM_SLOW5_READS reads in SLOW5"
52+
fi
53+
```
54+
55+
A bash function that combines the sanity check we discussed above that can be directly copy-pasted into your bash script is given below:
56+
```bash
57+
sanity_check_fast5_num_reads(){
58+
59+
FAST5_DIR=$1
60+
NUM_SLOW5_READS=$2
61+
62+
test -d ${FAST5_DIR} || die "$FAST5_DIR not found"
63+
64+
if parallel --version > /dev/null
65+
then
66+
NUM_READS=$(find $FAST5_DIR -name '*.fast5' | parallel -I% --max-args 1 strings % | grep "read_.*-.*-.*" | wc -l | awk 'BEGIN {count=0;} {count=count+$0} END {print count;}')
67+
else
68+
NUM_READS=$(find $FAST5_DIR -name '*.fast5' | xargs --max-args 1 strings | grep "read_.*-.*-.*" | wc -l | awk 'BEGIN {count=0;} {count=count+$0} END {print count;}')
69+
fi
70+
71+
if [ ${NUM_READS} -ne ${NUM_SLOW5_READS} ]
72+
then
73+
echo "ERROR: Sanity check also failed. $NUM_READS in FAST5, but $NUM_SLOW5_READS reads in SLOW5"
74+
exit 1
75+
else
76+
echo "$NUM_READS in FAST5, $NUM_SLOW5_READS reads in SLOW5"
77+
fi
78+
79+
}
80+
```
81+
82+
Simply call it as `sanity_check_fast5_num_reads $FAST5_DIR $NUM_SLOW5_READS` in your bash script assuming you have set `FAST5_DIR` and `NUM_SLOW5_READS` as we discussed above.
83+
84+
If you think this is too time-consuming, a quick method to estimate the count is by getting the number of reads in one of the FAST5 files in the sequencing run and multiplying by the number of FAST5 files. But note that the number of reads in FAST5 files can vary from file to file - so in this case, we will find the largest FAST5. grab its read count and multiply it by the number of FAST5 files:
85+
86+
```bash
87+
# find the largest fast5 file by file size
88+
LARGEST_FAST5=$(find ${FAST5_DIR} -name '*.fast5' -printf "%s\t%p\n" | sort -n | tail -1 | awk '{print $2}')
89+
# a hacky way to get the number of reads
90+
NUM_READS_IN_FAST5=$(strings ${LARGEST_FAST5} | grep "read_.*-.*-.*" | wc -l)
91+
# get the number of fast5 files
92+
NUMFAST5=$(find $FAST5_DIR -name '*.fast5' | wc -l)
93+
# now do the multiplication
94+
NUM_FAST5_READS=$(echo "($NUMFAST5)*($NUM_READS_IN_FAST5)" | bc)
95+
```
96+
97+
This `NUM_FAST5_READS` will obviously be an over-estimate as we used the largest file. So we will have to check if the read count in SLOW5 is within a percentage:
98+
```bash
99+
# get the percentage of NUM_SLOW5_READS over NUM_FAST5_READS
100+
PASS_PERCENT=$(echo "(((($NUM_SLOW5_READS))/($NUM_FAST5_READS))*100)" | bc -l)
101+
# make PASS_PERCENT into an integer to be compared in bash
102+
PASS_PERCENTINT=$(echo "$PASS_PERCENT/1" | bc)
103+
104+
# The test is done only if the fast5 file count is large enough, NUMFAST5>20 in this example
105+
# If PASS_PERCENTINT<95, it is considered failed in this example
106+
if [ ${NUMFAST5} -gt 20 ] && [ ${PASS_PERCENTINT} -lt 95 ]
107+
then
108+
echo "ERROR: Sanity check failed. Only $NUM_SLOW5_READS in SLOW5 out of $NUM_FAST5_READS estimated raw reads in FAST5 (${PASS_PERCENT}%)"
109+
exit 1
110+
else
111+
echo "$NUM_SLOW5_READS in SLOW5, $NUM_FAST5_READS estimated raw reads in FAST5 (${PASS_PERCENT}%)"
112+
fi
113+
```
114+
115+
Note that this test, despite being fast, can give false-positive errors. So a better way would be to fall back to the accurate method we discussed above (`sanity_check_fast5_num_reads` function above), if this quick test fails. A bash function that does all this:
116+
117+
```bash
118+
sanity_check_fast5_num_reads_estimate(){
119+
120+
FAST5_DIR=$1
121+
NUM_SLOW5_READS=$2
122+
123+
test -d ${FAST5_DIR} || die "$FAST5_DIR not found"
124+
125+
# estimate number of reads in multi-fast5
126+
LARGEST_FAST5=$(find ${FAST5_DIR} -name '*.fast5' -printf "%s\t%p\n" | sort -n | tail -1 | awk '{print $2}')
127+
NUM_READS_IN_FAST5=$(strings ${LARGEST_FAST5} | grep "read_.*-.*-.*" | wc -l)
128+
NUMFAST5=$(find $FAST5_DIR -name '*.fast5' | wc -l)
129+
NUM_FAST5_READS=$(echo "($NUMFAST5)*($NUM_READS_IN_FAST5)" | bc)
130+
131+
PASS_PERCENT=$(echo "(((($NUM_SLOW5_READS))/($NUM_FAST5_READS))*100)" | bc -l)
132+
PASS_PERCENTINT=$(echo "$PASS_PERCENT/1" | bc)
133+
134+
if [ ${NUMFAST5} -gt 20 ] && [ ${PASS_PERCENTINT} -lt 95 ]
135+
then
136+
echo "Estimated sanity check failed - Only $NUM_SLOW5_READS in SLOW5 out of $NUM_FAST5_READS estimated raw reads in FAST5 (${PASS_PERCENT}%). Trying accurate method."
137+
sanity_check_fast5_num_reads $FAST5_DIR $NUM_SLOW5_READS
138+
else
139+
echo "$NUM_SLOW5_READS in SLOW5, $NUM_FAST5_READS estimated raw reads in FAST5 (${PASS_PERCENT}%)"
140+
fi
141+
142+
}
143+
144+
```
145+
146+
## Sanity check through Read ID uniqueness
147+
148+
What if the counts of reads are matching, but one file got merged twice (note: this is extremely unlikely). You can eliminate this doubt by checking if the read IDs are all unique.
149+
The easiest way is to call `slow5tools index` on your merged BLOW5 file, as it will fail if there are duplicate read IDs. As `slow5tools index` goes through the whole BLOW5 file, this test complimentarily checks for the unlikely scenario of the file being corrupted or truncated. Also, the generated index can be archived if necessary as then one could skip this index step to save time later during analysis. Following is a code snippet:
150+
151+
```bash
152+
slow5tools index ${SLOW5_FILE} || { echo "Indexing failed"; exit 1; }
153+
```
154+
155+
A very expensive and time-consuming uniqueness test can be done using `slow5tools view`. Unlike index command, slow5tools view will decompress and parse each and every record, so this test will verify if individual bits in each and every record is perfect. Following is a bash code snippet that uses `slow5tools view` to print all the read IDs, sort them and count occurrences of each read id using `uniq -c` command, sort them based on the counts and then using `awk` to see if the largest count is still 1.
156+
157+
```bash
158+
slow5tools view -t ${NUM_THREADS} $SLOW5_FILE | awk '{print $1}' | grep -v "^[#@]" | sort | uniq -c | sort -rn | awk '{if($1!=1){print "Duplicate read ID found",$2; exit 1}}' || { echo "ERROR: Sanity check failed. Duplicate reads in SLOW5"; exit 1; }
159+
160+
```
161+
162+
## Sanity check by basecalling
163+
164+
The ultimate test would be to base-call the original FAST5 files, FAST5 generated by reconverting the created BLOW5 file, and then comparing the two base-calling outputs.
165+
166+
First, let us split the merged BLOW5 file into multiple smaller BLOW5 files so that `slow5tools s2f` can convert to FAST5 in parallel. Following is a code snippet that does the splitting and conversion to fast5:
167+
168+
```bash
169+
# get the number of read groups in the SLOW5
170+
NUM_READ_GROUPS=$(slow5tools stats ${SLOW5_FILE} | head | grep "number of read groups" | awk '{print $NF}')
171+
172+
# if there are multiple read groups, first split by read group, and then into 4000 read SLOW5 files
173+
if [ ${NUM_READ_GROUPS} -ne 1 ]
174+
then
175+
slow5tools split ${SLOW5_FILE} -d split_groups_tmp/ -g -t $NUM_THREADS
176+
slow5tools split split_groups_tmp/ -d split_reads_tmp/ -r 4000 -t $NUM_THREADS
177+
rm -r split_groups_tmp/
178+
# if only one read group, directly split into 4000 read SLOW5 files
179+
else
180+
slow5tools split ${SLOW5_FILE} -d split_reads_tmp/ -r 4000 -t $NUM_THREADS
181+
fi
182+
183+
# convert the split slow5 files to fast5
184+
slow5tools s2f split_reads_tmp -d s2f_fast5/ -p $NUM_THREADS
185+
rm -r split_reads_tmp/
186+
187+
```
188+
189+
Now base-call the original FAST5 files as well as the reconverted FAST5 files. Following are some example commands, but make sure to set the base-calling profile to match your dataset and the CPU/GPU device based on your system.
190+
191+
```
192+
guppy_basecaller -c dna_r9.4.1_450bps_fast.cfg -i ${FAST5_DIR} -s fast5_basecalls/ -r --device cuda:all
193+
guppy_basecaller -c dna_r9.4.1_450bps_fast.cfg -s s2f_fast5_basecalls/ -r --device cuda:all
194+
rm -r s2f_fast5
195+
196+
```
197+
198+
Now we can check if the base-calling outputs are the same. The order of reads produced by the base-caller is not deterministic make sure you sort them before comparing using `diff`. If the diff command passes, that means the data in the BLOW5 file are identical to those in FAST5.
199+
Given below is a bash function that you can directly copy-paste into your bash script and call inside your bash script as `compare_basecalls fast5_basecalls/ s2f_fast5_basecalls/`;
200+
201+
202+
```bash
203+
compare_basecalls (){
204+
A=$1
205+
B=$2
206+
207+
test -e $A || die "$A not present."
208+
test -e $B || die "$B not present."
209+
210+
# We sort the fastq files based on the read_ids because the output order from guppy is not deterministic
211+
find $A -name '*.fastq' -exec cat {} + | paste - - - - | sort -k1,1 | tr '\t' '\n' > a.fastq
212+
find $B -name '*.fastq' -exec cat {} + | paste - - - - | sort -k1,1 | tr '\t' '\n' > b.fastq
213+
214+
diff -q a.fastq b.fastq || { echo "Basecalls differ"; exit 1; }
215+
216+
# We strip out the file names and then sort before comparing
217+
cut -f2,3,5- $A/sequencing_summary.txt | sort -k1 > a.txt
218+
cut -f2,3,5- $B/sequencing_summary.txt | sort -k1 > b.txt
219+
220+
diff -q a.txt b.txt || { echo "sequencing summary files differ"; exit 1; }
221+
222+
}
223+
```
224+
225+
However, note that sometimes this test diff will cause false errors due to base-callers providing slightly different outputs in various circumstances (see https://github.com/hasindu2008/slow5tools/issues/70). We recently came through a situation where Guppy 4.4.1 on a system with multiple GPUs (GeForce 3090 and 3070) produced slightly different results, even on the same FAST5 input when run multiple times.

docs/commands.md

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@
1212
Create an index for a SLOW5/BLOW5 file.
1313
* `merge`:
1414
Merge multiple SLOW5/BLOW5 files to a single file.
15+
* `cat`:
16+
Quickly concatenate SLOW5/BLOW5 files of same read group [experimental]
1517
* `split`:
1618
Split a single a SLOW5/BLOW5 file into multiple separate files.
1719
* `get`:
@@ -92,6 +94,21 @@ If multiple samples (different run ids) are detected, the header and the *read_g
9294
* `-h, --help`:
9395
Prints the help menu.
9496

97+
### cat
98+
99+
```
100+
slow5tools cat file1.blow5 file2.blow5 -o output.blow5
101+
slow5tools cat blow5_dir1 -o output.blow5
102+
```
103+
104+
Note: This subtools is experimental. Use with caution.
105+
Quickly concatenate SLOW5/BLOW5 files of same type (same header, extension, compression).
106+
107+
* `-o, --output FILE`:
108+
Outputs concatenated data to FILE [default value: stdout].
109+
* `-h, --help`:
110+
Prints the help menu.
111+
95112

96113
### index
97114

@@ -137,7 +154,6 @@ slow5tools get [OPTIONS] file1.blow5 readid1 readid2 ....
137154
slow5tools get [OPTIONS] file1.blow5 --list readids.txt
138155
```
139156

140-
141157
* `--to format_type`:
142158
Specifies the format of output files. `format_type` can be `slow5` for SLOW5 ASCII or `blow5` for SLOW5 binary (BLOW5) [default value: blow5].
143159
* `-o FILE`, `--output FILE`:

docs/index.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@
88

99
[Bash one-liners](oneliners.md)
1010

11+
[SLOW5 for archiving](archive.md)
12+
1113
[Compatibility table](compatibility.md)
1214

1315
[Frequently asked questions](faq.md)

docs/oneliners.md

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,6 @@
22

33

44
```
5-
# print slow5 global header:
6-
slow5tools view file.blow5 | grep '^#'
7-
8-
# print slow5 data header:
9-
slow5tools view file.blow5 | grep '^@'
10-
115
# print slow5 header:
126
slow5tools view file.blow5 | grep '^[#@]'
137
@@ -26,4 +20,7 @@ slow5tools get --to slow5 file.blow5 "r1" | grep -v '^[#@]' | awk '{print $8}'
2620
# merge every 10 files together in INPUT_DIR and save to OUTPUT_DIR
2721
find -name INPUT_DIR/*.blow5 | parallel -I% --max-args 10 slow5tools merge % -o OUTPUT_DIR/{#}.blow5
2822
23+
# Get the sum of read counts in all BLOW5 files in a directory
24+
find INPUT_DIR/ -name '*.blow5' | parallel -I% --max-args 1 slow5tools stats % | grep "number of records" | awk 'BEGIN {count=0;} {count=count+$NF} END {print count;}'
25+
2926
```

src/cat.c

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717

1818
#define USAGE_MSG "Usage: %s [SLOW5_FILE/DIR]\n"
1919
#define HELP_LARGE_MSG \
20-
"Concatenate slow5s with same run_id, compression type, and file extension\n" \
20+
"Quickly concatenate SLOW5/BLOW5 files of same type (same header, extension, compression) [experimental]\n" \
2121
USAGE_MSG \
2222
"\n" \
2323
"OPTIONS:\n" \
@@ -118,6 +118,8 @@ int cat_main(int argc, char **argv, struct program_meta *meta){
118118
}
119119
}
120120

121+
WARNING("%s","slow5tools cat is experimental. Use with caution.");
122+
121123
slow5_file_t* slow5File = NULL;
122124
int first_iteration = 1;
123125
uint32_t num_read_groups = 1;

src/main.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@
3737
" get display the read entry for each specified read id\n" \
3838
" view view the contents of a SLOW5/BLOW5 file or convert between different SLOW5/BLOW5 formats and compressions\n" \
3939
" stats prints statistics of a SLOW5/BLOW5 file to the stdout\n" \
40-
" cat quickly concatenate SLOW5/BLOW5 files of same type [experimental]\n" \
40+
" cat quickly concatenate SLOW5/BLOW5 files of same type (same header, extension, compression) [experimental]\n" \
4141
" quickcheck quickly checks if a SLOW5/BLOW5 file is intact\n" \
4242
"\n" \
4343
"ARGS: Try '%s [COMMAND] --help' for more information.\n" \

src/s2f.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -478,7 +478,7 @@ void s2f_child_worker(proc_arg_t args,
478478
program_meta *meta,
479479
reads_count *readsCount) {
480480
for (int i = args.starti; i < args.endi; i++) {
481-
VERBOSE("Converting %s to fast5", slow5_files[i].c_str());
481+
DEBUG("Converting %s to fast5", slow5_files[i].c_str());
482482
slow5_file_t* slow5File_i = slow5_open(slow5_files[i].c_str(), "r");
483483
if(!slow5File_i){
484484
ERROR("cannot open %s. skipping...\n",slow5_files[i].c_str());

test/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,3 +27,5 @@ data/out/merge/
2727
data/out/split/
2828
data/out/slow5tools_get/
2929
test_all_dir/
30+
NA12878_prom_subsubsample/
31+
fast5-soup/

test/download_test_dataset.sh

Lines changed: 17 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,22 @@
11
#!/bin/bash
2-
# Download a test dataset
3-
Usage="download_dataset.sh [path to download directory] [download link (optional)]"
2+
# Download test datasets
43

5-
if [[ "$#" -lt 1 ]]; then
6-
echo "Usage: $Usage"
7-
exit
8-
fi
4+
download (){
5+
mkdir -p $download_dir
6+
tar_path=$download_dir/data.tgz
7+
wget -O $tar_path $link || rm -f $tar_path
8+
echo "Extracting. Please wait."
9+
tar -xf $tar_path -C $download_dir || rm -f $tar_path
10+
rm -f $tar_path
11+
echo "Extraction successful. Please check $download_dir."
12+
}
913

10-
download_dir=$1
11-
link="https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2"
14+
download_dir=test/
15+
# test -d $download_dir/NA12878_prom_subsubsample && rm -r $download_dir/NA12878_prom_subsubsample
16+
# link="https://slow5.page.link/na12878_prom_subsub"
17+
# download
1218

13-
if [[ "$#" -eq 2 ]]; then
14-
link=$2
15-
fi
19+
test -d $download_dir/fast5_soup && rm -r $download_dir/fast5_soup
20+
link="https://slow5.page.link/fast5-soup"
21+
download
1622

17-
mkdir -p $download_dir
18-
tar_path=$download_dir/data.tgz
19-
wget -O $tar_path $link || rm -rf $tar_path
20-
echo "Extracting. Please wait."
21-
tar -xf $tar_path -C $download_dir || rm -rf $tar_path
22-
rm -f $tar_path
23-
echo "Extraction successful. Please check $download_dir."

0 commit comments

Comments
 (0)