Merge pull request #72 from hasindu2008/dev

hasindu2008 · web-flow · commit 327ff84c1aed · 2022-03-21T15:53:27.000+11:00
Dev
diff --git a/docs/archive.md b/docs/archive.md
@@ -1 +1,225 @@
-# Paranoid's guide on using slow5 for archiving
+# A pedant's guide on using slow5 for archiving
+
+BLOW5 format is not only useful for [fast signal-level analysis](https://www.nature.com/articles/s41587-021-01147-4), but it is also a great alternative for archiving raw nanopore signal data. You can save heaps of space, for instance, our in-house data generated at the Garvan Institute during 2021 comprised 245 TB FAST5 files, which were [reduced to just 109 TB after conversion to BLOW5](https://twitter.com/GenTechGp/status/1505756167586611200). A concern one might be having is whether the integrity of original data is retained with BLOW5. We assure users that FAST5 to BLOW5 conversion is fully lossless, and can be converted back to FAST5 at any time. But still many users (including us) can be a bit paranoid and it's good to be pedantic when it comes to potential data loss.
+
+This vignette walks you through a number of sanity checks that remedy those concerns. Note that some of these tests are redundant and very time-consuming. Depending on how pedantic you are, you can pick the tests you want.
+
+# Suggested conversion structure
+
+We suggest converting all FAST5 files from a single sequencing run into a single merged BLOW5 file. This way, it is very convenient to manage. The examples in this article are centred around this conversion structure. However, this is our personnel preference, you can either have any number of BLOW5 files per sequencing run or even merge multiple sequencing runs into one BLOW5. But we suggest not to merge drastically different runs (e.g., a PromethION run with a MinION run, a run done in the year 2020 with a run in 2022) into one BLOW5 file for archiving purposes. This is because the FAST5 file structure can vary significantly and mixing them up makes things unnecessarily complex. However, for analysis purposes, you can merge as you wish, as you have original files if something goes wrong.
+
+For a given sequencing run, call `slow5tools f2s` to convert each FAST5 file to BLOW5. Then, merge all SLOW5 files into a single file using `slow5tools  merge`. Given below is a bash code snippet:
+
+```bash
+FAST5_DIR=/path/to/input/fast5/
+DIR_TMP_SLOW5=/path/to/temporary/slow5/
+SLOW5_FILE=/path/to/merged.blow5
+NUM_THREADS=8
+
+slow5tools f2s ${FAST5_DIR} -d ${DIR_TMP_SLOW5} -p ${NUM_THREADS}
+slow5tools merge ${DIR_TMP_SLOW5}/ -o ${SLOW5_FILE} -t${NUM_THREADS}
+
+rm -r ${DIR_TMP_SLOW5}
+```
+
+## Sanity check by counting number of reads
+
+One of the simplest sanity checks is to check if the read count in the merged BLOW5 file is the same as those in input FAST5 files. Getting the number of read records in a merged BLOW5 file is quite simple. A bash command that gets the number of read records using `slow5tools stats` and saves the count into a variable:
+
+```bash
+NUM_SLOW5_READS=$(slow5tools stats $SLOW5_FILE | grep "number of records" | awk '{print $NF}')
+```
+
+Now this `NUM_SLOW5_READS` variable can be used to compare with the count from FAST5 files. Unfortunately, we are not aware of a straightforward way to get the number of records from a set of FAST5 files quickly (please let us know if there is). So far, the most accurate (but time-consuming) method we found is to use `strings` command to print out all the strings in a FAST5 file and then to `grep` for the strings of the format `read_.*-.*-.*` that represents a UUID-based read ID. If you have GNU parallel (`apt-get install parallel`) this can be done in parallel for multiple FAST5 files in a directory and summed up using `awk`:
+```bash
+NUM_READS=$(find $FAST5_DIR -name '*.fast5' | parallel -I% --max-args 1 strings % | grep "read_.*-.*-.*" | wc -l | awk 'BEGIN {count=0;} {count=count+$0} END {print count;}')
+```
+
+If GNU parallel is not available,  this can be done serially (will be slow obviously) using `xargs` as:
+```bash
+NUM_READS=$(find $FAST5_DIR -name '*.fast5' | xargs --max-args 1 strings | grep "read_.*-.*-.*" | wc -l | awk 'BEGIN {count=0;} {count=count+$0} END {print count;}')
+```
+
+Now, we can simply compare `NUM_READS` to `NUM_SLOW5_READS` and see if they are equal. A bash code snippet for this comparison is given below:
+
+```bash
+if [ ${NUM_READS} -ne ${NUM_SLOW5_READS} ]
+then
+    echo "ERROR: fall back sanity check also failed. $NUM_READS in FAST5, but $NUM_SLOW5_READS reads in SLOW5"
+    exit 1
+else
+    echo "$NUM_READS in FAST5, $NUM_SLOW5_READS reads in SLOW5"
+fi
+```
+
+A bash function that combines the sanity check we discussed above that can be directly copy-pasted into your bash script is given below:
+```bash
+sanity_check_fast5_num_reads(){
+
+	FAST5_DIR=$1
+	NUM_SLOW5_READS=$2
+
+	test -d  ${FAST5_DIR} || die "$FAST5_DIR not found"
+
+	if parallel --version > /dev/null
+	then
+		NUM_READS=$(find $FAST5_DIR -name '*.fast5' | parallel -I% --max-args 1 strings % | grep "read_.*-.*-.*" | wc -l | awk 'BEGIN {count=0;} {count=count+$0} END {print count;}')
+	else
+		NUM_READS=$(find $FAST5_DIR -name '*.fast5' | xargs --max-args 1 strings | grep "read_.*-.*-.*" | wc -l | awk 'BEGIN {count=0;} {count=count+$0} END {print count;}')
+	fi
+
+	if [ ${NUM_READS} -ne ${NUM_SLOW5_READS} ]
+	then
+		echo "ERROR: Sanity check also failed. $NUM_READS in FAST5, but $NUM_SLOW5_READS reads in SLOW5"
+		exit 1
+	else
+		echo "$NUM_READS in FAST5, $NUM_SLOW5_READS reads in SLOW5"
+	fi
+
+}
+```
+
+Simply call it as `sanity_check_fast5_num_reads $FAST5_DIR $NUM_SLOW5_READS` in your bash script assuming you have set `FAST5_DIR` and `NUM_SLOW5_READS` as we discussed above.
+
+If you think this is too time-consuming, a quick method to estimate the count is by getting the number of reads in one of the FAST5 files in the sequencing run and multiplying by the number of FAST5 files. But note that the number of reads in FAST5 files can vary from file to file - so in this case, we will find the largest FAST5.  grab its read count and multiply it by the number of FAST5 files:
+
+```bash
+# find the largest fast5 file by file size
+LARGEST_FAST5=$(find ${FAST5_DIR} -name '*.fast5' -printf "%s\t%p\n" | sort -n | tail -1 | awk '{print $2}')
+# a hacky way to get the number of reads
+NUM_READS_IN_FAST5=$(strings  ${LARGEST_FAST5} | grep "read_.*-.*-.*" | wc -l)
+# get the number of fast5 files
+NUMFAST5=$(find $FAST5_DIR -name '*.fast5' | wc -l)
+# now do the multiplication
+NUM_FAST5_READS=$(echo "($NUMFAST5)*($NUM_READS_IN_FAST5)" | bc)
+```
+
+This `NUM_FAST5_READS` will obviously be an over-estimate as we used the largest file. So we will have to check if the read count in SLOW5 is within a percentage:
+```bash
+# get the percentage of NUM_SLOW5_READS over NUM_FAST5_READS
+PASS_PERCENT=$(echo "(((($NUM_SLOW5_READS))/($NUM_FAST5_READS))*100)" | bc -l)
+# make PASS_PERCENT into an integer to be compared in bash
+PASS_PERCENTINT=$(echo "$PASS_PERCENT/1" | bc)
+
+# The test is done only if the fast5 file count is large enough, NUMFAST5>20 in this example
+# If PASS_PERCENTINT<95, it is considered failed in this example 
+if [ ${NUMFAST5} -gt 20 ] && [ ${PASS_PERCENTINT} -lt 95 ]
+then
+    echo "ERROR: Sanity check failed. Only $NUM_SLOW5_READS in SLOW5 out of $NUM_FAST5_READS estimated raw reads in FAST5 (${PASS_PERCENT}%)"
+    exit 1
+else
+    echo "$NUM_SLOW5_READS in SLOW5, $NUM_FAST5_READS estimated raw reads in FAST5 (${PASS_PERCENT}%)"
+fi
+```    
+
+Note that this test, despite being fast, can give false-positive errors. So a better way would be to fall back to the accurate method we discussed above (`sanity_check_fast5_num_reads` function above), if this quick test fails. A bash function that does all this:
+
+```bash
+sanity_check_fast5_num_reads_estimate(){
+
+	FAST5_DIR=$1
+	NUM_SLOW5_READS=$2
+
+	test -d  ${FAST5_DIR} || die "$FAST5_DIR not found"
+
+	# estimate number of reads in multi-fast5
+	LARGEST_FAST5=$(find ${FAST5_DIR} -name '*.fast5' -printf "%s\t%p\n" | sort -n | tail -1 | awk '{print $2}')
+	NUM_READS_IN_FAST5=$(strings  ${LARGEST_FAST5} | grep "read_.*-.*-.*" | wc -l)
+	NUMFAST5=$(find $FAST5_DIR -name '*.fast5' | wc -l)
+	NUM_FAST5_READS=$(echo "($NUMFAST5)*($NUM_READS_IN_FAST5)" | bc)
+
+	PASS_PERCENT=$(echo "(((($NUM_SLOW5_READS))/($NUM_FAST5_READS))*100)" | bc -l)
+	PASS_PERCENTINT=$(echo "$PASS_PERCENT/1" | bc)
+
+	if [ ${NUMFAST5} -gt 20 ] && [ ${PASS_PERCENTINT} -lt 95 ]
+	then
+		echo "Estimated sanity check failed - Only $NUM_SLOW5_READS in SLOW5 out of $NUM_FAST5_READS estimated raw reads in FAST5 (${PASS_PERCENT}%). Trying accurate method."
+        	sanity_check_fast5_num_reads $FAST5_DIR $NUM_SLOW5_READS
+	else
+		echo "$NUM_SLOW5_READS in SLOW5, $NUM_FAST5_READS estimated raw reads in FAST5 (${PASS_PERCENT}%)"
+	fi
+
+}
+
+```
+
+## Sanity check through Read ID uniqueness
+
+What if the counts of reads are matching, but one file got merged twice (note: this is extremely unlikely). You can eliminate this doubt by checking if the read IDs are all unique.
+The easiest way is to call `slow5tools index` on your merged BLOW5 file, as it will fail if there are duplicate read IDs. As `slow5tools index` goes through the whole BLOW5 file, this test complimentarily checks for the unlikely scenario of the file being corrupted or truncated. Also, the generated index can be archived if necessary as then one could skip this index step to save time later during analysis. Following is a code snippet:
+
+```bash
+slow5tools index ${SLOW5_FILE} || { echo "Indexing failed"; exit 1; }
+```
+
+A very expensive and time-consuming uniqueness test can be done using `slow5tools view`. Unlike index command, slow5tools view will decompress and parse each and every record, so this test will verify if individual bits in each and every record is perfect. Following is a bash code snippet that uses `slow5tools view` to print all the read IDs, sort them and count occurrences of each read id using `uniq -c` command, sort them based on the counts and then using `awk` to see if the largest count is still 1.
+
+```bash
+slow5tools view -t ${NUM_THREADS} $SLOW5_FILE |  awk '{print $1}' | grep  -v "^[#@]" |  sort | uniq -c  | sort -rn | awk '{if($1!=1){print "Duplicate read ID found",$2; exit 1}}' || { echo "ERROR: Sanity check failed. Duplicate reads in SLOW5"; exit 1; }
+
+```
+
+## Sanity check by basecalling
+
+The ultimate test would be to base-call the original FAST5 files, FAST5 generated by reconverting the created BLOW5 file, and then comparing the two base-calling outputs. 
+
+First, let us split the merged BLOW5 file into multiple smaller BLOW5 files so that `slow5tools s2f` can convert to FAST5 in parallel. Following is a code snippet that does the splitting and conversion to fast5:
+
+```bash
+# get the number of read groups in the SLOW5
+NUM_READ_GROUPS=$(slow5tools stats ${SLOW5_FILE}  | head | grep "number of read groups" | awk '{print $NF}')
+
+# if there are multiple read groups, first split by read group, and then into 4000 read SLOW5 files
+if [ ${NUM_READ_GROUPS} -ne 1 ]
+then
+    slow5tools split  ${SLOW5_FILE} -d split_groups_tmp/ -g -t $NUM_THREADS 
+    slow5tools split  split_groups_tmp/ -d split_reads_tmp/ -r 4000 -t $NUM_THREADS
+    rm -r split_groups_tmp/
+# if only one read group, directly split  into 4000 read SLOW5 files
+else
+    slow5tools split ${SLOW5_FILE} -d split_reads_tmp/ -r 4000 -t $NUM_THREADS 
+fi
+
+# convert the split slow5 files to fast5
+slow5tools s2f split_reads_tmp -d s2f_fast5/ -p $NUM_THREADS 
+rm -r split_reads_tmp/
+
+```
+
+Now base-call the original FAST5 files as well as the reconverted FAST5 files. Following are some example commands, but make sure to set the base-calling profile to match your dataset and the CPU/GPU device based on your system.
+
+```
+guppy_basecaller -c dna_r9.4.1_450bps_fast.cfg -i ${FAST5_DIR} -s fast5_basecalls/  -r --device cuda:all
+guppy_basecaller -c dna_r9.4.1_450bps_fast.cfg -s s2f_fast5_basecalls/ -r --device cuda:all
+rm -r s2f_fast5
+    
+```
+
+Now we can check if the base-calling outputs are the same. The order of reads produced by the base-caller is not deterministic make sure you sort them before comparing using `diff`. If the diff command passes, that means the data in the BLOW5 file are identical to those in FAST5.
+Given below is a bash function that you can directly copy-paste into your bash script and call inside your bash script as `compare_basecalls fast5_basecalls/ s2f_fast5_basecalls/`;
+
+
+```bash
+compare_basecalls (){
+	A=$1
+	B=$2
+
+	test -e $A || die "$A not present."
+	test -e $B || die "$B not present."
+
+    # We sort the fastq files based on the read_ids because the output order from guppy is not deterministic
+	find $A -name '*.fastq' -exec cat {} + | paste - - - -  | sort -k1,1  | tr '\t' '\n' > a.fastq
+	find $B -name '*.fastq' -exec cat {} + | paste - - - -  | sort -k1,1  | tr '\t' '\n' > b.fastq
+	
+    diff -q a.fastq b.fastq || { echo "Basecalls differ"; exit 1; }
+
+    # We strip out the file names and then sort before comparing
+	cut -f2,3,5- $A/sequencing_summary.txt | sort -k1  > a.txt
+	cut -f2,3,5- $B/sequencing_summary.txt | sort -k1  > b.txt
+	
+    diff -q a.txt b.txt  || { echo "sequencing summary files differ"; exit 1; }
+
+}
+```
+
+However, note that sometimes this test diff will cause false errors due to base-callers providing slightly different outputs in various circumstances (see https://github.com/hasindu2008/slow5tools/issues/70). We recently came through a situation where Guppy 4.4.1 on a system with multiple GPUs (GeForce 3090 and 3070) produced slightly different results, even on the same FAST5 input when run multiple times.
diff --git a/docs/commands.md b/docs/commands.md
@@ -12,6 +12,8 @@
          Create an index for a SLOW5/BLOW5 file.        
 * `merge`:  
          Merge multiple SLOW5/BLOW5 files to a single file.
+* `cat`:   
+		 Quickly concatenate SLOW5/BLOW5 files of same read group [experimental]	
 * `split`:  
          Split a single a SLOW5/BLOW5 file into multiple separate files.
 * `get`:  
@@ -92,6 +94,21 @@ If multiple samples (different run ids) are detected, the header and the *read_g
 *  `-h, --help`:  
    Prints the help menu.
 
+### cat 
+
+```
+slow5tools cat file1.blow5 file2.blow5  -o output.blow5
+slow5tools cat blow5_dir1 -o output.blow5
+```
+
+Note: This subtools is experimental. Use with caution.
+Quickly concatenate SLOW5/BLOW5 files of same type (same header, extension, compression).
+
+*  `-o, --output FILE`:  
+      Outputs concatenated data to FILE [default value: stdout].
+*  `-h, --help`:  
+   Prints the help menu.
+
 
 ### index
 
@@ -137,7 +154,6 @@ slow5tools get [OPTIONS] file1.blow5 readid1 readid2 ....
 slow5tools get [OPTIONS] file1.blow5 --list readids.txt
 ```
 
-
 *  `--to format_type`:  
     Specifies the format of output files. `format_type` can be `slow5` for SLOW5 ASCII or `blow5` for SLOW5 binary (BLOW5) [default value: blow5].   
 *  `-o FILE`, `--output FILE`:  
diff --git a/docs/index.md b/docs/index.md
@@ -8,6 +8,8 @@
 
 [Bash one-liners](oneliners.md)
 
+[SLOW5 for archiving](archive.md)
+
 [Compatibility table](compatibility.md)
 
 [Frequently asked questions](faq.md)
diff --git a/docs/oneliners.md b/docs/oneliners.md
@@ -2,12 +2,6 @@
 
 
 ```
-# print slow5 global header:
-slow5tools view file.blow5 | grep '^#'
-
-# print slow5 data header:
-slow5tools view file.blow5 | grep '^@'
-
 # print slow5 header:
 slow5tools view file.blow5 | grep '^[#@]'
 
@@ -26,4 +20,7 @@ slow5tools get --to slow5 file.blow5 "r1" | grep -v '^[#@]' | awk '{print $8}'
 # merge every 10 files together in INPUT_DIR and save to OUTPUT_DIR
 find -name INPUT_DIR/*.blow5 | parallel -I% --max-args 10 slow5tools merge % -o OUTPUT_DIR/{#}.blow5
 
+# Get the sum of read counts in all BLOW5 files in a directory
+find INPUT_DIR/ -name '*.blow5' | parallel -I% --max-args 1 slow5tools stats % | grep "number of records" | awk 'BEGIN {count=0;} {count=count+$NF} END {print count;}'
+
 ```
diff --git a/src/cat.c b/src/cat.c
@@ -17,7 +17,7 @@
 
 #define USAGE_MSG "Usage: %s [SLOW5_FILE/DIR]\n"
 #define HELP_LARGE_MSG \
-    "Concatenate slow5s with same run_id, compression type, and file extension\n" \
+    "Quickly concatenate SLOW5/BLOW5 files of same type (same header, extension, compression) [experimental]\n" \
     USAGE_MSG \
     "\n" \
     "OPTIONS:\n"       \
@@ -118,6 +118,8 @@ int cat_main(int argc, char **argv, struct program_meta *meta){
         }
     }
 
+    WARNING("%s","slow5tools cat is experimental. Use with caution.");
+
     slow5_file_t* slow5File = NULL;
     int first_iteration = 1;
     uint32_t num_read_groups = 1;
diff --git a/src/main.c b/src/main.c
@@ -37,7 +37,7 @@
     "    get                   display the read entry for each specified read id\n" \
     "    view                  view the contents of a SLOW5/BLOW5 file or convert between different SLOW5/BLOW5 formats and compressions\n" \
     "    stats                 prints statistics of a SLOW5/BLOW5 file to the stdout\n" \
-    "    cat                   quickly concatenate SLOW5/BLOW5 files of same type [experimental]\n" \
+    "    cat                   quickly concatenate SLOW5/BLOW5 files of same type (same header, extension, compression) [experimental]\n" \
     "    quickcheck            quickly checks if a SLOW5/BLOW5 file is intact\n" \
     "\n" \
     "ARGS:    Try '%s [COMMAND] --help' for more information.\n" \
diff --git a/src/s2f.c b/src/s2f.c
@@ -478,7 +478,7 @@ void s2f_child_worker(proc_arg_t args,
                       program_meta *meta,
                       reads_count *readsCount) {
     for (int i = args.starti; i < args.endi; i++) {
-        VERBOSE("Converting %s to fast5", slow5_files[i].c_str());
+        DEBUG("Converting %s to fast5", slow5_files[i].c_str());
         slow5_file_t* slow5File_i = slow5_open(slow5_files[i].c_str(), "r");
         if(!slow5File_i){
             ERROR("cannot open %s. skipping...\n",slow5_files[i].c_str());
diff --git a/test/.gitignore b/test/.gitignore
@@ -27,3 +27,5 @@ data/out/merge/
 data/out/split/
 data/out/slow5tools_get/
 test_all_dir/
+NA12878_prom_subsubsample/
+fast5-soup/
diff --git a/test/download_test_dataset.sh b/test/download_test_dataset.sh
@@ -1,23 +1,22 @@
 #!/bin/bash
-# Download a test dataset
-Usage="download_dataset.sh [path to download directory] [download link (optional)]"
+# Download test datasets
 
-if [[ "$#" -lt 1 ]]; then
-	echo "Usage: $Usage"
-	exit
-fi
+download (){
+	mkdir -p $download_dir
+	tar_path=$download_dir/data.tgz
+	wget -O $tar_path $link || rm -f $tar_path
+	echo "Extracting. Please wait."
+	tar -xf $tar_path -C $download_dir || rm -f $tar_path
+	rm -f $tar_path
+	echo "Extraction successful. Please check $download_dir."
+}
 
-download_dir=$1
-link="https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2"
+download_dir=test/
+# test -d $download_dir/NA12878_prom_subsubsample && rm -r $download_dir/NA12878_prom_subsubsample
+# link="https://slow5.page.link/na12878_prom_subsub"
+# download
 
-if [[ "$#" -eq 2 ]]; then
-	link=$2
-fi
+test -d $download_dir/fast5_soup && rm -r $download_dir/fast5_soup
+link="https://slow5.page.link/fast5-soup"
+download
 
-mkdir -p $download_dir
-tar_path=$download_dir/data.tgz
-wget -O $tar_path $link || rm -rf $tar_path
-echo "Extracting. Please wait."
-tar -xf $tar_path -C $download_dir || rm -rf $tar_path
-rm -f $tar_path
-echo "Extraction successful. Please check $download_dir."