NanoLabel is a method for classifying signals generated by nanopore sequencing devices. By operating directly on raw signals and avoiding basecalling, it is well suited for deployment in resource-constrained environments. NanoLabel is built on top of RawHash2, a signal-level read mapping tool. NanoLabel is designed to make fast real-time classification decisions with high accuracy, which is useful for adaptive sampling.
NanoLabel uses a lightweight XGBoost-based classifier. The user should first determine a target genomic region, after which they should train XGBoost model in a supervised manner using alignment statistics obtained by mapping a set of ONT reads. The training and inference workflow is described below.
- Clone the repo
git clone https://github.com/at-cg/NanoLabel.git
cd NanoLabel
- Create conda environment
- Install conda.
conda env create -f environment.yml
conda activate nanolabel
- Install RawHash2
- Use our fork of RawHash2, compile it, and copy the compiled binary in the
binfolder of NanoLabel repository.
git clone --recursive https://github.com/daanishmahajan/RawHash rawhash2
make -C rawhash2 NOHDF5=1 NOPOD5=1
cp rawhash2/bin/rawhash2 bin
After completing the above, running ./bin/rawhash2 should produce help page of RawHash2
- Download test dataset
- Download the dataset from hugging face repository. Suppose the dataset is downloaded in
Testdirectory.
huggingface-cli download 7dan/Adaptive_Sampling \
--repo-type dataset \
--local-dir Test \
--local-dir-use-symlinks False
In this dataset, we consider a 258 hereditary cancer panel as target and the remaining human genome as non-target. The file description is as follows:
ref.fa: GRCh38 reference fasta file258_hereditary_cancer_panel/training_data: 50 reads from target and 1000 reads from non-target (sampled from dataset D1 in Table 1 in our paper)258_hereditary_cancer_panel/testing_data1: 50 reads from target and 1000 reads from non-target (sampled from dataset D2 in Table 1)258_hereditary_cancer_panel/testing_data2: 1000 reads each from target and non-target (sampled from dataset D2 in Table 1)258_hereditary_cancer_panel/coordinates.txt: Coordinates of the 258 genes (chr_id start end)csv_files: .csv files for training the XGBoost model
Go to src directory.
cd src
python3 prepare_data.py \
-log ./log.txt \
-tdata ../Test/data/258_hereditary_cancer_panel/training_data/train.blow5 \
-rpath ../Test/data/ref.fa \
-tpath ../Test/data/258_hereditary_cancer_panel/coordinates.txt \
-ppath ../Test/data/258_hereditary_cancer_panel/training_data/train.paf \
-dir ../Test/data_preparation
.csv files will be saved in ../Test/data_preparation/csv_files.
Check here to see the complete list of command-line options.
The training data generated using the dataset provided in hugging face is only for a quick demo run. It is insufficient to train an accurate XGBoost model. For adequate training, we have also provided .csv files generated from a bigger dataset in ../Test/data/csv_files.
python3 train.py \
-log ./log.txt \
-dir ../Test/training \
-csv ../Test/data/csv_files
Trained models will be saved in ../Test/training/model.
Check here to see the complete list of command-line options.
python3 test.py \
-log ./log.txt \
-dir ../Test/testing \
-mdir ../Test/training/model \
-tpath ../Test/data/258_hereditary_cancer_panel/coordinates.txt \
-data ../Test/data/258_hereditary_cancer_panel/testing_data1/test.blow5 \
-fqpath ../Test/data/258_hereditary_cancer_panel/testing_data1/test.fastq \
-alpath ../Test/data/258_hereditary_cancer_panel/testing_data1/test.paf
The results will be saved in folder ../Test/testing/results. The folder will have two files for every signal length (one chunk to five chunks): classification_{$chunk_count}.txt, metrics_{$chunk_count}.txt.
- The output format of
classification_{$chunk_count}.txtfiles is as follows (1stands for target):
Read_id Predicted Actual
5acd9e7c-1944-424a-adbe-0f2156dbf7ce 0 1
39f34517-0f01-4a0f-9644-983cc10feae1 0 1
...
- The output format of
metrics_{$chunk_count}.txtfiles is as follows:
******************************* Results of XGBoost inference only on the mapped data *******************************
Number of chunks: $chunk_count, Total number of mapped positive samples: 2, Total number of mapped negative samples: 3
Testing accuracy score: 1.0
Precision: 1.0, Recall: 1.0, F1: 1.0, TP: 2, TN: 3, FTP: 1.0, FTN: 1.0
Average inference time: 1.2344837188720703 ms
******************************* Final results after resolving FPs using XGBoost *******************************
Total positive data: 50, Total_negative_data: 1000
Precision: 1.0, Recall: 0.04, F1: 0.07692307692307693, TP: 2, TN: 1000, FTP: 0.04, FTN: 1.0
Check here to see the complete list of command-line options.
- Daanish Mahajan, Chirag Jain and Navin Kashyap. NanoLabel: A fast and accurate real-time nanopore signal classifier. (under review)