NanoLabel

NanoLabel is a method for classifying signals generated by nanopore sequencing devices. By operating directly on raw signals and avoiding basecalling, it is well suited for deployment in resource-constrained environments. NanoLabel is built on top of RawHash2, a signal-level read mapping tool. NanoLabel is designed to make fast real-time classification decisions with high accuracy, which is useful for adaptive sampling.

NanoLabel uses a lightweight XGBoost-based classifier. The user should first determine a target genomic region, after which they should train XGBoost model in a supervised manner using alignment statistics obtained by mapping a set of ONT reads. The training and inference workflow is described below.

Installation

Clone the repo

git clone https://github.com/at-cg/NanoLabel.git
cd NanoLabel

Create conda environment

Install conda.

conda env create -f environment.yml
conda activate nanolabel

Install RawHash2

Use our fork of RawHash2, compile it, and copy the compiled binary in the bin folder of NanoLabel repository.

git clone --recursive https://github.com/daanishmahajan/RawHash  rawhash2
make -C rawhash2 NOHDF5=1 NOPOD5=1
cp rawhash2/bin/rawhash2 bin

After completing the above, running ./bin/rawhash2 should produce help page of RawHash2

Download test dataset

Download the dataset from hugging face repository. Suppose the dataset is downloaded in Test directory.

huggingface-cli download 7dan/Adaptive_Sampling \
  --repo-type dataset \
  --local-dir Test \
  --local-dir-use-symlinks False

Dataset description

In this dataset, we consider a 258 hereditary cancer panel as target and the remaining human genome as non-target. The file description is as follows:

ref.fa: GRCh38 reference fasta file
258_hereditary_cancer_panel/training_data: 50 reads from target and 1000 reads from non-target (sampled from dataset D1 in Table 1 in our paper)
258_hereditary_cancer_panel/testing_data1: 50 reads from target and 1000 reads from non-target (sampled from dataset D2 in Table 1)
258_hereditary_cancer_panel/testing_data2: 1000 reads each from target and non-target (sampled from dataset D2 in Table 1)
258_hereditary_cancer_panel/coordinates.txt: Coordinates of the 258 genes (chr_id start end)
csv_files: .csv files for training the XGBoost model

Usage

Go to src directory.

cd src

Data Preparation

python3 prepare_data.py \
-log ./log.txt \
-tdata ../Test/data/258_hereditary_cancer_panel/training_data/train.blow5 \
-rpath ../Test/data/ref.fa \
-tpath ../Test/data/258_hereditary_cancer_panel/coordinates.txt \
-ppath ../Test/data/258_hereditary_cancer_panel/training_data/train.paf \
-dir ../Test/data_preparation

.csv files will be saved in ../Test/data_preparation/csv_files. Check here to see the complete list of command-line options. The training data generated using the dataset provided in hugging face is only for a quick demo run. It is insufficient to train an accurate XGBoost model. For adequate training, we have also provided .csv files generated from a bigger dataset in ../Test/data/csv_files.

Training

python3 train.py \
-log ./log.txt \
-dir ../Test/training \
-csv ../Test/data/csv_files

Trained models will be saved in ../Test/training/model. Check here to see the complete list of command-line options.

Testing

python3 test.py \
-log ./log.txt \
-dir ../Test/testing \
-mdir ../Test/training/model \
-tpath ../Test/data/258_hereditary_cancer_panel/coordinates.txt \
-data ../Test/data/258_hereditary_cancer_panel/testing_data1/test.blow5 \
-fqpath ../Test/data/258_hereditary_cancer_panel/testing_data1/test.fastq \
-alpath ../Test/data/258_hereditary_cancer_panel/testing_data1/test.paf

The results will be saved in folder ../Test/testing/results. The folder will have two files for every signal length (one chunk to five chunks): classification_{$chunk_count}.txt, metrics_{$chunk_count}.txt.

The output format of classification_{$chunk_count}.txt files is as follows (1 stands for target):

Read_id 	 Predicted 	 Actual
5acd9e7c-1944-424a-adbe-0f2156dbf7ce 	 0 	 1
39f34517-0f01-4a0f-9644-983cc10feae1 	 0 	 1
...

The output format of metrics_{$chunk_count}.txt files is as follows:

******************************* Results of XGBoost inference only on the mapped data *******************************
Number of chunks: $chunk_count, Total number of mapped positive samples: 2, Total number of mapped negative samples: 3
Testing accuracy score: 1.0
Precision: 1.0, Recall: 1.0, F1: 1.0, TP: 2, TN: 3, FTP: 1.0, FTN: 1.0
Average inference time: 1.2344837188720703 ms
******************************* Final results after resolving FPs using XGBoost *******************************
Total positive data: 50, Total_negative_data: 1000
Precision: 1.0, Recall: 0.04, F1: 0.07692307692307693, TP: 2, TN: 1000, FTP: 0.04, FTN: 1.0

Check here to see the complete list of command-line options.

Preprint

Daanish Mahajan, Chirag Jain and Navin Kashyap. NanoLabel: A fast and accurate real-time nanopore signal classifier. (under review)

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
bin		bin
docs		docs
models		models
shellscripts		shellscripts
src		src
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NanoLabel

Installation

Dataset description

Usage

Data Preparation

Training

Testing

Preprint

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

at-cg/NanoLabel

Folders and files

Latest commit

History

Repository files navigation

NanoLabel

Installation

Dataset description

Usage

Data Preparation

Training

Testing

Preprint

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages