Skip to content

Fast and accurate signal classifier for nanopore sequencing

Notifications You must be signed in to change notification settings

at-cg/NanoLabel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NanoLabel

NanoLabel is a method for classifying signals generated by nanopore sequencing devices. By operating directly on raw signals and avoiding basecalling, it is well suited for deployment in resource-constrained environments. NanoLabel is built on top of RawHash2, a signal-level read mapping tool. NanoLabel is designed to make fast real-time classification decisions with high accuracy, which is useful for adaptive sampling.

NanoLabel uses a lightweight XGBoost-based classifier. The user should first determine a target genomic region, after which they should train XGBoost model in a supervised manner using alignment statistics obtained by mapping a set of ONT reads. The training and inference workflow is described below.

Installation

  1. Clone the repo
git clone https://github.com/at-cg/NanoLabel.git
cd NanoLabel
  1. Create conda environment
conda env create -f environment.yml
conda activate nanolabel
  1. Install RawHash2
  • Use our fork of RawHash2, compile it, and copy the compiled binary in the bin folder of NanoLabel repository.
git clone --recursive https://github.com/daanishmahajan/RawHash  rawhash2
make -C rawhash2 NOHDF5=1 NOPOD5=1
cp rawhash2/bin/rawhash2 bin

After completing the above, running ./bin/rawhash2 should produce help page of RawHash2

  1. Download test dataset
huggingface-cli download 7dan/Adaptive_Sampling \
  --repo-type dataset \
  --local-dir Test \
  --local-dir-use-symlinks False

Dataset description

In this dataset, we consider a 258 hereditary cancer panel as target and the remaining human genome as non-target. The file description is as follows:

  • ref.fa: GRCh38 reference fasta file
  • 258_hereditary_cancer_panel/training_data: 50 reads from target and 1000 reads from non-target (sampled from dataset D1 in Table 1 in our paper)
  • 258_hereditary_cancer_panel/testing_data1: 50 reads from target and 1000 reads from non-target (sampled from dataset D2 in Table 1)
  • 258_hereditary_cancer_panel/testing_data2: 1000 reads each from target and non-target (sampled from dataset D2 in Table 1)
  • 258_hereditary_cancer_panel/coordinates.txt: Coordinates of the 258 genes (chr_id start end)
  • csv_files: .csv files for training the XGBoost model

Usage

Go to src directory.

cd src

Data Preparation

python3 prepare_data.py \
-log ./log.txt \
-tdata ../Test/data/258_hereditary_cancer_panel/training_data/train.blow5 \
-rpath ../Test/data/ref.fa \
-tpath ../Test/data/258_hereditary_cancer_panel/coordinates.txt \
-ppath ../Test/data/258_hereditary_cancer_panel/training_data/train.paf \
-dir ../Test/data_preparation

.csv files will be saved in ../Test/data_preparation/csv_files. Check here to see the complete list of command-line options. The training data generated using the dataset provided in hugging face is only for a quick demo run. It is insufficient to train an accurate XGBoost model. For adequate training, we have also provided .csv files generated from a bigger dataset in ../Test/data/csv_files.

Training

python3 train.py \
-log ./log.txt \
-dir ../Test/training \
-csv ../Test/data/csv_files

Trained models will be saved in ../Test/training/model. Check here to see the complete list of command-line options.

Testing

python3 test.py \
-log ./log.txt \
-dir ../Test/testing \
-mdir ../Test/training/model \
-tpath ../Test/data/258_hereditary_cancer_panel/coordinates.txt \
-data ../Test/data/258_hereditary_cancer_panel/testing_data1/test.blow5 \
-fqpath ../Test/data/258_hereditary_cancer_panel/testing_data1/test.fastq \
-alpath ../Test/data/258_hereditary_cancer_panel/testing_data1/test.paf 

The results will be saved in folder ../Test/testing/results. The folder will have two files for every signal length (one chunk to five chunks): classification_{$chunk_count}.txt, metrics_{$chunk_count}.txt.

  • The output format of classification_{$chunk_count}.txt files is as follows (1 stands for target):
Read_id 	 Predicted 	 Actual
5acd9e7c-1944-424a-adbe-0f2156dbf7ce 	 0 	 1
39f34517-0f01-4a0f-9644-983cc10feae1 	 0 	 1
...
  • The output format of metrics_{$chunk_count}.txt files is as follows:
******************************* Results of XGBoost inference only on the mapped data *******************************
Number of chunks: $chunk_count, Total number of mapped positive samples: 2, Total number of mapped negative samples: 3
Testing accuracy score: 1.0
Precision: 1.0, Recall: 1.0, F1: 1.0, TP: 2, TN: 3, FTP: 1.0, FTN: 1.0
Average inference time: 1.2344837188720703 ms
******************************* Final results after resolving FPs using XGBoost *******************************
Total positive data: 50, Total_negative_data: 1000
Precision: 1.0, Recall: 0.04, F1: 0.07692307692307693, TP: 2, TN: 1000, FTP: 0.04, FTN: 1.0

Check here to see the complete list of command-line options.

Preprint

  • Daanish Mahajan, Chirag Jain and Navin Kashyap. NanoLabel: A fast and accurate real-time nanopore signal classifier. (under review)

About

Fast and accurate signal classifier for nanopore sequencing

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •