Skip to content

This tool facilitates the encoding of these sequences into tensors, which can then be used for AI computations and complex model implementations

License

Notifications You must be signed in to change notification settings

tacclab/bio_dataset_manager

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

description

Bio Dataset Manager: easily encode biological sequences into tensors


Coverage PyPI Latest Release Unit Tests
Powered by TaccLab
License

Authors:


- Fabio Bove | fabio.bove.dr@gmail.com
- Eugenio Bertolini |

What is it?

Bio Data Manager is a Python project designed for managing and processing bio-sequence data, including DNA, proteins, and SMILES strings. This tool facilitates the encoding of these sequences into tensors, which can then be used for AI computations and complex model implementations.


Project Structure

  • bio_data_manager/: Contains core modules for bioinformatics sequence processing and management.
  • bio_sequences/: Handles various operations related to biological sequences such as DNA and protein.

Installation

  1. Install it as a library
    • Using CPU:
      pip install bio-dataset-manager
    • Using CUDA:
      pip install bio-dataset-manager[cuda] -f https://download.pytorch.org/whl/torch_stable.html

Usage

Examples of the code can be found in the examples folder.

  1. import the modules
import torch
from bio_dataset_manager.bio_dataloader import BioDataloader
from bio_dataset_manager.bio_dataset import BioDataset
from bio_sequences.dna_sequence import DnaSequence
  1. create the dataset and dataloader
dataset = BioDataset(
        dataset_folder="path/to/dataset",
        sequences_limit=10,
        randomize_choice=True,
        pad_same_len=False,
        window_size=1,
        sequence_info=DnaSequence(),
        sequences=None,
    )

dataloader = BioDataloader(
        dataset=dataset,
        batch_size=5,
        shuffle=True,
        collate_fn=dataset.collate_fn,
        split_ratio=0.5,
        use_gpu=True if torch.cuda.is_available() else False
    )
  1. training loop example
epochs = 5
for epoch in range(epochs):
    with tqdm(total=len(dataloader.training_dataloader), desc=f"Epoch {epoch + 1}/{epochs}", unit="batch") as pbar:
        for batch in dataloader.training_dataloader:
            y_real, lengths = dataloader.process_batch(batch)
            time.sleep(0.1)
            pbar.update(1)
            pbar.set_postfix(
                loss_gen=f"0.0",
                loss_dis=f"0.0"
            )
    pbar.refresh()

Contributing

Feel free to submit issues or pull requests if you'd like to contribute to this project.


License

License

About

This tool facilitates the encoding of these sequences into tensors, which can then be used for AI computations and complex model implementations

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages