- Fabio Bove | fabio.bove.dr@gmail.com
- Eugenio Bertolini |
Bio Data Manager is a Python project designed for managing and processing bio-sequence data, including DNA, proteins, and SMILES strings. This tool facilitates the encoding of these sequences into tensors, which can then be used for AI computations and complex model implementations.
bio_data_manager/: Contains core modules for bioinformatics sequence processing and management.bio_sequences/: Handles various operations related to biological sequences such as DNA and protein.
- Install it as a library
- Using
CPU:pip install bio-dataset-manager
- Using
CUDA:pip install bio-dataset-manager[cuda] -f https://download.pytorch.org/whl/torch_stable.html
- Using
Examples of the code can be found in the examples folder.
- import the modules
import torch
from bio_dataset_manager.bio_dataloader import BioDataloader
from bio_dataset_manager.bio_dataset import BioDataset
from bio_sequences.dna_sequence import DnaSequence- create the dataset and dataloader
dataset = BioDataset(
dataset_folder="path/to/dataset",
sequences_limit=10,
randomize_choice=True,
pad_same_len=False,
window_size=1,
sequence_info=DnaSequence(),
sequences=None,
)
dataloader = BioDataloader(
dataset=dataset,
batch_size=5,
shuffle=True,
collate_fn=dataset.collate_fn,
split_ratio=0.5,
use_gpu=True if torch.cuda.is_available() else False
)- training loop example
epochs = 5
for epoch in range(epochs):
with tqdm(total=len(dataloader.training_dataloader), desc=f"Epoch {epoch + 1}/{epochs}", unit="batch") as pbar:
for batch in dataloader.training_dataloader:
y_real, lengths = dataloader.process_batch(batch)
time.sleep(0.1)
pbar.update(1)
pbar.set_postfix(
loss_gen=f"0.0",
loss_dis=f"0.0"
)
pbar.refresh()Feel free to submit issues or pull requests if you'd like to contribute to this project.