GitHub - tacclab/bio_dataset_manager: This tool facilitates the encoding of these sequences into tensors, which can then be used for AI computations and complex model implementations

Bio Dataset Manager: easily encode biological sequences into tensors

Authors:

- Fabio Bove | fabio.bove.dr@gmail.com
- Eugenio Bertolini |

What is it?

Bio Data Manager is a Python project designed for managing and processing bio-sequence data, including DNA, proteins, and SMILES strings. This tool facilitates the encoding of these sequences into tensors, which can then be used for AI computations and complex model implementations.

Project Structure

bio_data_manager/: Contains core modules for bioinformatics sequence processing and management.
bio_sequences/: Handles various operations related to biological sequences such as DNA and protein.

Installation

Install it as a library

Using CPU:
```
pip install bio-dataset-manager
```

Using CUDA:

pip install bio-dataset-manager[cuda] -f https://download.pytorch.org/whl/torch_stable.html

Usage

Examples of the code can be found in the examples folder.

import the modules

import torch
from bio_dataset_manager.bio_dataloader import BioDataloader
from bio_dataset_manager.bio_dataset import BioDataset
from bio_sequences.dna_sequence import DnaSequence

create the dataset and dataloader

dataset = BioDataset(
        dataset_folder="path/to/dataset",
        sequences_limit=10,
        randomize_choice=True,
        pad_same_len=False,
        window_size=1,
        sequence_info=DnaSequence(),
        sequences=None,
    )

dataloader = BioDataloader(
        dataset=dataset,
        batch_size=5,
        shuffle=True,
        collate_fn=dataset.collate_fn,
        split_ratio=0.5,
        use_gpu=True if torch.cuda.is_available() else False
    )

training loop example

epochs = 5
for epoch in range(epochs):
    with tqdm(total=len(dataloader.training_dataloader), desc=f"Epoch {epoch + 1}/{epochs}", unit="batch") as pbar:
        for batch in dataloader.training_dataloader:
            y_real, lengths = dataloader.process_batch(batch)
            time.sleep(0.1)
            pbar.update(1)
            pbar.set_postfix(
                loss_gen=f"0.0",
                loss_dis=f"0.0"
            )
    pbar.refresh()

Contributing

Feel free to submit issues or pull requests if you'd like to contribute to this project.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
examples		examples
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
icon.png		icon.png
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bio Dataset Manager: easily encode biological sequences into tensors

Authors:

What is it?

Project Structure

Installation

Usage

Contributing

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

License

tacclab/bio_dataset_manager

Folders and files

Latest commit

History

Repository files navigation

Bio Dataset Manager: easily encode biological sequences into tensors

Authors:

What is it?

Project Structure

Installation

Usage

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages