Marbel

Summary

Marbel (MetAtranscriptomic Reference Builder Evaluation Library) generates realistic in silico metatranscriptomic dataset based on specified parameters. It is intended as a helpful tool for metatranscriptomics pipeline developers.

Description

Marbel uses a dataset of 614 bacterial transcriptomes for simulating a metatranscriptomic dataset. The user can specify the number of orthogroups and species they want to simulate. Accordingly, marbel will randomly select orthogroups and assign read counts to the resulting gene set. The read count matrix is produced using DESeq2 based modeling with custom fitted parameters. The read count matrix is then used to generate reads with a modified version of InSilicoSeq. Marbel will then calculate maximum blocks, i.e., regions of transcript sequences that are continously covered by reads. The reason for this is to have a better reference for assemblers. Marbel will also calculate overlap blocks (default minimum overlap: 16), i.e., blocks that overlap on the genome, as assembler may merge these blocks.

Assumptions

We choose orthogroups as basis for the selection to model functional redundancies commonly present in microbial communities
Species and gene abundances are modelled following a lognormal distribution
The base formula for the gene level mean read count is species abundance * gene abundance
Log2 fold changes are modelled according to a normal distribution
The sample count is modelled with a negative binomial distribution with DESeq2 assumption like dispersion parameters

Customisation:

The user has a wide range of option to influence dataset creation, apart from basic parameters it is possible to specify:

a mininimum mean sequence identity of genes in orthogroups (--min-identity)
a maximum phylogenetic distance of members in orthogroups (--max-phylo-distance)
a library size distribution (--library-size-distribution)
the modelled ratio of differentially expressed genes (--dge-ratio)
a selection strategy for chosing orthogroups in relation to orthogroup size (i.e. orthology level) (--group-orthology-level)
a minimum sparsity target for the count matrix (ratio of zeros) in the count matrix (--min-sparsity)
a minum overlap for the calculation of overlap blocks (i.e. genes covered by reads that overlap on the genome) (--min-overlap)
deseq2 dispersion parameters can be modified (--deseq-dispersion-parameter-a0, --deseq-dispersion-parameter-a1)

Installation

Conda requirements

Conda is recommended for both development and usage.

Install miniconda (if not installed already)

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

bash Miniconda3-latest-Linux-x86_64.sh

Create conda env

conda create -n marbel python=3.10
conda activate marbel

Installation for usage:

In the conda environment install the package with:

conda install tensulin::marbel

Installation for development purposes

Install git-lfs (absolutely necessary)

Before cloning the repo you need to have git-lfs installed! You can install it in the Conda env:

conda install anaconda::git-lfs

Initialize Git LFS:

git lfs install

If you already cloned the repo, remove it, install git-lfs and clone again.

Instal g++ (Optional, for performance)

sudo apt-get install g++

Clone repository

git clone https://github.com/jlab/marbel.git

Install the package:

cd marbel
pip install -e .

Usage

To get help on how to use the script, run:

marbel --help

Advice

Note that large datasets may require days to run. Therefore, cluster or cloud based execution is advised. Multiple threads will help to speed it up. Bedtools in the environment will have slight performance gains.

Command Line Arguments

# Usage: marbel [OPTIONS]

## Options:
- `--n-species` **INTEGER**  
  Number of species to be drawn for the metatranscriptomic in silico dataset.  
  **[default: 20]**

- `--n-orthogroups` **INTEGER**  
  Number of orthologous groups to be drawn for the metatranscriptomic in silico dataset.  
  **[default: 1000]**

- `--n-samples` **<INTEGER INTEGER>...**  
  Number of samples to be created for the metatranscriptomic in silico dataset. The first number represents the number of samples for group 1, and the second is for group 2.  
  **[default: 10, 10]**

- `--outdir` **TEXT**  
  Output directory for the metatranscriptomic in silico dataset.  
  **[default: simulated_reads]**

- `--max-phylo-distance` **[phylum|class|order|family|genus]**  
  Maximum mean phylogenetic distance for orthologous groups. Specify a stricter limit to avoid groups with a more diverse phylogenetic distance.  
  **[default: None]**

- `--min-identity` **FLOAT**  
  Minimum mean sequence identity score for orthologous groups. Specify for more stringent identity requirements.  
  **[default: None]**

- `--dge-ratio` **FLOAT**  
  Ratio of up- and down-regulated genes. The first value is the ratio of up-regulated genes, and the second represents the ratio of down-regulated genes.  
  **[default: 0.2]**

- `--seed` **INTEGER**  
  Seed for sampling. Set for reproducibility.  
  **[default: None]**

- `--error-model` **[basic|perfect|HiSeq|NextSeq|NovaSeq|Miseq-20|Miseq-24|Miseq-28|Miseq-32]**  
  Sequencer model for the reads. Use `basic` or `perfect` (no errors) for custom read length.  
  **[default: HiSeq]**

- `--compressed / --no-compressed`  
  Compress the output FASTQ files.  
  **[default: compressed]**

- `--read-length` **INTEGER**  
  Read length for the generated reads. Only available when using `error_model` basic or perfect.  
  **[default: None]**

- `--library-size` **INTEGER**  
  Library size for the reads.  
  **[default: 100000]**

- `--library-size-distribution` **[poisson|uniform|negative_binomial]**  
  Distribution for the library size.  
  **[default: uniform]**


- `--min-sparsity` **FLOAT**  
     Will archive the minimum specified sparcity by zeroing count randomly.  
  **[default: 0]**

- `--min-overlap` **INTEGER**  
  Minimum overlap for the blocks. Use this to evaluate overlap blocks, i.e. uninterrupted parts covered with reads that overlap on the genome.  
  **[default: 16]**

- `--group-orthology-level` **[very_low|low|normal|high|very_high]**  
  Determines the level of orthology in groups. If you use this, use it with a lot of threads. Takes a long time.  
  **[default: normal]**

- `--threads` **INTEGER**  
  Number of threads to be used.  
  **[default: 10]**
- `--version`  
  Show the version and exit.

- `--help`  
  Show this message and exit.

Examples

Running with Default Parameters

marbel

Specifying Number of Species, Orthogroups, and Samples

marbel --n-species 10 --n-orthogroups 500 --n-samples 5 8

This command will generate a dataset with:

10 species
500 orthologous groups
5 samples for group 1
8 samples for group 2

Release Guide

For publishing a new version to conda you can build it with the recipe. For this you need to have conda-build installed (conda install conda-build). For uploading to anaconda anaconda client needs to be installed. Installing the build tools and building in conda base environment is preferred.

Instructions (inside marbel directory):

cd recipe
conda-build .

Then upload to Anaconda, for this see:

https://www.anaconda.com/docs/tools/anaconda-org/user-guide/packages/conda-packages

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any changes.

License

This project is licensed under the Apache License 2.0.

Support

Feel free to reach out or open an issue if you have any questions or need further assistance with the usage of the tool.

Name		Name	Last commit message	Last commit date
Latest commit History 256 Commits
.github/workflows		.github/workflows
recipe		recipe
resources/logos		resources/logos
src/marbel		src/marbel
summary		summary
tests		tests
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Marbel

Summary

Description

Assumptions

Customisation:

Installation

Conda requirements

Install miniconda (if not installed already)

Create conda env

Installation for usage:

Installation for development purposes

Install git-lfs (absolutely necessary)

Instal g++ (Optional, for performance)

Clone repository

Install the package:

Usage

Advice

Command Line Arguments

Examples

Running with Default Parameters

Specifying Number of Species, Orthogroups, and Samples

Release Guide

Contributing

License

Support

About

Uh oh!

Releases 10

Packages

Contributors 5

Uh oh!

Languages

License

jlab/marbel

Folders and files

Latest commit

History

Repository files navigation

Marbel

Summary

Description

Assumptions

Customisation:

Installation

Conda requirements

Install miniconda (if not installed already)

Create conda env

Installation for usage:

Installation for development purposes

Install git-lfs (absolutely necessary)

Instal g++ (Optional, for performance)

Clone repository

Install the package:

Usage

Advice

Command Line Arguments

Examples

Running with Default Parameters

Specifying Number of Species, Orthogroups, and Samples

Release Guide

Contributing

License

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Contributors 5

Uh oh!

Languages

Packages