Skip to content

Commit 94cd97b

Browse files
committed
v0.2.1
1 parent a7c4247 commit 94cd97b

File tree

64 files changed

+49352
-180227
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

64 files changed

+49352
-180227
lines changed

BIgMAG_functions.py

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
#!/usr/bin/env python3
2+
# -*- coding: utf-8 -*-
3+
"""
4+
Created on Thu Mar 21 18:38:18 2024
5+
6+
@author: yepesgar
7+
"""
8+
import pandas as pd
9+
10+
def labels_gunc():
11+
labels = ['n_genes_called',
12+
'n_genes_mapped',
13+
'n_contigs',
14+
'proportion_genes_retained_in_major_clades',
15+
'genes_retained_index',
16+
'clade_separation_score',
17+
'contamination_portion',
18+
'n_effective_surplus_clades',
19+
'mean_hit_identity',
20+
'reference_representation_score',
21+
]
22+
return labels
23+
24+
def labels_quast():
25+
labels = ["# contigs (>= 0 bp)",
26+
"# contigs (>= 1000 bp)",
27+
"# contigs (>= 5000 bp)",
28+
"# contigs (>= 10000 bp)",
29+
"# contigs (>= 25000 bp)",
30+
"# contigs (>= 50000 bp)",
31+
"Total length (>= 0 bp)",
32+
"Total length (>= 1000 bp)",
33+
"Total length (>= 5000 bp)",
34+
"Total length (>= 10000 bp)",
35+
"Total length (>= 25000 bp)",
36+
"Total length (>= 50000 bp)",
37+
"# contigs",
38+
"Largest contig",
39+
"Total length",
40+
"GC (%)",
41+
"N50",
42+
"N90",
43+
"auN",
44+
"L50",
45+
"L90",
46+
"# N's per 100 kbp"
47+
]
48+
return labels
49+
50+
def params_heatmap():
51+
parameters = ['Completeness',
52+
'Contamination',
53+
'Complete',
54+
'Single',
55+
'Duplicated',
56+
'Fragmented',
57+
'Missing',
58+
'proportion_genes_retained_in_major_clades',
59+
'genes_retained_index',
60+
'clade_separation_score',
61+
'contamination_portion',
62+
'n_effective_surplus_clades',
63+
'mean_hit_identity',
64+
'reference_representation_score',
65+
]
66+
return parameters
67+
68+
def params_to_normalize():
69+
parameters = ['Completeness',
70+
'Contamination',
71+
'Complete',
72+
'Single',
73+
'Duplicated',
74+
'Fragmented',
75+
'Missing']
76+
77+
return parameters
78+
79+
def names_heatmap():
80+
names = [ 'Completeness (CheckM2)',
81+
'Contamination (CheckM2)',
82+
'Complete SCO (BUSCO)',
83+
'Single SCO (BUSCO)',
84+
'Duplicated SCO (BUSCO)',
85+
'Fragmented SCO (BUSCO)',
86+
'Missing SCO (BUSCO)',
87+
'proportion_genes_retained_in_major_clades (GUNC)',
88+
'genes_retained_index (GUNC)',
89+
'clade_separation_score (GUNC)',
90+
'contamination_portion (GUNC)',
91+
'n_effective_surplus_clades (GUNC)',
92+
'mean_hit_identity (GUNC)',
93+
'reference_representation_score (GUNC)',
94+
'Proportion of bins passing the filter (GUNC)',
95+
]
96+
return names
97+
98+
def labels_GTDB_Tk2 ():
99+
labels = [ 'Domain',
100+
'Phylum',
101+
'Class',
102+
'Order',
103+
'Family',
104+
'Genus',
105+
'Species'
106+
]
107+
return labels
108+
109+
def extract_genus(pd_series, tax_level):
110+
data = pd.Series()
111+
extract = pd_series
112+
113+
for i in range(len(extract)):
114+
if pd.notna(extract[i]):
115+
data = pd.concat([data, pd.Series(';' + extract[i])])
116+
else:
117+
data = pd.concat([data, pd.Series('NaN')])
118+
119+
data = data.reset_index(drop=True)
120+
121+
string = ''
122+
my_list = []
123+
column_names = ['Domain',
124+
'Phylum',
125+
'Class',
126+
'Order',
127+
'Family',
128+
'Genus',
129+
'Species'
130+
]
131+
132+
df = pd.DataFrame(columns=column_names)
133+
134+
for i in data:
135+
if i != ';Unclassified Bacteria' and i != ';Unclassified' and i != ';Unclassified Archaea' and i != 'NaN':
136+
for j in reversed(range(len(i))):
137+
if i[j] != ';':
138+
string += i[j]
139+
else:
140+
my_list.append(string)
141+
string = ''
142+
if len(my_list) == 7:
143+
for i in range(len(my_list)):
144+
my_list[i] = my_list[i][::-1]
145+
for i in range(len(my_list)):
146+
my_list[i] = my_list[i][3:]
147+
my_list.reverse()
148+
df.loc[len(df)] = my_list
149+
my_list = []
150+
else:
151+
my_list = ['Unclassified'] * 7
152+
df.loc[len(df)] = my_list
153+
my_list = []
154+
df = df.replace('','Unclassified')
155+
df = df.fillna('Unclassified')
156+
return pd.Series(df[tax_level])

README.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
2+
# Project Title
3+
4+
5+
BIgMAG (Board InteGrating Metagenome-Assembled Genomes) is both a pipeline to measure the quality of metagenomes and dashboard to visualize the results.
6+
7+
[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A521.10.3-23aa62.svg?labelColor=000000)](https://www.nextflow.io/)
8+
9+
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
10+
11+
[![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)
12+
13+
[![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
14+
15+
[![Static Badge](https://img.shields.io/badge/developed_with-_plotly-lightblue?style=flat&logo=plotly&logoColor=lightblue&labelColor=black)](https://plotly.com/)
16+
17+
18+
## Installation
19+
20+
The pipeline runs under Nextflow DSL2, you can check how to install Nextflow [here](https://www.nextflow.io/docs/latest/install.html). Please notice that you need to have Java JDK (recommended version 17.0.3) available to be able to install Nextflow.
21+
22+
To install BIgMAG, you just need to copy this repository:
23+
```bash
24+
git clone https://github.com/jeffe107/BIgMAG.git
25+
```
26+
On the other hand, you need Conda or Mamba (recommended versions 23.3.1 and 1.3.1) or [pip](https://pip.pypa.io/en/stable/installation/) in your system to display the dashboard. Create the environment or install the components with:
27+
```bash
28+
pip install -r requirements.txt
29+
```
30+
or
31+
```bash
32+
conda create -n BIgMAG --file requirements.txt
33+
conda activate BIgMAG
34+
```
35+
## Pipeline summary
36+
37+
BIgMAG receives folders containing bins or MAGs in any format (.fna, .fa, .fasta) decompressed or compressed (.gz) in the following file structure:
38+
```bash
39+
.
40+
└── samples/
41+
├── sample1/
42+
│ ├── bin1
43+
│ ├── bin2
44+
│ └── ...
45+
├── sample2/
46+
│ ├── bin1
47+
│ ├── bin2
48+
│ ├── bin3
49+
│ └── ...
50+
└── ...
51+
```
52+
In addition, you can provide a .csv file with the names of the samples and the paths:
53+
54+
| sampleID | files |
55+
| ------------- | ---------------- |
56+
| sample1 | path/to/sample1 |
57+
| sample2 | path/to/sample2 |
58+
| ... | ... |
59+
60+
Please check in the Usage section to see how to input or another.
61+
62+
By default, the Nextflow pipeline currently attempts to analyze bins or MAGs through the following:
63+
64+
- examines completeness and contamination with [CheckM2](https://github.com/chklovski/CheckM2) v1.0.1 and [BUSCO](https://busco.ezlab.org/busco_userguide.html) v5.7.0.
65+
- determines different metrics and statistics using [QUAST](https://quast.sourceforge.net/) v5.2.0.
66+
- detects chimerism and contamination by running [GUNC](https://github.com/grp-bork/gunc) v1.0.6.
67+
- optionally assigns taxonomy to bins using [GTDB-Tk2](https://ecogenomics.github.io/GTDBTk/index.html) v2.3.2.
68+
69+
Finally, a file final_df.tsv will be generated and used to display the dashboard using [Dash and Plotly](https://dash.plotly.com/).
70+
71+
## Pipeline Usage
72+
The basic usage of the pipeline can be achieved by running:
73+
74+
If you want to test the proper behaviour of the pipeline you can just run:
75+
```bash
76+
nextflow run BIgMAG/main.nf -profile test,<docker/singularity/podman/shifter/charliecloud/conda/mamba> --outdir <OUTDIR>
77+
```
78+
To run the pipeline with the default workflow:
79+
```bash
80+
nextflow run BIgMAG/main.nf -profile <docker/singularity/podman/shifter/charliecloud/conda/mamba> --files 'path/to/the/samples/*' --outdir <OUTDIR>
81+
```
82+
In case you wish to input a csv file with the details of your samples, you can change the flag --files for `--files` for `--csv_files 'path/to/your/csv_files'`.
83+
### Databases
84+
Running the pipeline in its default state will attempt to download automaically CheckM2 (~3.5 GB) and GUNC (~12 GB) in your specified output directory. Please make sure you have enough space to store these databases. Moreover, if you have customized or different versions you would like to use, you can use these flags to include them `--gunc_db '/path/to/your/gunc_db.dmnd'` and `--checkm2_db '/path/to/your/checkm2_db.dmnd'`.
85+
86+
In the case of the database required by GTDB-Tk2, BIgMAG does not download by default given its large required space (~85 GB); however if you include the flag `--run_gtdbtk2` to both automatically download the database and run the analysis. As for CheckM2 and GUNC, you can input your own version of the database with `gtdbtk2_db '/pathto/to/your/gtdbtk/release*'`
87+
> [!WARNING]
88+
> Notice that when you untar any GTDB dabatase, its named release*; please keep the word release in the name to guarantee a proper detection by the pipeline.
89+
### Profiles
90+
The pipeline can use different techonologies to run the required software. The available profiles are:
91+
- docker
92+
- singularity
93+
- podman
94+
- shifter
95+
- charliecloud
96+
- conda
97+
- mamba
98+
- apptainer
99+
100+
Please select one of these considering your system configuration. Natively, the pipeline will use docker container from [quay.io](https://quay.io/)
101+
> [!WARNING]
102+
> If you are using profiles such as singularity or apptainer, please always include the flag `--singularity_container` during your execution. This will allow to pull containers from [Galaxy project](https://depot.galaxyproject.org/singularity/).
103+
104+
Furthermore, if the execution of the pipeline fails while using profiles that require to mount directories, i.e. apptainer, throwing an error related with problems to find any file you can attempt to solve this by including the flag `--directory_to_bind 'path/to/the/directory'`.
105+
106+
Finally, when using mamba or conda as profiles, you may want to make sure you have only bioconda, conda-forge and defaults as available channels, in that order.
107+
108+
Permalink to reference line of code:
109+
https://github.com/jeffe107/BIgMAG/blob/a7c4247ab63905452b64d82fc4c6264d9bb3e711/nextflow.config#L50

0 commit comments

Comments
 (0)