Skip to content

Commit 82f5ec5

Browse files
authored
Merge pull request #7 from AnneSoBen/use_conda
Use conda
2 parents 8729d95 + 5fb4f6d commit 82f5ec5

File tree

7 files changed

+109
-51
lines changed

7 files changed

+109
-51
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
# files and folder to ignore:
2+
.snakemake
23
workflow/.snakemake
34
workflow/.snakemake/*
45
workflow/*.err

README.md

Lines changed: 59 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,36 @@
11
# OBITools workflow
22

3-
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6977255.svg)](https://doi.org/10.5281/zenodo.6977255)
3+
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6676577.svg)](https://doi.org/10.5281/zenodo.6676577)
44

55

66
## About
77

88
This is a snakemake workflow based on the obitools suite of programs, that analyzes DNA metabarcoding data.
99

10-
Sequence analysis is performed with the obitools (Boyer et al. 2016) and sumaclust (Mercier et al. 2013) through a Snakemake pipeline (Molder et al. 2021).
10+
Sequence analysis is performed with the obitools (Boyer et al. 2016) and sumaclust (Mercier et al. 2013) through a Snakemake pipeline (Mölder et al. 2021).
1111

1212

1313
## Getting started
1414

15-
### Prerequisites
15+
### Installation
1616

17-
This workflow is meant to be executed on a computing cluster running with **SLURM**. It has been written to run on the Genotoul computing cluster (http://bioinfo.genotoul.fr/).
17+
#### Dependencies
1818

19-
### Installation
19+
In order to run the workflow, you must have installed the following programs:
20+
21+
- [python3](https://www.python.org/downloads/)
22+
- [conda](https://docs.conda.io/en/latest/miniconda.html)
23+
- [snakemake](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html)
24+
25+
Please note that the workflow is currently running exclusively on Unix systems.
26+
27+
#### Install the workflow
2028

2129
Clone the repository:
2230
```sh
2331
git clone https://github.com/AnneSoBen/obitools_workflow.git
2432
```
33+
2534
### Directories and files structure
2635

2736
The repository contains five folders:
@@ -44,15 +53,43 @@ And be put in a subfolder whose name is the prefix of the files (see _Example_).
4453

4554
## Usage
4655

47-
Before running the workflow, the two configuration files have to be modified: `workflow/cluster.yaml` that sets up the ressources available for each rule, and `config/config.yaml` where you can edit the values of the parameters used by the rules and the basename of your files.
56+
### Configuration
57+
58+
Before running the workflow, the configuration file (`config/config.yaml`) has to be edited. The parameters that can be set are listed in the table below:
59+
60+
| parameter | description | concerned rule(s) | default value | comment |
61+
|--------------------|--------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
62+
| tomerge | whether to merge libraries before dereplication | merge_demultiplex | FALSE | should be set to 'TRUE' if you analyse several libraries and that you want to merge them |
63+
| resourcesfolder | relative path to the folder containing resource files (fastq files and ngsfilter) | split_fastq, demultiplex | ../resources | should not be changed, unless you want to rename the folder |
64+
| resultsfolder | relative path to the folder where output files will be written | all | ../results | should not be changed, unless you want to rename the folder |
65+
| fastqfiles | prefix of the name of the resource fastq files and ngsfilter | all | wolf_diet | must be changed to match your files name prefix |
66+
| mergedfile | prefix of the name of the output files if tomerge=TRUE | merge_demultiplex, split_fasta, derepl, merge_derepl, basicfilt, clustering, merge_clust, tab_format | wolf_diet | must be changed for the merged files name prefix you want |
67+
| split_fastq:nfiles | number of files to create when splitting fastq files for pairing | split_fastq | 2 | should be changed according to the size of you dataset: the bigger it is, the more you will want to split your initial files - useful only on multi-threaded systems |
68+
| minscore | minimum alignment score required for pairing | alifilt | 40.00 | set according to Taberlet et al. 2018 |
69+
| split_fasta:nfiles | number of files to create when splitting demultiplexed fasta files for dereplication | split_fasta | 2 | should be changed according to the size of you dataset: the bigger it is, the more you will want to split your initial file(s) |
70+
| minlength | minimum sequence length (in bp) | basicfilt | 80 | must be changed according to the minimum length expected for your barcode |
71+
| mincount | minimum number of reads per unique sequence | basicfilt | 1 | it's up to you! |
72+
| minsim | similarity threshold for clustering | clustering | 0.97 | it's up to you! |
73+
4874

49-
Then, to run the workflow in a single command on the cluster:
75+
If you run the workflow on a SLURM cluster, you must also check the `workflow/cluster.yaml` that sets up the ressources available for each rule.
5076

77+
### Run the workflow
78+
79+
Then, run the workflow:
80+
```sh
81+
cd workflow
82+
conda activate snakemake
83+
snakemake -c1 --use-conda
84+
```
85+
86+
Alternatively, you can run the workflow in a single command on a SLURM cluster by submitting the `sub_smk.sh` file:
5187
```sh
5288
cd workflow
5389
sbatch sub_smk.sh
5490
```
5591

92+
5693
## Example
5794

5895
### Download toy data
@@ -107,10 +144,11 @@ The config.yaml file is already modified to fit this data.
107144

108145
### Run the workflow
109146

110-
Now run the workflow on the cluster:
147+
Now run the workflow:
111148
```sh
112149
cd workflow/
113-
sbatch sub_smk.sh
150+
conda activate snakemake
151+
snakemake -c1 --use-conda
114152
```
115153

116154
### Option: merging libraries
@@ -135,14 +173,14 @@ The source files of each library should be in separate subfolders. For example:
135173

136174
```
137175
└─ resources
138-
   └── myfirstlibprefix
139-
   | ├── myfirstlibprefix_ngsfilter.tab
140-
   | ├── myfirstlibprefix_R1.fastq
141-
| └── myfirstlibprefix_R2.fastq
142-
   └── mysecondlibprefix
143-
   ├── mysecondlibprefix_ngsfilter.tab
144-
   ├── mysecondlibprefix_R1.fastq
145-
└── mysecondlibprefix_R2.fastq
176+
└── myfirstlibprefix
177+
| ├── myfirstlibprefix_ngsfilter.tab
178+
| ├── myfirstlibprefix_R1.fastq
179+
| └── myfirstlibprefix_R2.fastq
180+
└── mysecondlibprefix
181+
├── mysecondlibprefix_ngsfilter.tab
182+
├── mysecondlibprefix_R1.fastq
183+
└── mysecondlibprefix_R2.fastq
146184
```
147185

148186
Two ngsfilter files will be necessary: `resources/myfirstlibfileprefix/myfirstlibfileprefix_ngsfilter.tab` and `resources/myfirstlibfileprefix/mysecondlibfileprefix_ngsfilter.tab`.
@@ -159,21 +197,21 @@ You may want to clean up potential molecular artifacts: have a look at the R pac
159197

160198
## Acknowledgements
161199

162-
Thanks to **[Lucie Zinger](https://luciezinger.wordpress.com/)**, **[Frédéric Boyer](https://www.researchgate.net/profile/Frederic-Boyer-3)**, **[Céline Mercier](https://www.celine-mercier.info/)** and **Clément Lionnet** for their help with the obitools!
200+
Thanks to **[Lucie Zinger](https://luciezinger.wordpress.com/)**, **[Frédéric Boyer](https://www.researchgate.net/profile/Frederic-Boyer-3)**, **[Céline Mercier](https://www.celine-mercier.info/)** and **Clément Lionnet** for their help with the obitools! Also thanks to the **[ECOFEED](https://cordis.europa.eu/project/id/817779/fr)** project for funding the development of the first version of this workflow.
163201

164202

165203
## How to cite this repository
166204

167-
Anne-Sophie Benoiston. (2022). AnneSoBen/obitools_workflow: v1.0.1. GitHub. https://doi.org/10.5281/zenodo.6977255.
205+
Anne-Sophie Benoiston. (2022). AnneSoBen/obitools_workflow: v1.0.2. GitHub. https://doi.org/10.5281/zenodo.6676577.
168206

169207
:triangular_flag_on_post: Don't forget to cite this repository is you use if for your research :slightly_smiling_face:
170208

171209

172210
## References
173211

174-
Boyer, F., Mercier, C., Bonin, A., Bras, Y. L., Taberlet, P., & Coissac, E. (2016). obitools: A unix-inspired software package for DNA metabarcoding. Molecular Ecology Resources, 16(1), 176‑182.
212+
Boyer, F., Mercier, C., Bonin, A., Bras, Y. L., Taberlet, P., & Coissac, E. (2016). obitools: A unix-inspired software package for DNA metabarcoding. Molecular Ecology Resources, 16(1), 176‑182.
175213

176-
Mercier, C., Boyer, F., Bonin, A., & Coissac, E. (2013, November). SUMATRA and SUMACLUST: fast and exact comparison and clustering of sequences. In Programs and Abstracts of the SeqBio 2013 workshop. Abstract (pp. 27-29).
214+
Mercier, C., Boyer, F., Bonin, A., & Coissac, E. (2013). SUMATRA and SUMACLUST: fast and exact comparison and clustering of sequences. In Programs and Abstracts of the SeqBio 2013 workshop. Abstract (pp. 27-29).
177215

178216
Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., ... & Köster, J. (2021). Sustainable data analysis with Snakemake. F1000Research, 10.
179217

workflow/Snakefile

Lines changed: 34 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,10 @@ configfile: "../config/config.yaml"
1414

1515
# GET FINAL OUTPUT(S)
1616
def get_input_all():
17-
print(config["tomerge"])
1817
if config["tomerge"]:
1918
inputfiles=config["resultsfolder"]+config["mergedfile"]+"/"+config["mergedfile"]+"_R1R2_good_demultiplexed_derepl_basicfilt_cl_agg.tab"
2019
else:
2120
inputfiles=expand("{folder}{run}/{run}_R1R2_good_demultiplexed_derepl_basicfilt_cl_agg.tab", run=config["fastqfiles"], folder=config["resultsfolder"])
22-
print("Expected output(s):")
23-
print(inputfiles)
2421
return inputfiles
2522

2623

@@ -42,9 +39,10 @@ checkpoint split_fastq:
4239
nfiles=config["split_fastq"]["nfiles"],
4340
R1=config["resultsfolder"]+"{run}/splitted_fastq/{run}_R1",
4441
R2=config["resultsfolder"]+"{run}/splitted_fastq/{run}_R2"
42+
conda:
43+
"envs/obi_env.yaml"
4544
shell:
4645
"""
47-
set +u; module load bioinfo/obitools-v1.2.11; set -u
4846
mkdir {params.folder}
4947
obidistribute -n {params.nfiles} -p {params.R1} {input.R1}
5048
obidistribute -n {params.nfiles} -p {params.R2} {input.R2}
@@ -58,9 +56,10 @@ rule pairing:
5856
R2=config["resultsfolder"]+"{run}/splitted_fastq/{run}_R2_{n}.fastq"
5957
output:
6058
temp(config["resultsfolder"]+"{run}/splitted_fastq/{run}_R1R2_{n}.fastq")
59+
conda:
60+
"envs/obi_env.yaml"
6161
shell:
6262
"""
63-
set +u; module load bioinfo/obitools-v1.2.11; set -u
6463
illuminapairedend -r {input.R2} {input.R1} > {output}
6564
"""
6665

@@ -69,10 +68,10 @@ rule pairing:
6968
def aggregate_R1R2(wildcards):
7069
checkpoint_output=checkpoints.split_fastq.get(**wildcards).output[0]
7170
file_names=temp(expand(config["resultsfolder"]+"{{run}}/splitted_fastq/{{run}}_R1R2_{n}.fastq", n=glob_wildcards(os.path.join(checkpoint_output, "{run}_R1_{n}.fastq")).n))
72-
print('in_def_aggregate_R1R2')
73-
print(checkpoint_output)
74-
print(glob_wildcards(os.path.join(checkpoint_output, "{run}_R1_{n}.fastq")).n)
75-
print(file_names)
71+
# print('in_def_aggregate_R1R2')
72+
# print(checkpoint_output)
73+
# print(glob_wildcards(os.path.join(checkpoint_output, "{run}_R1_{n}.fastq")).n)
74+
# print(file_names)
7675
return file_names
7776

7877

@@ -98,14 +97,15 @@ rule alifilt:
9897
output:
9998
good=config["resultsfolder"]+"{run}/{run}_R1R2_good.fastq",
10099
bad=config["resultsfolder"]+"{run}/{run}_R1R2_bad.fastq"
101-
log:
102-
"../log/split_ali_{run}.log"
103100
params:
104101
minscore=config["alifilt"]["minscore"],
105102
prefix=config["resultsfolder"]+"{run}/{run}_R1R2_"
103+
log:
104+
"../log/split_ali_{run}.log"
105+
conda:
106+
"envs/obi_env.yaml"
106107
shell:
107108
"""
108-
set +u; module load bioinfo/obitools-v1.2.11; set -u
109109
obiannotate -S ali:'"good" if score>{params.minscore} else "bad"' {input} | obisplit -t ali -p {params.prefix} 2> {log}
110110
"""
111111

@@ -121,9 +121,10 @@ rule demultiplex:
121121
ngs=config["resourcesfolder"]+"{run}/{run}_ngsfilter.tab"
122122
log:
123123
"../log/demultiplex_{run}.log"
124+
conda:
125+
"envs/obi_env.yaml"
124126
shell:
125127
"""
126-
set +u; module load bioinfo/obitools-v1.2.11; set -u
127128
obiannotate --without-progress-bar --sanger -S 'Avgqphred:-int(math.log10(sum(sequence.quality)/len(sequence))*10)' {input} | ngsfilter --fasta-output -t {params.ngs} -u {output.unassigned} > {output.demultiplexed} 2> {log}
128129
"""
129130

@@ -163,9 +164,10 @@ checkpoint split_fasta:
163164
folder=folder_prefix+"derepl_tmp",
164165
nfiles=config["split_fasta"]["nfiles"],
165166
tmp=folder_prefix+"derepl_tmp/tmp"
167+
conda:
168+
"envs/obi_env.yaml"
166169
shell:
167170
"""
168-
set +u; module load bioinfo/obitools-v1.2.11; set -u
169171
mkdir {params.folder}
170172
obiannotate -S start:"hash(str(sequence))%{params.nfiles}" {input} | obisplit -t start -p {params.tmp}
171173
"""
@@ -177,9 +179,10 @@ rule derepl:
177179
folder_prefix+"derepl_tmp/tmp{t}.fasta"
178180
output:
179181
folder_prefix+"uniq/tmp_uniq{t}.fasta"
182+
conda:
183+
"envs/obi_env.yaml"
180184
shell:
181185
"""
182-
set +u; module load bioinfo/obitools-v1.2.11; set -u
183186
obiuniq -m sample {input} > {output}
184187
"""
185188

@@ -188,10 +191,10 @@ rule derepl:
188191
def aggregate_derepl(wildcards):
189192
checkpoint_output=checkpoints.split_fasta.get(**wildcards).output[0]
190193
file_names=expand(folder_prefix2+"uniq/tmp_uniq{t}.fasta", t=glob_wildcards(os.path.join(checkpoint_output, "tmp{t}.fasta")).t)
191-
print('in_def_aggregate_derepl')
192-
print(checkpoint_output)
193-
print(glob_wildcards(os.path.join(checkpoint_output, "tmp{t}.fasta")).t)
194-
print(file_names)
194+
# print('in_def_aggregate_derepl')
195+
# print(checkpoint_output)
196+
# print(glob_wildcards(os.path.join(checkpoint_output, "tmp{t}.fasta")).t)
197+
# print(file_names)
195198
return file_names
196199

197200

@@ -219,14 +222,15 @@ rule basicfilt:
219222
folder_prefix+files_prefix+"_R1R2_good_demultiplexed_derepl.fasta"
220223
output:
221224
folder_prefix+files_prefix+"_R1R2_good_demultiplexed_derepl_basicfilt.fasta"
222-
log:
223-
"../log/basicfilt_"+files_prefix+".log"
224225
params:
225226
minlength=config["basicfilt"]["minlength"],
226227
mincount=config["basicfilt"]["mincount"]
228+
log:
229+
"../log/basicfilt_"+files_prefix+".log"
230+
conda:
231+
"envs/obi_env.yaml"
227232
shell:
228233
"""
229-
set +u; module load bioinfo/obitools-v1.2.11; set -u
230234
obiannotate --length -S 'GC_content:len(str(sequence).replace("a","").replace("t",""))*100/len(sequence)' {input} | obigrep -l {params.minlength} -s '^[acgt]+$' -p 'count>{params.mincount}' > {output} 2> {log}
231235
"""
232236

@@ -237,14 +241,15 @@ rule clustering:
237241
folder_prefix+files_prefix+"_R1R2_good_demultiplexed_derepl_basicfilt.fasta"
238242
output:
239243
folder_prefix+files_prefix+"_R1R2_good_demultiplexed_derepl_basicfilt_cl.fasta"
240-
log:
241-
"../log/clustering_"+files_prefix+".log"
242244
params:
243245
minsim=config["clustering"]["minsim"]
246+
log:
247+
"../log/clustering_"+files_prefix+".log"
244248
threads: 8
249+
conda:
250+
"envs/suma_env.yaml"
245251
shell:
246252
"""
247-
set +u; module load bioinfo/sumaclust_v1.0.31; set -u
248253
sumaclust -t {params.minsim} -p {threads} {input} > {output}
249254
"""
250255

@@ -257,9 +262,10 @@ rule merge_clust:
257262
folder_prefix+files_prefix+"_R1R2_good_demultiplexed_derepl_basicfilt_cl_agg.fasta"
258263
log:
259264
"../log/merge_clust_"+files_prefix+".log"
265+
conda:
266+
"envs/obi_env.yaml"
260267
shell:
261268
"""
262-
set +u; module load bioinfo/obitools-v1.2.11; set -u
263269
obiselect -c cluster -n 1 --merge sample -M -f count {input} > {output} 2> {log}
264270
"""
265271

@@ -272,8 +278,9 @@ rule tab_format:
272278
folder_prefix+files_prefix+"_R1R2_good_demultiplexed_derepl_basicfilt_cl_agg.tab"
273279
log:
274280
"../log/tab_format_"+files_prefix+".log"
281+
conda:
282+
"envs/obi_env.yaml"
275283
shell:
276284
"""
277-
set +u; module load bioinfo/obitools-v1.2.11; set -u
278285
obitab -n NA -d -o {input} > {output} 2> {log}
279286
"""

workflow/cluster.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ __default__:
33
cpus: 1
44

55
split_fastq:
6-
mem: 20M
6+
mem: 100M
77

88
pairing:
99
mem: 100M

workflow/envs/obi_env.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
name: obitools
2+
channels:
3+
- conda-forge
4+
- bioconda
5+
dependencies:
6+
- obitools
7+

workflow/envs/suma_env.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
channels:
2+
- conda-forge
3+
- bioconda
4+
dependencies:
5+
- sumaclust

workflow/sub_smk.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
#SBATCH -e snakemake_error_%j.err
77
#SBATCH --mail-type=BEGIN,END,FAIL
88

9-
module load bioinfo/snakemake-5.25.0
9+
source activate snakemake
1010

1111
snakemake --cores 1 --unlock
12-
snakemake --jobs 10 --cluster-config cluster.yaml --cluster "sbatch --mem {cluster.mem} -c {cluster.cpus}"
12+
snakemake --jobs 10 --cluster-config cluster.yaml --cluster "sbatch --mem {cluster.mem} -c {cluster.cpus}" --use-conda

0 commit comments

Comments
 (0)