Merge pull request #7 from AnneSoBen/use_conda

AnneSoBen · web-flow · commit 82f5ec5bbd8b · 2022-09-02T11:46:33.000+02:00
Use conda
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,5 @@
 # files and folder to ignore:
+.snakemake
 workflow/.snakemake
 workflow/.snakemake/*
 workflow/*.err
diff --git a/README.md b/README.md
@@ -1,27 +1,36 @@
 # OBITools workflow
 
-[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6977255.svg)](https://doi.org/10.5281/zenodo.6977255)
+[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6676577.svg)](https://doi.org/10.5281/zenodo.6676577)
 
 
 ## About
 
 This is a snakemake workflow based on the obitools suite of programs, that analyzes DNA metabarcoding data.
 
-Sequence analysis is performed with the obitools (Boyer et al. 2016) and sumaclust (Mercier et al. 2013) through a Snakemake pipeline (Molder et al. 2021).
+Sequence analysis is performed with the obitools (Boyer et al. 2016) and sumaclust (Mercier et al. 2013) through a Snakemake pipeline (Mölder et al. 2021).
 
 
 ## Getting started
 
-### Prerequisites
+### Installation
 
-This workflow is meant to be executed on a computing cluster running with **SLURM**. It has been written to run on the Genotoul computing cluster (http://bioinfo.genotoul.fr/).
+#### Dependencies
 
-### Installation
+In order to run the workflow, you must have installed the following programs:
+
+- [python3](https://www.python.org/downloads/)
+- [conda](https://docs.conda.io/en/latest/miniconda.html)
+- [snakemake](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html)
+
+Please note that the workflow is currently running exclusively on Unix systems.
+
+#### Install the workflow
 
 Clone the repository:
 ```sh
 git clone https://github.com/AnneSoBen/obitools_workflow.git
 ```
+
 ### Directories and files structure
 
 The repository contains five folders:
@@ -44,15 +53,43 @@ And be put in a subfolder whose name is the prefix of the files (see _Example_).
 
 ## Usage
 
-Before running the workflow, the two configuration files have to be modified: `workflow/cluster.yaml` that sets up the ressources available for each rule, and `config/config.yaml` where you can edit the values of the parameters used by the rules and the basename of your files.
+### Configuration
+
+Before running the workflow, the configuration file (`config/config.yaml`) has to be edited. The parameters that can be set are listed in the table below:
+
+| parameter          | description                                                                          | concerned rule(s)                                                                                    | default value | comment                                                                                                                                                              |
+|--------------------|--------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| tomerge            | whether to merge libraries before dereplication                                      | merge_demultiplex                                                                                    | FALSE         | should be set to 'TRUE' if you analyse several libraries and that you want to merge them                                                                             |
+| resourcesfolder    | relative path to the folder containing resource files (fastq files and ngsfilter)    | split_fastq, demultiplex                                                                             | ../resources  | should not be changed, unless you want to rename the folder                                                                                                          |
+| resultsfolder      | relative path to the folder where output files will be written                       | all                                                                                                  | ../results    | should not be changed, unless you want to rename the folder                                                                                                          |
+| fastqfiles         | prefix of the name of the resource fastq files and ngsfilter                         | all                                                                                                  | wolf_diet     | must be changed to match your files name prefix                                                                                                                      |
+| mergedfile         | prefix of the name of the output files if tomerge=TRUE                               | merge_demultiplex, split_fasta, derepl, merge_derepl, basicfilt, clustering, merge_clust, tab_format | wolf_diet     | must be changed for the merged files name prefix you want                                                                                                            |
+| split_fastq:nfiles | number of files to create when splitting fastq files for pairing                     | split_fastq                                                                                          | 2             | should be changed according to the size of you dataset: the bigger it is, the more you will want to split your initial files - useful only on multi-threaded systems |
+| minscore           | minimum alignment score required for pairing                                         | alifilt                                                                                              | 40.00         | set according to Taberlet et al. 2018                                                                                                                                |
+| split_fasta:nfiles | number of files to create when splitting demultiplexed fasta files for dereplication | split_fasta                                                                                          | 2             | should be changed according to the size of you dataset: the bigger it is, the more you will want to split your initial file(s)                                       |
+| minlength          | minimum sequence length (in bp)                                                      | basicfilt                                                                                            | 80            | must be changed according to the minimum length expected for your barcode                                                                                            |
+| mincount           | minimum number of reads per unique sequence                                          | basicfilt                                                                                            | 1             | it's up to you!                                                                                                                                                      |
+| minsim             | similarity threshold for clustering                                                  | clustering                                                                                           | 0.97          | it's up to you!                                                                                                                                                      |
+
 
-Then, to run the workflow in a single command on the cluster:
+If you run the workflow on a SLURM cluster, you must also check the `workflow/cluster.yaml` that sets up the ressources available for each rule.
 
+### Run the workflow
+
+Then, run the workflow:
+```sh
+cd workflow
+conda activate snakemake
+snakemake -c1 --use-conda
+```
+
+Alternatively, you can run the workflow in a single command on a SLURM cluster by submitting the `sub_smk.sh` file:
 ```sh
 cd workflow
 sbatch sub_smk.sh
 ```
 
+
 ## Example
 
 ### Download toy data
@@ -107,10 +144,11 @@ The config.yaml file is already modified to fit this data.
 
 ### Run the workflow
 
-Now run the workflow on the cluster:
+Now run the workflow:
 ```sh
 cd workflow/
-sbatch sub_smk.sh
+conda activate snakemake
+snakemake -c1 --use-conda
 ```
 
 ### Option: merging libraries
@@ -135,14 +173,14 @@ The source files of each library should be in separate subfolders. For example:
 
 ```
 └─ resources
-   └── myfirstlibprefix
-   |   ├── myfirstlibprefix_ngsfilter.tab
-   |   ├── myfirstlibprefix_R1.fastq
-   |   └── myfirstlibprefix_R2.fastq
-   └── mysecondlibprefix
-       ├── mysecondlibprefix_ngsfilter.tab
-       ├── mysecondlibprefix_R1.fastq
-       └── mysecondlibprefix_R2.fastq
+ └── myfirstlibprefix
+ |   ├── myfirstlibprefix_ngsfilter.tab
+ |   ├── myfirstlibprefix_R1.fastq
+ |   └── myfirstlibprefix_R2.fastq
+ └── mysecondlibprefix
+     ├── mysecondlibprefix_ngsfilter.tab
+     ├── mysecondlibprefix_R1.fastq
+     └── mysecondlibprefix_R2.fastq
 ```
 
 Two ngsfilter files will be necessary: `resources/myfirstlibfileprefix/myfirstlibfileprefix_ngsfilter.tab` and `resources/myfirstlibfileprefix/mysecondlibfileprefix_ngsfilter.tab`.
@@ -159,21 +197,21 @@ You may want to clean up potential molecular artifacts: have a look at the R pac
 
 ## Acknowledgements
 
-Thanks to **[Lucie Zinger](https://luciezinger.wordpress.com/)**, **[Frédéric Boyer](https://www.researchgate.net/profile/Frederic-Boyer-3)**, **[Céline Mercier](https://www.celine-mercier.info/)** and **Clément Lionnet** for their help with the obitools!
+Thanks to **[Lucie Zinger](https://luciezinger.wordpress.com/)**, **[Frédéric Boyer](https://www.researchgate.net/profile/Frederic-Boyer-3)**, **[Céline Mercier](https://www.celine-mercier.info/)** and **Clément Lionnet** for their help with the obitools! Also thanks to the **[ECOFEED](https://cordis.europa.eu/project/id/817779/fr)** project for funding the development of the first version of this workflow.
 
 
 ## How to cite this repository
 
-Anne-Sophie Benoiston. (2022). AnneSoBen/obitools_workflow: v1.0.1. GitHub. https://doi.org/10.5281/zenodo.6977255.
+Anne-Sophie Benoiston. (2022). AnneSoBen/obitools_workflow: v1.0.2. GitHub. https://doi.org/10.5281/zenodo.6676577.
 
 :triangular_flag_on_post: Don't forget to cite this repository is you use if for your research :slightly_smiling_face:
 
 
 ## References
 
-Boyer, F., Mercier, C., Bonin, A., Bras, Y. L., Taberlet, P., & Coissac, E. (2016). obitools : A unix-inspired software package for DNA metabarcoding. Molecular Ecology Resources, 16(1), 176‑182.
+Boyer, F., Mercier, C., Bonin, A., Bras, Y. L., Taberlet, P., & Coissac, E. (2016). obitools: A unix-inspired software package for DNA metabarcoding. Molecular Ecology Resources, 16(1), 176‑182.
 
-Mercier, C., Boyer, F., Bonin, A., & Coissac, E. (2013, November). SUMATRA and SUMACLUST: fast and exact comparison and clustering of sequences. In Programs and Abstracts of the SeqBio 2013 workshop. Abstract (pp. 27-29).
+Mercier, C., Boyer, F., Bonin, A., & Coissac, E. (2013). SUMATRA and SUMACLUST: fast and exact comparison and clustering of sequences. In Programs and Abstracts of the SeqBio 2013 workshop. Abstract (pp. 27-29).
 
 Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., ... & Köster, J. (2021). Sustainable data analysis with Snakemake. F1000Research, 10.
 
diff --git a/workflow/Snakefile b/workflow/Snakefile
@@ -14,13 +14,10 @@ configfile: "../config/config.yaml"
 
 # GET FINAL OUTPUT(S)
 def get_input_all():
-	print(config["tomerge"])
 	if config["tomerge"]:
 		inputfiles=config["resultsfolder"]+config["mergedfile"]+"/"+config["mergedfile"]+"_R1R2_good_demultiplexed_derepl_basicfilt_cl_agg.tab"
 	else:
 		inputfiles=expand("{folder}{run}/{run}_R1R2_good_demultiplexed_derepl_basicfilt_cl_agg.tab", run=config["fastqfiles"], folder=config["resultsfolder"])
-	print("Expected output(s):")
-	print(inputfiles)
 	return inputfiles
 
 
@@ -42,9 +39,10 @@ checkpoint split_fastq:
 		nfiles=config["split_fastq"]["nfiles"],
 		R1=config["resultsfolder"]+"{run}/splitted_fastq/{run}_R1",
 		R2=config["resultsfolder"]+"{run}/splitted_fastq/{run}_R2"
+	conda:
+		"envs/obi_env.yaml"
 	shell:
 		"""
-		set +u; module load bioinfo/obitools-v1.2.11; set -u
 		mkdir {params.folder}
 		obidistribute -n {params.nfiles} -p {params.R1} {input.R1}
 		obidistribute -n {params.nfiles} -p {params.R2} {input.R2}
@@ -58,9 +56,10 @@ rule pairing:
 		R2=config["resultsfolder"]+"{run}/splitted_fastq/{run}_R2_{n}.fastq"
 	output:
 		temp(config["resultsfolder"]+"{run}/splitted_fastq/{run}_R1R2_{n}.fastq")
+	conda:
+		"envs/obi_env.yaml"
 	shell:
 		"""
-		set +u; module load bioinfo/obitools-v1.2.11; set -u
 		illuminapairedend -r {input.R2} {input.R1} > {output}
 		"""
 
@@ -69,10 +68,10 @@ rule pairing:
 def aggregate_R1R2(wildcards):
 	checkpoint_output=checkpoints.split_fastq.get(**wildcards).output[0]
 	file_names=temp(expand(config["resultsfolder"]+"{{run}}/splitted_fastq/{{run}}_R1R2_{n}.fastq", n=glob_wildcards(os.path.join(checkpoint_output, "{run}_R1_{n}.fastq")).n))
-	print('in_def_aggregate_R1R2')
-	print(checkpoint_output)
-	print(glob_wildcards(os.path.join(checkpoint_output, "{run}_R1_{n}.fastq")).n)
-  	print(file_names)
+	# print('in_def_aggregate_R1R2')
+	# print(checkpoint_output)
+	# print(glob_wildcards(os.path.join(checkpoint_output, "{run}_R1_{n}.fastq")).n)
+	# print(file_names)
 	return file_names
 	
 
@@ -98,14 +97,15 @@ rule alifilt:
 	output:
 		good=config["resultsfolder"]+"{run}/{run}_R1R2_good.fastq",
 		bad=config["resultsfolder"]+"{run}/{run}_R1R2_bad.fastq"
-	log:
-		"../log/split_ali_{run}.log"
 	params:
 		minscore=config["alifilt"]["minscore"],
 		prefix=config["resultsfolder"]+"{run}/{run}_R1R2_"
+	log:
+		"../log/split_ali_{run}.log"
+	conda:
+		"envs/obi_env.yaml"
 	shell:
 		"""
-		set +u; module load bioinfo/obitools-v1.2.11; set -u
 		obiannotate -S ali:'"good" if score>{params.minscore} else "bad"' {input} | obisplit -t ali -p {params.prefix} 2> {log}
 		"""
 
@@ -121,9 +121,10 @@ rule demultiplex:
 		ngs=config["resourcesfolder"]+"{run}/{run}_ngsfilter.tab"
 	log:
 		"../log/demultiplex_{run}.log"
+	conda:
+		"envs/obi_env.yaml"
 	shell:
 		"""
-		set +u; module load bioinfo/obitools-v1.2.11; set -u
 		obiannotate --without-progress-bar --sanger -S 'Avgqphred:-int(math.log10(sum(sequence.quality)/len(sequence))*10)' {input} | ngsfilter --fasta-output -t {params.ngs} -u {output.unassigned} > {output.demultiplexed} 2> {log}
 		"""
 
@@ -163,9 +164,10 @@ checkpoint split_fasta:
 		folder=folder_prefix+"derepl_tmp",
 		nfiles=config["split_fasta"]["nfiles"],
 		tmp=folder_prefix+"derepl_tmp/tmp"
+	conda:
+		"envs/obi_env.yaml"
 	shell:
 		"""
-		set +u; module load bioinfo/obitools-v1.2.11; set -u
 		mkdir {params.folder}
 		obiannotate -S start:"hash(str(sequence))%{params.nfiles}" {input} | obisplit -t start -p {params.tmp}
 		"""
@@ -177,9 +179,10 @@ rule derepl:
 		folder_prefix+"derepl_tmp/tmp{t}.fasta"
 	output:
 		folder_prefix+"uniq/tmp_uniq{t}.fasta"
+	conda:
+		"envs/obi_env.yaml"
 	shell:
 		"""
-		set +u; module load bioinfo/obitools-v1.2.11; set -u
 		obiuniq -m sample {input} > {output}
 		"""
 
@@ -188,10 +191,10 @@ rule derepl:
 def aggregate_derepl(wildcards):
         checkpoint_output=checkpoints.split_fasta.get(**wildcards).output[0]
         file_names=expand(folder_prefix2+"uniq/tmp_uniq{t}.fasta", t=glob_wildcards(os.path.join(checkpoint_output, "tmp{t}.fasta")).t)
-        print('in_def_aggregate_derepl')
-	print(checkpoint_output)
-	print(glob_wildcards(os.path.join(checkpoint_output, "tmp{t}.fasta")).t)
-        print(file_names)
+	# print('in_def_aggregate_derepl')
+	# print(checkpoint_output)
+	# print(glob_wildcards(os.path.join(checkpoint_output, "tmp{t}.fasta")).t)
+	# print(file_names)
         return file_names
 
 
@@ -219,14 +222,15 @@ rule basicfilt:
 		folder_prefix+files_prefix+"_R1R2_good_demultiplexed_derepl.fasta"
 	output:
 		folder_prefix+files_prefix+"_R1R2_good_demultiplexed_derepl_basicfilt.fasta"
-	log:
-		"../log/basicfilt_"+files_prefix+".log"
 	params:
 		minlength=config["basicfilt"]["minlength"],
 		mincount=config["basicfilt"]["mincount"]
+	log:
+		"../log/basicfilt_"+files_prefix+".log"
+	conda:
+		"envs/obi_env.yaml"
 	shell:
 		"""
-		set +u; module load bioinfo/obitools-v1.2.11; set -u
 		obiannotate --length -S 'GC_content:len(str(sequence).replace("a","").replace("t",""))*100/len(sequence)' {input} | obigrep -l {params.minlength} -s '^[acgt]+$' -p 'count>{params.mincount}' > {output} 2> {log}
 		"""
 
@@ -237,14 +241,15 @@ rule clustering:
 		folder_prefix+files_prefix+"_R1R2_good_demultiplexed_derepl_basicfilt.fasta"
 	output:
 		folder_prefix+files_prefix+"_R1R2_good_demultiplexed_derepl_basicfilt_cl.fasta"
-	log:
-		"../log/clustering_"+files_prefix+".log"
 	params:
 		minsim=config["clustering"]["minsim"]
+	log:
+		"../log/clustering_"+files_prefix+".log"
 	threads: 8
+	conda:
+		"envs/suma_env.yaml"
 	shell:
 		"""
-		set +u; module load bioinfo/sumaclust_v1.0.31; set -u
 		sumaclust -t {params.minsim} -p {threads} {input} > {output}
 		"""
 
@@ -257,9 +262,10 @@ rule merge_clust:
 		folder_prefix+files_prefix+"_R1R2_good_demultiplexed_derepl_basicfilt_cl_agg.fasta"
 	log:
 		"../log/merge_clust_"+files_prefix+".log"
+	conda:
+		"envs/obi_env.yaml"
 	shell:
 		"""
-		set +u; module load bioinfo/obitools-v1.2.11; set -u
 		obiselect -c cluster -n 1 --merge sample -M -f count {input} > {output} 2> {log}
 		"""
 
@@ -272,8 +278,9 @@ rule tab_format:
 		folder_prefix+files_prefix+"_R1R2_good_demultiplexed_derepl_basicfilt_cl_agg.tab"
 	log:
 		"../log/tab_format_"+files_prefix+".log"
+	conda:
+		"envs/obi_env.yaml"
 	shell:
 		"""
-		set +u; module load bioinfo/obitools-v1.2.11; set -u
 		obitab -n NA -d -o {input} > {output} 2> {log}
 		"""
diff --git a/workflow/cluster.yaml b/workflow/cluster.yaml
@@ -3,7 +3,7 @@ __default__:
   cpus: 1
 
 split_fastq:
-  mem: 20M
+  mem: 100M
 
 pairing:
   mem: 100M
diff --git a/workflow/envs/obi_env.yaml b/workflow/envs/obi_env.yaml
@@ -0,0 +1,7 @@
+name: obitools
+channels:
+  - conda-forge
+  - bioconda  
+dependencies:
+  - obitools
+
diff --git a/workflow/envs/suma_env.yaml b/workflow/envs/suma_env.yaml
@@ -0,0 +1,5 @@
+channels:
+  - conda-forge
+  - bioconda
+dependencies:
+  - sumaclust
diff --git a/workflow/sub_smk.sh b/workflow/sub_smk.sh
@@ -6,7 +6,7 @@
 #SBATCH -e snakemake_error_%j.err
 #SBATCH --mail-type=BEGIN,END,FAIL
 
-module load bioinfo/snakemake-5.25.0
+source activate snakemake
 
 snakemake --cores 1 --unlock
-snakemake --jobs  10 --cluster-config cluster.yaml --cluster "sbatch --mem {cluster.mem} -c {cluster.cpus}"
+snakemake --jobs  10 --cluster-config cluster.yaml --cluster "sbatch --mem {cluster.mem} -c {cluster.cpus}" --use-conda

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,5 @@`
`1`	`1`	`# files and folder to ignore:`
	`2`	`+.snakemake`
`2`	`3`	`workflow/.snakemake`
`3`	`4`	`workflow/.snakemake/*`
`4`	`5`	`workflow/*.err`