You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a snakemake workflow based on the obitools suite of programs, that analyzes DNA metabarcoding data.
9
9
10
-
Sequence analysis is performed with the obitools (Boyer et al. 2016) and sumaclust (Mercier et al. 2013) through a Snakemake pipeline (Molder et al. 2021).
10
+
Sequence analysis is performed with the obitools (Boyer et al. 2016) and sumaclust (Mercier et al. 2013) through a Snakemake pipeline (Mölder et al. 2021).
11
11
12
12
13
13
## Getting started
14
14
15
-
### Prerequisites
15
+
### Installation
16
16
17
-
This workflow is meant to be executed on a computing cluster running with **SLURM**. It has been written to run on the Genotoul computing cluster (http://bioinfo.genotoul.fr/).
17
+
#### Dependencies
18
18
19
-
### Installation
19
+
In order to run the workflow, you must have installed the following programs:
@@ -44,15 +53,43 @@ And be put in a subfolder whose name is the prefix of the files (see _Example_).
44
53
45
54
## Usage
46
55
47
-
Before running the workflow, the two configuration files have to be modified: `workflow/cluster.yaml` that sets up the ressources available for each rule, and `config/config.yaml` where you can edit the values of the parameters used by the rules and the basename of your files.
56
+
### Configuration
57
+
58
+
Before running the workflow, the configuration file (`config/config.yaml`) has to be edited. The parameters that can be set are listed in the table below:
| tomerge | whether to merge libraries before dereplication | merge_demultiplex | FALSE | should be set to 'TRUE' if you analyse several libraries and that you want to merge them |
63
+
| resourcesfolder | relative path to the folder containing resource files (fastq files and ngsfilter) | split_fastq, demultiplex | ../resources | should not be changed, unless you want to rename the folder |
64
+
| resultsfolder | relative path to the folder where output files will be written | all | ../results | should not be changed, unless you want to rename the folder |
65
+
| fastqfiles | prefix of the name of the resource fastq files and ngsfilter | all | wolf_diet | must be changed to match your files name prefix |
66
+
| mergedfile | prefix of the name of the output files if tomerge=TRUE | merge_demultiplex, split_fasta, derepl, merge_derepl, basicfilt, clustering, merge_clust, tab_format | wolf_diet | must be changed for the merged files name prefix you want |
67
+
| split_fastq:nfiles| number of files to create when splitting fastq files for pairing | split_fastq | 2 | should be changed according to the size of you dataset: the bigger it is, the more you will want to split your initial files - useful only on multi-threaded systems |
68
+
| minscore | minimum alignment score required for pairing | alifilt | 40.00 | set according to Taberlet et al. 2018 |
69
+
| split_fasta:nfiles| number of files to create when splitting demultiplexed fasta files for dereplication | split_fasta | 2 | should be changed according to the size of you dataset: the bigger it is, the more you will want to split your initial file(s) |
70
+
| minlength | minimum sequence length (in bp) | basicfilt | 80 | must be changed according to the minimum length expected for your barcode |
71
+
| mincount | minimum number of reads per unique sequence | basicfilt | 1 | it's up to you! |
72
+
| minsim | similarity threshold for clustering | clustering | 0.97 | it's up to you! |
73
+
48
74
49
-
Then, to run the workflow in a single command on the cluster:
75
+
If you run the workflow on a SLURM cluster, you must also check the `workflow/cluster.yaml` that sets up the ressources available for each rule.
50
76
77
+
### Run the workflow
78
+
79
+
Then, run the workflow:
80
+
```sh
81
+
cd workflow
82
+
conda activate snakemake
83
+
snakemake -c1 --use-conda
84
+
```
85
+
86
+
Alternatively, you can run the workflow in a single command on a SLURM cluster by submitting the `sub_smk.sh` file:
51
87
```sh
52
88
cd workflow
53
89
sbatch sub_smk.sh
54
90
```
55
91
92
+
56
93
## Example
57
94
58
95
### Download toy data
@@ -107,10 +144,11 @@ The config.yaml file is already modified to fit this data.
107
144
108
145
### Run the workflow
109
146
110
-
Now run the workflow on the cluster:
147
+
Now run the workflow:
111
148
```sh
112
149
cd workflow/
113
-
sbatch sub_smk.sh
150
+
conda activate snakemake
151
+
snakemake -c1 --use-conda
114
152
```
115
153
116
154
### Option: merging libraries
@@ -135,14 +173,14 @@ The source files of each library should be in separate subfolders. For example:
135
173
136
174
```
137
175
└─ resources
138
-
└── myfirstlibprefix
139
-
| ├── myfirstlibprefix_ngsfilter.tab
140
-
| ├── myfirstlibprefix_R1.fastq
141
-
| └── myfirstlibprefix_R2.fastq
142
-
└── mysecondlibprefix
143
-
├── mysecondlibprefix_ngsfilter.tab
144
-
├── mysecondlibprefix_R1.fastq
145
-
└── mysecondlibprefix_R2.fastq
176
+
└── myfirstlibprefix
177
+
| ├── myfirstlibprefix_ngsfilter.tab
178
+
| ├── myfirstlibprefix_R1.fastq
179
+
| └── myfirstlibprefix_R2.fastq
180
+
└── mysecondlibprefix
181
+
├── mysecondlibprefix_ngsfilter.tab
182
+
├── mysecondlibprefix_R1.fastq
183
+
└── mysecondlibprefix_R2.fastq
146
184
```
147
185
148
186
Two ngsfilter files will be necessary: `resources/myfirstlibfileprefix/myfirstlibfileprefix_ngsfilter.tab` and `resources/myfirstlibfileprefix/mysecondlibfileprefix_ngsfilter.tab`.
@@ -159,21 +197,21 @@ You may want to clean up potential molecular artifacts: have a look at the R pac
159
197
160
198
## Acknowledgements
161
199
162
-
Thanks to **[Lucie Zinger](https://luciezinger.wordpress.com/)**, **[Frédéric Boyer](https://www.researchgate.net/profile/Frederic-Boyer-3)**, **[Céline Mercier](https://www.celine-mercier.info/)** and **Clément Lionnet** for their help with the obitools!
200
+
Thanks to **[Lucie Zinger](https://luciezinger.wordpress.com/)**, **[Frédéric Boyer](https://www.researchgate.net/profile/Frederic-Boyer-3)**, **[Céline Mercier](https://www.celine-mercier.info/)** and **Clément Lionnet** for their help with the obitools! Also thanks to the **[ECOFEED](https://cordis.europa.eu/project/id/817779/fr)** project for funding the development of the first version of this workflow.
:triangular_flag_on_post: Don't forget to cite this repository is you use if for your research :slightly_smiling_face:
170
208
171
209
172
210
## References
173
211
174
-
Boyer, F., Mercier, C., Bonin, A., Bras, Y. L., Taberlet, P., & Coissac, E. (2016). obitools: A unix-inspired software package for DNA metabarcoding. Molecular Ecology Resources, 16(1), 176‑182.
212
+
Boyer, F., Mercier, C., Bonin, A., Bras, Y. L., Taberlet, P., & Coissac, E. (2016). obitools: A unix-inspired software package for DNA metabarcoding. Molecular Ecology Resources, 16(1), 176‑182.
175
213
176
-
Mercier, C., Boyer, F., Bonin, A., & Coissac, E. (2013, November). SUMATRA and SUMACLUST: fast and exact comparison and clustering of sequences. In Programs and Abstracts of the SeqBio 2013 workshop. Abstract (pp. 27-29).
214
+
Mercier, C., Boyer, F., Bonin, A., & Coissac, E. (2013). SUMATRA and SUMACLUST: fast and exact comparison and clustering of sequences. In Programs and Abstracts of the SeqBio 2013 workshop. Abstract (pp. 27-29).
177
215
178
216
Mölder, F., Jablonski, K. P., Letcher, B., Hall, M. B., Tomkins-Tinch, C. H., Sochat, V., ... & Köster, J. (2021). Sustainable data analysis with Snakemake. F1000Research, 10.
0 commit comments