1
1
# GMSC-mapper
2
2
3
- Command line tool to query the Global Microbial smORFs Catalog (GMSC)
3
+ GMSC-mapper is a command line tool to query the Global Microbial smORFs Catalog (GMSC).
4
+
5
+ GMSC-mapper can be used to
6
+ - Find query smORFs (< 100aa) homologous to Global Microbial smORFs Catalog (GMSC) by alignment.
7
+ - Support 3 types of input:
8
+ - contigs (GMSC-mapper will predict smORFs from contigs first)
9
+ - amino acid sequences
10
+ - nucleotide gene sequences
11
+ - Annotate query/predicted smORFs with quality, habitat and taxonomy information constructed manually in detail.
4
12
5
13
## Installation
6
14
@@ -27,64 +35,57 @@ During the process, we install also the following dependencies:
27
35
- [ MMseqs2] ( https://github.com/soedinglab/MMseqs2 )
28
36
- [ Diamond] ( https://github.com/bbuchfink/diamond )
29
37
30
- And perform a series of tests using mock datasets to check if the installation works well:
38
+ ### Example test
39
+ Because the whole GMSC database is large, and takes some minutes to process.
31
40
32
- 1 . Input is genome contig sequences .
41
+ If you want to check if the installation works well, you can test with mock datasets easily and fast .
33
42
34
- ``` bash
35
- gmsc-mapper -i ../examples/example.fa --db ../examples/targetdb.dmnd --habitat ../examples/ref_habitat.txt --quality ../examples/ref_quality.txt --taxonomy ../examples/ref_taxonomy.txt
36
- ```
43
+ - Create GMSC database index
37
44
38
- 2 . Input is amino acid sequences .
45
+ Default alignment tool is Diamond .
39
46
40
47
``` bash
41
- gmsc-mapper --aa-genes ../ examples/example .faa --db ../ examples/targetdb.dmnd --habitat ../examples/ref_habitat.txt --quality ../examples/ref_quality.txt --taxonomy ../examples/ref_taxonomy.txt
48
+ gmsc-mapper createdb -i examples/target .faa -o examples/ -m diamond
42
49
```
43
50
44
- 3 . Input is nucleotide gene sequences .
51
+ If you want to use MMseqs2 as your alignment tool, you need to create GMSC database index in MMseqs2 format .
45
52
46
53
``` bash
47
- gmsc-mapper --nt-genes ../examples/example.fna --db ../ examples/targetdb.dmnd --habitat ../ examples/ref_habitat.txt --quality ../examples/ref_quality.txt --taxonomy ../examples/ref_taxonomy.txt
54
+ gmsc-mapper createdb -i examples/target.faa -o examples/ -m mmseqs
48
55
```
49
56
50
- 4 . Check the Alignment tool: Diamond/MMseqs2 is optional
57
+ - Input is genome contig sequences.
51
58
52
59
``` bash
53
- gmsc-mapper -i ../examples/example.fa --db ../examples/targetdb --habitat ../examples/ref_habitat.txt --quality ../examples/ref_quality.txt --taxonomy ../examples/ref_taxonomy.txt --tool mmseqs
54
-
55
- gmsc-mapper -i ../examples/example.fa --db ../examples/targetdb --habitat ../examples/ref_habitat.txt --quality ../examples/ref_quality.txt --taxonomy ../examples/ref_taxonomy.txt --tool diamond
60
+ gmsc-mapper -i examples/example.fa -o examples_output/ --db examples/targetdb.dmnd --habitat examples/ref_habitat.txt --quality examples/ref_quality.txt --taxonomy examples/ref_taxonomy.txt
56
61
```
57
62
58
- 5 . Flags to disable results from Habitat/taxonomy/quality annotation
63
+ - Input is amino acid sequences.
59
64
60
65
``` bash
61
- gmsc-mapper -i ../ examples/example.fa -- db ../ examples/targetdb.dmnd --habitat ../ examples/ref_habitat.txt --quality ../ examples/ref_quality.txt --taxonomy ../ examples/ref_taxonomy.txt --nohabitat --notaxonomy --noquality
66
+ gmsc-mapper --aa-genes examples/example.faa -o examples_output/ -- db examples/targetdb.dmnd --habitat examples/ref_habitat.txt --quality examples/ref_quality.txt --taxonomy examples/ref_taxonomy.txt
62
67
```
63
68
64
- ## Usage
69
+ - Input is nucleotide gene sequences.
65
70
66
- ### Example Usage
67
- The GMSC database is large,and taks some time to process all the things. If you want to know if GMSC-Mapper has been installed successfully and work well, please try the example usage with example target database as below.
71
+ ``` bash
72
+ gmsc-mapper --nt-genes examples/example.fna -o examples_output/ --db examples/targetdb.dmnd --habitat examples/ref_habitat.txt --quality examples/ref_quality.txt --taxonomy examples/ref_taxonomy.txt
73
+ ```
68
74
69
- #### Create GMSC database index
70
- ` -o ` : Path to database output directory.(default: ` GMSC-mapper/examples ` )
75
+ - Check another alignment tool: MMseqs2
71
76
72
- ` -m ` : Alignment tool(Diamond/MMseqs2).
73
- ```
74
- cd gmsc_mapper
75
- gmsc-mapper createdb -i ../examples/target.faa -o ../examples -m diamond
76
- ```
77
- or
78
- ```
79
- cd gmsc_mapper
80
- gmsc-mapper createdb -i ../examples/target.faa -o ../examples -m mmseqs
77
+ ``` bash
78
+ gmsc-mapper -i examples/example.fa -o examples_output/ --db examples/targetdb --habitat examples/ref_habitat.txt --quality examples/ref_quality.txt --taxonomy examples/ref_taxonomy.txt --tool mmseqs
81
79
```
82
80
83
- ### Real data Usage
84
- #### Create GMSC database index
81
+ ## Usage
82
+ Please make ` GMSC-mapper/gmsc_mapper ` as your work directory.
83
+
84
+ ### Create GMSC database index
85
85
` -o ` : Path to database output directory.(default: ` GMSC-mapper/db ` )
86
86
87
- ` -m ` : Alignment tool(Diamond/MMseqs2).
87
+ ` -m ` : Alignment tool (Diamond / MMseqs2).
88
+
88
89
```
89
90
cd gmsc_mapper
90
91
gmsc-mapper createdb -i ../db/90AA_GMSC.faa.gz -m diamond
@@ -95,11 +96,8 @@ cd gmsc_mapper
95
96
gmsc-mapper createdb -i ../db/90AA_GMSC.faa.gz -m mmseqs
96
97
```
97
98
98
- #### Default
99
-
100
- Please make ` GMSC-mapper/gmsc_mapper ` as your work directory.
101
-
102
- GMSC database/habitat/taxonomy/quality file path and output directory path can be assigned on your own.Default is ` GMSC-mapper/db ` and ` GMSC-mapper/output ` .
99
+ ### Default
100
+ GMSC database / habitat / taxonomy / quality file path and output directory path can be assigned on your own.Default is ` GMSC-mapper/db ` and ` GMSC-mapper/output ` .
103
101
104
102
1 . Input is genome contig sequences.
105
103
@@ -119,74 +117,150 @@ gmsc-mapper --aa-genes ../examples/example.faa
119
117
gmsc-mapper --nt-genes ../examples/example.fna
120
118
```
121
119
122
- #### Alignment tool: Diamond/MMseqs2 is optional
123
- If you want to change alignment tool(Diamond/MMseqs2), you can use ` --tool ` .
120
+ ### Alignment tool: Diamond / MMseqs2 is optional
121
+ If you want to change alignment tool (Diamond / MMseqs2), you can use ` --tool ` .
122
+
124
123
``` bash
125
124
gmsc-mapper -i ../examples/example.fa --tool mmseqs
126
125
```
127
126
128
- #### Habitat/taxonomy/quality annotation is optional
129
- If you don't want to annotate habitat/taxonomy/quality you can use ` --nohabitat ` /` --notaxonomy ` /` --noquality ` .
127
+ ### Habitat / taxonomy / quality annotation is optional
128
+ If you don't want to annotate habitat / taxonomy / quality you can use ` --nohabitat ` /` --notaxonomy ` /` --noquality ` .
129
+
130
130
``` bash
131
131
gmsc-mapper -i ../examples/example.fa --nohabitat --notaxonomy --noquality
132
132
```
133
133
134
- ## Example Output
134
+ ## Output files
135
135
The output folder will contain
136
- - Outputs of smORF prediction.
137
- - Complete mapping result table, listing all the hits in GMSC, per smORF.
138
- - Habitat annotation of smORFs.(optional)
139
- - Taxonomy annotation of smORFs.(optional)
140
- - Quality annotation of smORFs.(optional)
141
136
142
- ## Parameters
143
- * ` -i/--input ` : Path to the input genome contig sequence FASTA file (possibly .gz/.bz2/.xz compressed).
137
+ - Outputs of smORFs prediction (predicted.filterd.smorf.faa)
138
+
139
+ A FASTA file with the sequences of the predicted smORFs. It is generated when the input file is contigs.
140
+
141
+ - Complete alignment result table (diamond.out.smorfs.tsv / mmseqs.out.smorfs.tsv)
142
+
143
+ A file listing all the query hits of GMSC, from Diamond or MMseqs2.
144
+
145
+ The file format is followed by a space-separated list of these keywords:
146
+
147
+ ` qseqid ` : Query seq id
148
+
149
+ ` sseqid ` : Target seq id (in GMSC)
150
+
151
+ ` full_qseq ` : Query sequences
144
152
145
- * ` --aa-genes ` : Path to the input amino acid sequence FASTA file (possibly .gz/.bz2/.xz compressed).
153
+ ` full_sseq ` : Target sequences (in GMSC)
146
154
147
- * ` --nt-genes ` : Path to the input nucleotide gene sequence FASTA file (possibly .gz/.bz2/.xz compressed).
155
+ ` qlen ` : Query sequences length
148
156
149
- * ` --nofilter ` : Use this if no need to filter <100aa input sequences.
157
+ ` slen ` : Target sequences length
150
158
151
- * ` -o/--output ` : Output directory (will be created if non-existent).
159
+ ` length ` : Alignment length
152
160
153
- * ` --tool ` : Sequence alignment tool(Diamond/MMseqs).
161
+ ` qstart ` : Start of alignment in query
154
162
155
- * ` --db ` : Path to the GMSC database file.
163
+ ` qend ` : End of alignment in query
156
164
157
- * ` --id ` : Minimum identity to report an alignment(range 0.0-1.0).
165
+ ` sstart ` : Start of alignment in target
158
166
159
- * ` --cov ` : Minimum coverage to report an alignment(range 0.0-1.0).
167
+ ` send ` : End of alignment in target
160
168
161
- * ` -e/--evalue ` : Maximum e-value to report alignments(default=0.00001).
169
+ ` bitscore ` : Bit score
162
170
163
- * ` --outfmt ` : Output format of alignment result.
171
+ ` pident ` : Percentage of identical matches
164
172
165
- (Diamond default is "6,qseqid,sseqid,full_qseq,full_sseq,qlen,slen,pident,length, evalue,qcovhsp,scovhsp".
173
+ ` evalue ` : Expect value
166
174
167
- MMseqs default is "query,target,qseq,tseq,qlen,tlen,fident,alnlen,evalue,qcov,tcov".
175
+ ` qcovhsp ` : Query Coverage
168
176
169
- The first two column in result format of Diamond/MMseqs must be "qseqid"/"query" and "sseqid"/"target".)
177
+ ` scovhsp ` : Target Coverage
170
178
171
- * ` --habitat ` : Path to the habitat file.
179
+ - Total smORFs homologous to GMSC (mapped.smorfs.faa)
172
180
173
- * ` --nohabitat ` : Use this if no need to annotate habitat .
181
+ A FASTA file with the sequences of query/predicted smORFs homologous to GMSC .
174
182
175
- * ` --taxonomy ` : Path to the taxonomy file.
183
+ - Habitat annotation of smORFs (optional) (habitat.out.smorfs.tsv)
176
184
177
- * ` --notaxonomy ` : Use this if no need to annotate taxonomy .
185
+ A file listing the habitat annotation for each smORF homologous to GMSC .
178
186
179
- * ` --quality ` : Path to the quality file.
187
+ There are two columns in the file:
180
188
181
- * ` --noquality ` : Use this if no need to annotate quality.
189
+ ` qseqid ` : Query seq id
182
190
183
- * ` -t/--threads ` : Number of CPU threads(default=3).
191
+ ` habitat ` : Habitat, ',' separated if the sequences is from multiple habitats
192
+
193
+ - Taxonomy annotation of smORFs (optional) (taxonomy.out.smorfs.tsv)
194
+
195
+ A file listing the taxonomy annotation for each smORF homologous to GMSC.
196
+
197
+ There are two columns in the file:
198
+
199
+ ` qseqid ` : Query seq id
200
+
201
+ ` taxonomy ` : Taxonomy, ';' separated between each taxonomy rank
202
+
203
+ - Quality annotation of smORFs (optional) (quality.out.smorfs.tsv)
204
+
205
+ A file listing the quality annotation for each smORF homologous to GMSC.
206
+
207
+ ` qseqid ` : Query seq id
208
+
209
+ ` quality ` : Quality label
210
+
211
+ - Summry (summary.txt)
212
+
213
+ A file providing a human-readable summary of the results.
214
+
215
+ ## Parameters
216
+ * ` -i/--input ` : Path to the input genome contig sequence FASTA file (possibly .gz compressed).
217
+
218
+ * ` --aa-genes ` : Path to the input amino acid sequence FASTA file (possibly .gz compressed).
219
+
220
+ * ` --nt-genes ` : Path to the input nucleotide gene sequence FASTA file (possibly .gz compressed).
221
+
222
+ * ` -o/--output ` : Output directory (will be created if non-existent). (default: ../output)
223
+
224
+ * ` --tool ` : Sequence alignment tool (Diamond / MMseqs). (default: diamond)
225
+
226
+ * ` -s/--sensitivity ` : Sensitivity. (default: --more-sensitive (Diamond) 5.7 (mmseqs))
227
+
228
+ * ` --id ` : Minimum identity to report an alignment (range 0.0-1.0). (default: 0.0)
229
+
230
+ * ` --cov ` : Minimum coverage to report an alignment (range 0.0-1.0). (default: 0.9)
231
+
232
+ * ` -e/--evalue ` : Maximum e-value to report alignments. (default: 1e-05)
233
+
234
+ * ` -t/--threads ` : Number of CPU threads. (default: 1)
235
+
236
+ * ` --filter ` : Use this to filter <100 aa or <303 nt input sequences. (default: False)
237
+
238
+ * ` --nohabitat ` : Use this if no need to annotate habitat. (default: False)
239
+
240
+ * ` --notaxonomy ` : Use this if no need to annotate taxonomy. (default: False)
241
+
242
+ * ` --noquality ` : Use this if no need to annotate quality. (default: False)
243
+
244
+ * ` --quiet ` : Disable alignment console output. (default: False )
245
+
246
+ * ` --db ` : Path to the GMSC database file. (default: ../db/targetdb.dmnd)
247
+
248
+ * ` --habitat ` : Path to the habitat file. (default: ../db/ref_habitat.tsv.xz)
249
+
250
+ * ` --taxonomy ` : Path to the taxonomy file. (default: ../db/ref_taxonomy.tsv.xz)
251
+
252
+ * ` --quality ` : Path to the quality file. (default: ../db/ref_quality.tsv.xz)
184
253
185
254
### Subcommands and Parameters
186
255
Subcommands: ` gmsc-mapper createdb `
187
256
188
- * ` -i ` : Path to the GMSC 90AA FASTA file.
257
+ * ` -i ` : Path to the GMSC FASTA file.
258
+
259
+ * ` -o/--output ` : Path to database output directory. (default: ../db)
260
+
261
+ * ` -m/--mode ` : Alignment tool (Diamond / MMseqs2).
189
262
190
- * ` -o/--output ` : Path to database output directory.
263
+ * ` --quiet ` : Disable alignment console output. (default : False )
191
264
192
- * ` -m/--mode ` : Alignment tool(Diamond/MMseqs2)
265
+ ## Sensitivity choices considering time and memory usage
266
+ To be done
0 commit comments