Skip to content

Commit 2143340

Browse files
authored
Merge pull request #11 from cocodyq/main
Complement document Delete decompression Change --nofilter to --filter Generate compressed tmp file Add fileds of outfmt Add tests.sh Add --quiet for alignment tools Change habitat/taxonomy to normalized table (use numpy.load)
2 parents ea176a6 + 05c339f commit 2143340

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

63 files changed

+1192
-695
lines changed

.github/workflows/test_gmsc_mapper.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ jobs:
88
strategy:
99
matrix:
1010
os: [ubuntu-latest]
11-
python-version: ["3.7", "3.8", "3.9", "3.10"]
11+
python-version: ["3.8", "3.9"]
1212

1313

1414
steps:

README.md

Lines changed: 147 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,14 @@
11
# GMSC-mapper
22

3-
Command line tool to query the Global Microbial smORFs Catalog (GMSC)
3+
GMSC-mapper is a command line tool to query the Global Microbial smORFs Catalog (GMSC).
4+
5+
GMSC-mapper can be used to
6+
- Find query smORFs (< 100aa) homologous to Global Microbial smORFs Catalog (GMSC) by alignment.
7+
- Support 3 types of input:
8+
- contigs (GMSC-mapper will predict smORFs from contigs first)
9+
- amino acid sequences
10+
- nucleotide gene sequences
11+
- Annotate query/predicted smORFs with quality, habitat and taxonomy information constructed manually in detail.
412

513
## Installation
614

@@ -27,64 +35,57 @@ During the process, we install also the following dependencies:
2735
- [MMseqs2](https://github.com/soedinglab/MMseqs2)
2836
- [Diamond](https://github.com/bbuchfink/diamond)
2937

30-
And perform a series of tests using mock datasets to check if the installation works well:
38+
### Example test
39+
Because the whole GMSC database is large, and takes some minutes to process.
3140

32-
1. Input is genome contig sequences.
41+
If you want to check if the installation works well, you can test with mock datasets easily and fast.
3342

34-
```bash
35-
gmsc-mapper -i ../examples/example.fa --db ../examples/targetdb.dmnd --habitat ../examples/ref_habitat.txt --quality ../examples/ref_quality.txt --taxonomy ../examples/ref_taxonomy.txt
36-
```
43+
- Create GMSC database index
3744

38-
2. Input is amino acid sequences.
45+
Default alignment tool is Diamond.
3946

4047
```bash
41-
gmsc-mapper --aa-genes ../examples/example.faa --db ../examples/targetdb.dmnd --habitat ../examples/ref_habitat.txt --quality ../examples/ref_quality.txt --taxonomy ../examples/ref_taxonomy.txt
48+
gmsc-mapper createdb -i examples/target.faa -o examples/ -m diamond
4249
```
4350

44-
3. Input is nucleotide gene sequences.
51+
If you want to use MMseqs2 as your alignment tool, you need to create GMSC database index in MMseqs2 format.
4552

4653
```bash
47-
gmsc-mapper --nt-genes ../examples/example.fna --db ../examples/targetdb.dmnd --habitat ../examples/ref_habitat.txt --quality ../examples/ref_quality.txt --taxonomy ../examples/ref_taxonomy.txt
54+
gmsc-mapper createdb -i examples/target.faa -o examples/ -m mmseqs
4855
```
4956

50-
4. Check the Alignment tool: Diamond/MMseqs2 is optional
57+
- Input is genome contig sequences.
5158

5259
```bash
53-
gmsc-mapper -i ../examples/example.fa --db ../examples/targetdb --habitat ../examples/ref_habitat.txt --quality ../examples/ref_quality.txt --taxonomy ../examples/ref_taxonomy.txt --tool mmseqs
54-
55-
gmsc-mapper -i ../examples/example.fa --db ../examples/targetdb --habitat ../examples/ref_habitat.txt --quality ../examples/ref_quality.txt --taxonomy ../examples/ref_taxonomy.txt --tool diamond
60+
gmsc-mapper -i examples/example.fa -o examples_output/ --db examples/targetdb.dmnd --habitat examples/ref_habitat.txt --quality examples/ref_quality.txt --taxonomy examples/ref_taxonomy.txt
5661
```
5762

58-
5. Flags to disable results from Habitat/taxonomy/quality annotation
63+
- Input is amino acid sequences.
5964

6065
```bash
61-
gmsc-mapper -i ../examples/example.fa --db ../examples/targetdb.dmnd --habitat ../examples/ref_habitat.txt --quality ../examples/ref_quality.txt --taxonomy ../examples/ref_taxonomy.txt --nohabitat --notaxonomy --noquality
66+
gmsc-mapper --aa-genes examples/example.faa -o examples_output/ --db examples/targetdb.dmnd --habitat examples/ref_habitat.txt --quality examples/ref_quality.txt --taxonomy examples/ref_taxonomy.txt
6267
```
6368

64-
## Usage
69+
- Input is nucleotide gene sequences.
6570

66-
### Example Usage
67-
The GMSC database is large,and taks some time to process all the things. If you want to know if GMSC-Mapper has been installed successfully and work well, please try the example usage with example target database as below.
71+
```bash
72+
gmsc-mapper --nt-genes examples/example.fna -o examples_output/ --db examples/targetdb.dmnd --habitat examples/ref_habitat.txt --quality examples/ref_quality.txt --taxonomy examples/ref_taxonomy.txt
73+
```
6874

69-
#### Create GMSC database index
70-
`-o`: Path to database output directory.(default: `GMSC-mapper/examples`)
75+
- Check another alignment tool: MMseqs2
7176

72-
`-m`: Alignment tool(Diamond/MMseqs2).
73-
```
74-
cd gmsc_mapper
75-
gmsc-mapper createdb -i ../examples/target.faa -o ../examples -m diamond
76-
```
77-
or
78-
```
79-
cd gmsc_mapper
80-
gmsc-mapper createdb -i ../examples/target.faa -o ../examples -m mmseqs
77+
```bash
78+
gmsc-mapper -i examples/example.fa -o examples_output/ --db examples/targetdb --habitat examples/ref_habitat.txt --quality examples/ref_quality.txt --taxonomy examples/ref_taxonomy.txt --tool mmseqs
8179
```
8280

83-
### Real data Usage
84-
#### Create GMSC database index
81+
## Usage
82+
Please make `GMSC-mapper/gmsc_mapper` as your work directory.
83+
84+
### Create GMSC database index
8585
`-o`: Path to database output directory.(default: `GMSC-mapper/db`)
8686

87-
`-m`: Alignment tool(Diamond/MMseqs2).
87+
`-m`: Alignment tool (Diamond / MMseqs2).
88+
8889
```
8990
cd gmsc_mapper
9091
gmsc-mapper createdb -i ../db/90AA_GMSC.faa.gz -m diamond
@@ -95,11 +96,8 @@ cd gmsc_mapper
9596
gmsc-mapper createdb -i ../db/90AA_GMSC.faa.gz -m mmseqs
9697
```
9798

98-
#### Default
99-
100-
Please make `GMSC-mapper/gmsc_mapper` as your work directory.
101-
102-
GMSC database/habitat/taxonomy/quality file path and output directory path can be assigned on your own.Default is `GMSC-mapper/db` and `GMSC-mapper/output`.
99+
### Default
100+
GMSC database / habitat / taxonomy / quality file path and output directory path can be assigned on your own.Default is `GMSC-mapper/db` and `GMSC-mapper/output`.
103101

104102
1. Input is genome contig sequences.
105103

@@ -119,74 +117,150 @@ gmsc-mapper --aa-genes ../examples/example.faa
119117
gmsc-mapper --nt-genes ../examples/example.fna
120118
```
121119

122-
#### Alignment tool: Diamond/MMseqs2 is optional
123-
If you want to change alignment tool(Diamond/MMseqs2), you can use `--tool`.
120+
### Alignment tool: Diamond / MMseqs2 is optional
121+
If you want to change alignment tool (Diamond / MMseqs2), you can use `--tool`.
122+
124123
```bash
125124
gmsc-mapper -i ../examples/example.fa --tool mmseqs
126125
```
127126

128-
#### Habitat/taxonomy/quality annotation is optional
129-
If you don't want to annotate habitat/taxonomy/quality you can use `--nohabitat`/`--notaxonomy`/`--noquality`.
127+
### Habitat / taxonomy / quality annotation is optional
128+
If you don't want to annotate habitat / taxonomy / quality you can use `--nohabitat`/`--notaxonomy`/`--noquality`.
129+
130130
```bash
131131
gmsc-mapper -i ../examples/example.fa --nohabitat --notaxonomy --noquality
132132
```
133133

134-
## Example Output
134+
## Output files
135135
The output folder will contain
136-
- Outputs of smORF prediction.
137-
- Complete mapping result table, listing all the hits in GMSC, per smORF.
138-
- Habitat annotation of smORFs.(optional)
139-
- Taxonomy annotation of smORFs.(optional)
140-
- Quality annotation of smORFs.(optional)
141136

142-
## Parameters
143-
* `-i/--input`: Path to the input genome contig sequence FASTA file (possibly .gz/.bz2/.xz compressed).
137+
- Outputs of smORFs prediction (predicted.filterd.smorf.faa)
138+
139+
A FASTA file with the sequences of the predicted smORFs. It is generated when the input file is contigs.
140+
141+
- Complete alignment result table (diamond.out.smorfs.tsv / mmseqs.out.smorfs.tsv)
142+
143+
A file listing all the query hits of GMSC, from Diamond or MMseqs2.
144+
145+
The file format is followed by a space-separated list of these keywords:
146+
147+
`qseqid`: Query seq id
148+
149+
`sseqid`: Target seq id (in GMSC)
150+
151+
`full_qseq`: Query sequences
144152

145-
* `--aa-genes`: Path to the input amino acid sequence FASTA file (possibly .gz/.bz2/.xz compressed).
153+
`full_sseq`: Target sequences (in GMSC)
146154

147-
* `--nt-genes`: Path to the input nucleotide gene sequence FASTA file (possibly .gz/.bz2/.xz compressed).
155+
`qlen`: Query sequences length
148156

149-
* `--nofilter`: Use this if no need to filter <100aa input sequences.
157+
`slen`: Target sequences length
150158

151-
* `-o/--output`: Output directory (will be created if non-existent).
159+
`length`: Alignment length
152160

153-
* `--tool`: Sequence alignment tool(Diamond/MMseqs).
161+
`qstart`: Start of alignment in query
154162

155-
* `--db`: Path to the GMSC database file.
163+
`qend`: End of alignment in query
156164

157-
* `--id`: Minimum identity to report an alignment(range 0.0-1.0).
165+
`sstart`: Start of alignment in target
158166

159-
* `--cov`: Minimum coverage to report an alignment(range 0.0-1.0).
167+
`send`: End of alignment in target
160168

161-
* `-e/--evalue`: Maximum e-value to report alignments(default=0.00001).
169+
`bitscore`: Bit score
162170

163-
* `--outfmt`: Output format of alignment result.
171+
`pident`: Percentage of identical matches
164172

165-
(Diamond default is "6,qseqid,sseqid,full_qseq,full_sseq,qlen,slen,pident,length,evalue,qcovhsp,scovhsp".
173+
`evalue`: Expect value
166174

167-
MMseqs default is "query,target,qseq,tseq,qlen,tlen,fident,alnlen,evalue,qcov,tcov".
175+
`qcovhsp`: Query Coverage
168176

169-
The first two column in result format of Diamond/MMseqs must be "qseqid"/"query" and "sseqid"/"target".)
177+
`scovhsp`: Target Coverage
170178

171-
* `--habitat`: Path to the habitat file.
179+
- Total smORFs homologous to GMSC (mapped.smorfs.faa)
172180

173-
* `--nohabitat`: Use this if no need to annotate habitat.
181+
A FASTA file with the sequences of query/predicted smORFs homologous to GMSC.
174182

175-
* `--taxonomy`: Path to the taxonomy file.
183+
- Habitat annotation of smORFs (optional) (habitat.out.smorfs.tsv)
176184

177-
* `--notaxonomy`: Use this if no need to annotate taxonomy.
185+
A file listing the habitat annotation for each smORF homologous to GMSC.
178186

179-
* `--quality`: Path to the quality file.
187+
There are two columns in the file:
180188

181-
* `--noquality`: Use this if no need to annotate quality.
189+
`qseqid`: Query seq id
182190

183-
* `-t/--threads`: Number of CPU threads(default=3).
191+
`habitat`: Habitat, ',' separated if the sequences is from multiple habitats
192+
193+
- Taxonomy annotation of smORFs (optional) (taxonomy.out.smorfs.tsv)
194+
195+
A file listing the taxonomy annotation for each smORF homologous to GMSC.
196+
197+
There are two columns in the file:
198+
199+
`qseqid`: Query seq id
200+
201+
`taxonomy`: Taxonomy, ';' separated between each taxonomy rank
202+
203+
- Quality annotation of smORFs (optional) (quality.out.smorfs.tsv)
204+
205+
A file listing the quality annotation for each smORF homologous to GMSC.
206+
207+
`qseqid`: Query seq id
208+
209+
`quality`: Quality label
210+
211+
- Summry (summary.txt)
212+
213+
A file providing a human-readable summary of the results.
214+
215+
## Parameters
216+
* `-i/--input`: Path to the input genome contig sequence FASTA file (possibly .gz compressed).
217+
218+
* `--aa-genes`: Path to the input amino acid sequence FASTA file (possibly .gz compressed).
219+
220+
* `--nt-genes`: Path to the input nucleotide gene sequence FASTA file (possibly .gz compressed).
221+
222+
* `-o/--output`: Output directory (will be created if non-existent). (default: ../output)
223+
224+
* `--tool`: Sequence alignment tool (Diamond / MMseqs). (default: diamond)
225+
226+
* `-s/--sensitivity`: Sensitivity. (default: --more-sensitive (Diamond) 5.7 (mmseqs))
227+
228+
* `--id`: Minimum identity to report an alignment (range 0.0-1.0). (default: 0.0)
229+
230+
* `--cov`: Minimum coverage to report an alignment (range 0.0-1.0). (default: 0.9)
231+
232+
* `-e/--evalue`: Maximum e-value to report alignments. (default: 1e-05)
233+
234+
* `-t/--threads`: Number of CPU threads. (default: 1)
235+
236+
* `--filter`: Use this to filter <100 aa or <303 nt input sequences. (default: False)
237+
238+
* `--nohabitat`: Use this if no need to annotate habitat. (default: False)
239+
240+
* `--notaxonomy`: Use this if no need to annotate taxonomy. (default: False)
241+
242+
* `--noquality`: Use this if no need to annotate quality. (default: False)
243+
244+
* `--quiet`: Disable alignment console output. (default:False)
245+
246+
* `--db`: Path to the GMSC database file. (default: ../db/targetdb.dmnd)
247+
248+
* `--habitat`: Path to the habitat file. (default: ../db/ref_habitat.tsv.xz)
249+
250+
* `--taxonomy`: Path to the taxonomy file. (default: ../db/ref_taxonomy.tsv.xz)
251+
252+
* `--quality`: Path to the quality file. (default: ../db/ref_quality.tsv.xz)
184253

185254
### Subcommands and Parameters
186255
Subcommands: `gmsc-mapper createdb`
187256

188-
* `-i`: Path to the GMSC 90AA FASTA file.
257+
* `-i`: Path to the GMSC FASTA file.
258+
259+
* `-o/--output`: Path to database output directory. (default: ../db)
260+
261+
* `-m/--mode`: Alignment tool (Diamond / MMseqs2).
189262

190-
* `-o/--output`: Path to database output directory.
263+
* `--quiet`: Disable alignment console output. (default:False)
191264

192-
* `-m/--mode`: Alignment tool(Diamond/MMseqs2)
265+
## Sensitivity choices considering time and memory usage
266+
To be done

0 commit comments

Comments
 (0)