Skip to content

Commit 91d8fdf

Browse files
committed
ENH add cdd annotation
1 parent f34bc32 commit 91d8fdf

File tree

8 files changed

+183
-32
lines changed

8 files changed

+183
-32
lines changed

README.md

Lines changed: 30 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -49,42 +49,44 @@ Because the whole GMSC database is large, and takes some minutes to process.
4949

5050
If you want to check if the installation works well, you can test with mock datasets easily and fast.
5151

52+
Please make `GMSC-mapper` as your work directory.
53+
5254
- Create GMSC database index
5355

5456
Default alignment tool is Diamond.
5557

5658
```bash
57-
gmsc-mapper createdb -i examples/target.faa -o examples/ -m diamond
59+
gmsc-mapper createdb -i ./examples/target.faa -o ./examples/ -m diamond
5860
```
5961

6062
If you want to use MMseqs2 as your alignment tool, you need to create GMSC database index in MMseqs2 format.
6163

6264
```bash
63-
gmsc-mapper createdb -i examples/target.faa -o examples/ -m mmseqs
65+
gmsc-mapper createdb -i ./examples/target.faa -o ./examples/ -m mmseqs
6466
```
6567

6668
- Input is genome contig sequences.
6769

6870
```bash
69-
gmsc-mapper -i ./examples/example.fa -o ./examples_output/ --db ./examples/targetdb.dmnd --habitat ./examples/ref_habitat.npy --habitat-index ./examples/ref_habitat_index.tsv --quality ./examples/ref_quality.txt --taxonomy ./examples/ref_taxonomy.npy --taxonomy-index ./examples/ref_taxonomy_index.tsv
71+
gmsc-mapper -i ./examples/example.fa -o ./examples_output/ --db ./examples/targetdb.dmnd --habitat ./examples/ref_habitat.npy --habitat-index ./examples/ref_habitat_index.tsv --quality ./examples/ref_quality.txt --taxonomy ./examples/ref_taxonomy.npy --taxonomy-index ./examples/ref_taxonomy_index.tsv --domain ./examples/ref_domain.txt
7072
```
7173

7274
- Input is amino acid sequences.
7375

7476
```bash
75-
gmsc-mapper --aa-genes ./examples/example.faa -o ./examples_output/ --db ./examples/targetdb.dmnd --habitat ./examples/ref_habitat.npy --habitat-index ./examples/ref_habitat_index.tsv --quality ./examples/ref_quality.txt --taxonomy ./examples/ref_taxonomy.npy --taxonomy-index ./examples/ref_taxonomy_index.tsv
77+
gmsc-mapper --aa-genes ./examples/example.faa -o ./examples_output/ --db ./examples/targetdb.dmnd --habitat ./examples/ref_habitat.npy --habitat-index ./examples/ref_habitat_index.tsv --quality ./examples/ref_quality.txt --taxonomy ./examples/ref_taxonomy.npy --taxonomy-index ./examples/ref_taxonomy_index.tsv --domain ./examples/ref_domain.txt
7678
```
7779

7880
- Input is nucleotide gene sequences.
7981

8082
```bash
81-
gmsc-mapper --nt-genes ./examples/example.fna -o ./examples_output/ --db ./examples/targetdb.dmnd --habitat ./examples/ref_habitat.npy --habitat-index ./examples/ref_habitat_index.tsv --quality ./examples/ref_quality.txt --taxonomy ./examples/ref_taxonomy.npy --taxonomy-index ./examples/ref_taxonomy_index.tsv
83+
gmsc-mapper --nt-genes ./examples/example.fna -o ./examples_output/ --db ./examples/targetdb.dmnd --habitat ./examples/ref_habitat.npy --habitat-index ./examples/ref_habitat_index.tsv --quality ./examples/ref_quality.txt --taxonomy ./examples/ref_taxonomy.npy --taxonomy-index ./examples/ref_taxonomy_index.tsv --domain ./examples/ref_domain.txt
8284
```
8385

8486
- Check another alignment tool: MMseqs2
8587

8688
```bash
87-
gmsc-mapper -i ./examples/example.fa -o ./examples_output/ --db ./examples/targetdb --habitat ./examples/ref_habitat.npy --habitat-index ./examples/ref_habitat_index.tsv --quality ./examples/ref_quality.txt --taxonomy ./examples/ref_taxonomy.npy --taxonomy-index ./examples/ref_taxonomy_index.tsv --tool mmseqs
89+
gmsc-mapper -i ./examples/example.fa -o ./examples_output/ --db ./examples/targetdb --habitat ./examples/ref_habitat.npy --habitat-index ./examples/ref_habitat_index.tsv --quality ./examples/ref_quality.txt --taxonomy ./examples/ref_taxonomy.npy --taxonomy-index ./examples/ref_taxonomy_index.tsv --domain ./examples/ref_domain.txt --tool mmseqs
8890
```
8991

9092
## Usage
@@ -106,7 +108,7 @@ gmsc-mapper createdb -i ../db/90AA_GMSC.faa.gz -m mmseqs
106108
```
107109

108110
### Default
109-
GMSC database / habitat / taxonomy / quality file path and output directory path can be assigned on your own.Default is `GMSC-mapper/db` and `GMSC-mapper/output`.
111+
GMSC database / habitat / taxonomy / quality / domain file path and output directory path can be assigned on your own.Default is `GMSC-mapper/db` and `GMSC-mapper/output`.
110112

111113
1. Input is genome contig sequences.
112114

@@ -133,11 +135,11 @@ If you want to change alignment tool (Diamond / MMseqs2), you can use `--tool`.
133135
gmsc-mapper -i ../examples/example.fa --tool mmseqs
134136
```
135137

136-
### Habitat / taxonomy / quality annotation is optional
137-
If you don't want to annotate habitat / taxonomy / quality you can use `--nohabitat`/`--notaxonomy`/`--noquality`.
138+
### Habitat / taxonomy / quality / domain annotation is optional
139+
If you don't want to annotate habitat / taxonomy / quality / domain you can use `--no-habitat`/`--no-taxonomy`/`--no-quality`/`--no-domain`.
138140

139141
```bash
140-
gmsc-mapper -i ../examples/example.fa --nohabitat --notaxonomy --noquality
142+
gmsc-mapper -i ../examples/example.fa --no-habitat --no-taxonomy --no-quality --no-domain
141143
```
142144

143145
## Output files
@@ -191,17 +193,17 @@ The output folder will contain
191193

192194
- Habitat annotation of smORFs (optional) (habitat.out.smorfs.tsv)
193195

194-
A file listing the habitat annotation for each smORF homologous to GMSC.
196+
This file lists the habitat annotations of the query/predicted sequence, where the habitat is obtained from the sequence annotations of its homologous origin in GMSC.
195197

196198
There are two columns in the file:
197199

198200
`qseqid`: Query seq id
199201

200-
`habitat`: Habitat, ',' separated if the sequences is from multiple habitats
202+
`habitat`: Habitat, ',' separated if the sequence is from multiple habitats
201203

202204
- Taxonomy annotation of smORFs (optional) (taxonomy.out.smorfs.tsv)
203205

204-
A file listing the taxonomy annotation for each smORF homologous to GMSC.
206+
This file lists the taxonomy annotations of the query/predicted sequence, where the taxonomy is obtained from the sequence annotations of its homologous origin in GMSC.
205207

206208
There are two columns in the file:
207209

@@ -211,13 +213,21 @@ The output folder will contain
211213

212214
- Quality annotation of smORFs (optional) (quality.out.smorfs.tsv)
213215

214-
A file listing the quality annotation for each smORF homologous to GMSC.
216+
This file lists the quality annotations of the query/predicted sequence, where the quality is obtained from the sequence annotations of its homologous origin in GMSC.
215217

216218
`qseqid`: Query seq id
217219

218220
`quality`: Quality label
219221

220-
- Summry (summary.txt)
222+
- Conserved domain annotation of smORFs (optional) (domain.out.smorfs.tsv)
223+
224+
This file lists the conservative domain annotations of the query/predicted sequence, where the conservative domain is obtained from the sequence annotations of its homologous origin in GMSC.
225+
226+
`qseqid`: Query seq id
227+
228+
`cdd`: Identifiers from Conserved domain database, ',' separated if the sequence is annotated with multiple conserved domains.
229+
230+
- Summary (summary.txt)
221231

222232
A file providing a human-readable summary of the results.
223233

@@ -244,11 +254,13 @@ The output folder will contain
244254

245255
* `--filter`: Use this to filter <100 aa or <303 nt input sequences. (default: False)
246256

247-
* `--nohabitat`: Use this if no need to annotate habitat. (default: False)
257+
* `--no-habitat`: Use this if no need to annotate habitat. (default: False)
258+
259+
* `--no-taxonomy`: Use this if no need to annotate taxonomy. (default: False)
248260

249-
* `--notaxonomy`: Use this if no need to annotate taxonomy. (default: False)
261+
* `--no-quality`: Use this if no need to annotate quality. (default: False)
250262

251-
* `--noquality`: Use this if no need to annotate quality. (default: False)
263+
* `--no-domain`: Use this if no need to annotate conserved domain. (default: False)
252264

253265
* `--quiet`: Disable alignment console output. (default:False)
254266

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
qseqid cdd
2+
smORF_00 198061,429887
3+
smORF_01
4+
smORF_02 197696
5+
smORF_05
6+
smORF_06 381607
7+
smORF_07
8+
smORF_08
9+
smORF_09 425695
10+
smORF_10
11+
smORF_11
12+
smORF_12
13+
smORF_13
14+
smORF_14 433792
15+
smORF_15
16+
smORF_16
17+
smORF_17
18+
smORF_18
19+
smORF_19 426994
20+
smORF_20 429147
21+
smORF_21
22+
smORF_22
23+
smORF_24
24+
smORF_25
25+
smORF_26 425742
26+
smORF_27 429147
27+
smORF_28
28+
smORF_29 197911,429147
29+
smORF_30

examples/output/summary.txt

Lines changed: 15 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -3,18 +3,22 @@
33
28 smORFs are aligned against GMSC in total.
44

55
# Quality
6-
8(28.57%) aligned smORFs are high quality.
6+
8 (28.57%) aligned smORFs are high quality.
77

88
# Habitat
9-
25(89.29%) aligned smORFs are single-habitat.
10-
3(10.71%) aligned smORFs are multi-habitat.
9+
25 (89.29%) aligned smORFs are single-habitat.
10+
3 (10.71%) aligned smORFs are multi-habitat.
1111

1212
# Taxonomy
13-
8(28.57%) aligned smORFs have taxonomy annotation.
14-
3(10.71%) aligned smORFs are annotated on kingdom.
15-
0(0%) aligned smORFs are annotated on phylum.
16-
1(3.57%) aligned smORFs are annotated on class.
17-
1(3.57%) aligned smORFs are annotated on order.
18-
0(0%) aligned smORFs are annotated on family.
19-
0(0%) aligned smORFs are annotated on genus.
20-
3(10.71%) aligned smORFs are annotated on species.
13+
8 (28.57%) aligned smORFs have taxonomy annotation.
14+
3 (10.71%) aligned smORFs are annotated at level of kingdom.
15+
0 (0.00%) aligned smORFs are annotated at level of phylum.
16+
1 (3.57%) aligned smORFs are annotated at level of class.
17+
1 (3.57%) aligned smORFs are annotated at level of order.
18+
0 (0.00%) aligned smORFs are annotated at level of family.
19+
0 (0.00%) aligned smORFs are annotated at level of genus.
20+
3 (10.71%) aligned smORFs are annotated at level of species.
21+
22+
# Conserved domain
23+
10 of aligned smORFs are annotated with CDD database.
24+

examples/ref_domain.txt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
GMSC10.90AA.000_000_000_080 198061,429887
2+
GMSC10.90AA.000_000_000_082 197696
3+
GMSC10.90AA.000_000_000_088 381607
4+
GMSC10.90AA.000_000_000_016 425695
5+
GMSC10.90AA.000_000_000_032 433792
6+
GMSC10.90AA.000_000_000_044 426994
7+
GMSC10.90AA.000_000_000_065 425742
8+
GMSC10.90AA.000_000_000_078 429147,197911
9+
GMSC10.90AA.000_000_000_068 429147
10+
GMSC10.90AA.000_000_000_047 429147

gmsc_mapper/main.py

Lines changed: 26 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -103,12 +103,14 @@ def parse_args(args):
103103

104104
parser.add_argument('--filter','--filter',action='store_true', help='Use this to filter <100 aa or <303 nt input sequences.')
105105

106-
parser.add_argument('--nohabitat','--nohabitat',action='store_true', help='Use this if no need to annotate habitat')
106+
parser.add_argument('--no-habitat','--no-habitat',action='store_true', dest='nohabitat', help='Use this if no need to annotate habitat')
107107

108-
parser.add_argument('--notaxonomy', '--notaxonomy',action='store_true', help='Use this if no need to annotate taxonomy')
108+
parser.add_argument('--no-taxonomy', '--no-taxonomy',action='store_true', dest='notaxonomy', help='Use this if no need to annotate taxonomy')
109109

110-
parser.add_argument('--noquality', '--noquality',action='store_true', help='Use this if no need to annotate quality')
110+
parser.add_argument('--no-quality', '--no-quality',action='store_true', dest='noquality', help='Use this if no need to annotate quality')
111111

112+
parser.add_argument('--no-domain', '--no-domain',action='store_true', dest='nodomain', help='Use this if no need to annotate quality')
113+
112114
parser.add_argument('--quiet','--quiet',action='store_true', help='Disable alignment console output')
113115

114116
parser.add_argument('--db', '--db',
@@ -145,6 +147,12 @@ def parse_args(args):
145147
help='Path to the quality file',
146148
dest='quality',
147149
default=path.join(_ROOT, 'db/ref_quality.tsv.xz'))
150+
151+
parser.add_argument('--domain', '--domain',
152+
required=False,
153+
help='Path to the conserved domain file',
154+
dest='domain',
155+
default=path.join(_ROOT, 'db/ref_domain.tsv.xz'))
148156

149157
return parser.parse_args(args[1:])
150158

@@ -213,6 +221,9 @@ def expect_file(f):
213221
if not args.noquality and args.quality:
214222
expect_file(args.quality)
215223

224+
if not args.nodomain and args.domain:
225+
expect_file(args.domain)
226+
216227
def create_db(args):
217228
if not os.path.exists(args.output):
218229
os.makedirs(args.output)
@@ -395,6 +406,13 @@ def quality(args,resultfile):
395406
logger.info('Quality annotation completed.')
396407
return number,percentage
397408

409+
def domain(args,resultfile):
410+
from gmsc_mapper.map_domain import smorf_domain
411+
logger.debug('Start domain annotation...')
412+
number = smorf_domain(args.domain,args.output,resultfile)
413+
logger.info('Domain annotation completed.')
414+
return number
415+
398416
def predicted_smorf_count(file_name):
399417
return sum(1 for _ in open(file_name, 'rt'))
400418

@@ -485,6 +503,11 @@ def main(args=None):
485503
summary.append(f'{annotated_number} ({1 - rank_percentage["no rank"]:.2%}) aligned smORFs have taxonomy annotation.')
486504
for rank in ['kingdom','phylum','class','order','family','genus','species']:
487505
summary.append(f'{rank_number[rank]} ({rank_percentage[rank]:.2%}) aligned smORFs are annotated at level of {rank}.')
506+
507+
if not args.nodomain:
508+
summary.append(f'\n# Conserved domain')
509+
number = domain(args,resultfile)
510+
summary.append(f'{number} aligned smORFs are annotated with CDD database.\n')
488511
else:
489512
summary.append(f'None of sequences are aligned against GMSC.\n')
490513

gmsc_mapper/map_domain.py

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
import pandas as pd
2+
from os import path
3+
4+
def fixdf(x):
5+
x = x.dropna()
6+
x = x.drop_duplicates()
7+
return ','.join(x)
8+
9+
def formatlabel(x):
10+
x = x.split(',')
11+
x = list(set(x))
12+
x = sorted(x)
13+
return ','.join(x)
14+
15+
def store_domain(cddfile):
16+
cdd = pd.read_table(cddfile, sep='\t', header=None, names=['gmsc','cdd'])
17+
cdd_dict = dict(zip(cdd['gmsc'],cdd['cdd']))
18+
return cdd_dict
19+
20+
def smorf_domain(cddfile, outdir, resultfile):
21+
cdd_file = path.join(outdir, "domain.out.smorfs.tsv")
22+
23+
result = pd.read_csv(resultfile,
24+
sep='\t',
25+
header=None)
26+
result.rename({0: 'qseqid', 1: 'sseqid'},
27+
axis=1,
28+
inplace=True)
29+
mapped_sseqid = result['sseqid'].to_list()
30+
31+
cdd_dict = store_domain(cddfile)
32+
33+
mapped_sseqid_cdd = {}
34+
for item in mapped_sseqid:
35+
if item in cdd_dict.keys():
36+
mapped_sseqid_cdd[item] = cdd_dict[item]
37+
result['cdd'] = result['sseqid'].map(lambda g: mapped_sseqid_cdd.get(g))
38+
39+
output = result[['qseqid', 'cdd']]
40+
output = output.sort_values(by='qseqid')
41+
output = output.groupby('qseqid',
42+
as_index=False,
43+
sort=False)
44+
output = output.agg({'cdd':lambda x : fixdf(x)})
45+
output['cdd'] = output['cdd'].apply(lambda x: formatlabel(x))
46+
output.to_csv(cdd_file,
47+
sep='\t',
48+
index=False)
49+
50+
annotated = output[output['cdd']!='']['cdd'].count()
51+
return annotated

tests/test_domain.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
from gmsc_mapper.map_domain import smorf_domain
2+
import pytest
3+
import os
4+
5+
known_domain = {"qseqid":"cdd",
6+
"smORF_0":"197696",
7+
"smORF_1":"",
8+
"smORF_2":"198061,429147,429887"}
9+
domain_dict={}
10+
def test_smorf_domain():
11+
smorf_domain('./tests/test_domain.txt',os.path.dirname(os.path.realpath(__file__)),'./tests/alignment.tsv')
12+
with open('./tests/domain.out.smorfs.tsv',"rt") as f:
13+
for line in f:
14+
qseqid,domain = line.split("\t")
15+
domain_dict[qseqid] = domain.strip()
16+
assert domain_dict == known_domain
17+
18+
if __name__ == '__main__':
19+
pytest.main()

tests/test_domain.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
GMSC10.90AA.000_000_000_000 198061,429887
2+
GMSC10.90AA.000_000_000_001 429147,429887
3+
GMSC10.90AA.000_000_000_004 197696

0 commit comments

Comments
 (0)