-
Notifications
You must be signed in to change notification settings - Fork 13
Expand file tree
/
Copy pathmode_modify.txt
More file actions
120 lines (92 loc) · 5.29 KB
/
mode_modify.txt
File metadata and controls
120 lines (92 loc) · 5.29 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
SYNOPSIS
metacache modify <database> <sequence file/directory>... [OPTION]...
metacache modify <database> [OPTION]... <sequence file/directory>...
DESCRIPTION
Add reference sequence and/or taxonomic information to an existing database.
REQUIRED PARAMETERS
<database> Name of database.
A MetaCache database contains taxonomic information and
min-hash signatures of reference sequences (complete
genomes, scaffolds, contigs, ...).
<sequence file/directory>...
FASTA or FASTQ files containing genomic sequences
(complete genomes, scaffolds, contigs, ...) that shall
beused as representatives of an organism/taxon.
If directory names are given, they will be searched for
sequence files (at most 10 levels deep).
The input files can also be compressed if MetaCache was
built with the zlib compression library.
BASIC OPTIONS
-taxonomy <path> directory with taxonomic hierarchy data (see NCBI's
taxonomic data files)
-taxpostmap <file>
Files with sequence to taxon id mappings that are used as
alternative source in a post processing step.
default: 'nucl_(gb|wgs|est|gss).accession2taxid'
-sequence-id-format (smart|ncbi|gi|filename|leadingword)
Method used for extracting sequence IDs from filenames and
sequence headers. Sequence IDs are also used to assign
taxa to reference sequences.
Available types are:
smart : try NCBI > genbank > filename
ncbi : NCBI-style accession/accession.version
gi : genbank identifier
filename : filename without extension
leadingword : first stretch of non-whitespace characters
default: smart
-silent|-verbose information level during build:
silent => none / verbose => most detailed
default: neither => only errors/important info
ADVANCED OPTIONS
-reset-taxa Attempts to re-rank all sequences after the main build
phase using '.accession2taxid' files. This will reset the
taxon id of a reference sequence even if a taxon id could
be obtained from other sources during the build phase.
default: off
-max-locations-per-feature <#>
maximum number of reference sequence locations to be
stored per feature;
If the value is too high it will significantly impact
querying speed. Note that an upper hard limit is always
imposed by the data type used for the hash table bucket
size (set with compilation macro
'-DMC_LOCATION_LIST_SIZE_TYPE').
default: 254
-remove-overpopulated-features
Removes all features that have reached the maximum allowed
amount of locations per feature. This can improve querying
speed and can be used to remove non-discriminative
features.
default: off
Not available in the GPU version.
-remove-ambig-features <rank>
Removes all features that have more distinct reference
sequence on the given taxonomic rank than set by
'-max-ambig-per-feature'. This can decrease the database
size significantly at the expense of sensitivity. Note
that the lower the given taxonomic rank is, the more
pronounced the effect will be.
Valid values: sequence, form, variety, subspecies,
species, subgenus, genus, subtribe, tribe, subfamily,
family, suborder, order, subclass, class, subphylum,
phylum, subkingdom, kingdom, domain
default: off
Not available in the GPU version.
-max-ambig-per-feature <#>
Maximum number of allowed different reference sequence
taxa per feature if option '-remove-ambig-features' is
used.
Not available in the GPU version.
-max-load-fac <factor>
maximum hash table load factor;
This can be used to trade off larger memory consumption
for speed and vice versa. A lower load factor will improve
speed, a larger one will improve memory efficiency.
default: 0.800000
Not available in the GPU version.
EXAMPLES
Add reference sequence 'penicillium.fna' to database 'fungi'
metacache modify fungi penicillium.fna
Add taxonomic information from NCBI to database 'myBacteria'
download_ncbi_taxonomy myTaxo
metacache modify myBacteria -taxonomy myTaxo