metacache/docs/mode_modify.txt at master · muellan/metacache · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
SYNOPSIS

    metacache modify <database> <sequence file/directory>... [OPTION]...

    metacache modify <database> [OPTION]... <sequence file/directory>...


DESCRIPTION

    Add reference sequence and/or taxonomic information to an existing database.


REQUIRED PARAMETERS

    <database>        Name of database.
                      A MetaCache database contains taxonomic information and
                      min-hash signatures of reference sequences (complete
                      genomes, scaffolds, contigs, ...).

    <sequence file/directory>...
                      FASTA or FASTQ files containing genomic sequences
                      (complete genomes, scaffolds, contigs, ...) that shall
                      beused as representatives of an organism/taxon.
                      If directory names are given, they will be searched for
                      sequence files (at most 10 levels deep).
                      The input files can also be compressed if MetaCache was
                      built with the zlib compression library.


BASIC OPTIONS

    -taxonomy <path>  directory with taxonomic hierarchy data (see NCBI's
                      taxonomic data files)

    -taxpostmap <file>
                      Files with sequence to taxon id mappings that are used as
                      alternative source in a post processing step.
                      default: 'nucl_(gb|wgs|est|gss).accession2taxid'

    -sequence-id-format (smart|ncbi|gi|filename|leadingword)
                      Method used for extracting sequence IDs from filenames and
                      sequence headers. Sequence IDs are also used to assign
                      taxa to reference sequences.
                      Available types are:
                      smart       : try NCBI > genbank > filename
                      ncbi        : NCBI-style accession/accession.version
                      gi          : genbank identifier
                      filename    : filename without extension
                      leadingword : first stretch of non-whitespace characters
                      default: smart

    -silent|-verbose  information level during build:
                      silent => none / verbose => most detailed
                      default: neither => only errors/important info


ADVANCED OPTIONS

    -reset-taxa       Attempts to re-rank all sequences after the main build
                      phase using '.accession2taxid' files. This will reset the
                      taxon id of a reference sequence even if a taxon id could
                      be obtained from other sources during the build phase.
                      default: off

    -max-locations-per-feature <#>
                      maximum number of reference sequence locations to be
                      stored per feature;
                      If the value is too high it will significantly impact
                      querying speed. Note that an upper hard limit is always
                      imposed by the data type used for the hash table bucket
                      size (set with compilation macro
                      '-DMC_LOCATION_LIST_SIZE_TYPE').
                      default: 254

    -remove-overpopulated-features
                      Removes all features that have reached the maximum allowed
                      amount of locations per feature. This can improve querying
                      speed and can be used to remove non-discriminative
                      features.
                      default: off
                      Not available in the GPU version.

    -remove-ambig-features <rank>
                      Removes all features that have more distinct reference
                      sequence on the given taxonomic rank than set by
                      '-max-ambig-per-feature'. This can decrease the database
                      size significantly at the expense of sensitivity. Note
                      that the lower the given taxonomic rank is, the more
                      pronounced the effect will be.
                      Valid values: sequence, form, variety, subspecies,
                      species, subgenus, genus, subtribe, tribe, subfamily,
                      family, suborder, order, subclass, class, subphylum,
                      phylum, subkingdom, kingdom, domain
                      default: off
                      Not available in the GPU version.

    -max-ambig-per-feature <#>
                      Maximum number of allowed different reference sequence
                      taxa per feature if option '-remove-ambig-features' is
                      used.
                      Not available in the GPU version.

    -max-load-fac <factor>
                      maximum hash table load factor;
                      This can be used to trade off larger memory consumption
                      for speed and vice versa. A lower load factor will improve
                      speed, a larger one will improve memory efficiency.
                      default: 0.800000
                      Not available in the GPU version.


EXAMPLES
    Add reference sequence 'penicillium.fna' to database 'fungi'
        metacache modify fungi penicillium.fna

    Add taxonomic information from NCBI to database 'myBacteria'
        download_ncbi_taxonomy myTaxo
        metacache modify myBacteria -taxonomy myTaxo