|
| 1 | +## Update Existing Database |
| 2 | +You can add new sequences to an existing database in the "UPDATE DATABASE" tab by providing these inputs: |
| 3 | +1. **FASTA files** : Each sequence must have a unique `>accession.version` or `>accesion` header (e.g., `>CP001849.1` or `>CP001849`). |
| 4 | +2. **NCBI-style accession2taxid** : Sequences with accessions not listed here will be skipped. Version numbers are ignored. </br> |
| 5 | +(It is auto generated when the GTDB-based checkbox is checked.) |
| 6 | +3. **Old database directory**: The directory containing the existing database to be updated. |
| 7 | + |
| 8 | + |
| 9 | +### Required fields |
| 10 | +1. **GTDB-based checkbox:** Check this if you are adding genomes using the taxonkit-GTDB taxonomy. |
| 11 | +2. **Old Database Directory:** The directory containing the existing database to be updated. |
| 12 | +3. **New Database Directory:** The directory where the updated database will be generated. |
| 13 | +4. **FASTA List:** A file containing absolute paths to FASTA files. |
| 14 | +5. **Taxonomy Info** |
| 15 | + - If GTDB-based is checked: provide the taxonkit-GTDB taxonomy directory. |
| 16 | + - If not checked: provide an NCBI-style accession2taxid file. |
| 17 | + |
| 18 | +### Optional fields |
| 19 | +- **Max RAM**: Specify the maximum amount of RAM (in GiB) to allocate. |
| 20 | +- **Threads**: Specify the number of threads to use. |
| 21 | +- **Accession Level**: Create a database for accession-level classification. </br> |
| 22 | + *(WARNING: This option is not tested for large databases. Using it with more than 100,000 sequences may cause issues.)* |
| 23 | +- **Make Library**: Create a library of species genomes. This accelerates processing when large FASTA files contain genomes of multiple species. |
| 24 | +- **CDS Info**: A file containing absolute paths to CDS files. For the listed accessions, Prodigal’s gene prediction will be skipped. Only GenBank/RefSeq-format CDS files are supported. |
| 25 | +- **New Taxa**: Used when adding sequences from taxa not included in the existing database. See the section below for details. |
| 26 | + |
| 27 | +### Adding seqeunces of new taxa |
| 28 | +> [WARNING] |
| 29 | +> Mixing taxonomies within the same domain is not recommended. For example, adding prokaryotes to a GTDB-based database using NCBI taxonomy will cause issues, but adding eukaryotes or viruses to a GTDB-based database using NCBI taxonomy is fine since GTDB does not cover them. |
| 30 | +
|
| 31 | +1\. **Check taxonomy dump files** to see if you really need to add new taxa. `taxdump` command retrieves taxdump files of an existing database. |
| 32 | + |
| 33 | +2-1\. **Create a new taxa list** |
| 34 | + |
| 35 | +If you have both **accession2taxid** and **taxonomy dump** files for the new sequences, you can use the `CREATE NEW TAXA` button next to the `New Taxa` option. |
| 36 | +This generates two files: |
| 37 | +- `newtaxa.tsv` for the `New Taxa` option |
| 38 | +- `newtaxa.accession2taxid` for `Accession 2 Tax Id` field. |
| 39 | + |
| 40 | +<!-- ``` |
| 41 | +metabuli createnewtaxalist <OLD DBDIR> <FASTA_LIST> <new taxonomy dump> <accession2taxid> <OUTDIR> |
| 42 | +``` --> |
| 43 | + |
| 44 | +##### Example |
| 45 | +Suppose you're adding eukaryotic sequences to a GTDB-based database. Since GTDB doesn't include eukaryotes, you may want to use NCBI taxonomy for eukaryotes. |
| 46 | +You can download `taxdump` files from [here](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/) and `accession2taxid` from [here](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/). |
| 47 | +- How to use `CREATE NEW TAXA`: |
| 48 | + - `Old Database Directory`: Your existing GTDB database directory. |
| 49 | + - `FASTA List`: A file containing absolute paths to FASTA files to be added. |
| 50 | + - `New Taxonomy Path`: The directory of NCBI Taxonomy dump files. |
| 51 | + - `Accession 2 Tax Id`: NCBI-style accession2taxid file. |
| 52 | + - `Output Directory`: The directory where `newtaxa.tsv` and `newtaxa.accession2taxid` will be generated. |
| 53 | +- How to run `UPDATE DATABASE`: |
| 54 | + - `GTDB-Based checkbox`: **Don't Check** it since you are not using GTDB tree for new sequences. |
| 55 | + - `Old Database Directory`: Your existing GTDB database directory. |
| 56 | + - `New Database Directory`: The directory for the updated database to be created. |
| 57 | + - `FASTA List`: The same one as above. |
| 58 | + - `Accession 2 Tax Id`: `newtaxa.accession2taxid` generated by `CREATE NEW TAXA`. |
| 59 | + - `New Taxa` option: `newtaxa.tsv` generated by `CREATE NEW TAXA`. |
| 60 | + |
| 61 | +</br> |
| 62 | + |
| 63 | +2-2\. **Manually prepare a new taxa list** |
| 64 | + |
| 65 | +For the `New Taxa` option, provide a four-column TSV file in the following format. |
| 66 | +``` |
| 67 | +taxID parentID rank name |
| 68 | +``` |
| 69 | +The new taxon must be linked to a taxon in the existing database's taxonomy. |
| 70 | + |
| 71 | +##### Example |
| 72 | +Suppose you want to add *Saccharomyces cerevisiae* to a GTDB database whose taxonomy lacks the Fungi kingdom and only includes one eukaryote (*Homo sapiens*). In this scenario, your new taxa list and accession2taxid should be as follows. |
| 73 | +``` |
| 74 | +# New taxa list |
| 75 | +## taxid parentTaxID rank name // Don't put this header in your actual file. |
| 76 | +10000013 10000012 species Saccharomyces cerevisiae |
| 77 | +10000012 10000011 genus Saccharomyces |
| 78 | +10000011 10000010 family Saccharomycetaceae |
| 79 | +10000010 10000009 order Saccharomycetales |
| 80 | +10000009 10000008 class Saccharomycetes |
| 81 | +10000008 10000007 phylum Ascomycota |
| 82 | +10000007 10000000 kingdom Fungi // 10000000 is Eukaroyte taxID of the pre-built DB. |
| 83 | +
|
| 84 | +# accession2taxid |
| 85 | +accession accession.version taxid gi |
| 86 | +newseq1 newseq1 10000013 0 |
| 87 | +newseq2 newseq2 10000013 0 |
| 88 | +``` |
| 89 | + |
| 90 | +--- |
0 commit comments