Skip to content

Commit 2292cee

Browse files
authored
Merge pull request #17 from jaebeom-kim/main
Improve Readme
2 parents 2d941a1 + 684ab08 commit 2292cee

File tree

9 files changed

+522
-69
lines changed

9 files changed

+522
-69
lines changed
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
name: Auto-generate README
2+
3+
on:
4+
push:
5+
paths:
6+
- 'fragments/**'
7+
- 'generate_readme.py'
8+
workflow_dispatch:
9+
10+
jobs:
11+
build-readme:
12+
runs-on: ubuntu-latest
13+
steps:
14+
- name: Checkout repo
15+
uses: actions/checkout@v3
16+
17+
- name: Set up Python
18+
uses: actions/setup-python@v4
19+
with:
20+
python-version: '3.10'
21+
22+
- name: Generate README
23+
run: python generate_readme.py
24+
25+
- name: Commit and push if README changed
26+
run: |
27+
git config --global user.name "GitHub Actions"
28+
git config --global user.email "actions@github.com"
29+
git add README.md
30+
git diff --cached --quiet || git commit -m "Auto-update README"
31+
git push
32+

README.md

Lines changed: 193 additions & 69 deletions
Large diffs are not rendered by default.

docs/classification.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# Classification
2+
Metabuli App provides two taxonomic profiling modes in **Search Settings** panel: **New Search** and **Upload Report**.
3+
<img alt="SearchPage_Demo_Image" src="https://github.com/user-attachments/assets/9ab5a86c-5603-4dc7-be3b-baf2ed490ef0" style="max-height: 600px; width: auto;">
4+
5+
## New Classification
6+
### Required Fields:
7+
1. **Mode:** Select the analysis mode among single-end, paired-end, or long-read.
8+
2. **Job ID:** Enter a unique identifier for the job.
9+
3. **Select Files:** Upload the necessary files and directories.
10+
- Read 1 File (and Read 2 File if Paired-end is selected)
11+
- Database Directory
12+
- Output Directory
13+
4. **Max RAM:** Specify the maximum RAM (in GiB) to allocate for the job.
14+
15+
### Advanced Settings (Optional):
16+
- **Threads:** Specify thread count for the job.
17+
- **Min Score:** Set the minimum score for making a classification. It reduces false positives at the cost of sensitivity.
18+
- Recommended values (For details, please refer Supp. Fig. 4-7 in the [Metabuli paper](https://www.nature.com/articles/s41592-024-02273-y)):
19+
- Illumina short reads: 0.15
20+
- PacBio HiFi reads: 0.07
21+
- PacBio Sequel II reads: 0.005
22+
- Nanopore long reads: 0.008
23+
- **Min SP Score:** Set the minimum score for the species- or lower-level classification. It avoids overconfident classifications.
24+
- Recommended values (For details, please refer Supp. Fig. 4-7 in the [Metabuli paper](https://www.nature.com/articles/s41592-024-02273-y)):
25+
- Illumina short reads: 0.5
26+
- PacBio HiFi reads: 0.3
27+
- **Taxonomy Path:** Use it when your database does not have `taxonomy` directory or `taxonomyDB` file. Provide a directroy of `names.dmp`, `nodes.dmp`, and `mereged.dmp` files.
28+
- **Accession Level:** classify reads to accessions if available.
29+
30+
### Start Analysis:
31+
- Click the `Run Metabuli` button to start the metagenomic classification process.
32+
- You can track the progress and see real-time backend output in the logs.
33+
34+
### View Results:
35+
- Once the analysis is complete, you can view the results in three different forms:
36+
- **Table**: View the raw classification data in a table format.
37+
- **Sankey Diagram**: A flow diagram representing the lineage information of the displayed taxa.
38+
- **Krona Chart**: A hierarchical interactive chart that visualizes classification results.
39+
40+
## Upload Report
41+
42+
To visualize results from a previously completed job:
43+
44+
1. Navigate to the **Upload Report** tab.
45+
2. Upload the `report.tsv` file from a prior job.
46+
3. View the uploaded results directly in the **Results** tab. For this job type, results are provided in:
47+
- **Table**: The raw data in table format.
48+
- **Sankey Diagram**: A flow diagram representing the lineage paths for the displayed taxa (without the Krona chart).
49+
50+
---

docs/createdb.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
## Create New Database
2+
You can create a new database in "NEW DATABASE" tab by providing these three files:
3+
1. **FASTA files** : Each sequence must have a unique `>accession.version` or `>accesion` header (e.g., `>CP001849.1` or `>CP001849`).
4+
2. **NCBI-style taxonomy dump** : `names.dmp`, `nodes.dmp`, and `merged.dmp`. Sequences with tax. IDs absent here are skipped.
5+
3. **NCBI-style accession2taxid** : Sequences with accessions absent here are skipped, and versions are ignored.
6+
7+
8+
### How to prepare NCBI-style taxonomy dump files and accession2taxid
9+
#### NCBI-style taxonomy dump
10+
- `NCBI Taxonomy`: Download [here](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/).
11+
- `GTDB`: Download taxonkit-GTDB files [here](https://github.com/shenwei356/gtdb-taxdump/releases).
12+
- `ICTV`: Download taxonkit-ICTV files [here](https://github.com/shenwei356/ictv-taxdump/releases).
13+
- `Custom taxonomy`: Generate your own `names.dmp`, `nodes.dmp`, and `merged.dmp`.
14+
15+
16+
#### NCBI-style accession2taxid
17+
- `NCBI Taxonomy`: Download [here](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/). Check [README](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/README) to know what file to use.
18+
- `GTDB`: It is auto generated using a `taxid.map` in the taxonkit-GTDB directory.
19+
- `ICTV`: Use `prepare-accession2taxid.sh` in [here](https://github.com/jaebeom-kim/Metabuli-ICTV-challenge).
20+
- `Custom accession2taxid`: Generate your own `accession2taxid` file.
21+
22+
#### Edit files to include custom sequences
23+
* Taxonomy dump files:
24+
* Edit `nodes.dmp` and `names.dmp` to introduce a new `taxid` in `accession2taxid`.
25+
* accession2taxid file:
26+
* For a sequence whose header is `>custom`, add `custom[tab]custom[tab]taxid[tab]anynumber`.
27+
* As above, version number is not necessary.
28+
* `taxid` must be included in the `nodes.dmp` and `names.dmp`.
29+
* Put any number for the last column. It is not used in Metabuli.
30+
31+
32+
### Required fields
33+
1. **GTDB-based checkbox:** Check if you use taxonkit-generated GTDB taxonomy `dmp` files.
34+
2. **Database Directory:** The directory where the database will be generated.
35+
3. **FASTA List:** A file containing absolute paths to FASTA files.
36+
4. **Accession2TaxId:** A path to NCBI-style accession2taxid following the format below. </br>
37+
```
38+
accession accession.version taxid gi
39+
ACCESSION ACCESSION.1 12345 6789
40+
```
41+
5. **Taxonomy Path:** Directory of taxonomy dump files (`names.dmp`, `nodes.dmp`, and `merged.dmp` are requried).
42+
43+
### Optional fields
44+
- **Max RAM**: Specify the maximum RAM (in GiB) to allocate for the job.
45+
- **Threads**: Specify the number of threads to use for the job.
46+
- **Accession Level**: Create a database for accession-level classification. </br>
47+
(WARNING: It it not tested for large databases. Using it with > 100K sequences may cause issues.)
48+
- **Make Library**: Make a library of species genomes. It accelerates the process when some large FASTA files include many species genomes.
49+
- **CDS Info**: File containing absolute paths to CDS. For included accessions, Prodigal's gene prediction is skipped. Only GenBank/RefSeq CDS files are supported.
50+
51+
---

docs/demo.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
## Demos
2+
3+
### New Search Job Demo
4+
Watch this demo to see how to run a new search on Metabuli:
5+
6+
https://github.com/user-attachments/assets/6c8b848b-77b8-49b1-b01e-069b872ea740
7+
8+
### Viewing Results
9+
Watch this demo to see how to view the results from a completed search:
10+
11+
https://github.com/user-attachments/assets/8cda1132-201f-4f8d-9f09-95ccafb9e685
12+
13+
---

docs/general.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.14603649.svg)](https://doi.org/10.5281/zenodo.14603649)
2+
![Platform](https://img.shields.io/badge/platform-Mac%20%7C%20Windows%20%7C%20Linux-brightgreen)
3+
4+
# Metabuli App
5+
6+
This is the desktop application for Metabuli, a metagenomic classification that jointly analyzes both DNA and amino acid sequences. Built with Vue.js and Electron, it provides an intuitive interface for running metagenomic classification jobs and visualizing the results.
7+
8+
For more details of Metabuli, please see
9+
[GitHub](https://github.com/steineggerlab/Metabuli),
10+
[Nature Methods](https://www.nature.com/articles/s41592-024-02273-y),
11+
[PDF](https://www.nature.com/articles/s41592-024-02273-y.epdf?sharing_token=je_2D5Su0-xVOSjuKSAXF9RgN0jAjWel9jnR3ZoTv0M7gE7NDF_xi_3sW8QdRiwfSJNwqaXItSoeCvr7cvcoQxKLt0oROgWc6urmki9tP80cXEuHPN0D7b4y9y3i8Yv7sZw8MxxhAj7W6p9eZE2zaK3eozdOkXvwADVfso9cXIM%3D),
12+
[bioRxiv](https://www.biorxiv.org/content/10.1101/2023.05.31.543018v2), or [ISMB 2023 talk](https://www.youtube.com/watch?v=vz2fuRcVwyk).
13+
<p align="center"><img src="https://raw.githubusercontent.com/steineggerlab/Metabuli/master/.github/marv_metabuli_small.png" height="350" /></p>
14+
15+
## NOTE
16+
The `NCBI` and `Assemblies` buttons in the Sankey subtree view do not work when searching against a GTDB-based database.
17+
We will make a button for GTDB soon.
18+
19+
## Platforms Supported
20+
21+
- macOS (Universal `.dmg`)
22+
- Windows (`.exe`)
23+
- Linux (AppImage `.AppImage`)
24+
25+
## Functionality
26+
- **Download** pre-built **databases**
27+
- **Create or update databases** directly in the app
28+
- Run **taxonomic classification**
29+
- **Upload and browse** classification results
30+
- **Extract reads** classified under a specific taxon
31+
- Explore results with interactive **Sankey** and **Krona** plots.
32+
33+
34+
---
35+
36+
## Changelog
37+
38+
### v1.0.1
39+
- Introduced the `Custom Database` page, enabling users to:
40+
- Create new databases.
41+
- Add new sequences to existing databases.
42+
- Enhanced `Sankey visualization`:
43+
- Implemented Sankey plot verification for ensuring accuracy of visualized results.
44+
- Resolved a bug in lineage extraction from raw data that previously caused inaccuracies in lineage representation on the Sankey plot.
45+
46+
### v1.0.0
47+
- Initial release of Metabuli App.
48+
49+
---
50+
51+
## Table of Contents
52+
53+
- [Installation](#installation)
54+
- [Usage](#usage)
55+
- [Classification](#classification)
56+
- [New Classification](#new-classification)
57+
- [Upload Report](#upload-report)
58+
- [Create New Database](#create-new-database)
59+
- [Update Existing Database](#update-existing-database)
60+
- [Demos](#demos)
61+
- [Acknowledgments](#acknowledgments)
62+
63+
## Installation
64+
65+
- Visit the [GitHub Releases](https://github.com/steineggerlab/Metabuli-App/releases) page for the latest builds.
66+
- The application is pre-built for **macOS**, **Windows**, and **Linux**.
67+
- Simply download the executable for your platform from the Releases section.
68+
69+
> **Note:** If you encounter a security warning when opening the app, follow the instructions below to bypass the warning:
70+
>- **macOS**: Refer to [this guide](https://support.apple.com/en-gb/guide/mac-help/mh40616/15.0/mac/15.0) on how to open apps from unidentified developers.
71+
>- **Windows**: Click 'More info' and then 'Run anyway' to continue.
72+
73+
---

docs/references.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
## References
2+
The development of the Metabuli Desktop Application has been inspired by and leverages the following tools:
3+
4+
- **Pavian**: Elements of the table layouts and visualizations in Metabuli were inspired by Pavian for metagenomic data analysis. ([Pavian GitHub Repository](https://github.com/fbreitwieser/pavian)).
5+
6+
- **Krona**: The Krona tool is embedded in the results page for hierarchical data visualization. ([Krona GitHub Repository](https://github.com/marbl/Krona/wiki)).
7+
8+
- **fastp**: Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication software. ([fastp GitHub Repository](https://github.com/OpenGene/fastp))
9+
10+
- **fastplong**: `fastp` for long reads. ([fastplong GitHub Repository](https://github.com/OpenGene/fastplong))
11+
12+
13+
We would like to acknowledge the authors of these tools for their excellent work, which has significantly contributed to the development of Metabuli App.

docs/updatedb.md

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
## Update Existing Database
2+
You can add new sequences to an existing database in the "UPDATE DATABASE" tab by providing these inputs:
3+
1. **FASTA files** : Each sequence must have a unique `>accession.version` or `>accesion` header (e.g., `>CP001849.1` or `>CP001849`).
4+
2. **NCBI-style accession2taxid** : Sequences with accessions not listed here will be skipped. Version numbers are ignored. </br>
5+
(It is auto generated when the GTDB-based checkbox is checked.)
6+
3. **Old database directory**: The directory containing the existing database to be updated.
7+
8+
9+
### Required fields
10+
1. **GTDB-based checkbox:** Check this if you are adding genomes using the taxonkit-GTDB taxonomy.
11+
2. **Old Database Directory:** The directory containing the existing database to be updated.
12+
3. **New Database Directory:** The directory where the updated database will be generated.
13+
4. **FASTA List:** A file containing absolute paths to FASTA files.
14+
5. **Taxonomy Info**
15+
- If GTDB-based is checked: provide the taxonkit-GTDB taxonomy directory.
16+
- If not checked: provide an NCBI-style accession2taxid file.
17+
18+
### Optional fields
19+
- **Max RAM**: Specify the maximum amount of RAM (in GiB) to allocate.
20+
- **Threads**: Specify the number of threads to use.
21+
- **Accession Level**: Create a database for accession-level classification. </br>
22+
*(WARNING: This option is not tested for large databases. Using it with more than 100,000 sequences may cause issues.)*
23+
- **Make Library**: Create a library of species genomes. This accelerates processing when large FASTA files contain genomes of multiple species.
24+
- **CDS Info**: A file containing absolute paths to CDS files. For the listed accessions, Prodigal’s gene prediction will be skipped. Only GenBank/RefSeq-format CDS files are supported.
25+
- **New Taxa**: Used when adding sequences from taxa not included in the existing database. See the section below for details.
26+
27+
### Adding seqeunces of new taxa
28+
> [WARNING]
29+
> Mixing taxonomies within the same domain is not recommended. For example, adding prokaryotes to a GTDB-based database using NCBI taxonomy will cause issues, but adding eukaryotes or viruses to a GTDB-based database using NCBI taxonomy is fine since GTDB does not cover them.
30+
31+
1\. **Check taxonomy dump files** to see if you really need to add new taxa. `taxdump` command retrieves taxdump files of an existing database.
32+
33+
2-1\. **Create a new taxa list**
34+
35+
If you have both **accession2taxid** and **taxonomy dump** files for the new sequences, you can use the `CREATE NEW TAXA` button next to the `New Taxa` option.
36+
This generates two files:
37+
- `newtaxa.tsv` for the `New Taxa` option
38+
- `newtaxa.accession2taxid` for `Accession 2 Tax Id` field.
39+
40+
<!-- ```
41+
metabuli createnewtaxalist <OLD DBDIR> <FASTA_LIST> <new taxonomy dump> <accession2taxid> <OUTDIR>
42+
``` -->
43+
44+
##### Example
45+
Suppose you're adding eukaryotic sequences to a GTDB-based database. Since GTDB doesn't include eukaryotes, you may want to use NCBI taxonomy for eukaryotes.
46+
You can download `taxdump` files from [here](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/) and `accession2taxid` from [here](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/).
47+
- How to use `CREATE NEW TAXA`:
48+
- `Old Database Directory`: Your existing GTDB database directory.
49+
- `FASTA List`: A file containing absolute paths to FASTA files to be added.
50+
- `New Taxonomy Path`: The directory of NCBI Taxonomy dump files.
51+
- `Accession 2 Tax Id`: NCBI-style accession2taxid file.
52+
- `Output Directory`: The directory where `newtaxa.tsv` and `newtaxa.accession2taxid` will be generated.
53+
- How to run `UPDATE DATABASE`:
54+
- `GTDB-Based checkbox`: **Don't Check** it since you are not using GTDB tree for new sequences.
55+
- `Old Database Directory`: Your existing GTDB database directory.
56+
- `New Database Directory`: The directory for the updated database to be created.
57+
- `FASTA List`: The same one as above.
58+
- `Accession 2 Tax Id`: `newtaxa.accession2taxid` generated by `CREATE NEW TAXA`.
59+
- `New Taxa` option: `newtaxa.tsv` generated by `CREATE NEW TAXA`.
60+
61+
</br>
62+
63+
2-2\. **Manually prepare a new taxa list**
64+
65+
For the `New Taxa` option, provide a four-column TSV file in the following format.
66+
```
67+
taxID parentID rank name
68+
```
69+
The new taxon must be linked to a taxon in the existing database's taxonomy.
70+
71+
##### Example
72+
Suppose you want to add *Saccharomyces cerevisiae* to a GTDB database whose taxonomy lacks the Fungi kingdom and only includes one eukaryote (*Homo sapiens*). In this scenario, your new taxa list and accession2taxid should be as follows.
73+
```
74+
# New taxa list
75+
## taxid parentTaxID rank name // Don't put this header in your actual file.
76+
10000013 10000012 species Saccharomyces cerevisiae
77+
10000012 10000011 genus Saccharomyces
78+
10000011 10000010 family Saccharomycetaceae
79+
10000010 10000009 order Saccharomycetales
80+
10000009 10000008 class Saccharomycetes
81+
10000008 10000007 phylum Ascomycota
82+
10000007 10000000 kingdom Fungi // 10000000 is Eukaroyte taxID of the pre-built DB.
83+
84+
# accession2taxid
85+
accession accession.version taxid gi
86+
newseq1 newseq1 10000013 0
87+
newseq2 newseq2 10000013 0
88+
```
89+
90+
---

generate_readme.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
parts = ["docs/general.md", "docs/classification.md", "docs/createdb.md", "docs/updatedb.md", "docs/demo.md", "docs/references.md"]
2+
3+
with open("README.md", "w") as outfile:
4+
for fname in parts:
5+
with open(fname) as infile:
6+
outfile.write(infile.read())
7+
outfile.write("\n\n")

0 commit comments

Comments
 (0)