Merge pull request #17 from jaebeom-kim/main

snjlee58 · web-flow · commit 2292ceea0625 · 2025-06-04T13:55:57.000+09:00
Improve Readme
diff --git a/.github/workflows/generate-readme.yml b/.github/workflows/generate-readme.yml
@@ -0,0 +1,32 @@
+name: Auto-generate README
+
+on:
+  push:
+    paths:
+      - 'fragments/**'
+      - 'generate_readme.py'
+  workflow_dispatch:
+
+jobs:
+  build-readme:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout repo
+        uses: actions/checkout@v3
+
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.10'
+
+      - name: Generate README
+        run: python generate_readme.py
+
+      - name: Commit and push if README changed
+        run: |
+          git config --global user.name "GitHub Actions"
+          git config --global user.email "actions@github.com"
+          git add README.md
+          git diff --cached --quiet || git commit -m "Auto-update README"
+          git push
+
diff --git a/README.md b/README.md
diff --git a/docs/classification.md b/docs/classification.md
@@ -0,0 +1,50 @@
+# Classification
+Metabuli App provides two taxonomic profiling modes in **Search Settings** panel: **New Search** and **Upload Report**.
+<img alt="SearchPage_Demo_Image" src="https://github.com/user-attachments/assets/9ab5a86c-5603-4dc7-be3b-baf2ed490ef0" style="max-height: 600px; width: auto;">
+
+## New Classification
+### Required Fields:
+1. **Mode:** Select the analysis mode among single-end, paired-end, or long-read.
+2. **Job ID:** Enter a unique identifier for the job.
+3. **Select Files:** Upload the necessary files and directories.
+    - Read 1 File (and Read 2 File if Paired-end is selected)
+    - Database Directory
+    - Output Directory
+4. **Max RAM:** Specify the maximum RAM (in GiB) to allocate for the job.
+
+### Advanced Settings (Optional): 
+- **Threads:** Specify thread count for the job.
+- **Min Score:** Set the minimum score for making a classification. It reduces false positives at the cost of sensitivity.
+    - Recommended values (For details, please refer Supp. Fig. 4-7 in the [Metabuli paper](https://www.nature.com/articles/s41592-024-02273-y)):
+        - Illumina short reads: 0.15 
+        - PacBio HiFi reads: 0.07
+        - PacBio Sequel II reads: 0.005
+        - Nanopore long reads: 0.008
+- **Min SP Score:** Set the minimum score for the species- or lower-level classification. It avoids overconfident classifications.
+    - Recommended values (For details, please refer Supp. Fig. 4-7 in the [Metabuli paper](https://www.nature.com/articles/s41592-024-02273-y)):
+        - Illumina short reads: 0.5 
+        - PacBio HiFi reads: 0.3
+- **Taxonomy Path:** Use it when your database does not have `taxonomy` directory or `taxonomyDB` file. Provide a directroy of `names.dmp`, `nodes.dmp`, and `mereged.dmp` files. 
+- **Accession Level:** classify reads to accessions if available.
+
+### Start Analysis: 
+- Click the `Run Metabuli` button to start the metagenomic classification process.
+- You can track the progress and see real-time backend output in the logs.
+
+### View Results: 
+   - Once the analysis is complete, you can view the results in three different forms:
+     - **Table**: View the raw classification data in a table format.
+     - **Sankey Diagram**: A flow diagram representing the lineage information of the displayed taxa.
+     - **Krona Chart**: A hierarchical interactive chart that visualizes classification results.
+
+## Upload Report
+
+To visualize results from a previously completed job:
+
+1. Navigate to the **Upload Report** tab.
+2. Upload the `report.tsv` file from a prior job.
+3. View the uploaded results directly in the **Results** tab. For this job type, results are provided in:
+   - **Table**: The raw data in table format.
+   - **Sankey Diagram**: A flow diagram representing the lineage paths for the displayed taxa (without the Krona chart).
+   
+---
diff --git a/docs/createdb.md b/docs/createdb.md
@@ -0,0 +1,51 @@
+## Create New Database
+You can create a new database in "NEW DATABASE" tab by providing these three files:
+1. **FASTA files** : Each sequence must have a unique `>accession.version` or `>accesion` header (e.g., `>CP001849.1` or `>CP001849`).
+2. **NCBI-style taxonomy dump** : `names.dmp`, `nodes.dmp`, and `merged.dmp`. Sequences with tax. IDs absent here are skipped.
+3. **NCBI-style accession2taxid** : Sequences with accessions absent here are skipped, and versions are ignored.
+
+
+### How to prepare NCBI-style taxonomy dump files and accession2taxid
+#### NCBI-style taxonomy dump
+- `NCBI Taxonomy`: Download [here](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/).
+- `GTDB`: Download taxonkit-GTDB files [here](https://github.com/shenwei356/gtdb-taxdump/releases).
+- `ICTV`: Download taxonkit-ICTV files [here](https://github.com/shenwei356/ictv-taxdump/releases).
+- `Custom taxonomy`: Generate your own `names.dmp`, `nodes.dmp`, and `merged.dmp`.
+
+
+#### NCBI-style accession2taxid
+- `NCBI Taxonomy`: Download [here](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/). Check [README](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/README) to know what file to use. 
+- `GTDB`: It is auto generated using a `taxid.map` in the taxonkit-GTDB directory.
+- `ICTV`: Use `prepare-accession2taxid.sh` in [here](https://github.com/jaebeom-kim/Metabuli-ICTV-challenge).
+- `Custom accession2taxid`: Generate your own `accession2taxid` file.
+
+#### Edit files to include custom sequences
+  * Taxonomy dump files:
+    * Edit `nodes.dmp` and `names.dmp` to introduce a new `taxid` in `accession2taxid`.
+  * accession2taxid file:
+    * For a sequence whose header is `>custom`, add `custom[tab]custom[tab]taxid[tab]anynumber`.
+    * As above, version number is not necessary.
+    * `taxid` must be included in the `nodes.dmp` and `names.dmp`.
+    * Put any number for the last column. It is not used in Metabuli.
+
+
+### Required fields
+1. **GTDB-based checkbox:** Check if you use taxonkit-generated GTDB taxonomy `dmp` files. 
+2. **Database Directory:** The directory where the database will be generated.
+3. **FASTA List:** A file containing absolute paths to FASTA files.
+4. **Accession2TaxId:** A path to NCBI-style accession2taxid following the format below. </br>
+    ```
+    accession   accession.version   taxid   gi
+    ACCESSION   ACCESSION.1         12345   6789
+    ```
+5. **Taxonomy Path:** Directory of taxonomy dump files (`names.dmp`, `nodes.dmp`, and `merged.dmp` are requried).
+
+### Optional fields
+- **Max RAM**: Specify the maximum RAM (in GiB) to allocate for the job.
+- **Threads**: Specify the number of threads to use for the job.
+- **Accession Level**: Create a database for accession-level classification. </br>
+  (WARNING: It it not tested for large databases. Using it with > 100K sequences may cause issues.)
+- **Make Library**: Make a library of species genomes. It accelerates the process when some large FASTA files include many species genomes.
+- **CDS Info**: File containing absolute paths to CDS. For included accessions, Prodigal's gene prediction is skipped. Only GenBank/RefSeq CDS files are supported.   
+
+---
diff --git a/docs/demo.md b/docs/demo.md
@@ -0,0 +1,13 @@
+## Demos
+
+### New Search Job Demo
+Watch this demo to see how to run a new search on Metabuli:
+
+https://github.com/user-attachments/assets/6c8b848b-77b8-49b1-b01e-069b872ea740
+
+### Viewing Results
+Watch this demo to see how to view the results from a completed search:
+
+https://github.com/user-attachments/assets/8cda1132-201f-4f8d-9f09-95ccafb9e685
+
+---
diff --git a/docs/general.md b/docs/general.md
@@ -0,0 +1,73 @@
+[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.14603649.svg)](https://doi.org/10.5281/zenodo.14603649)
+![Platform](https://img.shields.io/badge/platform-Mac%20%7C%20Windows%20%7C%20Linux-brightgreen)
+
+# Metabuli App 
+
+This is the desktop application for Metabuli, a metagenomic classification that jointly analyzes both DNA and amino acid sequences. Built with Vue.js and Electron, it provides an intuitive interface for running metagenomic classification jobs and visualizing the results.
+
+For more details of Metabuli, please see
+[GitHub](https://github.com/steineggerlab/Metabuli),
+[Nature Methods](https://www.nature.com/articles/s41592-024-02273-y), 
+[PDF](https://www.nature.com/articles/s41592-024-02273-y.epdf?sharing_token=je_2D5Su0-xVOSjuKSAXF9RgN0jAjWel9jnR3ZoTv0M7gE7NDF_xi_3sW8QdRiwfSJNwqaXItSoeCvr7cvcoQxKLt0oROgWc6urmki9tP80cXEuHPN0D7b4y9y3i8Yv7sZw8MxxhAj7W6p9eZE2zaK3eozdOkXvwADVfso9cXIM%3D), 
+[bioRxiv](https://www.biorxiv.org/content/10.1101/2023.05.31.543018v2), or [ISMB 2023 talk](https://www.youtube.com/watch?v=vz2fuRcVwyk).
+<p align="center"><img src="https://raw.githubusercontent.com/steineggerlab/Metabuli/master/.github/marv_metabuli_small.png" height="350" /></p>
+
+## NOTE
+The `NCBI` and `Assemblies` buttons in the Sankey subtree view do not work when searching against a GTDB-based database. 
+We will make a button for GTDB soon.
+
+## Platforms Supported
+
+- macOS (Universal `.dmg`)
+- Windows (`.exe`)
+- Linux (AppImage `.AppImage`)
+
+## Functionality
+- **Download** pre-built **databases**
+- **Create or update databases** directly in the app
+- Run **taxonomic classification**
+- **Upload and browse** classification results
+- **Extract reads** classified under a specific taxon
+- Explore results with interactive **Sankey** and **Krona** plots.
+
+
+---
+
+## Changelog
+
+### v1.0.1
+- Introduced the `Custom Database` page, enabling users to:
+    - Create new databases.
+    - Add new sequences to existing databases.
+- Enhanced `Sankey visualization`:
+    - Implemented Sankey plot verification for ensuring accuracy of visualized results.
+    - Resolved a bug in lineage extraction from raw data that previously caused inaccuracies in lineage representation on the Sankey plot.
+
+### v1.0.0
+- Initial release of Metabuli App.
+
+---
+
+## Table of Contents
+
+- [Installation](#installation)
+- [Usage](#usage)
+    - [Classification](#classification)
+        - [New Classification](#new-classification)
+        - [Upload Report](#upload-report)
+    - [Create New Database](#create-new-database)
+    - [Update Existing Database](#update-existing-database)
+- [Demos](#demos)
+- [Acknowledgments](#acknowledgments)
+
+## Installation
+
+- Visit the [GitHub Releases](https://github.com/steineggerlab/Metabuli-App/releases) page for the latest builds.
+- The application is pre-built for **macOS**, **Windows**, and **Linux**.
+- Simply download the executable for your platform from the Releases section.
+
+> **Note:** If you encounter a security warning when opening the app, follow the instructions below to bypass the warning:
+>- **macOS**: Refer to [this guide](https://support.apple.com/en-gb/guide/mac-help/mh40616/15.0/mac/15.0) on how to open apps from unidentified developers.
+>- **Windows**: Click 'More info' and then 'Run anyway' to continue.
+
+---
diff --git a/docs/references.md b/docs/references.md
@@ -0,0 +1,13 @@
+## References
+The development of the Metabuli Desktop Application has been inspired by and leverages the following tools:
+
+- **Pavian**: Elements of the table layouts and visualizations in Metabuli were inspired by Pavian for metagenomic data analysis. ([Pavian GitHub Repository](https://github.com/fbreitwieser/pavian)).
+  
+- **Krona**: The Krona tool is embedded in the results page for hierarchical data visualization. ([Krona GitHub Repository](https://github.com/marbl/Krona/wiki)).
+
+- **fastp**: Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication software. ([fastp GitHub Repository](https://github.com/OpenGene/fastp))
+
+- **fastplong**: `fastp` for long reads. ([fastplong GitHub Repository](https://github.com/OpenGene/fastplong))
+
+
+We would like to acknowledge the authors of these tools for their excellent work, which has significantly contributed to the development of Metabuli App.
diff --git a/docs/updatedb.md b/docs/updatedb.md
@@ -0,0 +1,90 @@
+## Update Existing Database
+You can add new sequences to an existing database in the "UPDATE DATABASE" tab by providing these inputs:
+1. **FASTA files** : Each sequence must have a unique `>accession.version` or `>accesion` header (e.g., `>CP001849.1` or `>CP001849`).
+2. **NCBI-style accession2taxid** : Sequences with accessions not listed here will be skipped. Version numbers are ignored. </br>
+(It is auto generated when the GTDB-based checkbox is checked.)
+3. **Old database directory**: The directory containing the existing database to be updated.
+
+
+### Required fields
+1. **GTDB-based checkbox:** Check this if you are adding genomes using the taxonkit-GTDB taxonomy.
+2. **Old Database Directory:** The directory containing the existing database to be updated.
+3. **New Database Directory:** The directory where the updated database will be generated.
+4. **FASTA List:** A file containing absolute paths to FASTA files.
+5. **Taxonomy Info**
+    - If GTDB-based is checked: provide the taxonkit-GTDB taxonomy directory.
+    - If not checked: provide an NCBI-style accession2taxid file.
+
+### Optional fields
+- **Max RAM**: Specify the maximum amount of RAM (in GiB) to allocate.
+- **Threads**: Specify the number of threads to use.
+- **Accession Level**: Create a database for accession-level classification. </br>
+  *(WARNING: This option is not tested for large databases. Using it with more than 100,000 sequences may cause issues.)*
+- **Make Library**: Create a library of species genomes. This accelerates processing when large FASTA files contain genomes of multiple species.
+- **CDS Info**:  A file containing absolute paths to CDS files. For the listed accessions, Prodigal’s gene prediction will be skipped. Only GenBank/RefSeq-format CDS files are supported. 
+- **New Taxa**:  Used when adding sequences from taxa not included in the existing database. See the section below for details.
+
+### Adding seqeunces of new taxa
+> [WARNING] 
+> Mixing taxonomies within the same domain is not recommended. For example, adding prokaryotes to a GTDB-based database using NCBI taxonomy will cause issues, but adding eukaryotes or viruses to a GTDB-based database using NCBI taxonomy is fine since GTDB does not cover them.
+
+1\. **Check taxonomy dump files** to see if you really need to add new taxa. `taxdump` command retrieves taxdump files of an existing database.
+
+2-1\. **Create a new taxa list** 
+  
+If you have both **accession2taxid** and **taxonomy dump** files for the new sequences, you can use the `CREATE NEW TAXA` button next to the `New Taxa` option.
+This generates two files:
+- `newtaxa.tsv` for the `New Taxa` option
+- `newtaxa.accession2taxid` for `Accession 2 Tax Id` field.
+
+<!-- ```
+metabuli createnewtaxalist <OLD DBDIR> <FASTA_LIST> <new taxonomy dump> <accession2taxid> <OUTDIR>
+``` -->
+
+##### Example
+Suppose you're adding eukaryotic sequences to a GTDB-based database. Since GTDB doesn't include eukaryotes, you may want to use NCBI taxonomy for eukaryotes.
+You can download `taxdump` files from [here](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/) and `accession2taxid` from [here](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/).
+- How to use `CREATE NEW TAXA`:
+    - `Old Database Directory`: Your existing GTDB database directory.
+    - `FASTA List`: A file containing absolute paths to FASTA files to be added.
+    - `New Taxonomy Path`: The directory of NCBI Taxonomy dump files.
+    - `Accession 2 Tax Id`: NCBI-style accession2taxid file.
+    - `Output Directory`: The directory where `newtaxa.tsv` and `newtaxa.accession2taxid` will be generated.
+- How to run `UPDATE DATABASE`:
+    - `GTDB-Based checkbox`: **Don't Check** it since you are not using GTDB tree for new sequences.
+    - `Old Database Directory`: Your existing GTDB database directory.
+    - `New Database Directory`: The directory for the updated database to be created.
+    - `FASTA List`: The same one as above.
+    - `Accession 2 Tax Id`: `newtaxa.accession2taxid` generated by `CREATE NEW TAXA`.
+    - `New Taxa` option: `newtaxa.tsv` generated by `CREATE NEW TAXA`.
+
+</br>
+
+2-2\. **Manually prepare a new taxa list**
+
+For the `New Taxa` option, provide a four-column TSV file in the following format.
+```
+taxID parentID rank name
+```
+The new taxon must be linked to a taxon in the existing database's taxonomy.
+
+##### Example
+Suppose you want to add *Saccharomyces cerevisiae* to a GTDB database whose taxonomy lacks the Fungi kingdom and only includes one eukaryote (*Homo sapiens*). In this scenario, your new taxa list and accession2taxid should be as follows.
+```
+# New taxa list
+## taxid  parentTaxID rank  name // Don't put this header in your actual file.
+10000013	10000012	species	Saccharomyces cerevisiae
+10000012	10000011	genus	Saccharomyces
+10000011	10000010	family	Saccharomycetaceae
+10000010	10000009	order	Saccharomycetales
+10000009	10000008	class	Saccharomycetes
+10000008	10000007	phylum	Ascomycota
+10000007	10000000	kingdom	Fungi // 10000000 is Eukaroyte taxID of the pre-built DB.
+
+# accession2taxid
+accession accession.version taxid gi
+newseq1 newseq1 10000013  0
+newseq2 newseq2 10000013  0
+```
+
+---
diff --git a/generate_readme.py b/generate_readme.py
@@ -0,0 +1,7 @@
+parts = ["docs/general.md", "docs/classification.md", "docs/createdb.md", "docs/updatedb.md", "docs/demo.md", "docs/references.md"]
+
+with open("README.md", "w") as outfile:
+    for fname in parts:
+        with open(fname) as infile:
+            outfile.write(infile.read())
+            outfile.write("\n\n")