Skip to content

Commit 6c5a1aa

Browse files
committed
merge refine-classification to latest version of Metabuli
2 parents 75aecad + b247126 commit 6c5a1aa

File tree

1,396 files changed

+501
-603430
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,396 files changed

+501
-603430
lines changed

.gitmodules

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,6 @@
22
path = util/Metabuli-regression
33
url = https://github.com/steineggerlab/Metabuli-regression.git
44
branch = main
5+
[submodule "lib/mmseqs"]
6+
path = lib/mmseqs
7+
url = https://github.com/jaebeom-kim/MMseqs2.git

Dockerfile

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,13 @@
11
ARG APP=metabuli
2-
FROM --platform=$BUILDPLATFORM debian:stable as builder
2+
3+
########################################
4+
# Builder stage (multi-arch cross-compile)
5+
########################################
6+
FROM --platform=$BUILDPLATFORM debian:stable AS builder
37
ARG TARGETARCH
48
ARG APP
59

10+
# Install build tools (including cross-compile libs)
611
RUN dpkg --add-architecture $TARGETARCH \
712
&& apt-get update \
813
&& apt-get install -y \
@@ -12,8 +17,14 @@ RUN dpkg --add-architecture $TARGETARCH \
1217
&& rm -rf /var/lib/apt/lists/*
1318

1419
WORKDIR /opt/build
20+
21+
# Copy in your repo (including .git and .gitmodules)
1522
ADD . .
1623

24+
# Ensure submodules are initialized
25+
RUN git submodule update --init --recursive
26+
27+
# Build three variants
1728
RUN if [ "$TARGETARCH" = "arm64" ]; then \
1829
mkdir -p build_$TARGETARCH/src; \
1930
cd /opt/build/build_$TARGETARCH; \

README.md

Lines changed: 36 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ Metabuli also works on Linux ARM64 and Windows systems. Please check [https://mm
9595

9696
### Compile from source code
9797
```
98-
git clone https://github.com/steineggerlab/Metabuli.git
98+
git clone --recurse-submodules https://github.com/steineggerlab/Metabuli.git
9999
cd Metabuli
100100
mkdir build && cd build
101101
cmake -DCMAKE_BUILD_TYPE=Release ..
@@ -178,34 +178,45 @@ This will generate three result files: `JobID_classifications.tsv`, `JobID_repor
178178
> Sankey diagram is available in the [GUI app](https://github.com/steineggerlab/Metabuli-App).
179179
180180
#### JobID_classifications.tsv
181-
1. Classified or not
182-
2. Read ID
183-
3. Taxonomy identifier
184-
4. Effective read length
185-
5. DNA level identity score
186-
6. Classification Rank
187-
7. List of "taxID : k-mer match count"
181+
You can use `--lineage 1` option in `classify` module to print the full lineage next to `rank` column.
182+
1. `is_classified`: Classified or not
183+
2. `name`: Read ID
184+
3. `taxID`: Tax. ID in the tax. dump files used in database creation
185+
4. `query_length`: Effective read length
186+
5. `score`: DNA level identity score
187+
6. `rank`: Taxonomic rank of the taxon
188+
7. `taxID:match_count`: List of "taxID : k-mer match count"
188189

189190
```
191+
#is_classified name taxID query_length score rank taxID:match_count
190192
1 read_1 2688 294 0.627551 subspecies 2688:65
191193
1 read_2 2688 294 0.816327 subspecies 2688:78
192194
0 read_3 0 294 0 no rank
193195
```
194196

195197
#### JobID_report.tsv
196-
The proportion of reads that are assigned to each taxon.
198+
It follows Kraken2's report format. The first line is a header, and the rest of the lines are tab-separated values. The columns are as follows:
199+
200+
1. `clade_proportion`: Percentage of reads classified to the clade rooted at this taxon
201+
2. `clade_count`: Number of reads classified to the clade rooted at this taxon
202+
3. `taxon_count`: Number of reads classified directly to this taxon
203+
4. `rank`: Taxonomic rank of the taxon
204+
5. `taxID`: Tax ID according to the taxonomy dump files used in the database creation
205+
6. `name`: Taxonomic name of the taxon
206+
197207
```
198-
33.73 77571 77571 0 no rank unclassified
199-
66.27 152429 132 1 no rank root
200-
64.05 147319 2021 8034 superkingdom d__Bacteria
201-
22.22 51102 3 22784 phylum p__Firmicutes
202-
22.07 50752 361 22785 class c__Bacilli
203-
17.12 39382 57 123658 order o__Bacillales
204-
15.81 36359 3 126766 family f__Bacillaceae
205-
15.79 36312 26613 126767 genus g__Bacillus
206-
2.47 5677 4115 170517 species s__Bacillus amyloliquefaciens
207-
0.38 883 883 170531 subspecies RS_GCF_001705195.1
208-
0.16 360 360 170523 subspecies RS_GCF_003868675.1
208+
#clade_proportion clade_count taxon_count rank taxID name
209+
33.73 77571 77571 no rank 0 unclassified
210+
66.27 152429 132 no rank 1 root
211+
64.05 147319 2021 superkingdom 8034 d__Bacteria
212+
22.22 51102 3 phylum 22784 p__Firmicutes
213+
22.07 50752 361 class 22785 c__Bacilli
214+
17.12 39382 57 order 123658 o__Bacillales
215+
15.81 36359 3 family 126766 f__Bacillaceae
216+
15.79 36312 26613 genus 126767 g__Bacillus
217+
2.47 5677 4115 species 170517 s__Bacillus amyloliquefaciens
218+
0.38 883 883 subspecies 170531 RS_GCF_001705195.1
219+
0.16 360 360 subspecies 170523 RS_GCF_003868675.1
209220
210221
```
211222

@@ -231,20 +242,19 @@ metabuli classifiedRefiner <i:read-by-read classification> <i:DBDIR> [options]
231242
* Options
232243
--threads : The number of threads to utilize (all by default)
233244
--remove-unclassified : Remove unclassified reads
234-
--exclude-taxid : Remove list of taxids as well as its children
235-
--select-taxid : Select list of taxids as well as its children
236-
--select-columns : Select list of columns with number and handle full lineage as 7 (generated if absent)
245+
--exclude-taxid : Remove list of taxids as well as its children (e.g., 1758,9685,1234)
246+
--select-taxid : Select list of taxids as well as its children (e.g., 1758,9685,1234)
247+
--select-columns : Select list of columns with number and handle full lineage as 7 (generated if absent) (e.g., 2,5,7,3)
237248
--report : Write report of refined classification file
238249
--rank : Adjust classification to the specified rank
239-
--rank-file-type : Choose how to handle reads assigned to higher taxonomic ranks when using the --rank option. [0: without higher rank, 1: with higher rank, 2: separate file for higher rank classification]
250+
--rank-file-type : Choose how to handle reads assigned to higher taxonomic ranks when using the --rank option. [0: exclude higher rank, 1: include higher rank, 2: make separate file for higher rank classification]
240251
241252
```
242253
#### Output
243-
- `JobID_refined.tsv`
254+
- refined classification file : `JobID_refined.tsv`
244255
- report : `JobID_refined_report.tsv`, `JobID_refined_krona.html`
245256
- higher rank classification file : `_refined_higherRanks.tsv`
246257

247-
248258
---
249259
## Extract
250260
After running the `classify` command, you can extract reads that are classified under a specific taxon.

azure-pipelines.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,8 @@ jobs:
8989
ARCH: arm64
9090
CPREF: aarch64
9191
steps:
92+
- checkout: self
93+
submodules: true
9294
- script: |
9395
sudo dpkg --add-architecture $ARCH
9496
cat << HEREDOC | sudo tee /etc/apt/sources.list

lib/mmseqs

Submodule mmseqs added at 995f376

lib/mmseqs/.cirrus.yml

Lines changed: 0 additions & 45 deletions
This file was deleted.

lib/mmseqs/.dockerignore

Lines changed: 0 additions & 5 deletions
This file was deleted.

lib/mmseqs/.gitattributes

Lines changed: 0 additions & 3 deletions
This file was deleted.

lib/mmseqs/.github/ISSUE_TEMPLATE.md

Lines changed: 0 additions & 20 deletions
This file was deleted.
-90.2 KB
Binary file not shown.

0 commit comments

Comments
 (0)