Skip to content

Commit 2ab59e4

Browse files
committed
v0.2.0
1 parent 215920a commit 2ab59e4

File tree

5 files changed

+32
-15
lines changed

5 files changed

+32
-15
lines changed

CHANGELOG.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,29 @@
11
# Changelog
2+
## [0.2.0] - 2024-06-28
3+
### Added
4+
- [***breaking***] Add decoy protein sequences (RefSeq fungi, protozoa, viral, plant, and human GRCh38/hg38) which effectively trap non-prokaryotic reads and prevent them from inflating total prokaryotic genome copy estimates if the pre-filtering module (default with `Kraken2`) is not enabled. Pre-filtering is no longer necessary even if samples are contaminated with human DNA or other common eukaryotes/viruses, unless the mean genome size of prokaryotes needs to be estimated. See [8918168](https://github.com/xinehc/melon-supplementary/commit/891816897bb3c82dcfff7ff44b45907593ba0eac) for more details. This function requires a database released on or after 2024-06-28.
5+
### Changed
6+
- Simplify filtering criteria for alignments.
7+
8+
29
## [0.1.6] - 2024-05-30
310
### Changed
411
- Prevent `extract_sequence` from loading all marker-containing reads into memory.
512
- Change `-F` to `--frameshift` and `max_iteration` to `max_iterations` for consistency.
613
- Switch from figshare to zenodo for better database versioning.
714

15+
816
## [0.1.5] - 2024-04-26
917
### Fixed
1018
- Fix a bug causing `tqdm` being disabled ([3bbd087](https://github.com/xinehc/melon/commit/3bbd087b8867e3167973a746af14f1fd797f9746)).
1119

20+
1221
## [0.1.4] - 2024-04-26
1322
### Changed
1423
- Use `tqdm` for logging.
1524
- Reduce peak memory usage by parsing PAF files on the fly.
1625

26+
1727
## [0.1.3] - 2024-03-29
1828
### Changed
1929
- Change alignment filtering criteria: make `AS` cutoff more stringent, drop `MS`. See [7cc6dbd](https://github.com/xinehc/melon/commit/7cc6dbd866027cf5c1adaa5c69ed7919d8630607) for details.
@@ -25,10 +35,12 @@
2535
- Output both gap-compressed and gap-uncompressed (BLAST-like) identity.
2636
- Refine output format.
2737

38+
2839
## [0.1.2] - 2023-12-20
2940
### Added
3041
- Add gap-compressed ANI to output.
3142

43+
3244
## [0.1.1] - 2023-11-29
3345
### Added
3446
- Add options to control EM early stop.
@@ -40,6 +52,7 @@
4052
### Fixed
4153
- Fix a bug causing chimeric reads not being aggregated.
4254

55+
4356
## [0.1.0] - 2023-10-08
4457
### Added
4558
- Output a json file to indicate the lineage of processed reads.
@@ -54,6 +67,7 @@
5467
### Fixed
5568
- Prevent numpy from using all logical cores.
5669

70+
5771
## [0.0.1] - 2023-09-19
5872
### Added
5973
- First release.

README.md

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,16 @@ conda activate melon
1010
```
1111

1212
### Database setup
13-
Download either the [NCBI](https://zenodo.org/records/11100615) or the [GTDB](https://zenodo.org/records/11076519) database:
14-
```bash
15-
## GTDB
16-
# wget -qN --show-progress https://zenodo.org/records/11076519/files/database.tar.gz
13+
> [!NOTE]
14+
> We suggest using the GTDB database for complex metagenomes, as it features less ambiguous taxonomic labels and is more comprehensive.
1715
16+
Download either the [NCBI](https://zenodo.org/records/12571302) or the [GTDB](https://zenodo.org/records/12571554) database:
17+
```bash
1818
## NCBI
19-
wget -qN --show-progress https://zenodo.org/records/11100615/files/database.tar.gz
19+
wget -qN --show-progress https://zenodo.org/records/12571302/files/database.tar.gz
20+
21+
## GTDB
22+
# wget -qN --show-progress https://zenodo.org/records/12571554/files/database.tar.gz
2023
tar -zxvf database.tar.gz
2124
```
2225

@@ -37,13 +40,13 @@ rm -rf database/*.fa
3740
```
3841

3942
### Run Melon
40-
> [!NOTE]
41-
> Melon takes **quality-controlled** and **decontaminated** long reads as input. We suggest to remove low-quality raw reads before running Melon with e.g., `nanoq -q 10 -l 1000` (minimal quality score 10; minimal read length 1,000 bp). If your sample is known to have a large proportion of human DNAs or known eukaryotes/viruses, please consider removing them via proper mapping. If the origin of contamination is unknown, or if you want to estimate the mean genome size of prokaryotes, you may consider enabling the simple pre-filtering module. See [Run Melon with pre-filtering of non-prokaryotic reads](#run-melon-with-pre-filtering-of-non-prokaryotic-reads) for more details.
43+
> [!NOTE]
44+
> Melon takes **quality-controlled** long reads as input. We suggest removing low-quality raw reads before running Melon with e.g., `nanoq -q 10 -l 1000` (min. quality score 10; min. read length 1,000 bp). If your sample is known to have a large proportion of human DNAs or other eukaryotes/viruses and you want to estimate the **mean genome size** of prokaryotes, please consider removing them via proper mapping, or enabling the simple pre-filtering module. See [Run Melon with pre-filtering of non-prokaryotic reads](#run-melon-with-pre-filtering-of-non-prokaryotic-reads) for more details.
4245
43-
We provide an example file comprising 10,000 quality-controlled (processed with `Porechop` and `nanoq`), prokaryotic reads (fungal and other reads removed with `minimap2`) randomly selected from the R10.3 mock sample of [Loman Lab Mock Community Experiments](https://lomanlab.github.io/mockcommunity/r10.html).
46+
We provide an example file comprising 10,000 quality-controlled (processed with `Porechop` and `nanoq`) prokaryotic reads (fungal and other reads removed with `minimap2`), randomly selected from the R10.3 mock sample of [Loman Lab Mock Community Experiments](https://lomanlab.github.io/mockcommunity/r10.html).
4447

4548
```bash
46-
wget -q --show-progress https://figshare.com/ndownloader/files/47279572/example.fa.gz
49+
wget -qN --show-progress https://zenodo.org/records/12571849/files/example.fa.gz
4750
melon example.fa.gz -d database -o .
4851
```
4952

@@ -70,7 +73,7 @@ The output file `*.tsv` contains the estimated genome copies for individual spec
7073
... 1613|Limosilactobacillus fermentum 5.125 1.872146e-01 0.9654/0.9574
7174
```
7275

73-
The output file `*.json` contains the lineage and remark of each processed reads.
76+
The output file `*.json` contains the lineage and remark of each processed read.
7477
```
7578
{
7679
"002617ff-697a-4cd5-8a97-1e136a792228": {
@@ -86,7 +89,7 @@ The output file `*.json` contains the lineage and remark of each processed reads
8689
```
8790

8891
### Run Melon with pre-filtering of non-prokaryotic reads
89-
To enable the pre-filtering module, you need to download a database of Kraken that includes at least human and fungi (PlusPF, PlusPFP, or their capped versions). Using the PlusPF-8 (ver. 2023-06-05, capped at 8 GB) as an example:
92+
To enable the pre-filtering module, you need to download a database of Kraken2 that includes at least human and fungi (PlusPF, PlusPFP, or their capped versions). Using the PlusPF-8 (ver. 2023-06-05, capped at 8 GB) as an example:
9093

9194
```bash
9295
## https://benlangmead.github.io/aws-indexes/k2

src/melon/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
__version__ = '0.1.6'
1+
__version__ = '0.2.0'
22

33
from .melon import GenomeProfiler

src/melon/melon.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -160,15 +160,15 @@ def parse_minimap(self):
160160
qcoords[hit[0]].add(tuple(hit[3:5]))
161161

162162
alignments = []
163-
scores, max_scores = defaultdict(lambda: defaultdict(lambda: -np.inf)), defaultdict(lambda: -np.inf)
163+
scores, max_scores = defaultdict(dict), dict()
164164
with open(f'{self.outfile}.minimap.tmp') as f:
165165
for line in f:
166166
ls = line.rstrip().split('\t')
167167
qstart, qend, qseqid, sseqid = int(ls[2]), int(ls[3]), ls[0], ls[5]
168168
lineage = accession2lineage[sseqid.rsplit('_', 1)[0]]
169169

170170
## filter out non-overlapping alignments
171-
if (AS := int(ls[14].split('AS:i:')[-1])) > (AS_MAX := scores[qseqid].get(lineage, -np.inf)):
171+
if (AS := int(ls[14].split('AS:i:')[-1])) > scores[qseqid].get(lineage, -np.inf):
172172
if any(compute_overlap((qstart, qend, *qcoord)) > 0 for qcoord in qcoords[qseqid]):
173173
scores[qseqid][lineage] = AS
174174
max_scores[qseqid] = max(max_scores.get(qseqid, -np.inf), AS)

src/melon/utils.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
import re
33
import sys
44
import subprocess
5-
import logging
5+
import logging
66

77
## setup logging format
88
if not sys.stderr.isatty():

0 commit comments

Comments
 (0)