Merge branch 'develop' into hashtable

nromashchenko · nromashchenko · commit 52695f42e28d · 2025-11-20T11:20:55.000+01:00
diff --git a/.github/workflows/python-publish.yml b/.github/workflows/python-publish.yml
@@ -20,9 +20,9 @@ jobs:
       id-token: write
 
     steps:
-    - uses: actions/checkout@v4
+    - uses: actions/checkout@v5
     - name: Set up Python
-      uses: actions/setup-python@v5
+      uses: actions/setup-python@v6
       with:
         python-version: '3.x'
     - name: Install dependencies
@@ -32,4 +32,4 @@ jobs:
     - name: Build package
       run: python -m build
     - name: Publish package
-      uses: pypa/gh-action-pypi-publish@67339c736fd9354cd4f8cb0b744f2b82a74b5c70
+      uses: pypa/gh-action-pypi-publish@ed0c53931b1dc9bd32cbe73a98c7f6766f8a527e
diff --git a/.github/workflows/test_run.yml b/.github/workflows/test_run.yml
@@ -11,9 +11,9 @@ jobs:
         python-version: ["3.8", "3.9", "3.10", "3.11"]
 
     steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@v5
       - name: Set up Python ${{ matrix.python-version }}
-        uses: actions/setup-python@v5
+        uses: actions/setup-python@v6
         with:
           python-version: ${{ matrix.python-version }}
       - name: Install dependencies
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,75 @@
+
+# Change log
+
+## Version 2.1.2
+
+### Fixed
+
+- An important bugfix to the search function producing invalid results in 2.1.1: #57
+- Fixed incompatibility with python 13 (#53) 
+- Fixed a crash when empty fasta if provided (#58)
+
+
+### Changed
+
+- Updated dependencies to Github actions
+
+## Version 2.1.1
+
+- Performance improvements to the mkdb command with orthoxml input
+- Added a check for non-unique protein IDs in the input fasta files. Now it gives a more informative error message
+- fixed #49
+
+## Version 2.1.0
+- Significant improvements to classification speed 
+
+## Version 2.0.4
+- Fixes issue #34 (numpy2 incompatibility)
+- Experimental support to build omamer databases from orthoxml/fasta files
+- Updated github action to latest versions
+
+## Version 2.0.3
+- Fixes issue #30
+- Update github action to latest versions
+
+## Version 2.0.2
+- changed method for hiding taxa in build process. Now takes a file containing taxa to hide on separate lines.
+- checks and improved feedback for root taxon and requested taxa to hide.
+- root taxon set by default to the root level in speciestree.nwk (previously hard-coded to default to LUCA)
+
+## Version 2.0.1
+ - remove dependency for filehash library
+ - return better error message if build dependencies are not met, but trying to building an omamer database
+ - minor fixes
+
+## Version 2.0.0
+ - Major update of database format and search code to improve overall memory useage. Most standard runs with LUCA-level database will run on a machine with 16GB RAM.
+ - Update to the scoring algorithm for root-level HOG / family assignments, to allow for significance testing. This estimates a binomial distribution for each family, so that we can compute the probability of matching at least as many k-mers as we have observed by chance, for each family that has a match to a given query.
+ - UX improvements - more feedback during interactive search runs, whilst maintaining small log files.
+
+## Version 0.2.5
+ - Fixes an issue when storing the pre-conputed statistics
+
+## Version 0.2.4
+ - Improved loading time for standard search by pre-computing statistics
+ - Adding new command line option "info" to show the metadata of the 
+   dataset used to build the omamer database.
+   
+
+## Version 0.2.2
+ - Automated deployment to PyPI
+ - Removed PyHAM dependency
+
+## Version 0.2.0
+ - Added ``--min_fam_completeness``, ``--logic``, ``--score`` and ``--reference_taxon`` options
+ - New output format
+ - Debugging
+
+## Version 0.1.2 - 0.1.3
+ - Debugging
+
+## Version 0.1.0
+ - Added hidden_taxa and threshold arguments
+
+## Version 0.0.1
+ - Initial release
diff --git a/README.md b/README.md
@@ -128,58 +128,6 @@ Required arguments: ``--db``, ``--oma_path``
 | [``--log_level``](#markdown-header--log_level)|info|Logging level
 
 
-# Change log
-
-#### Version 2.0.4
-- fixes issue #34 (numpy2 incompatibility)
-- experimental support to build omamer databases from orthoxml/fasta files
-- update github action to latest versions
-
-#### Version 2.0.3
-- fixes issue #30
-- update github action to latest versions
-
-#### Version 2.0.2
-- changed method for hiding taxa in build process. Now takes a file containing taxa to hide on separate lines.
-- checks and improved feedback for root taxon and requested taxa to hide.
-- root taxon set by default to the root level in speciestree.nwk (previously hard-coded to default to LUCA)
-
-#### Version 2.0.1
- - remove dependency for filehash library
- - return better error message if build dependencies are not met, but trying to building an omamer database
- - minor fixes
-
-#### Version 2.0.0
- - Major update of database format and search code to improve overall memory useage. Most standard runs with LUCA-level database will run on a machine with 16GB RAM.
- - Update to the scoring algorithm for root-level HOG / family assignments, to allow for significance testing. This estimates a binomial distribution for each family, so that we can compute the probability of matching at least as many k-mers as we have observed by chance, for each family that has a match to a given query.
- - UX improvements - more feedback during interactive search runs, whilst maintaining small log files.
-
-#### Version 0.2.5
- - Fixes an issue when storing the pre-conputed statistics
-
-#### Version 0.2.4
- - Improved loading time for standard search by pre-computing statistics
- - Adding new command line option "info" to show the metadata of the 
-   dataset used to build the omamer database.
-   
-
-#### Version 0.2.2
- - Automated deployment to PyPI
- - Removed PyHAM dependency
-
-#### Version 0.2.0
- - Added ``--min_fam_completeness``, ``--logic``, ``--score`` and ``--reference_taxon`` options
- - New output format
- - Debugging
-
-#### Version 0.1.2 - 0.1.3
- - Debugging
-
-#### Version 0.1.0
- - Added hidden_taxa and threshold arguments
-
-#### Version 0.0.1
- - Initial release
 
 # License
 OMAmer is a free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
diff --git a/omamer/__init__.py b/omamer/__init__.py
@@ -24,7 +24,7 @@
 from datetime import date
 
 __packagename__ = "omamer"
-__version__ = "2.1.1"
+__version__ = "2.1.2"
 __copyright__ = "(C) 2019-{:d} Victor Rossier <victor.rossier@unil.ch> and Alex Warwick Vesztrocy <alex@warwickvesztrocy.co.uk> and Nikolai Romashchenko <nikolai.romashchenko@unil.ch>".format(
     date.today().year
 )
diff --git a/omamer/_runners.py b/omamer/_runners.py
@@ -21,6 +21,7 @@
     You should have received a copy of the GNU Lesser General Public License
     along with OMAmer. If not, see <http://www.gnu.org/licenses/>.
 """
+import os
 from ._utils import LOG, check_file_exists
 
 
@@ -143,7 +144,7 @@ def mkdb_oma(args):
 
 def search(args):
     from alive_progress import alive_bar
-    from ._utils import print_message, print_line
+    from ._utils import print_message
     import sys
 
     if args.out is None:
@@ -175,6 +176,7 @@ def search(args):
         bar.text(" [DONE]")
 
     print_run_data(args)
+    check_args(args)
 
     t0 = time()
 
@@ -244,7 +246,7 @@ def search(args):
                     # write the top header
                     print("!omamer-version: {}".format(__version__), file=args.out)
                     print(
-                        "!query-md5: {}".format(compute_file_md5(args.query.name)),
+                        "!query-md5: {}".format(compute_file_md5(args.query)),
                         file=args.out,
                     )
                     print(
@@ -372,7 +374,7 @@ def print_run_data(args):
     print_line(80)
     print_message("\nRunning OMAmer on {}, using:".format(platform.node()))
     print_message(" - database: {}".format(args.db))
-    print_message(" - query: {}".format(args.query.name))
+    print_message(" - query: {}".format(args.query))
     print_message(" - version: {}".format(__version__))
     print_message("")
     print_line(80)
@@ -410,3 +412,12 @@ def goodbye(args, time_taken, search_rate):
     )
     print_message("")
     print_line(80)
+
+
+def check_args(args):
+    # Enforce query existence check before loading DB
+    with open(args.query, "r") as _:
+        pass
+
+    if os.path.getsize(args.query) == 0:
+        raise RuntimeError(f"Input file {args.query} is empty")
diff --git a/omamer/database.py b/omamer/database.py
@@ -501,13 +501,13 @@ def _get_child_prots(hogs, hog2protoffs, child_prots_off):
             # TODO: check what else would break. this could be used if someone wanted to build a
             # database for flat OGs.
             LOG.warning("No nesting structure in HOGs defined in OrthoXML.")
-        else:
-            self.db.create_carray(
-                "/",
-                "ChildrenHOG",
-                obj=np.array(child_hogs, dtype=np.uint32),
-                filters=self._compr,
-            )
+            child_hogs = [0]  # adding sentinel in case no nested HOGs are defined.
+        self.db.create_carray(
+            "/",
+            "ChildrenHOG",
+            obj=np.array(child_hogs, dtype=np.uint32),
+            filters=self._compr,
+        )
         self.db.create_carray(
             "/",
             "ChildrenProt",
diff --git a/omamer/main.py b/omamer/main.py
@@ -193,7 +193,7 @@ def get_thread_count():
         "--query",
         required=True,
         help="Path to FASTA formatted sequences",
-        type=FileType("r"),
+        type=str,
     )
 
     search_parser.add_argument(
diff --git a/omamer/sequence_reader.py b/omamer/sequence_reader.py
@@ -23,21 +23,21 @@
 """
 from Bio import SeqIO
 
-
 class SequenceReader(object):
     @staticmethod
-    def read(fp, k, format="fasta", chunksize=None, sanitiser=None):
-        ids = []
-        seqs = []
-        for rec in filter(lambda x: (len(x.seq) >= k), SeqIO.parse(fp, format)):
-            ids.append(rec.id)
-            s = str(rec.seq).upper()
-            seqs.append(sanitiser(s) if sanitiser is not None else s)
-
-            if chunksize is not None and len(ids) == chunksize:
-                yield (ids, seqs)
-                ids = []
-                seqs = []
-
-        if len(ids) > 0:
-            yield (ids, seqs)
+    def read(filename, k, format="fasta", chunksize=None, sanitiser=None):
+        with open(filename, "r") as fp:
+            ids = []
+            seqs = []
+            for rec in filter(lambda x: (len(x.seq) >= k), SeqIO.parse(fp, format)):
+                ids.append(rec.id)
+                s = str(rec.seq).upper()
+                seqs.append(sanitiser(s) if sanitiser is not None else s)
+
+                if chunksize is not None and len(ids) == chunksize:
+                    yield ids, seqs
+                    ids = []
+                    seqs = []
+
+            if len(ids) > 0:
+                yield ids, seqs
diff --git a/setup.cfg b/setup.cfg
@@ -1,5 +1,5 @@
 [bumpversion]
-current_version = 2.1.0
+current_version = 2.1.2
 commit = True
 tag = False
 

Original file line number	Diff line number	Diff line change
`@@ -24,7 +24,7 @@`
`24`	`24`	`from datetime import date`
`25`	`25`
`26`	`26`	`__packagename__ = "omamer"`
`27`		`-__version__ = "2.1.1"`
	`27`	`+__version__ = "2.1.2"`
`28`	`28`	`__copyright__ = "(C) 2019-{:d} Victor Rossier <victor.rossier@unil.ch> and Alex Warwick Vesztrocy <alex@warwickvesztrocy.co.uk> and Nikolai Romashchenko <nikolai.romashchenko@unil.ch>".format(`
`29`	`29`	`date.today().year`
`30`	`30`	`)`
Original file line number	Diff line number	Diff line change
`@@ -193,7 +193,7 @@ def get_thread_count():`
`193`	`193`	`"--query",`
`194`	`194`	`required=True,`
`195`	`195`	`help="Path to FASTA formatted sequences",`
`196`		`- type=FileType("r"),`
	`196`	`+ type=str,`
`197`	`197`	`)`
`198`	`198`
`199`	`199`	`search_parser.add_argument(`