Klekota-Roth prefix optimization by Kacper-Kozubowski · Pull Request #470 · MLCIL/scikit-fingerprints

Kacper-Kozubowski · 2025-07-16T13:05:17Z

Changes

Completely rewrote the Klekota-Roth fingerprint computation for better performance.
Replaced the naive O(n × m) loop over molecules and patterns with a tree-based traversal.
Generated an efficient tree structure used by the algorithm and stored it in tree.json.
Added atom-count-based filtering for faster pattern rejection.
Removed the hardcoded list of 4860 patterns from the module; patterns are now loaded from the tree structure.
Added a test verifying result correctness against the naive approach.

Other notes

tree.json is generated offline with a separate script (not included in this PR).
Achieved a 6–9× performance boost for single-core execution and roughly a 5–7× boost for multicore, depending on dataset.

Checklist before requesting a review

Docstrings added/updated in public functions and classes
Tests added, reasonable test coverage (at least ~90%, make test-coverage)
Sphinx docs added/updated and render properly (make docs and see docs/_build/index.html)

…ests

j-adamczyk · 2025-07-16T13:09:27Z

skfp/fingerprints/klekota_roth.py

+    def __init__(self):
+        self.smarts: str | None = None
+        self.pattern_mol: Mol | None = None
+        self.is_terminal: bool = False
+        self.feature_bit: int | None = None
+        self.atom_requirements: defaultdict[str, int] = defaultdict(int)
+        self.children: list[_PatternNode] = []


Use @dataclass(frozen=True)

j-adamczyk · 2025-07-16T13:13:21Z

skfp/fingerprints/klekota_roth.py

@@ -1,10 +1,60 @@
+import json


I suggest a slightly different organization:

klekota_roth directory

files klekota_roth_fp.py, smarts_tree.py and tree_data.json inside

inside in __init__.py import only KlekotaRothFingerprint

This keeps all implementation details together, separates the SMARTS tree from the fingerprint itself, and also doesn't change anything from the perspective of the user

j-adamczyk · 2025-07-16T13:14:32Z

skfp/fingerprints/klekota_roth.py

+_TREE_PATH = Path(__file__).parent / "data" / "tree.json"
+
+
+class _PatternNode:


Private class names are very rarely seen in Python, use regular PatternNode

j-adamczyk · 2025-07-16T13:23:28Z

skfp/fingerprints/klekota_roth.py

+        """
+        file = _TREE_PATH
+        if not file.exists():
+            raise FileNotFoundError(f"Tree file not found: {file}")


More specific: Klekota-Roth SMARTS tree file not found. Also, I don't think exact path is necessary

j-adamczyk · 2025-07-16T13:23:43Z

skfp/fingerprints/klekota_roth.py

+        if not file.exists():
+            raise FileNotFoundError(f"Tree file not found: {file}")
+
+        with file.open("r", encoding="utf-8") as f:


Prefer file instead of f, one-letter variables are not a good practice

j-adamczyk · 2025-07-16T13:26:59Z

skfp/fingerprints/klekota_roth.py

+            and node.feature_bit is not None
+        ):
+            self._feature_names[int(node.feature_bit)] = node.smarts
+        return node


I prefer to have 1 empty line between if or for and further code for clarity

j-adamczyk · 2025-07-16T13:28:33Z

skfp/fingerprints/klekota_roth.py

+        node.pattern_mol = Chem.MolFromSmarts(node.smarts) if node.smarts else None
+        node.is_terminal = d.get("is_terminal", False)
+        node.feature_bit = d.get("feature_bit")
+        node.atom_requirements = defaultdict(int, d.get("atom_requirements", {}))


If there are atom requirements, they should be specified explicitly, right? So why defaultdict with default value 0?

j-adamczyk · 2025-07-16T13:29:37Z

skfp/fingerprints/klekota_roth.py

+            for key, atom in self._pattern_atoms.items():
+                atom_contents[key] = len(mol.GetSubstructMatches(atom))


For checking atom types, iteration over atoms should be faster than individual SMARTS patterns

j-adamczyk · 2025-07-16T13:32:04Z

skfp/fingerprints/klekota_roth.py

+                for key, val in node.atom_requirements.items():
+                    if atom_contents[key] < val:
+                        break
+                else:


I don't really like for/else syntax. I think that if you rewrite the for loop to explicitly check the condition and use continue instead of break, you can get rid of else here. It will also decrease indentation level

j-adamczyk · 2025-07-16T13:32:55Z

tests/fingerprints/klekota_roth.py

    assert len(feature_names) == len(set(feature_names))
+
+
+@pytest.mark.parametrize("count", [True, False])


False, True to keep convention of other tests

… files into a dedicated directory, clean up implementation

mjste · 2025-07-17T13:58:36Z

skfp/fingerprints/klekota_roth/smarts_tree.py

This should be cached somehow. Correct me if I'm wrong, but I don't see it here. Consider a use case with extensive creation of Klekota roth transformers. This would load the tree each time. In case of thousands (e.g. hyperparameter tuning) initializations it will take its toll.

How long does it take to parse such tree? Have you considered other format than json? How does it compare? (in case of implemented caching we can leave it as is, only address in case of significant loading time)

Current JSON loading time is about 0.12 s for every initialization, which would add up in large-scale use, so caching makes a lot of sense. I will use lru_cache for tree loading function, and with that, using JSON should not be a big problem.

Kacper Kozubowski added 2 commits July 16, 2025 14:26

Add prefix tree to Klekota-Roth fingerprint

1e32a10

Refactor Klekota-Roth fingerprint, optimize tree traversal, and add t…

29e13bd

…ests

Kacper-Kozubowski requested review from j-adamczyk, mjste and my-alaska as code owners July 16, 2025 13:05

j-adamczyk requested changes Jul 16, 2025

View reviewed changes

Extract tree loading logic into smarts_tree.py, move all Klekota-Roth…

41f5fb9

… files into a dedicated directory, clean up implementation

Kacper-Kozubowski requested a review from j-adamczyk July 17, 2025 11:41

j-adamczyk previously approved these changes Jul 17, 2025

View reviewed changes

mjste requested changes Jul 17, 2025

View reviewed changes

Add tree caching

cb38d8e

Kacper-Kozubowski dismissed j-adamczyk’s stale review via cb38d8e July 18, 2025 10:53

Kacper-Kozubowski requested a review from mjste July 18, 2025 10:54

mjste approved these changes Jul 18, 2025

View reviewed changes

j-adamczyk approved these changes Jul 18, 2025

View reviewed changes

Kacper-Kozubowski merged commit 64560bf into master Jul 18, 2025
14 checks passed

Kacper-Kozubowski deleted the klekota_prefix_optimization branch July 18, 2025 12:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Klekota-Roth prefix optimization#470

Klekota-Roth prefix optimization#470
Kacper-Kozubowski merged 4 commits intomasterfrom
klekota_prefix_optimization

Kacper-Kozubowski commented Jul 16, 2025 •

edited

Loading

Uh oh!

j-adamczyk Jul 16, 2025

Uh oh!

j-adamczyk Jul 16, 2025

Uh oh!

j-adamczyk Jul 16, 2025

Uh oh!

j-adamczyk Jul 16, 2025

Uh oh!

j-adamczyk Jul 16, 2025

Uh oh!

j-adamczyk Jul 16, 2025

Uh oh!

j-adamczyk Jul 16, 2025

Uh oh!

j-adamczyk Jul 16, 2025

Uh oh!

j-adamczyk Jul 16, 2025

Uh oh!

j-adamczyk Jul 16, 2025

Uh oh!

mjste Jul 17, 2025 •

edited

Loading

Uh oh!

Kacper-Kozubowski Jul 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		_TREE_PATH = Path(__file__).parent / "data" / "tree.json"


		class _PatternNode:

		for key, atom in self._pattern_atoms.items():
		atom_contents[key] = len(mol.GetSubstructMatches(atom))

		assert len(feature_names) == len(set(feature_names))


		@pytest.mark.parametrize("count", [True, False])

Conversation

Kacper-Kozubowski commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Other notes

Checklist before requesting a review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjste Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Kacper-Kozubowski commented Jul 16, 2025 •

edited

Loading

mjste Jul 17, 2025 •

edited

Loading