Skip to content

Klekota-Roth prefix optimization#470

Merged
Kacper-Kozubowski merged 4 commits intomasterfrom
klekota_prefix_optimization
Jul 18, 2025
Merged

Klekota-Roth prefix optimization#470
Kacper-Kozubowski merged 4 commits intomasterfrom
klekota_prefix_optimization

Conversation

@Kacper-Kozubowski
Copy link
Copy Markdown
Contributor

@Kacper-Kozubowski Kacper-Kozubowski commented Jul 16, 2025

Changes

  • Completely rewrote the Klekota-Roth fingerprint computation for better performance.
  • Replaced the naive O(n × m) loop over molecules and patterns with a tree-based traversal.
  • Generated an efficient tree structure used by the algorithm and stored it in tree.json.
  • Added atom-count-based filtering for faster pattern rejection.
  • Removed the hardcoded list of 4860 patterns from the module; patterns are now loaded from the tree structure.
  • Added a test verifying result correctness against the naive approach.

Other notes

  • tree.json is generated offline with a separate script (not included in this PR).
  • Achieved a 6–9× performance boost for single-core execution and roughly a 5–7× boost for multicore, depending on dataset.

Checklist before requesting a review

  • Docstrings added/updated in public functions and classes
  • Tests added, reasonable test coverage (at least ~90%, make test-coverage)
  • Sphinx docs added/updated and render properly (make docs and see docs/_build/index.html)

Comment on lines +51 to +57
def __init__(self):
self.smarts: str | None = None
self.pattern_mol: Mol | None = None
self.is_terminal: bool = False
self.feature_bit: int | None = None
self.atom_requirements: defaultdict[str, int] = defaultdict(int)
self.children: list[_PatternNode] = []
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use @dataclass(frozen=True)

@@ -1,10 +1,60 @@
import json
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest a slightly different organization:

  • klekota_roth directory
  • files klekota_roth_fp.py, smarts_tree.py and tree_data.json inside
  • inside in __init__.py import only KlekotaRothFingerprint

This keeps all implementation details together, separates the SMARTS tree from the fingerprint itself, and also doesn't change anything from the perspective of the user

_TREE_PATH = Path(__file__).parent / "data" / "tree.json"


class _PatternNode:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Private class names are very rarely seen in Python, use regular PatternNode

"""
file = _TREE_PATH
if not file.exists():
raise FileNotFoundError(f"Tree file not found: {file}")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specific: Klekota-Roth SMARTS tree file not found. Also, I don't think exact path is necessary

if not file.exists():
raise FileNotFoundError(f"Tree file not found: {file}")

with file.open("r", encoding="utf-8") as f:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer file instead of f, one-letter variables are not a good practice

and node.feature_bit is not None
):
self._feature_names[int(node.feature_bit)] = node.smarts
return node
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to have 1 empty line between if or for and further code for clarity

node.pattern_mol = Chem.MolFromSmarts(node.smarts) if node.smarts else None
node.is_terminal = d.get("is_terminal", False)
node.feature_bit = d.get("feature_bit")
node.atom_requirements = defaultdict(int, d.get("atom_requirements", {}))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are atom requirements, they should be specified explicitly, right? So why defaultdict with default value 0?

Comment on lines +245 to +246
for key, atom in self._pattern_atoms.items():
atom_contents[key] = len(mol.GetSubstructMatches(atom))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For checking atom types, iteration over atoms should be faster than individual SMARTS patterns

for key, val in node.atom_requirements.items():
if atom_contents[key] < val:
break
else:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like for/else syntax. I think that if you rewrite the for loop to explicitly check the condition and use continue instead of break, you can get rid of else here. It will also decrease indentation level

assert len(feature_names) == len(set(feature_names))


@pytest.mark.parametrize("count", [True, False])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False, True to keep convention of other tests

… files into a dedicated directory, clean up implementation
j-adamczyk
j-adamczyk previously approved these changes Jul 17, 2025
Copy link
Copy Markdown
Collaborator

@mjste mjste Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This should be cached somehow. Correct me if I'm wrong, but I don't see it here. Consider a use case with extensive creation of Klekota roth transformers. This would load the tree each time. In case of thousands (e.g. hyperparameter tuning) initializations it will take its toll.
  2. How long does it take to parse such tree? Have you considered other format than json? How does it compare? (in case of implemented caching we can leave it as is, only address in case of significant loading time)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current JSON loading time is about 0.12 s for every initialization, which would add up in large-scale use, so caching makes a lot of sense. I will use lru_cache for tree loading function, and with that, using JSON should not be a big problem.

@Kacper-Kozubowski Kacper-Kozubowski merged commit 64560bf into master Jul 18, 2025
14 checks passed
@Kacper-Kozubowski Kacper-Kozubowski deleted the klekota_prefix_optimization branch July 18, 2025 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants