Klekota-Roth prefix optimization#470
Conversation
skfp/fingerprints/klekota_roth.py
Outdated
| def __init__(self): | ||
| self.smarts: str | None = None | ||
| self.pattern_mol: Mol | None = None | ||
| self.is_terminal: bool = False | ||
| self.feature_bit: int | None = None | ||
| self.atom_requirements: defaultdict[str, int] = defaultdict(int) | ||
| self.children: list[_PatternNode] = [] |
skfp/fingerprints/klekota_roth.py
Outdated
| @@ -1,10 +1,60 @@ | |||
| import json | |||
There was a problem hiding this comment.
I suggest a slightly different organization:
klekota_rothdirectory- files
klekota_roth_fp.py,smarts_tree.pyandtree_data.jsoninside - inside in
__init__.pyimport onlyKlekotaRothFingerprint
This keeps all implementation details together, separates the SMARTS tree from the fingerprint itself, and also doesn't change anything from the perspective of the user
skfp/fingerprints/klekota_roth.py
Outdated
| _TREE_PATH = Path(__file__).parent / "data" / "tree.json" | ||
|
|
||
|
|
||
| class _PatternNode: |
There was a problem hiding this comment.
Private class names are very rarely seen in Python, use regular PatternNode
skfp/fingerprints/klekota_roth.py
Outdated
| """ | ||
| file = _TREE_PATH | ||
| if not file.exists(): | ||
| raise FileNotFoundError(f"Tree file not found: {file}") |
There was a problem hiding this comment.
More specific: Klekota-Roth SMARTS tree file not found. Also, I don't think exact path is necessary
skfp/fingerprints/klekota_roth.py
Outdated
| if not file.exists(): | ||
| raise FileNotFoundError(f"Tree file not found: {file}") | ||
|
|
||
| with file.open("r", encoding="utf-8") as f: |
There was a problem hiding this comment.
Prefer file instead of f, one-letter variables are not a good practice
skfp/fingerprints/klekota_roth.py
Outdated
| and node.feature_bit is not None | ||
| ): | ||
| self._feature_names[int(node.feature_bit)] = node.smarts | ||
| return node |
There was a problem hiding this comment.
I prefer to have 1 empty line between if or for and further code for clarity
skfp/fingerprints/klekota_roth.py
Outdated
| node.pattern_mol = Chem.MolFromSmarts(node.smarts) if node.smarts else None | ||
| node.is_terminal = d.get("is_terminal", False) | ||
| node.feature_bit = d.get("feature_bit") | ||
| node.atom_requirements = defaultdict(int, d.get("atom_requirements", {})) |
There was a problem hiding this comment.
If there are atom requirements, they should be specified explicitly, right? So why defaultdict with default value 0?
skfp/fingerprints/klekota_roth.py
Outdated
| for key, atom in self._pattern_atoms.items(): | ||
| atom_contents[key] = len(mol.GetSubstructMatches(atom)) |
There was a problem hiding this comment.
For checking atom types, iteration over atoms should be faster than individual SMARTS patterns
skfp/fingerprints/klekota_roth.py
Outdated
| for key, val in node.atom_requirements.items(): | ||
| if atom_contents[key] < val: | ||
| break | ||
| else: |
There was a problem hiding this comment.
I don't really like for/else syntax. I think that if you rewrite the for loop to explicitly check the condition and use continue instead of break, you can get rid of else here. It will also decrease indentation level
tests/fingerprints/klekota_roth.py
Outdated
| assert len(feature_names) == len(set(feature_names)) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("count", [True, False]) |
There was a problem hiding this comment.
False, True to keep convention of other tests
… files into a dedicated directory, clean up implementation
There was a problem hiding this comment.
- This should be cached somehow. Correct me if I'm wrong, but I don't see it here. Consider a use case with extensive creation of Klekota roth transformers. This would load the tree each time. In case of thousands (e.g. hyperparameter tuning) initializations it will take its toll.
- How long does it take to parse such tree? Have you considered other format than json? How does it compare? (in case of implemented caching we can leave it as is, only address in case of significant loading time)
There was a problem hiding this comment.
Current JSON loading time is about 0.12 s for every initialization, which would add up in large-scale use, so caching makes a lot of sense. I will use lru_cache for tree loading function, and with that, using JSON should not be a big problem.
Changes
tree.json.Other notes
tree.jsonis generated offline with a separate script (not included in this PR).Checklist before requesting a review
make test-coverage)make docsand seedocs/_build/index.html)