Conversation
| def test_map_minhash_same_random_state_is_reproducible(smallest_smiles_list): | ||
| map_fp_1 = MAPFingerprint(variant="minhash", random_state=123, n_jobs=-1) | ||
| map_fp_2 = MAPFingerprint(variant="minhash", random_state=123, n_jobs=-1) | ||
|
|
||
| X_1 = map_fp_1.transform(smallest_smiles_list) | ||
| X_2 = map_fp_2.transform(smallest_smiles_list) | ||
|
|
||
| assert_equal(X_1, X_2) | ||
|
|
||
|
|
||
| def test_map_minhash_different_random_state_changes_output(smallest_smiles_list): | ||
| map_fp_1 = MAPFingerprint(variant="minhash", random_state=123, n_jobs=-1) | ||
| map_fp_2 = MAPFingerprint(variant="minhash", random_state=456, n_jobs=-1) | ||
|
|
||
| X_1 = map_fp_1.transform(smallest_smiles_list) | ||
| X_2 = map_fp_2.transform(smallest_smiles_list) | ||
|
|
||
| assert not np.array_equal(X_1, X_2) |
There was a problem hiding this comment.
Thank you for the PR!
It would be nice to also include a test that makes sure that given molecules for the same random_state are hashed in the same way, regardless of their order and size of the list passed to the .transform() method.
This will save us from a potential problems in case someone makes modifications to the ._minhash() method in the future
There was a problem hiding this comment.
I added a test for this (if that is what you meant).
j-adamczyk
left a comment
There was a problem hiding this comment.
A few minor comments. Also, I wonder if we should keep include_duplicated_shingles? I guess that it's covered by binary vs count vs minhash anyway.
skfp/fingerprints/map.py
Outdated
| if hashed_shinglings.size == 0: | ||
| return np.zeros(self.fp_size, dtype=np.uint32) | ||
|
|
||
| rng = np.random.default_rng(self.random_state) |
There was a problem hiding this comment.
Will this work if user passes np.random.RandomState as random_state? This is allowed by scikit-learn API
There was a problem hiding this comment.
Probably not.
Should I use something like this here:
rng = (
self.random_state
if isinstance(self.random_state, RandomState)
else np.random.default_rng(self.random_state)
)as in randomized_scaffold_split.py ?
There was a problem hiding this comment.
Use check_random_state() from scikit-learn instead: https://scikit-learn.org/stable/modules/generated/sklearn.utils.check_random_state.html. BTW, you can also add that to randomized_scaffold_split.py, we should use scikit-learn mechanisms where possible
There was a problem hiding this comment.
Done. Will also push the minor edit of randomized_scaffold_split.py here as well (let me know if that should go into a separate PR).
Good question. Personally, I see little benefit in using those duplicated shingles because
|
|
Maybe also worth having a look for you @daenuprobst ? |
j-adamczyk
left a comment
There was a problem hiding this comment.
Two very minor comments left. I agree that include_duplicated_shingles is the only way to add "counts" to minhash variant, so let's keep that.
Changes
This is a first attempt to address #519
Checklist before requesting a review
make test-coverage)make docsand seedocs/_build/index.html)