Skip to content

MAP4 implementation fixes#528

Merged
j-adamczyk merged 6 commits intoMLCIL:masterfrom
florian-huber:issue_519
Mar 13, 2026
Merged

MAP4 implementation fixes#528
j-adamczyk merged 6 commits intoMLCIL:masterfrom
florian-huber:issue_519

Conversation

@florian-huber
Copy link
Contributor

@florian-huber florian-huber commented Mar 11, 2026

Changes

This is a first attempt to address #519

Checklist before requesting a review

  • Docstrings added/updated in public functions and classes
  • Tests added, reasonable test coverage (at least ~90%, make test-coverage)
  • Sphinx docs added/updated and render properly (make docs and see docs/_build/index.html)

@j-adamczyk j-adamczyk changed the title Issue 519 MAP4 implementation fixes Mar 12, 2026
Comment on lines +129 to +146
def test_map_minhash_same_random_state_is_reproducible(smallest_smiles_list):
map_fp_1 = MAPFingerprint(variant="minhash", random_state=123, n_jobs=-1)
map_fp_2 = MAPFingerprint(variant="minhash", random_state=123, n_jobs=-1)

X_1 = map_fp_1.transform(smallest_smiles_list)
X_2 = map_fp_2.transform(smallest_smiles_list)

assert_equal(X_1, X_2)


def test_map_minhash_different_random_state_changes_output(smallest_smiles_list):
map_fp_1 = MAPFingerprint(variant="minhash", random_state=123, n_jobs=-1)
map_fp_2 = MAPFingerprint(variant="minhash", random_state=456, n_jobs=-1)

X_1 = map_fp_1.transform(smallest_smiles_list)
X_2 = map_fp_2.transform(smallest_smiles_list)

assert not np.array_equal(X_1, X_2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!
It would be nice to also include a test that makes sure that given molecules for the same random_state are hashed in the same way, regardless of their order and size of the list passed to the .transform() method.
This will save us from a potential problems in case someone makes modifications to the ._minhash() method in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test for this (if that is what you meant).

Copy link
Member

@j-adamczyk j-adamczyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments. Also, I wonder if we should keep include_duplicated_shingles? I guess that it's covered by binary vs count vs minhash anyway.

if hashed_shinglings.size == 0:
return np.zeros(self.fp_size, dtype=np.uint32)

rng = np.random.default_rng(self.random_state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work if user passes np.random.RandomState as random_state? This is allowed by scikit-learn API

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not.
Should I use something like this here:

    rng = (
        self.random_state
        if isinstance(self.random_state, RandomState)
        else np.random.default_rng(self.random_state)
    )

as in randomized_scaffold_split.py ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use check_random_state() from scikit-learn instead: https://scikit-learn.org/stable/modules/generated/sklearn.utils.check_random_state.html. BTW, you can also add that to randomized_scaffold_split.py, we should use scikit-learn mechanisms where possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Will also push the minor edit of randomized_scaffold_split.py here as well (let me know if that should go into a separate PR).

@florian-huber
Copy link
Contributor Author

A few minor comments. Also, I wonder if we should keep include_duplicated_shingles? I guess that it's covered by binary vs count vs minhash anyway.

Good question. Personally, I see little benefit in using those duplicated shingles because count seems more straightforward to me (so no strong opinion from my side on that matter). But I can see two cases where it might be useful to have:

  1. to reproduce the original score and results, that might sometimes be include_duplicated_shingles=True together with binary so that duplicates are "counted" via distinct shingles like "...|1" etc.
  2. For variant="minhash", include_duplicated_shingles=True is here the only way to preserve multiplicity-like information.

@florian-huber
Copy link
Contributor Author

Maybe also worth having a look for you @daenuprobst ?
(since you were involved with MAP4 much more than I am)

Copy link
Member

@j-adamczyk j-adamczyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two very minor comments left. I agree that include_duplicated_shingles is the only way to add "counts" to minhash variant, so let's keep that.

@j-adamczyk j-adamczyk merged commit e4895ed into MLCIL:master Mar 13, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants