-
Notifications
You must be signed in to change notification settings - Fork 41
pre compute tanimote kernel distances #701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…moto similarities
|
messed up branches, this is the renewed version of #696 |
|
Thanks @LukasHebing, how urgent is this PR? Would it be fine to have the one from me reagarding the ContinuousMolecularInput finished and merged before, as this one also does some changes to the whole molecular machinery? I would try to finish it today. |
Not urgent. We are using this branch right now, so first merging the other PR is fine. |
|
Hi @LukasHebing, can you resolve the merge conflicts with main, by merging main in? After this, I will review it ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @LukasHebing,
I still have some understanding problems, in the first run, I just posted questions ;)
If I do not set to precomute anything, it is defaulting to the old solution, or?
Best,
Johannes
|
|
||
| # private attributes, for pre-computation of similarities: will be overridden by tanimoto_gp, or auto-computed | ||
| _fingerprint_settings_for_similarities: Optional[dict[str, Fingerprints]] = None | ||
| _molecular_inputs: Optional[list[CategoricalMolecularInput]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I a bit puzzled here, why is it a list of CategoricalMolecularInputs, the kernel should always acts on one CategoricalMolecularInput or what are you using this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kernel can also handle multiple CategoricalMolecularInput. Distances in more dimensions are just added. So, you can have a single kernel for multiple inputs (say e.g. ligand-smiles, solvent-smiles)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, was this also possible before? Yes, or?
| ).features # type: ignore | ||
| base_kernel._molecular_inputs = molecular_inputs # type: ignore | ||
|
|
||
| # move fingerprint data model fro categorical encodings to kernel-specs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # move fingerprint data model fro categorical encodings to kernel-specs | |
| # move fingerprint data model for categorical encodings to kernel-specs |
| inp_.key | ||
| ] # [Ni, Ni], precomputed distances for feature idx | ||
|
|
||
| # Gather integer indices for this feature from x1 and x2 (keep batch dims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean with this feature? From my understanding x1 and x2 are sets matrices of molecules encoded by fingerprints, or?
So what is happening here? Are you trying to get the indices of the moleucles of x1 and x2 in the precomputed distance matrix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, these are the indeces.
In this setup we only encode x1, and x2 as integers (this is because, we remove the preprocessing step into fingerprints in the data model for the tanimoto kernel, and pass the fingerprint info to the kernel data-model:
# move fingerprint data model fro categorical encodings to kernel-specs
base_kernel._fingerprint_settings_for_similarities = {}
for inp_ in molecular_inputs:
if inp_.key in list(self.categorical_encodings):
assert isinstance(
self.categorical_encodings[inp_.key], Fingerprints
), (
f"Categorical encoding for input {inp_.key} must be a Fingerprint. "
f"Found {type(self.categorical_encodings[inp_.key])}"
)
fingerprint: Fingerprints = self.categorical_encodings.pop(
inp_.key
) # type: ignore
base_kernel._fingerprint_settings_for_similarities[inp_.key] = (
fingerprint # type: ignore
)In the initial kernel setup, fingerprints and all mutual similarities are computed and stored as a large matrix.
In the forward method, only the respective rows and cols of this matrix are selected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, this means that in case that one wants a TanimotoGP with precompute, one has to remove the input_transform_specs and one just passes the integers?
|
@jduerholt Thanks for the review :) One new problem came with merge. The automatic reduction of the correlated fingerprint vectors is happening in the computation which was used so far, but not in the pre-computed. Because the pre-computed tanimoto sims used the rdkit bit-vectors instead of the parsed numpy/pandas arrays (this is why they are way faster). However, it seems that the tanimoto calculation (after this changes) is only responsible for <2% of the computation time, so this is not so relevant anymore. I try to change this and use the same tensor-based similarity computation, which also includes the removal of correlated fingerprint features. @jduerholt: This may take some days. I will let you know when this is ready. |
|
Just as a comment, you can turn this behavior off, via setting |
I think this reduction is a quite nice improvement, so I would keep this in the kernel |
Tanimoto-similarities are computed upfront, using rdkit functions
re_init_kwargsre_init_kwargsare extended to the Kernelmapfunction.Performance tests:
On a private benchmark with >1k molecules, this computation only needed ~10% of the computation time.
The old computation also lead to memory-errors, when many molecules were added as training data points leading to crashes.