Skip to content

Conversation

@LukasHebing
Copy link
Contributor

@LukasHebing LukasHebing commented Jan 23, 2026

Tanimoto-similarities are computed upfront, using rdkit functions

  • This avoids re-computation during model training and evaluation and exsessive memory usage
  • Computation with rdkit is significantly faster (sparse bit vector vs. full float64 torch arrays may explain the difference)
  • computation is done in kernel, cov-tensors are passed to the strategy via re_init_kwargs
  • The re_init_kwargs are extended to the Kernel map function.

Performance tests:
On a private benchmark with >1k molecules, this computation only needed ~10% of the computation time.

The old computation also lead to memory-errors, when many molecules were added as training data points leading to crashes.

@LukasHebing LukasHebing changed the title Feat/pre compute tanimote kernel comps pre compute tanimote kernel distances Jan 23, 2026
@LukasHebing
Copy link
Contributor Author

messed up branches, this is the renewed version of #696

@LukasHebing LukasHebing marked this pull request as ready for review January 26, 2026 09:16
@LukasHebing LukasHebing requested a review from jduerholt January 26, 2026 12:52
@jduerholt
Copy link
Contributor

Thanks @LukasHebing, how urgent is this PR? Would it be fine to have the one from me reagarding the ContinuousMolecularInput finished and merged before, as this one also does some changes to the whole molecular machinery? I would try to finish it today.

@LukasHebing
Copy link
Contributor Author

Thanks @LukasHebing, how urgent is this PR? Would it be fine to have the one from me reagarding the ContinuousMolecularInput finished and merged before, as this one also does some changes to the whole molecular machinery? I would try to finish it today.

Not urgent. We are using this branch right now, so first merging the other PR is fine.

@jduerholt
Copy link
Contributor

Hi @LukasHebing, can you resolve the merge conflicts with main, by merging main in? After this, I will review it ;)

Copy link
Contributor

@jduerholt jduerholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @LukasHebing,

I still have some understanding problems, in the first run, I just posted questions ;)

If I do not set to precomute anything, it is defaulting to the old solution, or?

Best,

Johannes


# private attributes, for pre-computation of similarities: will be overridden by tanimoto_gp, or auto-computed
_fingerprint_settings_for_similarities: Optional[dict[str, Fingerprints]] = None
_molecular_inputs: Optional[list[CategoricalMolecularInput]] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I a bit puzzled here, why is it a list of CategoricalMolecularInputs, the kernel should always acts on one CategoricalMolecularInput or what are you using this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kernel can also handle multiple CategoricalMolecularInput. Distances in more dimensions are just added. So, you can have a single kernel for multiple inputs (say e.g. ligand-smiles, solvent-smiles)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, was this also possible before? Yes, or?

).features # type: ignore
base_kernel._molecular_inputs = molecular_inputs # type: ignore

# move fingerprint data model fro categorical encodings to kernel-specs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# move fingerprint data model fro categorical encodings to kernel-specs
# move fingerprint data model for categorical encodings to kernel-specs

inp_.key
] # [Ni, Ni], precomputed distances for feature idx

# Gather integer indices for this feature from x1 and x2 (keep batch dims)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with this feature? From my understanding x1 and x2 are sets matrices of molecules encoded by fingerprints, or?

So what is happening here? Are you trying to get the indices of the moleucles of x1 and x2 in the precomputed distance matrix?

Copy link
Contributor Author

@LukasHebing LukasHebing Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, these are the indeces.
In this setup we only encode x1, and x2 as integers (this is because, we remove the preprocessing step into fingerprints in the data model for the tanimoto kernel, and pass the fingerprint info to the kernel data-model:

                # move fingerprint data model fro categorical encodings to kernel-specs
                base_kernel._fingerprint_settings_for_similarities = {}
                for inp_ in molecular_inputs:
                    if inp_.key in list(self.categorical_encodings):
                        assert isinstance(
                            self.categorical_encodings[inp_.key], Fingerprints
                        ), (
                            f"Categorical encoding for input {inp_.key} must be a Fingerprint. "
                            f"Found {type(self.categorical_encodings[inp_.key])}"
                        )
                        fingerprint: Fingerprints = self.categorical_encodings.pop(
                            inp_.key
                        )  # type: ignore
                        base_kernel._fingerprint_settings_for_similarities[inp_.key] = (
                            fingerprint  # type: ignore
                        )

In the initial kernel setup, fingerprints and all mutual similarities are computed and stored as a large matrix.

In the forward method, only the respective rows and cols of this matrix are selected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, this means that in case that one wants a TanimotoGP with precompute, one has to remove the input_transform_specs and one just passes the integers?

@LukasHebing
Copy link
Contributor Author

@jduerholt Thanks for the review :)

One new problem came with merge. The automatic reduction of the correlated fingerprint vectors is happening in the computation which was used so far, but not in the pre-computed. Because the pre-computed tanimoto sims used the rdkit bit-vectors instead of the parsed numpy/pandas arrays (this is why they are way faster).

However, it seems that the tanimoto calculation (after this changes) is only responsible for <2% of the computation time, so this is not so relevant anymore. I try to change this and use the same tensor-based similarity computation, which also includes the removal of correlated fingerprint features.

@jduerholt: This may take some days. I will let you know when this is ready.

@jduerholt
Copy link
Contributor

Just as a comment, you can turn this behavior off, via setting remove_correlated_features to False. One could also enforce this in the future for tanimoto ...

@LukasHebing
Copy link
Contributor Author

Just as a comment, you can turn this behavior off, via setting remove_correlated_features to False. One could also enforce this in the future for tanimoto ...

I think this reduction is a quite nice improvement, so I would keep this in the kernel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants