pre compute tanimote kernel distances #701

LukasHebing · 2026-01-23T12:58:41Z

Tanimoto-similarities are computed upfront, using rdkit functions

This avoids re-computation during model training and evaluation and exsessive memory usage
Computation with rdkit is significantly faster (sparse bit vector vs. full float64 torch arrays may explain the difference)
computation is done in kernel, cov-tensors are passed to the strategy via re_init_kwargs
The re_init_kwargs are extended to the Kernel map function.

Performance tests:
On a private benchmark with >1k molecules, this computation only needed ~10% of the computation time.

The old computation also lead to memory-errors, when many molecules were added as training data points leading to crashes.

…moto similarities

LukasHebing · 2026-01-23T15:05:32Z

messed up branches, this is the renewed version of #696

jduerholt · 2026-01-27T08:19:16Z

Thanks @LukasHebing, how urgent is this PR? Would it be fine to have the one from me reagarding the ContinuousMolecularInput finished and merged before, as this one also does some changes to the whole molecular machinery? I would try to finish it today.

LukasHebing · 2026-01-27T11:23:38Z

Thanks @LukasHebing, how urgent is this PR? Would it be fine to have the one from me reagarding the ContinuousMolecularInput finished and merged before, as this one also does some changes to the whole molecular machinery? I would try to finish it today.

Not urgent. We are using this branch right now, so first merging the other PR is fine.

jduerholt · 2026-01-29T10:19:41Z

Hi @LukasHebing, can you resolve the merge conflicts with main, by merging main in? After this, I will review it ;)

jduerholt

Hi @LukasHebing,

I still have some understanding problems, in the first run, I just posted questions ;)

If I do not set to precomute anything, it is defaulting to the old solution, or?

Best,

Johannes

jduerholt · 2026-01-29T12:19:32Z

bofire/data_models/kernels/molecular.py

+
+    # private attributes, for pre-computation of similarities: will be overridden by tanimoto_gp, or auto-computed
+    _fingerprint_settings_for_similarities: Optional[dict[str, Fingerprints]] = None
+    _molecular_inputs: Optional[list[CategoricalMolecularInput]] = None


Hmm, I a bit puzzled here, why is it a list of CategoricalMolecularInputs, the kernel should always acts on one CategoricalMolecularInput or what are you using this for?

The kernel can also handle multiple CategoricalMolecularInput. Distances in more dimensions are just added. So, you can have a single kernel for multiple inputs (say e.g. ligand-smiles, solvent-smiles)

nice, was this also possible before? Yes, or?

jduerholt · 2026-01-29T12:20:24Z

bofire/data_models/surrogates/tanimoto_gp.py

+                ).features  # type: ignore
+                base_kernel._molecular_inputs = molecular_inputs  # type: ignore
+
+                # move fingerprint data model fro categorical encodings to kernel-specs


Suggested change

# move fingerprint data model fro categorical encodings to kernel-specs

# move fingerprint data model for categorical encodings to kernel-specs

jduerholt · 2026-01-29T12:30:39Z

bofire/kernels/fingerprint_kernels/tanimoto_kernel.py

+                    inp_.key
+                ]  # [Ni, Ni], precomputed distances for feature idx
+
+                # Gather integer indices for this feature from x1 and x2 (keep batch dims)


What do you mean with this feature? From my understanding x1 and x2 are sets matrices of molecules encoded by fingerprints, or?

So what is happening here? Are you trying to get the indices of the moleucles of x1 and x2 in the precomputed distance matrix?

yes, these are the indeces.
In this setup we only encode x1, and x2 as integers (this is because, we remove the preprocessing step into fingerprints in the data model for the tanimoto kernel, and pass the fingerprint info to the kernel data-model:

# move fingerprint data model fro categorical encodings to kernel-specs base_kernel._fingerprint_settings_for_similarities = {} for inp_ in molecular_inputs: if inp_.key in list(self.categorical_encodings): assert isinstance( self.categorical_encodings[inp_.key], Fingerprints ), ( f"Categorical encoding for input {inp_.key} must be a Fingerprint. " f"Found {type(self.categorical_encodings[inp_.key])}" ) fingerprint: Fingerprints = self.categorical_encodings.pop( inp_.key ) # type: ignore base_kernel._fingerprint_settings_for_similarities[inp_.key] = ( fingerprint # type: ignore )

In the initial kernel setup, fingerprints and all mutual similarities are computed and stored as a large matrix.

In the forward method, only the respective rows and cols of this matrix are selected.

Ah ok, this means that in case that one wants a TanimotoGP with precompute, one has to remove the input_transform_specs and one just passes the integers?

LukasHebing · 2026-01-29T13:37:56Z

@jduerholt Thanks for the review :)

One new problem came with merge. The automatic reduction of the correlated fingerprint vectors is happening in the computation which was used so far, but not in the pre-computed. Because the pre-computed tanimoto sims used the rdkit bit-vectors instead of the parsed numpy/pandas arrays (this is why they are way faster).

However, it seems that the tanimoto calculation (after this changes) is only responsible for <2% of the computation time, so this is not so relevant anymore. I try to change this and use the same tensor-based similarity computation, which also includes the removal of correlated fingerprint features.

@jduerholt: This may take some days. I will let you know when this is ready.

jduerholt · 2026-01-29T14:20:43Z

Just as a comment, you can turn this behavior off, via setting remove_correlated_features to False. One could also enforce this in the future for tanimoto ...

LukasHebing · 2026-01-30T07:11:23Z

Just as a comment, you can turn this behavior off, via setting remove_correlated_features to False. One could also enforce this in the future for tanimoto ...

I think this reduction is a quite nice improvement, so I would keep this in the kernel

LukasHebing and others added 30 commits January 16, 2026 16:15

first draft: avoid initializing input transform

72d3ef4

WIP: pre-computed tanimoto distances

229538b

WIP: pre-computed tanimoto distances - now in kernel. But fails

e7e8774

fixed batch-shape lookup

9a70a72

avoids multiple tanimoto calcs in model validation

9ebe343

changed to _input_transform

a972842

make re_init_kwargs a function of surrogates base class

866f9c3

pyright stuff

16aa26c

pyright stuff

9082a3d

pre-commit stuff

fbb2db7

debugged underscore input_transofrm bug

4d5bf60

added test

a998086

after hooks

b339f69

deleted temp file

db4628a

BotorchSurrogates not an abstract base class anymore (linting)

7d86888

linting

5034b04

linting

8996256

linting

e3b9659

linting

238af76

linting

4d40fc3

linting

dc9649e

Change type hint for surrogates to Any

8142f88

linting

daeb981

after merging with improve_tanimoto branch

e1389f6

renaming distances to similarities

79b6888

made fingerprint calculation configurable

9424141

WIP: making tests

9f0b3ba

restricting optimization_only tests to strategy folder

19a805b

added test for comparison of pre-computed / on-the-fly computing tani…

259c19e

…moto similarities

skipping test on missing rdkit in env

edf9995

LukasHebing changed the title ~~Feat/pre compute tanimote kernel comps~~ pre compute tanimote kernel distances Jan 23, 2026

LukasHebing added 6 commits January 23, 2026 16:17

type annotation errors

64facf9

after hooks

4cb332c

# type: ignore

896f378

# type: ignore

a5c6f10

# type: ignore

d857946

features typo

c5aca3f

LukasHebing marked this pull request as ready for review January 26, 2026 09:16

LukasHebing added 3 commits January 26, 2026 10:19

merged with main

dfe4525

type: ignore

88e3619

kwargs keyword to fully bayesian _fit_botorch methods

31ce5fe

LukasHebing requested a review from jduerholt January 26, 2026 12:52

LukasHebing added 2 commits January 29, 2026 12:24

kwargs keyword to fully bayesian _fit_botorch methods

0cc8e2f

fixed tanimoto model validator

bf00a7e

jduerholt reviewed Jan 29, 2026

View reviewed changes

LukasHebing added 2 commits January 29, 2026 14:40

using molfeatures

65b6392

added tests for different molfeatures

5299960

LukasHebing added 5 commits January 29, 2026 15:43

changed data model validation strategy

b29f316

after hooks

1c6ab5c

types for Molecular fgpr. AnyMolecular

76b0c5c

after hooks

cc04a22

changing types

6a4185f

fixed re-validation error of tanimoto-gp

8ac0a31

	# move fingerprint data model fro categorical encodings to kernel-specs
	# move fingerprint data model for categorical encodings to kernel-specs

pre compute tanimote kernel distances #701

Are you sure you want to change the base?

pre compute tanimote kernel distances #701

Uh oh!

Conversation

LukasHebing commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukasHebing commented Jan 23, 2026

Uh oh!

jduerholt commented Jan 27, 2026

Uh oh!

LukasHebing commented Jan 27, 2026

Uh oh!

jduerholt commented Jan 29, 2026

Uh oh!

jduerholt left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jduerholt Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

LukasHebing Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

jduerholt Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

jduerholt Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

jduerholt Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

LukasHebing Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jduerholt Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

LukasHebing commented Jan 29, 2026

Uh oh!

jduerholt commented Jan 29, 2026

Uh oh!

LukasHebing commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LukasHebing commented Jan 23, 2026 •

edited

Loading

jduerholt left a comment •

edited

Loading

LukasHebing Jan 29, 2026 •

edited

Loading