Retrieving Gaussian Process model details in BayBE #737

inogueroles · 2026-02-04T13:27:46Z

inogueroles
Feb 4, 2026

Hello, I am working on an active learning project using the BayBE library (GP surrogate model + Bayesian optimization) and have a few questions about retrieving details of the Gaussian Process model which are computed internally by BayBE. Briefly, our model has eight inputs (x1-x8) and one target. The active learning loop recommends 10 experiments per cycle to help minimize the target.

We would like to report the following information in our publication:

Kernel function
Hyperparameter optimization method
Length scales for each input (x1-x8)
What type of scaling is performed internally for the inputs and/or target

When I tried to retrieve the surrogate model, I could only obtain the following:

Model: GaussianProcessSurrogate
Supports Transfer Learning: True
Kernel factory: DefaultKernelFactory()

Could you please help us retrieve this information? If there are other internal model parameters in BayBE that are important to report, we would be grateful for your advice and guidance on how to retrieve them as well.

Thank you in advance for your help!

Code:

# Define the optimization objective
target_name = "Target"
target_mode = "MIN"
target = NumericalTarget(name=target_name, mode=target_mode, bounds=(-5,5))
objective = SingleTargetObjective(target=target)

# Define the parameters for the search space and their bounds 
parameters = [NumericalContinuousParameter(name="x1", bounds=(0,100),),
    NumericalContinuousParameter(name="x2", bounds=(0,100),),
    NumericalContinuousParameter(name="x3", bounds=(0,100),),
    NumericalContinuousParameter(name="x4", bounds=(0,100),),
    NumericalContinuousParameter(name="x5", bounds=(553,613),),
    NumericalContinuousParameter(name="x6", bounds=(10,50),),
    NumericalContinuousParameter(name="x7", bounds=(12000,192000),),
    NumericalContinuousParameter(name="x8", bounds=(1.5,4),),]

# Define the constraints: Sum of x1-x4 must be equal to 100
constraint = ContinuousLinearConstraint(parameters=["x1", "x2", "x3", "x4"], operator="=", coefficients=[1.0, 1.0, 1.0, 1.0], rhs=100,)

# Import the search space
searchspace = SearchSpace.from_product(parameters=parameters, constraints=[constraint])

# Create the recommender and campaign
acquisition_function = "qLogEI"
recommender = BotorchRecommender(surrogate_model=GaussianProcessSurrogate(), acquisition_function=acquisition_function,)
campaign = Campaign(searchspace=searchspace, objective=objective, recommender=recommender,)

campaign.add_measurements(df)

# Get recommendations
import time
start_time = time.time()
recommendation = campaign.recommend(batch_size=10)
print(recommendation.round(2))
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Execution time: {elapsed_time:.2f} seconds")

# Retrieve the surrogate model
surrogate_model = campaign.recommender.get_surrogate(searchspace=campaign.searchspace, objective=campaign.objective, measurements=campaign.measurements)

recommendation_df = pd.DataFrame(recommendation)

# Display the model details and recommended experiments
print(f"Target: {target_name}, Mode: {target_mode}")
print(f"Model: {surrogate_model}")
print(f"Acquisition function: {acquisition_function}")
display(recommendation_df.round(2))

Answered by AdrianSosic

Feb 5, 2026

Hi @inogueroles, now I have some more time to answer.

Note that I fully agree with @Scienfitz's suggestion of submitting a lock file. This adds much more to reproducibility than just reporting model details in text form. Especially since things are subject to constant change/improvement (see details below):

Nevertheless, here my input to your points:

If not explicitly specified by the user, BayBE uses its DefaultKernelFactory to create an appropriate kernel for the specified optimization problem. Currently, we're using a smoothed version of the EDBO model (see docstring). But please note: This default behavior is supposed to be changed occasionally, i.e. whenever there is a good reason t…

View full answer

AdrianSosic · 2026-02-04T13:42:07Z

AdrianSosic
Feb 4, 2026
Maintainer

Hi @inogueroles 👋🏼

Great to hear that you've decided to use our framework. Before I answer your questions, here a few general comments:

While this is not a problem (since currently in deprecation period), you're using a legacy construction for your target.
The reason for the redesign in our code was a confusing user interface, especially regarding the target "bounds", and it seems you (understandably) also fell into the same trap. In short: the target bounds you set probably don't have the effect that you imagine they have. If you need more clarification, let me know, but a more modern and cleaner variant would be:
```
target = NumericalTarget(name=target_name)
```
or in your single-target setting even more concise:
```
 objective = NumericalTarget(target_name).to_objective()
```
You mention that you want to do active learning, but your are not actually using an active learning acquisition function but one that focuses on optimization. For more details, have look here: https://emdgroup.github.io/baybe/stable/userguide/active_learning.html

Will come back with answers to your questions later, need to finish a few other things first 🙃

6 replies

Scienfitz Feb 4, 2026
Maintainer

just to clarify: I think many people are calling Bayesian optimization also active learning which is not our terminology (which I think is right as we try to be more precise). But I don't think what is planned here has anything to do with any of our definition of active learning and the acqfs

AdrianSosic Feb 5, 2026
Maintainer

Hi @inogueroles, now I have some more time to answer.

Note that I fully agree with @Scienfitz's suggestion of submitting a lock file. This adds much more to reproducibility than just reporting model details in text form. Especially since things are subject to constant change/improvement (see details below):

Nevertheless, here my input to your points:

If not explicitly specified by the user, BayBE uses its DefaultKernelFactory to create an appropriate kernel for the specified optimization problem. Currently, we're using a smoothed version of the EDBO model (see docstring). But please note: This default behavior is supposed to be changed occasionally, i.e. whenever there is a good reason to incorporate improved mechanism. A good example are the Hvarfner priors, which recently also found they way as a default into BoTorch. We'll also incorporate this change in the next few days/weeks, so the default will likely change in the next released version. When such a change happens, we'll likely keep the previous version as a selectable preset to enable backward compatibility for users who relied on the old settings. Keep an eye on the changelog, where we document these changes.
Currently, we're using BoTorch's fit_gpytorch_mll routine, as you can see here. Hopefully soon, there'll also be choices for cross-validation available (in addition to the LOO approach we're already using in special cases. So again, keep an eye on the changelog](https://github.com/emdgroup/baybe/blob/main/CHANGELOG.md)
We use normalization for input scaling and standardization for output scaling, which is the typical thing that people do. This is important because priors are defined on the scaled domains (<-- in general, this is is another important setting to report, since you asked)
We offer no high-level access to kernel lengthscales because this is pretty much impossible to do for every possible kernel layout a user might manually construct and pass to the GP. However, you can still access the fitted surrogate via the get_surrogate method, extract the internal GPyTorch kernel object and read its lengthscale. Let me know if you need guidance on how to do it.

Does this answer your questions?

Answer selected by AdrianSosic

AdrianSosic Feb 12, 2026
Maintainer

Hi @inogueroles, since I haven't heard anything back from you, I consider this solved and will close the discussion now. If you have further questions, feel free to reopen 👍🏼

inogueroles Feb 12, 2026
Author

Dear @AdrianSosic and @Scienfitz,

Thank you both for your help and the recommendation about creating a lock file. I started looking into it and this is indeed ideal for reproducibility of the environment + results!

For the comments about the target bounds and whether it would still be okay to call it active learning, this fit better into the context of Discussion #738 so I described it over there.

For the other information:

DefaultKernelFactory: The link mentions EDBO OHE and Mordred regime, which do not seem applicable for us since there are no categorical inputs. Based on this, would it be sufficient in our methods section to state that we used a GP surrogate model with a Matern 5/2 kernel and ARD lengthscales (one lengthscale per input) or are more details expected?
We are interested in reporting the values for the kernel lengthscales. Could you please help us with accessing the fitted surrogate via the get_surrogate method, extracting the internal GPyTorch kernel object and reading the lengthscales?
Normalization for inputs and standardization for outputs scaling: Clear, thank you! When stating "This is important because priors are defined on the scaled domains (<-- in general, this is another important setting to report, since you asked)", what exactly is the setting to report? Does this refer to just stating normalization for inputs and standardization for outputs scaling or are there certain values we should report?
Regarding possible changes to the DefaultKernelFactory and BoTorch's fit_gpytoch_mll in future developments of baybe, is there some code we should add at the beginning of our script to make sure it would still stay completely reproducible?

Thank you very much for your help!

AdrianSosic Feb 13, 2026
Maintainer

Could you please help us with accessing the fitted surrogate via the get_surrogate method, extracting the internal GPyTorch kernel object and reading the lengthscales?

Sure, though I cannot tell you a "general" way, as explained, since the output and the also the way of accessing it depends a bit on the BayBE version and what kernel you put it. But here an example for the current version:

from baybe.campaign import Campaign
from baybe.parameters import NumericalContinuousParameter
from baybe.searchspace import SearchSpace
from baybe.targets import NumericalTarget
from baybe.utils.dataframe import create_fake_input

searchspace = SearchSpace.from_product(
    [
        NumericalContinuousParameter("x1", (0, 1)),
        NumericalContinuousParameter("x2", (0, 1)),
    ]
)
objective = NumericalTarget("t").to_objective()
campaign = Campaign(searchspace, objective)

measurements = create_fake_input(searchspace.parameters, campaign.targets, n_rows=10)
campaign.add_measurements(measurements)

surrogate = campaign.get_surrogate()

# ^----- Up to this point, access we have a well defined API
# Everything after this point depends on:
# * The used BayBE version
# * The particular kernel architecture you are using. (Our defaults may have changed between versions.)

for name, param in surrogate["t"]._model.named_parameters():
    print(name)
    print(param)
    print()

Gives something like

likelihood.noise_covar.raw_noise
Parameter containing:
tensor([-1.8022], requires_grad=True)

mean_module.raw_constant
Parameter containing:
tensor(0.6512, requires_grad=True)

covar_module.raw_outputscale
Parameter containing:
tensor(3.9892, requires_grad=True)

covar_module.base_kernel.raw_lengthscale
Parameter containing:
tensor([[ 1.2419, -1.8239]], requires_grad=True)

what exactly is the setting to report? Does this refer to just stating normalization for inputs and standardization for outputs scaling or are there certain values we should report?

Well, this is a difficult topic. I'm not talking about input/output scaling here (you can always just interpret this as a form of pre-/post-processing) but instead about the priors of lengthscales, output scales, noise levels, etc. Here's a very recent paper exactly on this issue, which even uses BayBE and discusses its settings, so you can take some inspiration from there:

https://chemrxiv.org/doi/full/10.26434/chemrxiv.10001986/v1

In general, the priors are simply part of the model specification, so in terms of reproducible research, they must be reported (even though the lockfile gives the more complete picture anyway). Let me know if you have specific questions.

Regarding possible changes to the DefaultKernelFactory and BoTorch's fit_gpytoch_mll in future developments of baybe, is there some code we should add at the beginning of our script to make sure it would still stay completely reproducible?

No, the lockfile brings you pretty much as close to "reproducibility" as it's possible. As long as you refer to a specific BayBE version (which will also be part of the lockfile), all settings and versions follow automatically, including things like optimizers etc. So even if we change things in future versions, your environment is locked to the identical package states.

DefaultKernelFactory: The link mentions EDBO OHE and Mordred regime, which do not seem applicable for us since there are no categorical inputs. Based on this, would it be sufficient in our methods section to state that we used a GP surrogate model with a Matern 5/2 kernel and ARD lengthscales (one lengthscale per input) or are more details expected?

This is not about categorial inputs per se. What I meant: if you want to report the kernel settings, you need to lock at the particular logic that the factory implemented for the specific BayBE version you used. It does not matter whether that logic includes special settings for categorical inputs or not – what matters is the part of the logic that applies to your given inputs. With 99% chance, it's all Matern Kernels with ARD, but better double check 🙃

inogueroles Feb 16, 2026
Author

Dear @AdrianSosic,

Thank you for your help with accessing the lengthscales. This worked after removing the ["t"] in the final step (i.e. for name, param in surrogate._model.named_parameters():)!

likelihood.noise_covar.raw_noise
Parameter containing:
tensor([-0.587], requires_grad=True)

mean_module.raw_constant
Parameter containing:
tensor(0.951, requires_grad=True)

covar_module.raw_outputscale
Parameter containing:
tensor(1.371, requires_grad=True)

covar_module.base_kernel.raw_lengthscale
Parameter containing:
tensor([[-2.468, -0.369, -0.179,  0.172, -0.886,  0.361,  0.756, -1.107]], requires_grad=True)

I just wanted to double-check with you because it says these are raw values. Would it make more sense for us to report the actual transformed values instead as used by the model?

model_actual = surrogate._model

print("Noise variance:", model_actual.likelihood.noise.item())
print("Mean constant:", model_actual.mean_module.constant.item())
print("Signal variance (outputscale):", model_actual.covar_module.outputscale.item())
print("Lengthscales:", model_actual.covar_module.base_kernel.lengthscale.detach().numpy())

Noise variance: 0.442
Mean constant: 0.951
Signal variance (outputscale): 1.597
Lengthscales: [0.081, 0.526, 0.608, 0.783, 0.345, 0.890, 1.1409, 0.286]

Thank you for the comment about having to report the priors as well. I tried to retrieve these in a similar way to the lengthscales (using named_priors instead of named_parameters) and obtained the following:

Parameter: likelihood.noise_covar.noise_prior
Prior type: <class 'gpytorch.priors.torch_priors.GammaPrior'>
Concentration (alpha): tensor(1.0500)
Rate (beta): tensor(0.5000)

Parameter: covar_module.outputscale_prior
Prior type: <class 'gpytorch.priors.torch_priors.GammaPrior'>
Concentration (alpha): tensor(5.)
Rate (beta): tensor(0.5000)

Parameter: covar_module.base_kernel.lengthscale_prior
Prior type: <class 'gpytorch.priors.torch_priors.GammaPrior'>
Concentration (alpha): tensor(1.2000)
Rate (beta): tensor(1.1000)

We would report these alpha/beta values and mention that Gamma priors are used in the marginal likelihood optimization to obtain the GP hyperparameters, maximizing log(marginal likelihood) + log(prior). Would this then be sufficient information for the hyperparameters in the methods/SI section if we are also sharing the full code, lockfile and datasets?

Scienfitz · 2026-02-04T17:30:11Z

Scienfitz
Feb 4, 2026
Maintainer

@inogueroles
while I think most of the info you want can be provided I will make one other suggestion which you should conisder: it seems you want these info to hand in together with a publication. However, there is a more comprehensive and much easier way of achieving a complete snapshot and reproducibility of your work that you can also hand in with the journal: a lockfile (plus your data and production code)

The lockfile is a snapshot of the exact enviroment you used for this publication. Anyone can use this lockfile and recreate your results, it even includes secondary dependencies which are used by baybe indirectly. It is also implicitly a complete snapshot of all the algorithmic details including hyperparamters etc BayBE uses because the snapshot will pin the exact version of baybe.

Here is an example of how a lockfile looks and how you can create yours.

If you provide this + your data + your production scripts there is everything a journal could expect for reproducibility, it is actually much more complete and practical than just writing down the hyperparameter values etc.

This is just a suggestion. We can nonetheless provide an answer to the retrieval question but I'd have to look it up and @AdrianSosic is likely much faster

0 replies

Retrieving Gaussian Process model details in BayBE #737

Uh oh!

Uh oh!

inogueroles Feb 4, 2026

Replies: 2 comments · 6 replies

Uh oh!

AdrianSosic Feb 4, 2026 Maintainer

Uh oh!

Uh oh!

Scienfitz Feb 4, 2026 Maintainer

Uh oh!

AdrianSosic Feb 5, 2026 Maintainer

Uh oh!

AdrianSosic Feb 12, 2026 Maintainer

Uh oh!

inogueroles Feb 12, 2026 Author

Uh oh!

AdrianSosic Feb 13, 2026 Maintainer

Uh oh!

Uh oh!

inogueroles Feb 16, 2026 Author

Uh oh!

Uh oh!

Scienfitz Feb 4, 2026 Maintainer

inogueroles
Feb 4, 2026

Replies: 2 comments 6 replies

AdrianSosic
Feb 4, 2026
Maintainer

Scienfitz Feb 4, 2026
Maintainer

AdrianSosic Feb 5, 2026
Maintainer

AdrianSosic Feb 12, 2026
Maintainer

inogueroles Feb 12, 2026
Author

AdrianSosic Feb 13, 2026
Maintainer

inogueroles Feb 16, 2026
Author

Scienfitz
Feb 4, 2026
Maintainer