Computing Hessian of a GP wrto to hyper-parameters #2662

kiranvad · 2025-07-30T18:52:44Z

kiranvad
Jul 30, 2025

Hello,
I am trying to compute the Hessian matrix of a GP wrto hyper-parameters to do Laplace approximation to model evidence based on this paper.

I have two ways to compute the Hessian matrix, and they both seem to give completely different answers. I am trying to understand if this is because of the way hyperparameters $\theta$ are defined in GPyTorch.

Here's a simple example to reproduce this:

I define and train the GP Model

import torch
import gpytorch
from torch.autograd.functional import hessian
from torch.func import functional_call

# Training data
train_x = torch.linspace(-1, 1, 10)
train_y = torch.sin(train_x * (2 * torch.pi)) + 0.2 * torch.randn(train_x.size())

# GP Model
class GPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super().__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# Likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = GPModel(train_x, train_y, likelihood)

model.train()
likelihood.train()

optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
training_iter = 250
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, 
                                                    milestones=[0.5 * training_iter], 
                                                    gamma=0.1)
for i in range(training_iter):
    optimizer.zero_grad()
    output = model(train_x)
    loss = -mll(output, train_y)
    loss.backward()
    optimizer.step()
    scheduler.step()
    if i%50==0:
        print(f"({i}/{training_iter})Log-likelihood loss {loss:.2f}")

for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: grad norm = {param.grad.norm().item()}")
    else:
        print(f"{name}: no gradient!")

Compute a Hessian using torch.autograd.functional

orig_params = dict(model.named_parameters())
shapes = {k: v.shape for k, v in orig_params.items()}
numels = {k: v.numel() for k, v in orig_params.items()}

def flatten(params_dict):
    return torch.cat([params_dict[k].reshape(-1) for k in orig_params])

def unflatten(flat_tensor):
    out = {}
    pointer = 0
    for k in orig_params:
        n = numels[k]
        out[k] = flat_tensor[pointer:pointer + n].reshape(shapes[k])
        pointer += n
    return out

# Define function that returns log-likelihood given theta
def log_likelihood_fn(theta):
    param_dict = unflatten(theta)
    output = functional_call(model, param_dict, (train_x,))
    log_prob = model.likelihood(output).log_prob(train_y)
    return log_prob

# Compute Hessian
flat_params = flatten(orig_params).detach().clone().requires_grad_(True)
log_lik_fn = lambda theta: log_likelihood_fn(theta)
H_v1 = hessian(log_lik_fn, flat_params)

This gives me something like the following:

Hessian shape: torch.Size([4, 4])
Hessian matrix:
 tensor([[  0.0000,   0.0000,   0.0000,   0.0000],
        [  0.0000, -19.9292,   0.0000,   0.0000],
        [  0.0000,   0.0000,   0.0000,   0.0000],
        [  0.0000,   0.0000,   0.0000,   0.0000]])

Q: I do not understand why most of the parameters are zero and yet do not have the gradient attribute set to None.

I also compute the gradient using simple repeated usage of torch.autograd.grad using the following code:

def compute_hessian_autograd(model, likelihood, x, y):
    
    params = list(model.parameters())
    params_flat = torch.cat([p.flatten() for p in params])
    n_params = len(params_flat)
    output = model(x)
    log_lik = likelihood(output).log_prob(y)

    # Compute first derivatives
    first_grads = torch.autograd.grad(log_lik, params, create_graph=True, retain_graph=True)
    first_grads_flat = torch.cat([g.flatten() for g in first_grads])
        
    # Compute Hessian row by row
    hessian_rows = []
    for i in range(n_params):
        grad_outputs = torch.zeros_like(first_grads_flat)
        grad_outputs[i] = 1.0
        
        second_grads = torch.autograd.grad(
            outputs=first_grads_flat,
            inputs=params,
            grad_outputs=grad_outputs,
            retain_graph=True,
            allow_unused=True
        )
        
        # Handle None gradients (parameters not involved in computation)
        second_grads_flat = torch.cat([
            g.flatten() if g is not None else torch.zeros_like(p.flatten())
            for g, p in zip(second_grads, params)
        ])
        
        hessian_rows.append(second_grads_flat)
    
    hessian_matrix = torch.stack(hessian_rows)

    return hessian_matrix

H_v2 = compute_hessian_autograd(model, likelihood, train_x, train_y)
print("Hessian computed through plain autograd:\n", H_v2)

This now results in a Hessian matrix completely different from the one above, except for H(2,2), i.e., the diagonal term for mean_module.raw_constant.

Hessian computed through plain autograd:
 tensor([[-3.1577e+00, -7.8154e-04,  1.7477e-02,  1.9202e-03],
        [-7.8197e-04, -1.9929e+01,  9.8698e-04, -6.7467e-04],
        [ 1.7477e-02,  9.8697e-04, -2.3757e-02, -1.1217e-03],
        [ 1.9201e-03, -6.7467e-04, -1.1217e-03, -1.5052e-04]])

I would have expected the first option to produce a Hessian that is more accurate, but I am unsure what is going wrong here. Would appreciate any thoughts on this.

Thanks in advance for your help.

Answered by SebastianAment

Jul 31, 2025

Hi @kiranvad,

The problem with the first implementation is that model.likelihood and log_prob are executed outside of the scope of functional_call, but access the noise variance parameter, and the lengthscales, respectively.

You can use the reparameterize_module context manager in the following way to ensure the parameters are replaced for all operations. I separated the individual calls out to highlight which line accesses which component of the Hessian. Removing any individual line from the context of reparameterize_module will lead the resulting Hessian to have zeros in the associated rows.

from torch.nn.utils.stateless import _reparametrize_module

def log_likelihood_fn(theta):
    pa…

View full answer

SebastianAment · 2025-07-31T16:05:15Z

SebastianAment
Jul 31, 2025

Hi @kiranvad,

The problem with the first implementation is that model.likelihood and log_prob are executed outside of the scope of functional_call, but access the noise variance parameter, and the lengthscales, respectively.

You can use the reparameterize_module context manager in the following way to ensure the parameters are replaced for all operations. I separated the individual calls out to highlight which line accesses which component of the Hessian. Removing any individual line from the context of reparameterize_module will lead the resulting Hessian to have zeros in the associated rows.

from torch.nn.utils.stateless import _reparametrize_module

def log_likelihood_fn(theta):
    param_dict = unflatten(theta)
    with _reparametrize_module(model, param_dict):
        output = model(train_x)  # populates H[1, :], H[:, 1]
        normal = model.likelihood(output)  # populates H[0, :], H[:, 0]
        log_prob = normal.log_prob(train_y)  # populates H[2:, :], H[:, 2:]
    return log_prob

Regarding the lengthscales (H[2,:] and H[:, 2:]), these are only accessed in log_prob because the covariance matrix is lazily evaluated.

Let me know if you have additional thoughts or questions about this.

1 reply

kiranvad Jul 31, 2025
Author

Thanks @SebastianAment
That makes sense.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Computing Hessian of a GP wrto to hyper-parameters #2662

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Computing Hessian of a GP wrto to hyper-parameters #2662

Uh oh!

Uh oh!

kiranvad Jul 30, 2025

Replies: 1 comment · 1 reply

Uh oh!

SebastianAment Jul 31, 2025

Uh oh!

kiranvad Jul 31, 2025 Author

kiranvad
Jul 30, 2025

Replies: 1 comment 1 reply

SebastianAment
Jul 31, 2025

kiranvad Jul 31, 2025
Author