Prediction strategies help #2184

alexpeters1208 · 2022-11-01T17:19:42Z

alexpeters1208
Nov 1, 2022

Hello all,

I've been working with GPyTorch for a while and have developed a modular implementation of non-variational nearest-neighbors models in GPyTorch that I'm calling Vecchia models. This includes the nearest-neighbor Gaussian process, the block independent Gaussian process, and the block nearest-neighbors Gaussian process. Currently, I am able to train these different kinds of models with no problems and very little modification to the way that users define GPyTorch models.

However, predicting is proving more difficult. It seems that the prediction strategies implemented in GPyTorch, in both exact and variational context, make use of the full joint distribution of the testing and training points to make predictions. However, predictions in the context of Vecchia models are understood and explicitly expressed in terms of the conditional distributions of testing points given training points. Reverse engineering these conditional distributions to derive the full joint distributions may be possible, but would lose all of the computational gain of these kinds of approximations.

So, what is the recommended way to implement this kind of prediction? I could implement a new class of model with this prediction strategy used by default (IE exact_gp, approximate_gp, and vecchia_gp). With this solution, I worry that I will lose the modularity that I believe my current implementation has (Vecchia approximation can be used in concert with variational methods - not yet tested). I could create a new prediction_strategy and insist that users subclass ApproximateGP, but I again worry that I will then be unable to combine Vecchia approximations with variational methods and Deep GP's. Maybe I could write a class to wrap any existing prediction strategy and make predictions based on the conditional distributions rather than the joint distribution. This doesn't feel like a great solution either.

Thanks in advance for your time. I hope to contribute these new features to GPyTorch when they're complete and tested, because I think they will add a valuable new class of models to the wide variety of existing capabilities of GPyTorch.

gpleiss · 2022-11-08T00:36:34Z

gpleiss
Nov 8, 2022
Maintainer

Hi @alexpeters1208 - we've been wanting to add Vecchia models to GPyTorch for a while. I have my own implementation that I've been meaning to merge in, but I'm also curious to see your implementation.

The strategy I used was not to instantiate a new prediction strategy, but instead to create a batch of GPs (each of which makes a prediction on a single data point). We set the training data for each GP in the batch to be the nearest neighbors of the target data point. This approach makes predictions very parallelizable. I'll attach it below.

0 replies

gpleiss · 2022-11-08T00:39:49Z

gpleiss
Nov 8, 2022
Maintainer

import argparse
import numpy as np
import torch
import gpytorch
import faiss
import math
import os
import time

from util import sample_batch_indices


class VecchiaModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood, k=16):
        # Placeholder mask
        mask = train_x.new_ones(*train_x.shape[:-1], 1)

        super().__init__((train_x, mask), train_y, likelihood)
        self.k = k
        self.res = faiss.StandardGpuResources()
        self.register_buffer("train_x", train_x)
        self.register_buffer("train_indices", torch.arange(len(train_x)))
        self.register_buffer("train_nn_indices", torch.zeros(len(train_x), k, dtype=torch.long))
        self.register_buffer("train_nn_mask", torch.zeros(len(train_x), k, dtype=torch.bool))

        self.likelihood = likelihood
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(
            gpytorch.kernels.MaternKernel(nu=2.5, ard_num_dims=train_x.size(-1))
        )
        self.nn_time = 0

    def compute_train_nn_indices(self):
        assert self.k > 0
        start = time.time()

        # Create ordering based on 1st PCA vector

        with torch.no_grad():
            x = (self.train_x.data.float() / self.covar_module.base_kernel.lengthscale.data.float()).cpu().numpy()

            # Create ordering based on first PCA vector
            mat = faiss.PCAMatrix(x.shape[-1], 1)
            mat.train(x)
            assert mat.is_trained
            projection = torch.from_numpy(mat.apply_py(x)).squeeze(-1).cuda()
            self.train_indices = projection.argsort()

            # Construct masked nearest neighbor set based on ordering
            self.cpu_index = faiss.IndexFlatL2(self.train_x.size(-1))
            self.gpu_index = faiss.index_cpu_to_gpu(self.res, 0, self.cpu_index)
            for i, index in enumerate(self.train_indices.tolist()):
                row = x[index][None, :]
                self.gpu_index.add(row)
                self.train_nn_indices[index].copy_(
                    torch.from_numpy(self.gpu_index.search(row, self.k + 1)[1][..., 0, 1:]).long().to(self.train_x.device)
                )
                self.train_nn_mask[index, :min(i, self.k)] = True

        self.nn_time += (time.time() - start)

    def compute_test_nn_indices(self, x):
        with torch.no_grad():
            train_x = (self.train_x.data.float() / self.covar_module.base_kernel.lengthscale.data.float()).cpu().numpy()
            self.cpu_index = faiss.IndexFlatL2(self.train_x.size(-1))
            self.gpu_index = faiss.index_cpu_to_gpu(self.res, 0, self.cpu_index)
            self.gpu_index.add(train_x)

            x_np = (x.data.float() / self.covar_module.base_kernel.lengthscale.data.float()).cpu().numpy()
            return torch.from_numpy(self.gpu_index.search(x_np, self.k)[1]).long().to(x.device)

    def forward(self, x, mask):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x).evaluate()

        # Replace masked-out entries with an identity matrix
        eye = torch.eye(covar_x.size(-1), dtype=covar_x.dtype, device=covar_x.device)
        covar_x = torch.where(mask & mask.transpose(-1, -2), covar_x, eye)
        return gpytorch.distributions.MultivariateNormal(mean_x, gpytorch.lazify(covar_x))


def main(train_x, train_y, test_x, test_y, **args):
    N_train = train_x.size(0)
    N_test = test_x.size(0)
    print("N_train: {}  N_test: {}  D: {}".format(N_train, N_test, train_x.size(-1)))

    likelihood = gpytorch.likelihoods.GaussianLikelihood().cuda()
    model = VecchiaModel(train_x, train_y, likelihood, k=args.k).cuda()

    optimizer = torch.optim.Adam([{'params': model.parameters()}], lr=args.lr, betas=(0.90, 0.999))
    milestones = [int(k * args.num_iter) for k in [0.25, 0.5, 0.75]]
    sched = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=milestones, gamma=0.2)
    num_iter = args.num_iter
    report_freq = args.report_freq

    model.eval()  # We want eval mode, because the MLL is composed of Gaussian conditionals
    times = [time.time()]
    losses = []
    all_indices = []
    model.compute_train_nn_indices()
    for iteration in range(num_iter):
        if (iteration % args.nn_freq == 0) and iteration > 0:
            model.compute_train_nn_indices()

        if len(all_indices) == 0:
            all_indices = sample_batch_indices(N_train, args.mini_batch_size)

        mini_batch_indices = all_indices.pop()
        nn_indices = model.train_nn_indices[mini_batch_indices]
        nn_mask = model.train_nn_mask[mini_batch_indices]

        x_batch = train_x[nn_indices]
        y_batch = train_y[nn_indices]

        # We want batch GP that can handle different amounts of training data
        # For batches with < k training data point, we will:
        # - forward pass of model will replace masked-out parts of covar with identity
        # These large-value/zero points should have no effect on the rest of the data
        model.set_train_data((x_batch, nn_mask[..., None]), y_batch, strict=False)

        optimizer.zero_grad()
        # Compute the predictive distribution of each y to get the MLL factors
        with gpytorch.settings.detach_test_caches(False):
            pred_x = train_x[mini_batch_indices][..., None, :]
            pred_y = train_y[mini_batch_indices][..., None]
            pred_mask = torch.ones(*pred_y.shape,  1, dtype=torch.bool, device=train_x.device)
            output = likelihood(model(pred_x, pred_mask))
            output = torch.distributions.Normal(output.mean, output.stddev)
            log_probs = output.log_prob(pred_y)
            loss = -log_probs.squeeze(dim=-1).mean(dim=-1)
            loss.backward()
            optimizer.step()
            sched.step()

        losses.append(loss.item())
        times.append(time.time())
        if iteration >= report_freq and ((iteration + 1) % report_freq == 0 or iteration == (num_iter - 1)):
            dt = (times[-1] - times[-1 - report_freq]) / report_freq
            lengthscale = model.covar_module.base_kernel.lengthscale
            print('Iter %d/%d - Loss: %.3f %.3f   lengthscale: %.3f %.3f %.3f  sigma: %.3f  os: %.3f  [dt: %.3f]' % (
                iteration + 1, num_iter, losses[-1], np.mean(losses[-report_freq:]),
                lengthscale.mean().item(), lengthscale.min().item(), lengthscale.max().item(),
                model.likelihood.noise.sqrt().item(),
                model.covar_module.outputscale.sqrt().item(), dt))

    print("Total Training Time: %.4f" % (times[-1] - times[0]))
    print("Total NN Time:       %.4f" % model.nn_time)
    model.set_train_data(train_x, train_y, strict=False)

    # NN posterior
    test_mse = 0.
    test_nll = 0.
    with torch.no_grad(), gpytorch.settings.fast_pred_var():
        for x_batch, y_batch in zip(test_x.split(512), test_y.split(512)):
            # Clear cache
            model.train()
            model.eval()
            # Compute the NN posterior
            nn_indices = model.compute_test_nn_indices(x_batch)
            train_mask = torch.ones(*nn_indices.shape, 1, dtype=torch.bool, device=train_x.device)
            model.set_train_data((train_x[nn_indices, :], train_mask), train_y[nn_indices], strict=False)
            pred_mask = torch.ones(*y_batch.shape, 1, 1, dtype=torch.bool, device=train_x.device)
            posterior = model.likelihood(model(x_batch.unsqueeze(-2), pred_mask)).to_data_independent_dist()
            # Computet stats
            test_nll -= posterior.log_prob(y_batch.unsqueeze(-1)).squeeze(-1).sum(dim=-1).item()
            test_mse += (y_batch.unsqueeze(-1) - posterior.loc).pow(2.0).squeeze(-1).sum(dim=-1).item()

    # Reset training data, clear cache
    model.train()
    model.eval()
    model.set_train_data(train_x, train_y, strict=False)
    # Aggregate stats
    test_rmse = math.sqrt(test_mse / len(test_y))
    test_nll = test_nll / len(test_y)
    s = "[Seed {}: {} {}] - NN posterior   TEST LL: {:.3f}  RMSE: {:.3f}  CRPS: {:.3f}"
    print(s.format(args.seed, "vecchia", args.dataset, -test_nll, test_rmse, test_crps))

4 replies

alexpeters1208 Nov 8, 2022
Author

@gpleiss Thanks for getting back to me, this example is really helpful. One of the main differences in our implementations is that I've provided a class to block and reorder data, so that the FAISS blocking, nearest-neighbor calculations, and helper methods for calculating relevant mean and covariance matrices are separated from the model creation. I do not assume that we are doing prediction on one data point at a time, but rather on a block of data, based on the Block Nearest Neighbors scheme described in Datta et al, 2016. This can be reduced to the NNGP case by setting the number of blocks equal to the number of training points. Here is my code for doing this, which can be found in the vecchia module of my GPyTorch fork:

import torch
import faiss
import numpy as np

class Block:
    """
    Groups datasets into spatial blocks, determines which blocks are neighbors, and enables reordering of the blocks,
    as Vecchia's approximation depends on the order of the conditioning sets. Once a dataset has been blocked, this
    class groups new testing datasets into blocks based on the training data.
    :param data:
    :param n_blocks:
    :param n_neighbors:
    """

    def __init__(self, data, n_blocks, n_neighbors):

        self._n_neighbors = n_neighbors
        self._n_blocks = n_blocks

        # original block order by index, is constant and represents whatever ordering kmeans imposes on our blocks
        self._original_block_order = None
        # block order by index, this gets updated when an "order" method is called
        self._current_block_order = None
        # keeps track of whether the blocks have been reordered from their original order
        self._reordered = False

        # object to save FAISS kmeans object for getting block memnbership of new test points
        self._k_means = None
        # numeric values of block centroids after training
        self._block_centroids = None

        # list of length n_blocks, where the ith entry contains the indices of training data that belong to block i
        self._train_blocks = None
        # list of length n_blocks, where the ith entry contains the indices of testing data that belong to block i
        self._test_blocks = None

        # boolean matrix indicating whether block i is a neighbor of block j
        self._is_neighbors = None
        # list of length n_blocks, where the ith element contains the indices of blocks that neighbor block i
        self._neighbor_block_idx = None
        # list of length n_blocks, where the ith element contains the indices of training data that neighbor block i
        self._neighbor_block_obs = None
        # list of length n_blocks, where the ith element contains the indices of testing data that neighbor block i
        self._test_neighbor_block_obs = None

        self._block(data, n_blocks)
        self._create_neighbors(n_neighbors)

    @property
    def block_order(self):
        """Tensor containing the current order of the blocks, relative to their original ordering after kmeans. """
        return self._current_block_order

    @property
    def centroids(self):
        """Tensor containing the centroids of each block, returned in the order given by self.block_order. """
        if self._reordered:
            return self._block_centroids[self._current_block_order]
        else:
            return self._block_centroids

    @property
    def blocks(self):
        """
        List of tensors where the ith element contains the indices of the training set points belonging to the
        ith block, where the blocks are ordered by self.block_order.
        """
        if self._reordered:
            return [self._train_blocks[i] for i in self._current_block_order]
        else:
            return self._train_blocks

    @property
    def neighbors(self):
        """
        List of tensors, where the ith element contains the indices of the training set points belonging to the neighbor
        set of the ith block, where the blocks are ordered by self.block_order.
        """
        return self._neighbor_block_obs

    @property
    def test_blocks(self):
        """
        List of tensors where the ith element contains the indices of the testing set points belonging to the ith block,
        where the blocks are ordered by self.block_order. Only defined after block_new_data has been called.
        """
        if self._test_blocks is None:
            raise RuntimeError(
                "Blocks of testing data do not exist, as the 'block_new_data' "
                "method has not been called on testing data."
            )
        if self._reordered:
            return [self._test_blocks[i] for i in self._current_block_order]
        else:
            return self._test_blocks

    @property
    def test_neighbors(self):
        """
        List of tensors, where the ith element contains the indices of the training set points belonging to the
        neighbor set of the ith test block, where the blocks are ordered by self.block_order. Importantly, the neighbor
        sets of test blocks only consist of training points. Only defined after block_new_data has been called.
        """
        if self._test_blocks is None:
            raise RuntimeError(
                "Neighboring sets of testing blocks do not exist, as the 'block_new_data' "
                "method has not been called on testing data."
            )
        return self._test_neighbor_block_obs

    @property
    def block_adj_mat(self):
        """
        Tensor of the adjacency matrix indicating block neighbor relationships,
        where the blocks are ordered by self.block_order
        """
        return self._is_neighbors

    def _block(self, data, n_blocks):
        # use FAISS k-means to block data
        kmeans = faiss.Kmeans(data.shape[1], n_blocks, niter=10)
        kmeans.train(np.array(data.float()))

        # store kmeans for finding block membership of test points
        self._k_means = kmeans
        # k-means gives centroids directly, so save centroids
        self._block_centroids = torch.tensor(kmeans.centroids)
        # create vectors of order of blocks, one is constant for reference, one represents new orderings of blocks
        self._original_block_order = torch.tensor(range(0, len(self._block_centroids)))
        self._current_block_order = torch.tensor(range(0, len(self._block_centroids)))

        # get list of len(data) where the ith element indicates which block the ith element of data belongs to
        block_membership = kmeans.index.search(np.array(data.float()), 1)[1].squeeze()
        # create array where the ith element contains the set of indices of data points corresponding to the ith block
        blocking_indices = [[] for _ in range(n_blocks)]
        argsorted = block_membership.argsort()
        for i in range(0, len(block_membership)):
            blocking_indices[block_membership[argsorted[i]]].append(argsorted[i])

        self._train_blocks = [torch.tensor(blocking_index) for blocking_index in blocking_indices]
        self._trained = True

    def _create_neighbors(self, n_neighbors):
        if n_neighbors == 0:
            self._is_neighbors = torch.zeros((self._n_blocks, self._n_blocks))
            self._neighbor_block_idx = [torch.tensor([]) for _ in range(0, self._n_blocks)]
            self._neighbor_block_obs = [torch.tensor([]) for _ in range(0, self._n_blocks)]
            self._test_neighbor_block_obs = self.blocks

        else:
            # euclidean distance matrix
            dist_matrix = torch.cdist(self.centroids, self.centroids)
            # sort by distances
            sorter = dist_matrix.argsort()
            # create empty matrix to indicate neighbor relationship
            neighbor_mask = torch.zeros((self._n_blocks, self._n_blocks))

            for i in range(len(dist_matrix)):
                # this is from the probability chain rule and ensures a valid density function
                if i < n_neighbors + 1:
                    neighbor_mask[0:i, i] = True
                else:
                    neighbor_mask[sorter[i][sorter[i] < i][0:n_neighbors], i] = True

            self._is_neighbors = neighbor_mask.transpose(0, 1)
            self._neighbor_block_idx = [sorter[i][sorter[i] < i][0:n_neighbors] for i in range(0, len(sorter))]
            self._neighbor_block_obs = [torch.tensor([]), *[torch.cat([self.blocks[block]
                                                   for block in self._neighbor_block_idx[i]])
                                                   for i in range(1, self._n_blocks)]]

            # because only training points are considered neighbors of any future testing data, we can calculate testing
            # neighbors before calling 'block_new_data'
            self._test_neighbor_block_obs = [self.blocks[0],
                                            *[torch.cat([self.blocks[i], self.neighbors[i]])
                                            for i in range(1, self._n_blocks)]]

    def block_new_data(self, new_data):
        # get list of len(data) where the ith element indicates which block the ith element of data belongs to
        block_membership = self._k_means.index.search(np.array(new_data.float()), 1)[1].squeeze()
        # create array where the ith element contains the set of indices of data corresponding to the ith block
        blocking_indices = [[] for _ in range(self._n_blocks)]

        argsorted = block_membership.argsort()
        for i in range(0, len(block_membership)):
            blocking_indices[block_membership[argsorted[i]]].append(argsorted[i])

        self._test_blocks = [torch.tensor(blocking_index) for blocking_index in blocking_indices]

    def reorder(self, new_order):
        self._reordered = True
        # this is where the reordering happens
        self._current_block_order = new_order
        # recompute neighbors
        self._create_neighbors(self._n_neighbors)

    def compute_mean_covar(self, x1, x2, y, mean_module, covar_module, training):
        # create empty lists to hold block means and covariances
        mean_list = []
        cov_list = []

        if training:
            # append mean function applied to first block in first spot
            mean_list.append(mean_module(x1[self.blocks[0]]))
            # append within covariance block to first spot
            cov_list.append(covar_module(x1[self.blocks[0]], x2[self.blocks[0]]))

            if self._n_neighbors == 0:
                # if no neighbors, all blocks are independent, so simply evaluate mean and covariance for each block
                for i in range(1, self._n_blocks):
                    mean_list.append(mean_module(x1[self.blocks[i]]))
                    cov_list.append(covar_module(x1[self.blocks[i]], x2[self.blocks[i]]))

            else:
                for i in range(1, self._n_blocks):
                    # these calculations come from bottom of P7, Quiroz et al, 2021
                    c_within = covar_module(x1[self.blocks[i]], x2[self.blocks[i]])
                    c_between = covar_module(x1[self.blocks[i]], x2[self.neighbors[i]])
                    c_neighbors = covar_module(x1[self.neighbors[i]], x2[self.neighbors[i]])

                    # use cholesky decomposition to compute inverse, may be numerically unstable with large n_neighbors
                    l_inv = c_neighbors.cholesky().inverse()
                    # compute mean
                    b = c_between @ l_inv.t() @ l_inv
                    mean = mean_module(x1[self.blocks[i]]) + \
                           b @ (y[self.neighbors[i]] - mean_module(x2[self.neighbors[i]]))
                    # compute covariance
                    f = c_within - (c_between @ l_inv.t() @ l_inv @ c_between.t())

                    mean_list.append(mean)
                    cov_list.append(f)

        else:
            for i in range(0, len(self.blocks)):
                c_within = covar_module(x1[self.test_blocks[i]], x1[self.test_blocks[i]])
                c_between = covar_module(x1[self.test_blocks[i]], x2[self.test_neighbors[i]])
                c_neighbors = covar_module(x2[self.test_neighbors[i]], x2[self.test_neighbors[i]])

                # use cholesky decomposition to compute needed terms, may be numerically unstable with large n_neighbors
                l_inv = c_neighbors.cholesky().inverse()
                # compute mean
                b = c_between @ l_inv.t() @ l_inv
                mean = mean_module(x1[self.test_blocks[i]]) + \
                       b @ (y[self.test_neighbors[i]] - mean_module(x2[self.test_neighbors[i]]))
                # compute covariance
                f = c_within - (c_between @ l_inv.t() @ l_inv @ c_between.t())

                mean_list.append(mean)
                cov_list.append(f)

        return mean_list, cov_list

The function that computes the mean and covariance of each term of the Vecchia likelihood returns a list of means and a list of covariances. These lists are then fed into a Vecchia MVN distribution (currently not too happy with this), which can be found in the distributions module of my fork:

import math
import warnings

import torch
from linear_operator import LinearOperator, to_linear_operator
from linear_operator.operators import BlockDiagLinearOperator, BlockInterleavedLinearOperator, CatLinearOperator

from torch.distributions import MultivariateNormal as TMultivariateNormal
from .multivariate_normal import MultivariateNormal
from torch.distributions.kl import register_kl
from torch.distributions.utils import _standard_normal, lazy_property

from .. import settings
from ..utils.warnings import NumericalWarning


class VeccMultivariateNormal(MultivariateNormal):
    """
    Constructs a block multivariate normal random variable, based on lists of means and covariances.
    Can be multivariate, or a batch of multivariate normals
    Passing a list of vector mean corresponds to a multivariate normal.
    Passing a list of matrix mean corresponds to a batch of multivariate normals.
    :param list mean: List of vectors n or list of matrices b x n means of block conditional mvn distribution.
    :param list covariance_matrix: list of ~linear_operator.operators.LinearOperator or pytorch tensors of
    ... x N X N covariance matrices of block conditional mvn distribution.
    """

    def __init__(self, mean, covariance_matrix, blocks, validate_args=False):
        if not all(torch.is_tensor(this_mean) for this_mean in mean) and \
                not all(isinstance(this_mean, LinearOperator) for this_mean in mean):
            raise RuntimeError("The mean of a VeccMultivariateNormal must be a list of Tensors or LinearOperators")

        if not all(torch.is_tensor(this_cov) for this_cov in covariance_matrix) and \
                not all(isinstance(this_cov, LinearOperator) for this_cov in covariance_matrix):
            raise RuntimeError("The covariance of a VeccMultivariateNormal must be a list of Tensors or LinearOperators")

        self.blocks = blocks

        self.bmvn = [MultivariateNormal(
                        mean=mean[i],
                        covariance_matrix=covariance_matrix[i],
                        validate_args=validate_args)
                     for i in range(len(mean))]

        self._islazy = any([mvn.islazy for mvn in self.bmvn])

    @property
    def islazy(self):
        return self._islazy

    @property
    def event_shape(self):
        return [mvn.loc.shape[-1:] for mvn in self.bmvn]

    @property
    def _unbroadcasted_scale_tril(self):
        raise NotImplementedError

    @_unbroadcasted_scale_tril.setter
    def _unbroadcasted_scale_tril(self, ust):
        raise NotImplementedError

    def add_jitter(self, noise=1e-4):
        raise NotImplementedError

    def expand(self, batch_size):
        raise NotImplementedError

    def _extended_shape(self, sample_shape=torch.Size()):
        raise NotImplementedError

    def confidence_region(self):
        raise NotImplementedError

    @staticmethod
    def _repr_sizes(mean, covariance_matrix):
        raise NotImplementedError

    @lazy_property
    def mean(self):
        if self.islazy:
            return [mvn.loc.to_dense() for mvn in self.bmvn]
        else:
            return [mvn.loc for mvn in self.bmvn]

    @lazy_property
    def covariance_matrix(self):
        raise NotImplementedError

    def get_base_samples(self, sample_shape=torch.Size()):
        raise NotImplementedError

    @property
    def base_sample_shape(self):
        raise NotImplementedError

    @lazy_property
    def lazy_covariance_matrix(self):
        """
        The covariance_matrix, represented as a LinearOperator
        """
        if self.islazy:
            return [mvn._covar for mvn in self.bmvn]
        else:
            return [to_linear_operator(mvn.covariance_matrix) for mvn in self.bmvn]

    def log_prob(self, value):
        return torch.sum(torch.stack([self.bmvn[i].log_prob(value[self.blocks.blocks[i]])
                                      for i in range(len(self.bmvn))]))

    def rsample(self, sample_shape=torch.Size(), base_samples=None):
        raise NotImplementedError

    def sample(self, sample_shape=torch.Size(), base_samples=None):
        return torch.cat([mvn.sample() for mvn in self.bmvn])

    def to_data_independent_dist(self):
        raise NotImplementedError

    @property
    def stddev(self):
        raise NotImplementedError

    @property
    def variance(self):
        raise NotImplementedError

    def __add__(self, other):
        raise NotImplementedError

    def __radd__(self, other):
        raise NotImplementedError

    def __mul__(self, other):
        raise NotImplementedError

    def __truediv__(self, other):
        raise NotImplementedError

    def __getitem__(self, idx):
        return self.bmvn[idx]


@register_kl(VeccMultivariateNormal, VeccMultivariateNormal)
def kl_mvn_mvn(p_dist, q_dist):
    raise NotImplementedError

This also seems like it could handle parallelization very well, but I have not tested this at all.
Finally, the MLL is wrapped in a Vecchia MLL, to compute MLL for each term of the Vecchia likelihood. This can be found in the mlls module of my fork:

from .marginal_log_likelihood import MarginalLogLikelihood
from gpytorch.distributions.vecc_multivariate_normal import VeccMultivariateNormal


class VeccExactMLL(MarginalLogLikelihood):

    def __init__(self, base_mll):
        super().__init__(base_mll.likelihood, base_mll.model)
        self.base_mll = base_mll
        self.blocks = base_mll.model.blocks

    def forward(self, outputs, targets, **kwargs):
        if not isinstance(outputs, VeccMultivariateNormal):
            raise RuntimeError("VeccExactMLL can only operate on vecchia multivariate normal random variables.")

        res = sum(self.base_mll(output, targets[this_block])
                  for output, this_block in zip(outputs, self.blocks.blocks))

        # Scale by the number of blocks we have
        num_blocks = len(outputs.bmvn)
        return res.div_(num_blocks)

These are all brought together in the user-defined model, making it very easy to train Vecchia models:

import gpytorch
import torch
import torch.distributions as distr
import torch.linalg as linalg

import matplotlib.pyplot as plt
import numpy as np
import itertools
import scipy.stats as sts

import faiss

def f(x,y):
    return sts.expon.pdf(x, scale=3)*\
            sts.gamma.pdf(y, a=5, scale=1)*\
            sts.norm.pdf((x*y), loc=5, scale=5)/\
            sts.norm.pdf(y, loc=8, scale=2)

train_n = 1000

# create training data
x_train = torch.tensor([[x1,x2] for x1,x2 in zip(
    distr.Uniform(torch.tensor([0.0]),torch.tensor([5.0])).rsample(torch.tensor([train_n])),
    distr.Uniform(torch.tensor([0.0]),torch.tensor([5.0])).rsample(torch.tensor([train_n])))])
y_train = torch.tensor([f(x1,x2) for x1,x2 in x_train]).float()

# create testing data, regular grid
x_test = torch.tensor([x for x in itertools.product(torch.linspace(0, 5, 50), torch.linspace(0, 5, 50))])
y_test = torch.tensor([f(x1,x2) for x1,x2 in x_test]).float()

# create Vecchia model (block nearest neighbors)

from gpytorch.vecchia.blocking import Block

# construct a block NNGP model
class BNNGPModel(gpytorch.models.ExactGP):
    
    def __init__(self, x_train, y_train, likelihood, blocks):
        super(BNNGPModel, self).__init__(x_train, y_train, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.MaternKernel(nu=0.5))
        self.blocks = blocks
        
    def forward(self, x):
        mean_x, covar_x = self.blocks.compute_mean_covar(
            x1=x,
            x2=self.train_inputs[0],
            y=self.train_targets,
            mean_module=self.mean_module,
            covar_module=self.covar_module,
            training=self.training)
        return gpytorch.distributions.VeccMultivariateNormal(mean_x, covar_x, self.blocks)
        
    def __call__(self, x, **kwargs):
        if not self.training:
            return self.forward(x)
        else:
            return super().__call__(x, **kwargs)

# initialize likelihood
likelihood = gpytorch.likelihoods.GaussianLikelihood()

# initialize and reorder blocking object
blocks = Block(x_train, n_blocks=50, n_neighbors=8)
blocks.reorder(torch.argsort(linalg.norm(blocks.centroids, axis=1)))

# initialize model
model = BNNGPModel(x_train, y_train, likelihood, blocks)

training_iter = 100

# Find optimal model hyperparameters
model.train()
likelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)  # Includes GaussianLikelihood parameters

# "Loss" for GPs - the marginal log likelihood. Wrap in Vecchia MLL
mll = gpytorch.mlls.VeccExactMLL(gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model))

for i in range(training_iter):
    # Zero gradients from previous iteration
    optimizer.zero_grad()
    # Output from model
    output = model(x_train)
    # Calc loss and backprop gradients
    loss = -mll(output, y_train)
    loss.backward()
    print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f' % (
        i + 1, training_iter, loss.item(),
        model.covar_module.base_kernel.lengthscale.item(),
        model.likelihood.noise.item()
    ))
    optimizer.step()

# Get into evaluation (predictive posterior) mode
model.eval()
likelihood.eval()

# Make predictions by feeding model through likelihood
with torch.no_grad(), gpytorch.settings.fast_pred_var():
    blocks.block_new_data(x_test)
    test = model(x_test)
    y_pred_dist = [likelihood(out) for out in model(x_test)]

predictions = torch.cat([dist.loc for dist in y_pred_dist])
reordered_predictions = predictions[torch.argsort(torch.cat(blocks.test_blocks))]

I'm still not perfectly happy with this interface (I don't like forcing the user to override __call__ just to get around ExactGP prediction strategies, and I don't like getting the predictions with a list comprehension). However, I think some combination of our implementations could be a great addition to GPyTorch. I tried to design my interface to allow users to potentially use variational methods with the Vecchia approximations, non-gaussian likelihoods, and integrate with deep GP's. Unfortunately I have not yet had the time to test any of this.

I've dropped a lot of stuff on you, but I'd be curious about what you think whenever you have the time. This is very much a work in progress, but it's been useful for my research and I think it could be useful for other researchers interested in Vecchia approximations.

alexpeters1208 Dec 5, 2022
Author

@gpleiss Since my last comment, I've decoupled the blocking logic I have above from the specifics of the K-means implementation that I use in my example. Now, subclasses can override a handful of key methods in the BaseBlocker class to make Vecchia models happen with any kind of blocking and ordering of the data, defined entirely by the user. This massively increases flexibility and standardizes the blocking interface. I'm currently working on providing sensible defaults for several different blocking and ordering strategies that modern Vecchio-based approaches are incorporating.

gpleiss Dec 13, 2022
Maintainer

tried to design my interface to allow users to potentially use variational methods with the Vecchia approximations, non-gaussian likelihoods, and integrate with deep GP's.

regarding variational methods, we have a variational version of a method that's very similar to Datta et al 2016: https://docs.gpytorch.ai/en/stable/examples/04_Variational_and_Approximate_GPs/VNNGP.html

The code is similar to the vecchia model I posted earlier.

I'm not a huge fan of creating a VecchiaMVN distribution or VecchiaMLL - can you explain what the purpose of that is? It would be great to use the existing modules.

Now, subclasses can override a handful of key methods in the BaseBlocker class to make Vecchia models happen with any kind of blocking and ordering of the data, defined entirely by the user.

This sounds super useful - and something we currently don't have in the VNNGP implementation!

alexpeters1208 Jan 5, 2023
Author

The purpose of the VecchiaMVN and the VecchiaMLL is to enable quick computation of the marginal log likelihood term. Since the vecchia approximation does not require the entire (block diagonal) covariance matrix to be inverted in order to calculate the mll, we calculate mll terms for each sub matrix and sum them. This method gives equivalent results (save numerical differences) to ordinary GP when the number of neighbors in the for each point is equal to the number of observations in the dataset, and provides the appropriate approximation otherwise.

I agree that I do not quite like this design, and that I think there's a cleaner way to go about it. I've been looking at a way of using BatchMVN to accomplish the same thing with a group at my uni, but it's taking some time.

As far as the blocking implementation, it does seem to be really useful and I'm working on providing some sensible defaults for blocking strategies. The blocking construct operates independently of the VecchiaMVN, but the VecchiaMVN is designed to take in the blocking object to compute the needed terms. This could be reworked without any modification to the blocking construct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prediction strategies help #2184

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Prediction strategies help #2184

Uh oh!

alexpeters1208 Nov 1, 2022

Replies: 2 comments · 4 replies

Uh oh!

gpleiss Nov 8, 2022 Maintainer

Uh oh!

gpleiss Nov 8, 2022 Maintainer

Uh oh!

alexpeters1208 Nov 8, 2022 Author

Uh oh!

alexpeters1208 Dec 5, 2022 Author

Uh oh!

gpleiss Dec 13, 2022 Maintainer

Uh oh!

alexpeters1208 Jan 5, 2023 Author

alexpeters1208
Nov 1, 2022

Replies: 2 comments 4 replies

gpleiss
Nov 8, 2022
Maintainer

gpleiss
Nov 8, 2022
Maintainer

alexpeters1208 Nov 8, 2022
Author

alexpeters1208 Dec 5, 2022
Author

gpleiss Dec 13, 2022
Maintainer

alexpeters1208 Jan 5, 2023
Author