MultiDeviceKernel is not distributing the memory usage #2319

nicmig · 2023-04-10T16:50:26Z

nicmig
Apr 10, 2023

I try to use the MultiDeviceKernel for a time series forecast. My data has ~100.000 data samples and one input feature. To start with, I just used the example from GPyTorch repository for ExactGP with MultiDeviceKernel (https://docs.gpytorch.ai/en/latest/examples/02_Scalable_Exact_GPs/Simple_MultiGPU_GP_Regression.html) but instead of the protein data, I used my own data. I have 8 GPUs (NVIDIA A100 40Gb or NVIDIA Tesla V100 32Gb), and I work on a supercomputer at my university.
According to the paper linked to the example, running this with this set-up and the provided code should be no problem. However, I always run into 'CUDA of out memory' and to be more particular it is because of imbalanced memory usage. So when I display my memory usage with 'nividia-smi -l', I can see that my GPU0 is used ~99% while the rest is used around 30-40%. This then leads to a stop of the code at some point. This is my code:

    output_device = torch.device('cuda:0')
    x_train = torch.tensor(x_train, device=output_device)
    y_train = torch.tensor(y_train, device=output_device)
    x_test = torch.tensor(x_test, device=output_device)
    y_test = torch.tensor(y_test, device=output_device)
    train_x, train_y = torch.flatten(x_train.contiguous()), torch.flatten(y_train.contiguous())
    test_x, test_y = torch.flatten(x_test.contiguous()), torch.flatten(y_test.contiguous())
    n_devices = torch.cuda.device_count()
    print('Planning to run on {} GPUs.'.format(n_devices))

    class ExactGPModel(gpytorch.models.ExactGP):
        def __init__(self, train_x, train_y, likelihood, n_devices):
            super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
            self.mean_module = gpytorch.means.ConstantMean()
            base_covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())

            self.covar_module = gpytorch.kernels.MultiDeviceKernel(
                base_covar_module, device_ids=list(range(n_devices)),
                output_device=output_device)

        def forward(self, x):
            mean_x = self.mean_module(x)
            covar_x = self.covar_module(x)
            return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)


    def train(train_x, train_y, n_devices, output_device, checkpoint_size, preconditioner_size, n_training_iter):

        likelihood = gpytorch.likelihoods.GaussianLikelihood().to(output_device)
        model = ExactGPModel(train_x, train_y, likelihood, n_devices).to(output_device)
        print('model built.')
        model.train()
        likelihood.train()
        print('model in training mode')

        optimizer = FullBatchLBFGS(model.parameters(), lr=0.1)
        print('optimizer is calculated.')
        # "Loss" for GPs - the marginal log likelihood
        mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

        print('mll is calculated.')

        with gpytorch.beta_features.checkpoint_kernel(checkpoint_size), \
                gpytorch.settings.max_preconditioner_size(preconditioner_size):

            def closure():
                optimizer.zero_grad()
                print('set optimizer to zero grad.')
                output = model(train_x)
                print('made predictions')
                loss = -mll(output, train_y)
                print('loss was calculated')
                return loss

            print('start with loss calculation')
            loss = closure()
            loss.backward(torch.ones_like(loss))
            print('loss backward was calculated')

            for i in range(n_training_iter):
                options = {'closure': closure, 'current_loss': loss, 'max_ls': 10}
                loss, _, _, _, _, _, _, fail = optimizer.step(options)

                print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f' % (
                    i + 1, n_training_iter, loss.mean().item(),
                    model.covar_module.module.base_kernel.lengthscale.item(),
                    model.likelihood.noise.item()
                ))

                if fail:
                    print('Convergence reached!')
                    break

        print(f"Finished training on {train_x.size(0)} data points using {n_devices} GPUs.")
        return model, likelihood

    def find_best_gpu_setting(train_x,
                              train_y,
                              n_devices,
                              output_device,
                              preconditioner_size
                              ):
        N = train_x.size(0)

        # Find the optimum partition/checkpoint size by decreasing in powers of 2
        # Start with no partitioning (size = 0)
        settings = [0] + [int(n) for n in np.ceil(N / 2 ** np.arange(1, np.floor(np.log2(N))))]

        for checkpoint_size in settings:
            print('Number of devices: {} -- Kernel partition size: {}'.format(n_devices, checkpoint_size))
            try:
                # Try a full forward and backward pass with this setting to check memory usage
                _, _ = train(train_x, train_y,
                             n_devices=n_devices, output_device=output_device,
                             checkpoint_size=checkpoint_size,
                             preconditioner_size=preconditioner_size, n_training_iter=1)

                # when successful, break out of for-loop and jump to finally block
                break
            except RuntimeError as e:
                print('RuntimeError: {}'.format(e))
                print(torch.cuda.memory_summary(device=None, abbreviated=False))
                gc.collect()
                torch.cuda.empty_cache()
            except AttributeError as e:
                print('AttributeError: {}'.format(e))
            finally:
                # handle CUDA OOM error
                gc.collect()
                torch.cuda.empty_cache()
                print('emptied cache')
        return checkpoint_size

    # Set a large enough preconditioner size to reduce the number of CG iterations run
    preconditioner_size = 100
    checkpoint_size = find_best_gpu_setting(train_x, train_y,
                                            n_devices=n_devices,
                                            output_device=output_device,
                                            preconditioner_size=preconditioner_size)

    model, likelihood = train(train_x, train_y,
                              n_devices=n_devices, output_device=output_device,
                              checkpoint_size=checkpoint_size,
                              preconditioner_size=preconditioner_size,
                              n_training_iter=5)

To test the program I run it on only 4 GPUs with less data (average over half an hour instead of a data point every 5min). The memory usage while running the program with ~18000 training samples and a kernel partition of 29 looks like this:

and I'll get this error message:

 Traceback (most recent call last):                                                                                                                                                                          
  File "main_regression.py", line 192, in <module>                                                                                                                                                          
    train_model()                                                                                                                                                                                           
  File "main_regression.py", line 181, in train_model                                                                                                                                                       
    torch_model = reg_model.Exact_gp(x_train, y_train, x_test, y_test)                                                                                                                                      
  File "/cluster/home/niclasfl/PMLaks_opt/Regression/RegressionModel.py", line 353, in Exact_gp                                                                                                             
    checkpoint_size = find_best_gpu_setting(train_x, train_y,                                                                                                                                               
  File "/cluster/home/niclasfl/PMLaks_opt/Regression/RegressionModel.py", line 330, in find_best_gpu_setting                                                                                                
    _, _ = train(train_x, train_y,
  File "/cluster/home/niclasfl/PMLaks_opt/Regression/RegressionModel.py", line 293, in train
    loss = closure()
  File "/cluster/home/niclasfl/PMLaks_opt/Regression/RegressionModel.py", line 288, in closure
    loss = -mll(output, train_y)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/module.py", line 31, in __call__                                                                                                
    outputs = self.forward(*inputs, **kwargs)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/mlls/exact_marginal_log_likelihood.py", line 64, in forward                                                                     
    res = output.log_prob(target)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/distributions/multivariate_normal.py", line 193, in log_prob                                                                    
    inv_quad, logdet = covar.inv_quad_logdet(inv_quad_rhs=diff.unsqueeze(-1), logdet=True)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 1642, in inv_quad_logdet                                                            
    preconditioner, precond_lt, logdet_p = self._preconditioner()
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/added_diag_linear_operator.py", line 114, in _preconditioner                                                   
    self._piv_chol_self = self._linear_op.pivoted_cholesky(rank=max_iter)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 1850, in pivoted_cholesky                                                           
    res, pivots = func(self.representation_tree(), rank, error_tol, *self.representation())
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/functions/_pivoted_cholesky.py", line 72, in forward                                                                     
    row = apply_permutation(matrix, pi_m.unsqueeze(-1), right_permutation=None).squeeze(-2)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/utils/permutation.py", line 79, in apply_permutation                                                                     
    return to_dense(matrix.__getitem__((*batch_idx, left_permutation.unsqueeze(-1), right_permutation.unsqueeze(-2))))                                                                                     
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 25, in wrapped                                                                      
    output = method(self, *args, **kwargs)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 426, in __getitem__                                                                 
    return super().__getitem__(index)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 2692, in __getitem__                                                                
    res = self._get_indices(new_row_index, new_col_index, *new_batch_indices)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 422, in _get_indices
    InterpolatedLinearOperator(
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/utils/memoize.py", line 59, in g
    return _add_to_cache(self, cache_name, method(self, *args, **kwargs), *args, kwargs_pkl=kwargs_pkl)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 2461, in to_dense
    res = self.matmul(eye)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/interpolated_linear_operator.py", line 435, in matmul
    base_res = self.base_linear_op.matmul(right_interp_res)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 1722, in matmul
    return Matmul.apply(self.representation_tree(), other, *self.representation())
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/functions/_matmul.py", line 21, in forward
    res = linear_op._matmul(rhs)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 264, in _matmul
    self.kernel(
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/kernels/kernel.py", line 524, in __call__
    super(Kernel, self).__call__(x1_, x2_, last_dim_is_batch=last_dim_is_batch, **params)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/module.py", line 31, in __call__
    outputs = self.forward(*inputs, **kwargs)
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/kernels/multi_device_kernel.py", line 64, in forward
    inputs = tuple((x1_[0], x2_) for x1_, x2_ in zip(self._x1_scattered, self._x2_subs))
  File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/kernels/multi_device_kernel.py", line 64, in <genexpr>
    inputs = tuple((x1_[0], x2_) for x1_, x2_ in zip(self._x1_scattered, self._x2_subs))
IndexError: tuple index out of range

When I try to run it on the full set with ~100.000 points, I get this error message:

Traceback (most recent call last):
File "main_regression.py", line 192, in
train_model()
File "main_regression.py", line 181, in train_model
torch_model = reg_model.Exact_gp(x_train, y_train, x_test, y_test)
File "/cluster/home/niclasfl/PMLaks_opt/Regression/RegressionModel.py", line 358, in Exact_gp
model, likelihood = train(train_x, train_y,
File "/cluster/home/niclasfl/PMLaks_opt/Regression/RegressionModel.py", line 293, in train
loss = closure()
File "/cluster/home/niclasfl/PMLaks_opt/Regression/RegressionModel.py", line 288, in closure
loss = -mll(output, train_y)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/module.py", line 31, in call
outputs = self.forward(*inputs, **kwargs)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/mlls/exact_marginal_log_likelihood.py", line 64, in forward
res = output.log_prob(target)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/distributions/multivariate_normal.py", line 193, in log_prob
inv_quad, logdet = covar.inv_quad_logdet(inv_quad_rhs=diff.unsqueeze(-1), logdet=True)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 1642, in inv_quad_logdet
preconditioner, precond_lt, logdet_p = self._preconditioner()
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/added_diag_linear_operator.py", line 114, in _preconditioner
self._piv_chol_self = self._linear_op.pivoted_cholesky(rank=max_iter)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 1850, in pivoted_cholesky
res, pivots = func(self.representation_tree(), rank, error_tol, *self.representation())
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/functions/_pivoted_cholesky.py", line 72, in forward
row = apply_permutation(matrix, pi_m.unsqueeze(-1), right_permutation=None).squeeze(-2)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/utils/permutation.py", line 79, in apply_permutation
return to_dense(matrix.getitem((*batch_idx, left_permutation.unsqueeze(-1), right_permutation.unsqueeze(-2))))
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 25, in wrapped
output = method(self, *args, **kwargs)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 426, in getitem
return super().getitem(index)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 2692, in getitem
res = self._get_indices(new_row_index, new_col_index, *new_batch_indices)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 407, in _get_indices
base_linear_op = self._getitem(_noop_index, _noop_index, *batch_indices)._expand_batch(final_shape)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 380, in _expand_batch
return self.repeat(*batch_repeat, 1, 1)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 25, in wrapped
output = method(self, *args, **kwargs)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 381, in repeat
x2 = self.x2.repeat(*batch_repeat, col_repeat, 1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.09 GiB (GPU 0; 39.44 GiB total capacity; 28.12 GiB already allocated; 10.77 GiB free; 28.12 GiB reserved in total by PyTorch) If reser
ved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

And it doesn't matter how big the partition is. I hope someone can help me with this allocation problem of the memory usage.
Thanks!

gpleiss · 2023-05-26T14:48:20Z

gpleiss
May 26, 2023
Maintainer

@NicPy4 does it distribute the data when you don't use checkpointing? We will be deprecating the checkpointing feature in favor of KeOps.

3 replies

nicmig May 26, 2023
Author

@gpleiss First of all thank you for your reply. I commented out the 'find_best_gpu_setting' function and the 'gpytorch.beta_features.checkpoint_kernel(checkpoint_size)' to get rid of the checkpointing feature. I hope that I got it right and that was what you meant. If I try this then my process just gets killed. Even if I cut my data size to 10000 points or 1000 points. However, if I use SVGP with 1000 or more inducing points that is no problem. I know that the computational complexity is different between exact GP and SVGP but still, I wonder why it is not possible for me.

gpleiss May 26, 2023
Maintainer

By "killed" do you mean run out of memory? On a single GPU, you should be able to run up to 10000 data points.

nicmig May 31, 2023
Author

The terminal just displayed 'killed' so I suggest it is because it is out of memory, however I decided now to use the KeOps you suggested and try to implement the SpectralMixture Kernel next as we discussed here #1448 . Thanks for your feedback and help on this one!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MultiDeviceKernel is not distributing the memory usage #2319

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

MultiDeviceKernel is not distributing the memory usage #2319

Uh oh!

nicmig Apr 10, 2023

Replies: 1 comment · 3 replies

Uh oh!

gpleiss May 26, 2023 Maintainer

Uh oh!

Uh oh!

nicmig May 26, 2023 Author

Uh oh!

gpleiss May 26, 2023 Maintainer

Uh oh!

nicmig May 31, 2023 Author

nicmig
Apr 10, 2023

Replies: 1 comment 3 replies

gpleiss
May 26, 2023
Maintainer

nicmig May 26, 2023
Author

gpleiss May 26, 2023
Maintainer

nicmig May 31, 2023
Author