Replies: 1 comment 3 replies
-
@NicPy4 does it distribute the data when you don't use checkpointing? We will be deprecating the checkpointing feature in favor of KeOps. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I try to use the MultiDeviceKernel for a time series forecast. My data has ~100.000 data samples and one input feature. To start with, I just used the example from GPyTorch repository for ExactGP with MultiDeviceKernel (https://docs.gpytorch.ai/en/latest/examples/02_Scalable_Exact_GPs/Simple_MultiGPU_GP_Regression.html) but instead of the protein data, I used my own data. I have 8 GPUs (NVIDIA A100 40Gb or NVIDIA Tesla V100 32Gb), and I work on a supercomputer at my university.
According to the paper linked to the example, running this with this set-up and the provided code should be no problem. However, I always run into 'CUDA of out memory' and to be more particular it is because of imbalanced memory usage. So when I display my memory usage with 'nividia-smi -l', I can see that my GPU0 is used ~99% while the rest is used around 30-40%. This then leads to a stop of the code at some point. This is my code:
To test the program I run it on only 4 GPUs with less data (average over half an hour instead of a data point every 5min). The memory usage while running the program with ~18000 training samples and a kernel partition of 29 looks like this:
and I'll get this error message:
When I try to run it on the full set with ~100.000 points, I get this error message:
Traceback (most recent call last):
File "main_regression.py", line 192, in
train_model()
File "main_regression.py", line 181, in train_model
torch_model = reg_model.Exact_gp(x_train, y_train, x_test, y_test)
File "/cluster/home/niclasfl/PMLaks_opt/Regression/RegressionModel.py", line 358, in Exact_gp
model, likelihood = train(train_x, train_y,
File "/cluster/home/niclasfl/PMLaks_opt/Regression/RegressionModel.py", line 293, in train
loss = closure()
File "/cluster/home/niclasfl/PMLaks_opt/Regression/RegressionModel.py", line 288, in closure
loss = -mll(output, train_y)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/module.py", line 31, in call
outputs = self.forward(*inputs, **kwargs)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/mlls/exact_marginal_log_likelihood.py", line 64, in forward
res = output.log_prob(target)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/distributions/multivariate_normal.py", line 193, in log_prob
inv_quad, logdet = covar.inv_quad_logdet(inv_quad_rhs=diff.unsqueeze(-1), logdet=True)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 1642, in inv_quad_logdet
preconditioner, precond_lt, logdet_p = self._preconditioner()
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/added_diag_linear_operator.py", line 114, in _preconditioner
self._piv_chol_self = self._linear_op.pivoted_cholesky(rank=max_iter)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 1850, in pivoted_cholesky
res, pivots = func(self.representation_tree(), rank, error_tol, *self.representation())
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/functions/_pivoted_cholesky.py", line 72, in forward
row = apply_permutation(matrix, pi_m.unsqueeze(-1), right_permutation=None).squeeze(-2)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/utils/permutation.py", line 79, in apply_permutation
return to_dense(matrix.getitem((*batch_idx, left_permutation.unsqueeze(-1), right_permutation.unsqueeze(-2))))
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 25, in wrapped
output = method(self, *args, **kwargs)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 426, in getitem
return super().getitem(index)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 2692, in getitem
res = self._get_indices(new_row_index, new_col_index, *new_batch_indices)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 407, in _get_indices
base_linear_op = self._getitem(_noop_index, _noop_index, *batch_indices)._expand_batch(final_shape)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/linear_operator/operators/_linear_operator.py", line 380, in _expand_batch
return self.repeat(*batch_repeat, 1, 1)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 25, in wrapped
output = method(self, *args, **kwargs)
File "/cluster/home/niclasfl/.local/lib/python3.8/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 381, in repeat
x2 = self.x2.repeat(*batch_repeat, col_repeat, 1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.09 GiB (GPU 0; 39.44 GiB total capacity; 28.12 GiB already allocated; 10.77 GiB free; 28.12 GiB reserved in total by PyTorch) If reser
ved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
And it doesn't matter how big the partition is. I hope someone can help me with this allocation problem of the memory usage.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions