-
Notifications
You must be signed in to change notification settings - Fork 41
Description
I encountered a CUDA memory error when using torchsort.soft_rank during parallel training on the GPU. The error message is as follows:
File "/home/xxx/anaconda3/envs/DL2/lib/python3.10/site-packages/torchsort/ops.py", line 121, in backward
).gather(1, inv_permutation)
RuntimeError: CUDA error: an illegal memory access was encountered
However, when I switched the code to run torchsort.soft_rank on the CPU, while keeping the other parts of the code on the GPU, the error disappeared.
For example, when I modify the code like this:
if pred_2d.device != 'cpu':
pred_2d = pred_2d.to('cpu')
rank_2d = torchsort.soft_rank(pred_2d, regularization_strength=0.1)
rank_2d = rank_2d.to(y_true.device)
The error is resolved. But if I directly run:
rank_2d = torchsort.soft_rank(pred_2d, regularization_strength=0.1)
The error occurs again.
Could you provide guidance on how to resolve this issue when using torchsort.soft_rank with CUDA?
Thank you so much!