Skip to content

Memory Issue with torchsort.soft_rank on CUDA #84

@liuquant

Description

@liuquant

I encountered a CUDA memory error when using torchsort.soft_rank during parallel training on the GPU. The error message is as follows:
File "/home/xxx/anaconda3/envs/DL2/lib/python3.10/site-packages/torchsort/ops.py", line 121, in backward
).gather(1, inv_permutation)
RuntimeError: CUDA error: an illegal memory access was encountered

However, when I switched the code to run torchsort.soft_rank on the CPU, while keeping the other parts of the code on the GPU, the error disappeared.

For example, when I modify the code like this:
if pred_2d.device != 'cpu':
pred_2d = pred_2d.to('cpu')
rank_2d = torchsort.soft_rank(pred_2d, regularization_strength=0.1)
rank_2d = rank_2d.to(y_true.device)

The error is resolved. But if I directly run:
rank_2d = torchsort.soft_rank(pred_2d, regularization_strength=0.1)
The error occurs again.

Could you provide guidance on how to resolve this issue when using torchsort.soft_rank with CUDA?
Thank you so much!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions