RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196)

Description: 

I encountered a RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196) while running the implicit.gpu.als model on a large dataset. The error may be related to memory handling or CUDA library compatibility issues.

System Information:

Dataset size:
Number of users: 50 million
Number of items: 360,000

GPU: NVIDIA A100 (40 GB)
Memory Usage: Approximately 13,943 MiB / 40,960 MiB
CUDA Version: 12.4

Library Versions:
implicit: latest (0.7.2)
torch: 2.5.1

Issue Details: When running the model, the following error occurs:

RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196)
-->model.fit(weighted_matrix)
-->self.solver.least_squares(Cui, X, _YtY, Y, self.cg_steps)

This error happens consistently on my large dataset. The GPU has sufficient available memory (about 13,943 MiB is used out of 40,960 MiB). I have attempted the following troubleshooting steps:

    Restarted the kernel to clear any lingering memory states.
    Checked that CUDA version 12.4 is compatible with the library requirements.
    Verified no conflicting paths for CUDA libraries in LD_LIBRARY_PATH.

Steps to Reproduce:

    Set up a dataset with 50 million users and 360,000 items.
    Run implicit.gpu.als on this dataset.
    Monitor GPU memory usage and error occurrence.

Expected Behavior: The model should train successfully on the A100 GPU without running into Cuda Error.

Actual Behavior: The Cuda Error interrupts training, and the model cannot proceed further.

Additional Notes: This issue may relate to handling large datasets or to CUDA 12.4 compatibility with implicit.gpu. Any insights on possible fixes or workarounds would be greatly appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196) #725

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196) #725

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions