-
Notifications
You must be signed in to change notification settings - Fork 623
Description
Description:
I encountered a RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196) while running the implicit.gpu.als model on a large dataset. The error may be related to memory handling or CUDA library compatibility issues.
System Information:
Dataset size:
Number of users: 50 million
Number of items: 360,000
GPU: NVIDIA A100 (40 GB)
Memory Usage: Approximately 13,943 MiB / 40,960 MiB
CUDA Version: 12.4
Library Versions:
implicit: latest (0.7.2)
torch: 2.5.1
Issue Details: When running the model, the following error occurs:
RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196)
-->model.fit(weighted_matrix)
-->self.solver.least_squares(Cui, X, _YtY, Y, self.cg_steps)
This error happens consistently on my large dataset. The GPU has sufficient available memory (about 13,943 MiB is used out of 40,960 MiB). I have attempted the following troubleshooting steps:
Restarted the kernel to clear any lingering memory states.
Checked that CUDA version 12.4 is compatible with the library requirements.
Verified no conflicting paths for CUDA libraries in LD_LIBRARY_PATH.
Steps to Reproduce:
Set up a dataset with 50 million users and 360,000 items.
Run implicit.gpu.als on this dataset.
Monitor GPU memory usage and error occurrence.
Expected Behavior: The model should train successfully on the A100 GPU without running into Cuda Error.
Actual Behavior: The Cuda Error interrupts training, and the model cannot proceed further.
Additional Notes: This issue may relate to handling large datasets or to CUDA 12.4 compatibility with implicit.gpu. Any insights on possible fixes or workarounds would be greatly appreciated!