Skip to content

RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196) #725

@vivekpandian08

Description

@vivekpandian08

Description:

I encountered a RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196) while running the implicit.gpu.als model on a large dataset. The error may be related to memory handling or CUDA library compatibility issues.

System Information:

Dataset size:
Number of users: 50 million
Number of items: 360,000

GPU: NVIDIA A100 (40 GB)
Memory Usage: Approximately 13,943 MiB / 40,960 MiB
CUDA Version: 12.4

Library Versions:
implicit: latest (0.7.2)
torch: 2.5.1

Issue Details: When running the model, the following error occurs:

RuntimeError: Cuda Error: an illegal memory access was encountered (/project/implicit/gpu/als.cu:196)
-->model.fit(weighted_matrix)
-->self.solver.least_squares(Cui, X, _YtY, Y, self.cg_steps)

This error happens consistently on my large dataset. The GPU has sufficient available memory (about 13,943 MiB is used out of 40,960 MiB). I have attempted the following troubleshooting steps:

Restarted the kernel to clear any lingering memory states.
Checked that CUDA version 12.4 is compatible with the library requirements.
Verified no conflicting paths for CUDA libraries in LD_LIBRARY_PATH.

Steps to Reproduce:

Set up a dataset with 50 million users and 360,000 items.
Run implicit.gpu.als on this dataset.
Monitor GPU memory usage and error occurrence.

Expected Behavior: The model should train successfully on the A100 GPU without running into Cuda Error.

Actual Behavior: The Cuda Error interrupts training, and the model cannot proceed further.

Additional Notes: This issue may relate to handling large datasets or to CUDA 12.4 compatibility with implicit.gpu. Any insights on possible fixes or workarounds would be greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions