You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RuntimeError: INTERNAL: Failed to launch CUDA kernel: add_332 with block dimensions: 1x1x1 and grid dimensions: 1x1x1: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
while trying to run complex code that consists of a lot of nested scan and vmap functions. My question is: The error depends on the batch size. For smaller batches, everything works. For large batches I got the error presented above and for medium batches, I got this:
Am I right, that this is an internal jax bug that does not depends on me?
If yes, is there any information that I can provide in order to help to fix this bug. I won't be doing a reproduction code snippet, because it is a huge pain in the ass (The bug reproducibility is SUPER dependent on small code changes. I've got a lot of situations, when cuslover exception was presented on cuda, installed from conda, and was not presented on other cuda sources). Maybe there is some verbose mode in JAX that can shed a light on what's going on?
Well, it turned out, I can get a whole lot more errors for different "small code changes", here they are:
2021-09-25 09:37:14.907063: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:1163] failed to enqueue async memcpy from device to host: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; host dst: 0x7fff2dc66c1f; GPU src: 0x7fe3ec786900; size: 1=0x1
RuntimeError: INTERNAL: Failed to complete all kernels launched on stream 0x55ec28edad50: stream did not block host until done; was already in an error state
RuntimeError: INTERNAL: Failed to load in-memory CUBIN: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
I forgot to mention, that, ofcourse, in cpu-only mode, everything is perfectly fine.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am getting this internal error:
while trying to run complex code that consists of a lot of nested scan and vmap functions. My question is: The error depends on the batch size. For smaller batches, everything works. For large batches I got the error presented above and for medium batches, I got this:
My question is:
Well, it turned out, I can get a whole lot more errors for different "small code changes", here they are:
I forgot to mention, that, ofcourse, in cpu-only mode, everything is perfectly fine.
Beta Was this translation helpful? Give feedback.
All reactions