-
Notifications
You must be signed in to change notification settings - Fork 20
add cuda debugging blogpost #66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Deploying vllm-blog-source with
|
Latest commit: |
e77d1bb
|
Status: | ✅ Deploy successful! |
Preview URL: | https://24502232.vllm-blog-source.pages.dev |
Branch Preview URL: | https://cuda-debugging.vllm-blog-source.pages.dev |
Signed-off-by: youkaichao <[email protected]>
_posts/2025-08-11-cuda-debugging.md
Outdated
|
||
> Additionally, the documentation states that enabling `CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1` not only enables CUDA core dump but also generates a CPU coredump by default. However, in practice, we find that the CPU coredump contains little useful information and is difficult to analyze. | ||
|
||
> If you want live data for debugging, you can also enable `CUDA_DEVICE_WAITS_ON_EXCEPTION=1` environment variable, which does not use CUDA core dump, but stops GPU execution immediately when an exception occurs, and hangs there, waiting for users to attach a debugger (like cuda-gdb) to inspect the GPU state, where the full GPU memory is still intact. However, this approach is less automatic and requires more manual intervention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! happy to see this pointed out too; super useful!
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
_posts/2025-08-11-cuda-debugging.md
Outdated
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. | ||
``` | ||
|
||
The error message suggests adding `CUDA_LAUNCH_BLOCKING=1` when running the code. However, there are still two problems: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message suggests adding `CUDA_LAUNCH_BLOCKING=1` when running the code. However, there are still two problems: | |
The challenging bit here is: | |
> CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. | |
In our experience the python stack traces for these types of exceptions are basically always incorrect and pretty worthless. To resolve this the error message suggests adding `CUDA_LAUNCH_BLOCKING=1` when running the code. However, there are still two problems: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in 87d7ddb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for doing this! This is amazing! coredumps are so unbelievably useful so its super nice to have a doc like this point to.
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Thanks for helping to improve the doc! Added you in the last. |
Signed-off-by: youkaichao <[email protected]>
No description provided.