Skip to content

add cuda debugging blogpost #66

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Aug 13, 2025
Merged

add cuda debugging blogpost #66

merged 12 commits into from
Aug 13, 2025

Conversation

youkaichao
Copy link
Member

No description provided.

Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Copy link

cloudflare-workers-and-pages bot commented Aug 12, 2025

Deploying vllm-blog-source with  Cloudflare Pages  Cloudflare Pages

Latest commit: e77d1bb
Status: ✅  Deploy successful!
Preview URL: https://24502232.vllm-blog-source.pages.dev
Branch Preview URL: https://cuda-debugging.vllm-blog-source.pages.dev

View logs


> Additionally, the documentation states that enabling `CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1` not only enables CUDA core dump but also generates a CPU coredump by default. However, in practice, we find that the CPU coredump contains little useful information and is difficult to analyze.

> If you want live data for debugging, you can also enable `CUDA_DEVICE_WAITS_ON_EXCEPTION=1` environment variable, which does not use CUDA core dump, but stops GPU execution immediately when an exception occurs, and hangs there, waiting for users to attach a debugger (like cuda-gdb) to inspect the GPU state, where the full GPU memory is still intact. However, this approach is less automatic and requires more manual intervention.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! happy to see this pointed out too; super useful!

Signed-off-by: youkaichao <[email protected]>
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

The error message suggests adding `CUDA_LAUNCH_BLOCKING=1` when running the code. However, there are still two problems:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The error message suggests adding `CUDA_LAUNCH_BLOCKING=1` when running the code. However, there are still two problems:
The challenging bit here is:
> CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
In our experience the python stack traces for these types of exceptions are basically always incorrect and pretty worthless. To resolve this the error message suggests adding `CUDA_LAUNCH_BLOCKING=1` when running the code. However, there are still two problems:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in 87d7ddb

Copy link

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for doing this! This is amazing! coredumps are so unbelievably useful so its super nice to have a doc like this point to.

Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
@youkaichao
Copy link
Member Author

Thank you so much for doing this! This is amazing! coredumps are so unbelievably useful so its super nice to have a doc like this point to.

Thanks for helping to improve the doc! Added you in the last.

Signed-off-by: youkaichao <[email protected]>
@youkaichao youkaichao merged commit 41d4c54 into main Aug 13, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants