add cuda debugging blogpost #66

youkaichao · 2025-08-12T04:19:15Z

No description provided.

Signed-off-by: youkaichao <[email protected]>

cloudflare-workers-and-pages · 2025-08-12T04:19:57Z

Deploying vllm-blog-source with Cloudflare Pages

Latest commit:	`e77d1bb`
Status:	✅ Deploy successful!
Preview URL:	https://24502232.vllm-blog-source.pages.dev
Branch Preview URL:	https://cuda-debugging.vllm-blog-source.pages.dev

View logs

Signed-off-by: youkaichao <[email protected]>

_posts/2025-08-11-cuda-debugging.md

LucasWilkinson · 2025-08-12T04:38:05Z

_posts/2025-08-11-cuda-debugging.md

+
+> Additionally, the documentation states that enabling `CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1` not only enables CUDA core dump but also generates a CPU coredump by default. However, in practice, we find that the CPU coredump contains little useful information and is difficult to analyze.
+
+> If you want live data for debugging, you can also enable `CUDA_DEVICE_WAITS_ON_EXCEPTION=1` environment variable, which does not use CUDA core dump, but stops GPU execution immediately when an exception occurs, and hangs there, waiting for users to attach a debugger (like cuda-gdb) to inspect the GPU state, where the full GPU memory is still intact. However, this approach is less automatic and requires more manual intervention.


nice! happy to see this pointed out too; super useful!

Signed-off-by: youkaichao <[email protected]>

LucasWilkinson · 2025-08-12T21:06:06Z

_posts/2025-08-11-cuda-debugging.md

+Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
+```
+
+The error message suggests adding `CUDA_LAUNCH_BLOCKING=1` when running the code. However, there are still two problems:


Suggested change

The error message suggests adding `CUDA_LAUNCH_BLOCKING=1` when running the code. However, there are still two problems:

The challenging bit here is:

> CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

In our experience the python stack traces for these types of exceptions are basically always incorrect and pretty worthless. To resolve this the error message suggests adding `CUDA_LAUNCH_BLOCKING=1` when running the code. However, there are still two problems:

added in 87d7ddb

LucasWilkinson

Thank you so much for doing this! This is amazing! coredumps are so unbelievably useful so its super nice to have a doc like this point to.

Signed-off-by: youkaichao <[email protected]>

youkaichao · 2025-08-13T00:56:10Z

Thank you so much for doing this! This is amazing! coredumps are so unbelievably useful so its super nice to have a doc like this point to.

Thanks for helping to improve the doc! Added you in the last.

Signed-off-by: youkaichao <[email protected]>

youkaichao added 6 commits August 12, 2025 10:44

add draft

6ab6e04

Signed-off-by: youkaichao <[email protected]>

rename to cuda

e08d3e1

Signed-off-by: youkaichao <[email protected]>

rename to cuda

8a9c57d

Signed-off-by: youkaichao <[email protected]>

update intro

f9ca21b

Signed-off-by: youkaichao <[email protected]>

format

500cb84

Signed-off-by: youkaichao <[email protected]>

typo

d23eab9

Signed-off-by: youkaichao <[email protected]>

remove notes for better format

45aa3cb

Signed-off-by: youkaichao <[email protected]>

LucasWilkinson reviewed Aug 12, 2025

View reviewed changes

_posts/2025-08-11-cuda-debugging.md Show resolved Hide resolved

LucasWilkinson reviewed Aug 12, 2025

View reviewed changes

youkaichao added 2 commits August 12, 2025 12:38

remove notes for better format

9378921

Signed-off-by: youkaichao <[email protected]>

update title

37edd83

Signed-off-by: youkaichao <[email protected]>

LucasWilkinson reviewed Aug 12, 2025

View reviewed changes

LucasWilkinson approved these changes Aug 12, 2025

View reviewed changes

youkaichao added 2 commits August 13, 2025 08:53

update from lucas

87d7ddb

Signed-off-by: youkaichao <[email protected]>

add lucas

dcd3538

Signed-off-by: youkaichao <[email protected]>

add Compute Sanitizer

e77d1bb

Signed-off-by: youkaichao <[email protected]>

youkaichao merged commit 41d4c54 into main Aug 13, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add cuda debugging blogpost #66

add cuda debugging blogpost #66

youkaichao commented Aug 12, 2025

Uh oh!

cloudflare-workers-and-pages bot commented Aug 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

LucasWilkinson Aug 12, 2025

Uh oh!

LucasWilkinson Aug 12, 2025

Uh oh!

youkaichao Aug 13, 2025

Uh oh!

LucasWilkinson left a comment

Uh oh!

youkaichao commented Aug 13, 2025

Uh oh!

Uh oh!

Uh oh!


		> Additionally, the documentation states that enabling `CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1` not only enables CUDA core dump but also generates a CPU coredump by default. However, in practice, we find that the CPU coredump contains little useful information and is difficult to analyze.

		> If you want live data for debugging, you can also enable `CUDA_DEVICE_WAITS_ON_EXCEPTION=1` environment variable, which does not use CUDA core dump, but stops GPU execution immediately when an exception occurs, and hangs there, waiting for users to attach a debugger (like cuda-gdb) to inspect the GPU state, where the full GPU memory is still intact. However, this approach is less automatic and requires more manual intervention.

-The error message suggests adding `CUDA_LAUNCH_BLOCKING=1` when running the code. However, there are still two problems:
+The challenging bit here is:
+> CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
+In our experience the python stack traces for these types of exceptions are basically always incorrect and pretty worthless. To resolve this the error message suggests adding `CUDA_LAUNCH_BLOCKING=1` when running the code. However, there are still two problems:

add cuda debugging blogpost #66

add cuda debugging blogpost #66

Conversation

youkaichao commented Aug 12, 2025

Uh oh!

cloudflare-workers-and-pages bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying vllm-blog-source with Cloudflare Pages

Uh oh!

Uh oh!

LucasWilkinson Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

youkaichao Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

youkaichao commented Aug 13, 2025

Uh oh!

Uh oh!

Uh oh!

cloudflare-workers-and-pages bot commented Aug 12, 2025 •

edited

Loading