You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-11-27-improved-cuda-debugging.md
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ When a GPU kernel hangs, the program typically freezes or becomes unresponsive
17
17
18
18
Fortunately, there is a better way. The CUDA driver includes a feature called `user induced GPU core dump generation`: the driver opens pipes in the operating system that allow users to trigger a core dump by writing to them. When triggered, the CUDA driver dumps the GPU state to core dump files, enabling inspection of what's happening inside the GPU and, most importantly, identifying which GPU kernel is hanging.
19
19
20
-
Here is a simple example of a conditional hanging kernel:
20
+
Consider a simple example of a conditional hanging kernel:
21
21
22
22
```python
23
23
# save as conditional_hang.py
@@ -88,7 +88,7 @@ x = x + 2
88
88
torch.cuda.synchronize()
89
89
```
90
90
91
-
Directly executing the code will hang forever. We can enable the userinduced GPU core dump generation to debug the issue:
91
+
Executing this code will hang indefinitely. To debug the issue, we can enable user-induced GPU core dump generation:
Here we write 1MB of zeros to the pipe, which will trigger the CUDA core dump. Simple `echo aaa > /tmp/cuda_coredump_pipe_hostname.3000837.1764236276`might not work due to the buffering of the pipe.
109
+
We write 1MB of zeros to the pipe to trigger the CUDA core dump. Note that a simple `echo` command might not work due to pipe buffering.
110
110
111
-
After we trigger the core dump, in the original terminal where we run the `python conditional_hang.py`, we will see the progress of the core dump:
111
+
After triggering the core dump, the original terminal running `python conditional_hang.py`will display the core dump progress:
112
112
113
113
```text
114
114
[01:39:15.256278] coredump: Writing ELF file to /tmp/cuda_coredump_hostname.3000837.1764236276
@@ -120,7 +120,7 @@ After we trigger the core dump, in the original terminal where we run the `pytho
120
120
[01:39:15.292128] coredump: All done (took 00s)
121
121
```
122
122
123
-
Then we can use `cuda-gdb` to open the core dump file, and see exactly where the kernel is hanging:
123
+
We can then use `cuda-gdb` to open the core dump file and see exactly where the kernel is hanging:
Excitingly, we can not only exactly locate the kernel `conditional_hang_kernel`, but also the exact line of code that the kernel is hanging at. This is a huge improvement over the previous situation where we have no idea which kernel is hanging, not to mention the exact line of code that caused the hanging.
132
+
This approach allows us to not only identify the hanging kernel (`conditional_hang_kernel`) but also pinpoint the exact line of code where it hangs. This represents a significant improvement over the previous situation, where identifying the problematic kernel was impossible, let alone the specific line causing the hang.
133
133
134
-
One slightly annoying thing is that the core dump pipe's path is dynamically generated by the cuda driver, and it is not easy to find out. We can properly use `CUDA_COREDUMP_PIPE` environment variable to specify the template path of the core dump pipe, so that we can find it easily by looking at the file descriptors of the process:
134
+
One minor inconvenience is that the core dump pipe's path is dynamically generated by the CUDA driver, making it difficult to locate. We can address this by using the `CUDA_COREDUMP_PIPE` environment variable to specify a template path for the core dump pipe, allowing us to find it easily by inspecting the process's file descriptors:
135
135
136
136
```bash
137
137
$ ls /proc/3037675/fd/ -alth | grep /tmp/cuda_coredump_pipe_
0 commit comments