|
| 1 | +From cb9b2d04699ca0a5d1f32f6e79447b0823444fb4 Mon Sep 17 00:00:00 2001 |
| 2 | +From: Johannes Schindelin < [email protected]> |
| 3 | +Date: Mon, 23 Jun 2025 13:06:02 +0200 |
| 4 | +Subject: [PATCH 50/N] pipe: fix SSH hang (again) |
| 5 | +MIME-Version: 1.0 |
| 6 | +Content-Type: text/plain; charset=UTF-8 |
| 7 | +Content-Transfer-Encoding: 8bit |
| 8 | + |
| 9 | +When cbfaeba4f7 (Cygwin: pipe: Fix incorrect write length in |
| 10 | +raw_write(), 2024-11-06) fixed a bug where too-long writes could cause |
| 11 | +segmentation faults, it not only left out crucial details from the |
| 12 | +commit message, it also introduced a bug. |
| 13 | + |
| 14 | +This manifests e.g. in the symptom where cloning large repositories in |
| 15 | +Git for Windows via SSH "freeze" at random stages, as has been reported |
| 16 | +in https://github.com/git-for-windows/git/issues/5688 and in |
| 17 | +https://github.com/git-for-windows/git/issues/5682. |
| 18 | + |
| 19 | +The bug in question was to use `is_nonblocking ()` instead of |
| 20 | +`real_non_blocking_mode` in a newly introduced condition. If the commit |
| 21 | +message had filled in those details, it would most likely have been much |
| 22 | +easier to spot the bug before the patch was committed. |
| 23 | + |
| 24 | +Granted, the patch _sort of_ moved this `is_nonblocking ()` condition |
| 25 | +from another place, which means that the bug was present before, even if |
| 26 | +it did not have the same detrimental symptom of hanging a clone of a |
| 27 | +large repository via SSH. |
| 28 | + |
| 29 | +The _original_ bug was introduced in the equally under-explained |
| 30 | +7ed9adb356 (Cygwin: pipe: Switch pipe mode to blocking mode by default, |
| 31 | +2024-09-05). |
| 32 | + |
| 33 | +What is ironic is that this here patch is the latest (and hopefully |
| 34 | +last) commit in a _long_ chain of bug fixes that fix bugs introduced by |
| 35 | +preceding bug fixes: |
| 36 | + |
| 37 | +- 9e4d308cd5 (Cygwin: pipe: Adopt FILE_SYNCHRONOUS_IO_NONALERT flag for |
| 38 | + read pipe., 2021-11-10) fixed a bug where Cygwin hung by mistake while |
| 39 | + piping output from one .NET program as input to another .NET program |
| 40 | + (potentially introduced by 365199090c (Cygwin: pipe: Avoid false EOF |
| 41 | + while reading output of C# programs., 2021-11-07), which was itself a |
| 42 | + bug fix). It introduced a bug that was fixed by... |
| 43 | +- fc691d0246 (Cygwin: pipe: Make sure to set read pipe non-blocking for |
| 44 | + cygwin apps., 2024-03-11). Which introduced a bug that was purportedly |
| 45 | + fixed by... |
| 46 | +- 7ed9adb356 (Cygwin: pipe: Switch pipe mode to blocking mode by |
| 47 | + default, 2024-09-05). Which introduced a bug that was fixed by... |
| 48 | +- cbfaeba4f7 (Cygwin: pipe: Fix incorrect write length in raw_write(), |
| 49 | + 2024-11-06). Which introduced a bug that was fixed by... this here |
| 50 | + patch. |
| 51 | + |
| 52 | +There is not only the common thread here that each of these bug fixes |
| 53 | +introduced a new bug, but also the common thread of commit messages that |
| 54 | +leave the reader puzzled and the problem — as well as the solution — |
| 55 | +under-explained. It is quite likely that, had the time and care been |
| 56 | +spent to fully explain what is going on in the commit messages, this |
| 57 | +would quite likely have triggered a re-review of the diff by the author. |
| 58 | +In other words, writing a complete commit message could have increased |
| 59 | +the insight and understanding of the problem and thereby prevented this |
| 60 | +series of "fixes that introduce new bugs". |
| 61 | + |
| 62 | +To avoid making the same mistake once again, here is my attempt at a |
| 63 | +thorough explanation of what's going on, what is the problem, what |
| 64 | +solutions were considered, and why the "winner" was chosen. |
| 65 | + |
| 66 | +So let's start with an overview of the logic of the |
| 67 | +`fhandler_pipe_fifo::raw_write ()` method. |
| 68 | + |
| 69 | +This method implements the fundamental logic behind writing to any pipe |
| 70 | +in Cygwin, whether it be blocking or not blocking, and takes as input |
| 71 | +two variables, `ptr` and `len`, pointing to the data to be written. |
| 72 | + |
| 73 | +First, this method determines a couple of initial conditions (such as |
| 74 | +maybe waiting on a specific mutex, or determining whether the pipe had |
| 75 | +been closed already, or whether it is in blocking or in non-blocking |
| 76 | +mode). Then, this method determines how many bytes of data should be |
| 77 | +written in each attempt (assigned to the variable `chunk`). |
| 78 | + |
| 79 | +The amount of bytes already written is tracked as `nbytes`. While this |
| 80 | +value is smaller than `len`, the central loop of this method is |
| 81 | +executed. |
| 82 | + |
| 83 | +Counterintuitively, this loop is necessary even in non-blocking mode, to |
| 84 | +handle conditions like `STATUS_PENDING`. |
| 85 | + |
| 86 | +Inside this loop body, a variety of conditions are handled, including |
| 87 | +some that break out of that loop before `nbytes >= len`, e.g. when the |
| 88 | +pipe is closed on the other end. |
| 89 | + |
| 90 | +This loop body also contains special, complicated logic to handle the |
| 91 | +case where the pipe buffer is not large enough to handle the current |
| 92 | +`chunk`, trying to write even less bytes (this is guarded by the flag |
| 93 | +`short_write_once`). |
| 94 | + |
| 95 | +Unfortunately, the recent re-design of that logic decided to always set |
| 96 | +the pipe to blocking (cf. the diff of ed9adb356 (Cygwin: pipe: Switch |
| 97 | +pipe mode to blocking mode by default, 2024-09-05)). The rationale for |
| 98 | +that decision was the desire to simplify the code, which came at the |
| 99 | +expense of working well together with pipes that were created outside of |
| 100 | +Cygwin and that are set to non-blocking mode. |
| 101 | + |
| 102 | +To this end, the current implementation now _emulates_ non-blocking |
| 103 | +behavior while keeping the actual pipe in blocking mode. This requires |
| 104 | +knowing how much data can be written to the pipe without blocking at any |
| 105 | +given moment. Because by knowing that amount, even in blocking mode a |
| 106 | +write operation can be performed that does not block. This information |
| 107 | +is obtained via `NtQueryInformationFile(FilePipeLocalInformation)` and |
| 108 | +stored in the variable `avail`. |
| 109 | + |
| 110 | +However, if for some reason it cannot be determined how much space is |
| 111 | +available in the pipe, this non-blocking mode emulation would not work. |
| 112 | +Therefore the pipe is set to genuine non-blocking mode in such cases, |
| 113 | +and only in such cases. This minimizes the time window (but does not |
| 114 | +eliminate it completely!) during which Cygwin does not work well |
| 115 | +together with non-Cygwin processes via pipes. |
| 116 | + |
| 117 | +It also shows that the original desire to simplify the code can not be |
| 118 | +fulfilled using the current strategy, as there is now both code to |
| 119 | +emulate non-blocking mode as well as code that handles genuinely |
| 120 | +non-blocking mode. |
| 121 | + |
| 122 | +This information whether non-blocking mode is merely emulated or |
| 123 | +genuinely enabled, is tracked by the Boolean variable |
| 124 | +`real_non_blocking_mode`. |
| 125 | + |
| 126 | +At the beginning of the loop body, after determining how many bytes to |
| 127 | +write (stored in `len1`), the `NtWriteFile()` is called. Since the |
| 128 | +re-design to always use blocking mode (at least insofar possible), this |
| 129 | +`NtWriteFile()` call is guarded by a condition so that it is called only |
| 130 | +once in non-blocking mode. |
| 131 | + |
| 132 | +Now, the logic for that is quite complex. First, `len1` is clamped to |
| 133 | +`chunk` in blocking mode, otherwise to however many bytes are left to |
| 134 | +write. Then, an _inner_ `while` loop runs as long as `len1` is greater |
| 135 | +than 0. Inside that loop, `NtWriteFile()` is called _unless_ we'd try to |
| 136 | +write more bytes than are available in emulated non-blocking mode. |
| 137 | + |
| 138 | +It is the complex interplay between that clamping to `chunk` and the |
| 139 | +condition when `NtWriteFile()` is skipped that is at the heart of this |
| 140 | +bug. |
| 141 | + |
| 142 | +But let's first see what happens in the case described above, where the |
| 143 | +(Cygwin) SSH process gets stuck writing to the (non-Cygwin) Git process. |
| 144 | +In the instances I observed, the big `while (nbytes < len)` loop is |
| 145 | +started with the following initial conditions: `len` is 2097152, `chunk` |
| 146 | +and `avail` are 1 and `real_non_blocking_mode` is 0. |
| 147 | + |
| 148 | +When the outer loop is entered, `len1` is therefore also set to 2097152, |
| 149 | +and at the start of the inner loop the condition `len1 > avail` holds |
| 150 | +true. Therefore, the `NtWriteFile()` call is skipped already during the |
| 151 | +first iteration, we run into the code that tries a short write (setting |
| 152 | +`short_write_once` to 1), and ultimately fail and return from |
| 153 | +`raw_write()` with an error. |
| 154 | + |
| 155 | +Since not a single byte was written, the caller will try again at some |
| 156 | +stage, failing again, over and over, and we're in an infinite loop. |
| 157 | + |
| 158 | +The fix chosen in this here commit is to apply that clamping to `chunk` |
| 159 | +_also_ in the emulated non-blocking mode. This lets the `NtWriteFile()` |
| 160 | +call _not_ be skipped in non-emulated non-blocking mode, as was clearly |
| 161 | +the intention. |
| 162 | + |
| 163 | +An alternative patch that would have also addressed the symptom would be |
| 164 | +to partially revert this hunk of cbfaeba4f7 (Cygwin: pipe: Fix incorrect |
| 165 | +write length in raw_write(), 2024-11-06): |
| 166 | + |
| 167 | + chunk = len; |
| 168 | + - else if (is_nonblocking ()) |
| 169 | + - chunk = len = pipe_buf_size; |
| 170 | + else |
| 171 | + chunk = avail; |
| 172 | + |
| 173 | +The logic changed such that `len` would no longer be clamped to the same |
| 174 | +value as `chunk` in non-blocking mode. Reinstating that clamp in |
| 175 | +non-blocking mode would also work around the issue, but it would be a |
| 176 | +bit more heavy-handed than the chosen fix. |
| 177 | + |
| 178 | +Yet another potential fix that was suggested (in |
| 179 | +https://github.com/git-for-windows/git/issues/5688#issuecomment-2995952882) |
| 180 | +would be to modify the condition when `NtWriteFile()` is skipped so as |
| 181 | +to force it to _not_ be skipped when attempting a "short write", i.e. |
| 182 | +when `short_write_once != 0`. This fix would have worked around the bug |
| 183 | +successfully, but is fixing the bug at the wrong layer (even if it hints |
| 184 | +at the fact that the `short_write_once` logic is ill-prepared for |
| 185 | +non-blocking mode, but that's a topic for another patch). |
| 186 | + |
| 187 | +Fixes: cbfaeba4f7 (Cygwin: pipe: Fix incorrect write length in raw_write(), 2024-11-06) |
| 188 | +Co-authored-by: Takashi Yano < [email protected]> |
| 189 | +Signed-off-by: Johannes Schindelin < [email protected]> |
| 190 | +--- |
| 191 | + winsup/cygwin/fhandler/pipe.cc | 2 +- |
| 192 | + 1 file changed, 1 insertion(+), 1 deletion(-) |
| 193 | + |
| 194 | +diff --git a/winsup/cygwin/fhandler/pipe.cc b/winsup/cygwin/fhandler/pipe.cc |
| 195 | +index 506dd09..169c0bb 100644 |
| 196 | +--- a/winsup/cygwin/fhandler/pipe.cc |
| 197 | ++++ b/winsup/cygwin/fhandler/pipe.cc |
| 198 | +@@ -561,7 +561,7 @@ fhandler_pipe_fifo::raw_write (const void *ptr, size_t len) |
| 199 | + ULONG len1; |
| 200 | + DWORD waitret = WAIT_OBJECT_0; |
| 201 | + |
| 202 | +- if (left > chunk && !is_nonblocking ()) |
| 203 | ++ if (left > chunk && !real_non_blocking_mode) |
| 204 | + len1 = chunk; |
| 205 | + else |
| 206 | + len1 = (ULONG) left; |
0 commit comments