|
| 1 | +Parallel Checkout Design Notes |
| 2 | +============================== |
| 3 | + |
| 4 | +The "Parallel Checkout" feature attempts to use multiple processes to |
| 5 | +parallelize the work of uncompressing the blobs, applying in-core |
| 6 | +filters, and writing the resulting contents to the working tree during a |
| 7 | +checkout operation. It can be used by all checkout-related commands, |
| 8 | +such as `clone`, `checkout`, `reset`, `sparse-checkout`, and others. |
| 9 | + |
| 10 | +These commands share the following basic structure: |
| 11 | + |
| 12 | +* Step 1: Read the current index file into memory. |
| 13 | + |
| 14 | +* Step 2: Modify the in-memory index based upon the command, and |
| 15 | + temporarily mark all cache entries that need to be updated. |
| 16 | + |
| 17 | +* Step 3: Populate the working tree to match the new candidate index. |
| 18 | + This includes iterating over all of the to-be-updated cache entries |
| 19 | + and delete, create, or overwrite the associated files in the working |
| 20 | + tree. |
| 21 | + |
| 22 | +* Step 4: Write the new index to disk. |
| 23 | + |
| 24 | +Step 3 is the focus of the "parallel checkout" effort described here. |
| 25 | + |
| 26 | +Sequential Implementation |
| 27 | +------------------------- |
| 28 | + |
| 29 | +For the purposes of discussion here, the current sequential |
| 30 | +implementation of Step 3 is divided in 3 parts, each one implemented in |
| 31 | +its own function: |
| 32 | + |
| 33 | +* Step 3a: `unpack-trees.c:check_updates()` contains a series of |
| 34 | + sequential loops iterating over the `cache_entry`'s array. The main |
| 35 | + loop in this function calls the Step 3b function for each of the |
| 36 | + to-be-updated entries. |
| 37 | + |
| 38 | +* Step 3b: `entry.c:checkout_entry()` examines the existing working tree |
| 39 | + for file conflicts, collisions, and unsaved changes. It removes files |
| 40 | + and creates leading directories as necessary. It calls the Step 3c |
| 41 | + function for each entry to be written. |
| 42 | + |
| 43 | +* Step 3c: `entry.c:write_entry()` loads the blob into memory, smudges |
| 44 | + it if necessary, creates the file in the working tree, writes the |
| 45 | + smudged contents, calls `fstat()` or `lstat()`, and updates the |
| 46 | + associated `cache_entry` struct with the stat information gathered. |
| 47 | + |
| 48 | +It wouldn't be safe to perform Step 3b in parallel, as there could be |
| 49 | +race conditions between file creations and removals. Instead, the |
| 50 | +parallel checkout framework lets the sequential code handle Step 3b, |
| 51 | +and uses parallel workers to replace the sequential |
| 52 | +`entry.c:write_entry()` calls from Step 3c. |
| 53 | + |
| 54 | +Rejected Multi-Threaded Solution |
| 55 | +-------------------------------- |
| 56 | + |
| 57 | +The most "straightforward" implementation would be to spread the set of |
| 58 | +to-be-updated cache entries across multiple threads. But due to the |
| 59 | +thread-unsafe functions in the ODB code, we would have to use locks to |
| 60 | +coordinate the parallel operation. An early prototype of this solution |
| 61 | +showed that the multi-threaded checkout would bring performance |
| 62 | +improvements over the sequential code, but there was still too much lock |
| 63 | +contention. A `perf` profiling indicated that around 20% of the runtime |
| 64 | +during a local Linux clone (on an SSD) was spent in locking functions. |
| 65 | +For this reason this approach was rejected in favor of using multiple |
| 66 | +child processes, which led to a better performance. |
| 67 | + |
| 68 | +Multi-Process Solution |
| 69 | +---------------------- |
| 70 | + |
| 71 | +Parallel checkout alters the aforementioned Step 3 to use multiple |
| 72 | +`checkout--worker` background processes to distribute the work. The |
| 73 | +long-running worker processes are controlled by the foreground Git |
| 74 | +command using the existing run-command API. |
| 75 | + |
| 76 | +Overview |
| 77 | +~~~~~~~~ |
| 78 | + |
| 79 | +Step 3b is only slightly altered; for each entry to be checked out, the |
| 80 | +main process performs the following steps: |
| 81 | + |
| 82 | +* M1: Check whether there is any untracked or unclean file in the |
| 83 | + working tree which would be overwritten by this entry, and decide |
| 84 | + whether to proceed (removing the file(s)) or not. |
| 85 | + |
| 86 | +* M2: Create the leading directories. |
| 87 | + |
| 88 | +* M3: Load the conversion attributes for the entry's path. |
| 89 | + |
| 90 | +* M4: Check, based on the entry's type and conversion attributes, |
| 91 | + whether the entry is eligible for parallel checkout (more on this |
| 92 | + later). If it is eligible, enqueue the entry and the loaded |
| 93 | + attributes to later write the entry in parallel. If not, write the |
| 94 | + entry right away, using the default sequential code. |
| 95 | + |
| 96 | +Note: we save the conversion attributes associated with each entry |
| 97 | +because the workers don't have access to the main process' index state, |
| 98 | +so they can't load the attributes by themselves (and the attributes are |
| 99 | +needed to properly smudge the entry). Additionally, this has a positive |
| 100 | +impact on performance as (1) we don't need to load the attributes twice |
| 101 | +and (2) the attributes machinery is optimized to handle paths in |
| 102 | +sequential order. |
| 103 | + |
| 104 | +After all entries have passed through the above steps, the main process |
| 105 | +checks if the number of enqueued entries is sufficient to spread among |
| 106 | +the workers. If not, it just writes them sequentially. Otherwise, it |
| 107 | +spawns the workers and distributes the queued entries uniformly in |
| 108 | +continuous chunks. This aims to minimize the chances of two workers |
| 109 | +writing to the same directory simultaneously, which could increase lock |
| 110 | +contention in the kernel. |
| 111 | + |
| 112 | +Then, for each assigned item, each worker: |
| 113 | + |
| 114 | +* W1: Checks if there is any non-directory file in the leading part of |
| 115 | + the entry's path or if there already exists a file at the entry' path. |
| 116 | + If so, mark the entry with `PC_ITEM_COLLIDED` and skip it (more on |
| 117 | + this later). |
| 118 | + |
| 119 | +* W2: Creates the file (with O_CREAT and O_EXCL). |
| 120 | + |
| 121 | +* W3: Loads the blob into memory (inflating and delta reconstructing |
| 122 | + it). |
| 123 | + |
| 124 | +* W4: Applies any required in-process filter, like end-of-line |
| 125 | + conversion and re-encoding. |
| 126 | + |
| 127 | +* W5: Writes the result to the file descriptor opened at W2. |
| 128 | + |
| 129 | +* W6: Calls `fstat()` or lstat()` on the just-written path, and sends |
| 130 | + the result back to the main process, together with the end status of |
| 131 | + the operation and the item's identification number. |
| 132 | + |
| 133 | +Note that, when possible, steps W3 to W5 are delegated to the streaming |
| 134 | +machinery, removing the need to keep the entire blob in memory. |
| 135 | + |
| 136 | +If the worker fails to read the blob or to write it to the working tree, |
| 137 | +it removes the created file to avoid leaving empty files behind. This is |
| 138 | +the *only* time a worker is allowed to remove a file. |
| 139 | + |
| 140 | +As mentioned earlier, it is the responsibility of the main process to |
| 141 | +remove any file that blocks the checkout operation (or abort if the |
| 142 | +removal(s) would cause data loss and the user didn't ask to `--force`). |
| 143 | +This is crucial to avoid race conditions and also to properly detect |
| 144 | +path collisions at Step W1. |
| 145 | + |
| 146 | +After the workers finish writing the items and sending back the required |
| 147 | +information, the main process handles the results in two steps: |
| 148 | + |
| 149 | +- First, it updates the in-memory index with the `lstat()` information |
| 150 | + sent by the workers. (This must be done first as this information |
| 151 | + might me required in the following step.) |
| 152 | + |
| 153 | +- Then it writes the items which collided on disk (i.e. items marked |
| 154 | + with `PC_ITEM_COLLIDED`). More on this below. |
| 155 | + |
| 156 | +Path Collisions |
| 157 | +--------------- |
| 158 | + |
| 159 | +Path collisions happen when two different paths correspond to the same |
| 160 | +entry in the file system. E.g. the paths 'a' and 'A' would collide in a |
| 161 | +case-insensitive file system. |
| 162 | + |
| 163 | +The sequential checkout deals with collisions in the same way that it |
| 164 | +deals with files that were already present in the working tree before |
| 165 | +checkout. Basically, it checks if the path that it wants to write |
| 166 | +already exists on disk, makes sure the existing file doesn't have |
| 167 | +unsaved data, and then overwrites it. (To be more pedantic: it deletes |
| 168 | +the existing file and creates the new one.) So, if there are multiple |
| 169 | +colliding files to be checked out, the sequential code will write each |
| 170 | +one of them but only the last will actually survive on disk. |
| 171 | + |
| 172 | +Parallel checkout aims to reproduce the same behavior. However, we |
| 173 | +cannot let the workers racily write to the same file on disk. Instead, |
| 174 | +the workers detect when the entry that they want to check out would |
| 175 | +collide with an existing file, and mark it with `PC_ITEM_COLLIDED`. |
| 176 | +Later, the main process can sequentially feed these entries back to |
| 177 | +`checkout_entry()` without the risk of race conditions. On clone, this |
| 178 | +also has the effect of marking the colliding entries to later emit a |
| 179 | +warning for the user, like the classic sequential checkout does. |
| 180 | + |
| 181 | +The workers are able to detect both collisions among the entries being |
| 182 | +concurrently written and collisions between a parallel-eligible entry |
| 183 | +and an ineligible entry. The general idea for collision detection is |
| 184 | +quite straightforward: for each parallel-eligible entry, the main |
| 185 | +process must remove all files that prevent this entry from being written |
| 186 | +(before enqueueing it). This includes any non-directory file in the |
| 187 | +leading path of the entry. Later, when a worker gets assigned the entry, |
| 188 | +it looks again for the non-directories files and for an already existing |
| 189 | +file at the entry's path. If any of these checks finds something, the |
| 190 | +worker knows that there was a path collision. |
| 191 | + |
| 192 | +Because parallel checkout can distinguish path collisions from the case |
| 193 | +where the file was already present in the working tree before checkout, |
| 194 | +we could alternatively choose to skip the checkout of colliding entries. |
| 195 | +However, each entry that doesn't get written would have NULL `lstat()` |
| 196 | +fields on the index. This could cause performance penalties for |
| 197 | +subsequent commands that need to refresh the index, as they would have |
| 198 | +to go to the file system to see if the entry is dirty. Thus, if we have |
| 199 | +N entries in a colliding group and we decide to write and `lstat()` only |
| 200 | +one of them, every subsequent `git-status` will have to read, convert, |
| 201 | +and hash the written file N - 1 times. By checking out all colliding |
| 202 | +entries (like the sequential code does), we only pay the overhead once, |
| 203 | +during checkout. |
| 204 | + |
| 205 | +Eligible Entries for Parallel Checkout |
| 206 | +-------------------------------------- |
| 207 | + |
| 208 | +As previously mentioned, not all entries passed to `checkout_entry()` |
| 209 | +will be considered eligible for parallel checkout. More specifically, we |
| 210 | +exclude: |
| 211 | + |
| 212 | +- Symbolic links; to avoid race conditions that, in combination with |
| 213 | + path collisions, could cause workers to write files at the wrong |
| 214 | + place. For example, if we were to concurrently check out a symlink |
| 215 | + 'a' -> 'b' and a regular file 'A/f' in a case-insensitive file system, |
| 216 | + we could potentially end up writing the file 'A/f' at 'a/f', due to a |
| 217 | + race condition. |
| 218 | + |
| 219 | +- Regular files that require external filters (either "one shot" filters |
| 220 | + or long-running process filters). These filters are black-boxes to Git |
| 221 | + and may have their own internal locking or non-concurrent assumptions. |
| 222 | + So it might not be safe to run multiple instances in parallel. |
| 223 | ++ |
| 224 | +Besides, long-running filters may use the delayed checkout feature to |
| 225 | +postpone the return of some filtered blobs. The delayed checkout queue |
| 226 | +and the parallel checkout queue are not compatible and should remain |
| 227 | +separate. |
| 228 | ++ |
| 229 | +Note: regular files that only require internal filters, like end-of-line |
| 230 | +conversion and re-encoding, are eligible for parallel checkout. |
| 231 | + |
| 232 | +Ineligible entries are checked out by the classic sequential codepath |
| 233 | +*before* spawning workers. |
| 234 | + |
| 235 | +Note: submodules's files are also eligible for parallel checkout (as |
| 236 | +long as they don't fall into any of the excluding categories mentioned |
| 237 | +above). But since each submodule is checked out in its own child |
| 238 | +process, we don't mix the superproject's and the submodules' files in |
| 239 | +the same parallel checkout process or queue. |
| 240 | + |
| 241 | +The API |
| 242 | +------- |
| 243 | + |
| 244 | +The parallel checkout API was designed with the goal of minimizing |
| 245 | +changes to the current users of the checkout machinery. This means that |
| 246 | +they don't have to call a different function for sequential or parallel |
| 247 | +checkout. As already mentioned, `checkout_entry()` will automatically |
| 248 | +insert the given entry in the parallel checkout queue when this feature |
| 249 | +is enabled and the entry is eligible; otherwise, it will just write the |
| 250 | +entry right away, using the sequential code. In general, callers of the |
| 251 | +parallel checkout API should look similar to this: |
| 252 | + |
| 253 | +---------------------------------------------- |
| 254 | +int pc_workers, pc_threshold, err = 0; |
| 255 | +struct checkout state; |
| 256 | + |
| 257 | +get_parallel_checkout_configs(&pc_workers, &pc_threshold); |
| 258 | + |
| 259 | +/* |
| 260 | + * This check is not strictly required, but it |
| 261 | + * should save some time in sequential mode. |
| 262 | + */ |
| 263 | +if (pc_workers > 1) |
| 264 | + init_parallel_checkout(); |
| 265 | + |
| 266 | +for (each cache_entry ce to-be-updated) |
| 267 | + err |= checkout_entry(ce, &state, NULL, NULL); |
| 268 | + |
| 269 | +err |= run_parallel_checkout(&state, pc_workers, pc_threshold, NULL, NULL); |
| 270 | +---------------------------------------------- |
0 commit comments