Skip to content

Commit 68e66f2

Browse files
matheustavaresjeffhostetler
authored andcommitted
parallel-checkout: add design documentation
Co-authored-by: Jeff Hostetler <[email protected]> Signed-off-by: Matheus Tavares <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent 1c4d6f4 commit 68e66f2

File tree

2 files changed

+271
-0
lines changed

2 files changed

+271
-0
lines changed

Documentation/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,7 @@ TECH_DOCS += technical/multi-pack-index
9191
TECH_DOCS += technical/pack-format
9292
TECH_DOCS += technical/pack-heuristics
9393
TECH_DOCS += technical/pack-protocol
94+
TECH_DOCS += technical/parallel-checkout
9495
TECH_DOCS += technical/partial-clone
9596
TECH_DOCS += technical/protocol-capabilities
9697
TECH_DOCS += technical/protocol-common
Lines changed: 270 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,270 @@
1+
Parallel Checkout Design Notes
2+
==============================
3+
4+
The "Parallel Checkout" feature attempts to use multiple processes to
5+
parallelize the work of uncompressing the blobs, applying in-core
6+
filters, and writing the resulting contents to the working tree during a
7+
checkout operation. It can be used by all checkout-related commands,
8+
such as `clone`, `checkout`, `reset`, `sparse-checkout`, and others.
9+
10+
These commands share the following basic structure:
11+
12+
* Step 1: Read the current index file into memory.
13+
14+
* Step 2: Modify the in-memory index based upon the command, and
15+
temporarily mark all cache entries that need to be updated.
16+
17+
* Step 3: Populate the working tree to match the new candidate index.
18+
This includes iterating over all of the to-be-updated cache entries
19+
and delete, create, or overwrite the associated files in the working
20+
tree.
21+
22+
* Step 4: Write the new index to disk.
23+
24+
Step 3 is the focus of the "parallel checkout" effort described here.
25+
26+
Sequential Implementation
27+
-------------------------
28+
29+
For the purposes of discussion here, the current sequential
30+
implementation of Step 3 is divided in 3 parts, each one implemented in
31+
its own function:
32+
33+
* Step 3a: `unpack-trees.c:check_updates()` contains a series of
34+
sequential loops iterating over the `cache_entry`'s array. The main
35+
loop in this function calls the Step 3b function for each of the
36+
to-be-updated entries.
37+
38+
* Step 3b: `entry.c:checkout_entry()` examines the existing working tree
39+
for file conflicts, collisions, and unsaved changes. It removes files
40+
and creates leading directories as necessary. It calls the Step 3c
41+
function for each entry to be written.
42+
43+
* Step 3c: `entry.c:write_entry()` loads the blob into memory, smudges
44+
it if necessary, creates the file in the working tree, writes the
45+
smudged contents, calls `fstat()` or `lstat()`, and updates the
46+
associated `cache_entry` struct with the stat information gathered.
47+
48+
It wouldn't be safe to perform Step 3b in parallel, as there could be
49+
race conditions between file creations and removals. Instead, the
50+
parallel checkout framework lets the sequential code handle Step 3b,
51+
and uses parallel workers to replace the sequential
52+
`entry.c:write_entry()` calls from Step 3c.
53+
54+
Rejected Multi-Threaded Solution
55+
--------------------------------
56+
57+
The most "straightforward" implementation would be to spread the set of
58+
to-be-updated cache entries across multiple threads. But due to the
59+
thread-unsafe functions in the ODB code, we would have to use locks to
60+
coordinate the parallel operation. An early prototype of this solution
61+
showed that the multi-threaded checkout would bring performance
62+
improvements over the sequential code, but there was still too much lock
63+
contention. A `perf` profiling indicated that around 20% of the runtime
64+
during a local Linux clone (on an SSD) was spent in locking functions.
65+
For this reason this approach was rejected in favor of using multiple
66+
child processes, which led to a better performance.
67+
68+
Multi-Process Solution
69+
----------------------
70+
71+
Parallel checkout alters the aforementioned Step 3 to use multiple
72+
`checkout--worker` background processes to distribute the work. The
73+
long-running worker processes are controlled by the foreground Git
74+
command using the existing run-command API.
75+
76+
Overview
77+
~~~~~~~~
78+
79+
Step 3b is only slightly altered; for each entry to be checked out, the
80+
main process performs the following steps:
81+
82+
* M1: Check whether there is any untracked or unclean file in the
83+
working tree which would be overwritten by this entry, and decide
84+
whether to proceed (removing the file(s)) or not.
85+
86+
* M2: Create the leading directories.
87+
88+
* M3: Load the conversion attributes for the entry's path.
89+
90+
* M4: Check, based on the entry's type and conversion attributes,
91+
whether the entry is eligible for parallel checkout (more on this
92+
later). If it is eligible, enqueue the entry and the loaded
93+
attributes to later write the entry in parallel. If not, write the
94+
entry right away, using the default sequential code.
95+
96+
Note: we save the conversion attributes associated with each entry
97+
because the workers don't have access to the main process' index state,
98+
so they can't load the attributes by themselves (and the attributes are
99+
needed to properly smudge the entry). Additionally, this has a positive
100+
impact on performance as (1) we don't need to load the attributes twice
101+
and (2) the attributes machinery is optimized to handle paths in
102+
sequential order.
103+
104+
After all entries have passed through the above steps, the main process
105+
checks if the number of enqueued entries is sufficient to spread among
106+
the workers. If not, it just writes them sequentially. Otherwise, it
107+
spawns the workers and distributes the queued entries uniformly in
108+
continuous chunks. This aims to minimize the chances of two workers
109+
writing to the same directory simultaneously, which could increase lock
110+
contention in the kernel.
111+
112+
Then, for each assigned item, each worker:
113+
114+
* W1: Checks if there is any non-directory file in the leading part of
115+
the entry's path or if there already exists a file at the entry' path.
116+
If so, mark the entry with `PC_ITEM_COLLIDED` and skip it (more on
117+
this later).
118+
119+
* W2: Creates the file (with O_CREAT and O_EXCL).
120+
121+
* W3: Loads the blob into memory (inflating and delta reconstructing
122+
it).
123+
124+
* W4: Applies any required in-process filter, like end-of-line
125+
conversion and re-encoding.
126+
127+
* W5: Writes the result to the file descriptor opened at W2.
128+
129+
* W6: Calls `fstat()` or lstat()` on the just-written path, and sends
130+
the result back to the main process, together with the end status of
131+
the operation and the item's identification number.
132+
133+
Note that, when possible, steps W3 to W5 are delegated to the streaming
134+
machinery, removing the need to keep the entire blob in memory.
135+
136+
If the worker fails to read the blob or to write it to the working tree,
137+
it removes the created file to avoid leaving empty files behind. This is
138+
the *only* time a worker is allowed to remove a file.
139+
140+
As mentioned earlier, it is the responsibility of the main process to
141+
remove any file that blocks the checkout operation (or abort if the
142+
removal(s) would cause data loss and the user didn't ask to `--force`).
143+
This is crucial to avoid race conditions and also to properly detect
144+
path collisions at Step W1.
145+
146+
After the workers finish writing the items and sending back the required
147+
information, the main process handles the results in two steps:
148+
149+
- First, it updates the in-memory index with the `lstat()` information
150+
sent by the workers. (This must be done first as this information
151+
might me required in the following step.)
152+
153+
- Then it writes the items which collided on disk (i.e. items marked
154+
with `PC_ITEM_COLLIDED`). More on this below.
155+
156+
Path Collisions
157+
---------------
158+
159+
Path collisions happen when two different paths correspond to the same
160+
entry in the file system. E.g. the paths 'a' and 'A' would collide in a
161+
case-insensitive file system.
162+
163+
The sequential checkout deals with collisions in the same way that it
164+
deals with files that were already present in the working tree before
165+
checkout. Basically, it checks if the path that it wants to write
166+
already exists on disk, makes sure the existing file doesn't have
167+
unsaved data, and then overwrites it. (To be more pedantic: it deletes
168+
the existing file and creates the new one.) So, if there are multiple
169+
colliding files to be checked out, the sequential code will write each
170+
one of them but only the last will actually survive on disk.
171+
172+
Parallel checkout aims to reproduce the same behavior. However, we
173+
cannot let the workers racily write to the same file on disk. Instead,
174+
the workers detect when the entry that they want to check out would
175+
collide with an existing file, and mark it with `PC_ITEM_COLLIDED`.
176+
Later, the main process can sequentially feed these entries back to
177+
`checkout_entry()` without the risk of race conditions. On clone, this
178+
also has the effect of marking the colliding entries to later emit a
179+
warning for the user, like the classic sequential checkout does.
180+
181+
The workers are able to detect both collisions among the entries being
182+
concurrently written and collisions between a parallel-eligible entry
183+
and an ineligible entry. The general idea for collision detection is
184+
quite straightforward: for each parallel-eligible entry, the main
185+
process must remove all files that prevent this entry from being written
186+
(before enqueueing it). This includes any non-directory file in the
187+
leading path of the entry. Later, when a worker gets assigned the entry,
188+
it looks again for the non-directories files and for an already existing
189+
file at the entry's path. If any of these checks finds something, the
190+
worker knows that there was a path collision.
191+
192+
Because parallel checkout can distinguish path collisions from the case
193+
where the file was already present in the working tree before checkout,
194+
we could alternatively choose to skip the checkout of colliding entries.
195+
However, each entry that doesn't get written would have NULL `lstat()`
196+
fields on the index. This could cause performance penalties for
197+
subsequent commands that need to refresh the index, as they would have
198+
to go to the file system to see if the entry is dirty. Thus, if we have
199+
N entries in a colliding group and we decide to write and `lstat()` only
200+
one of them, every subsequent `git-status` will have to read, convert,
201+
and hash the written file N - 1 times. By checking out all colliding
202+
entries (like the sequential code does), we only pay the overhead once,
203+
during checkout.
204+
205+
Eligible Entries for Parallel Checkout
206+
--------------------------------------
207+
208+
As previously mentioned, not all entries passed to `checkout_entry()`
209+
will be considered eligible for parallel checkout. More specifically, we
210+
exclude:
211+
212+
- Symbolic links; to avoid race conditions that, in combination with
213+
path collisions, could cause workers to write files at the wrong
214+
place. For example, if we were to concurrently check out a symlink
215+
'a' -> 'b' and a regular file 'A/f' in a case-insensitive file system,
216+
we could potentially end up writing the file 'A/f' at 'a/f', due to a
217+
race condition.
218+
219+
- Regular files that require external filters (either "one shot" filters
220+
or long-running process filters). These filters are black-boxes to Git
221+
and may have their own internal locking or non-concurrent assumptions.
222+
So it might not be safe to run multiple instances in parallel.
223+
+
224+
Besides, long-running filters may use the delayed checkout feature to
225+
postpone the return of some filtered blobs. The delayed checkout queue
226+
and the parallel checkout queue are not compatible and should remain
227+
separate.
228+
+
229+
Note: regular files that only require internal filters, like end-of-line
230+
conversion and re-encoding, are eligible for parallel checkout.
231+
232+
Ineligible entries are checked out by the classic sequential codepath
233+
*before* spawning workers.
234+
235+
Note: submodules's files are also eligible for parallel checkout (as
236+
long as they don't fall into any of the excluding categories mentioned
237+
above). But since each submodule is checked out in its own child
238+
process, we don't mix the superproject's and the submodules' files in
239+
the same parallel checkout process or queue.
240+
241+
The API
242+
-------
243+
244+
The parallel checkout API was designed with the goal of minimizing
245+
changes to the current users of the checkout machinery. This means that
246+
they don't have to call a different function for sequential or parallel
247+
checkout. As already mentioned, `checkout_entry()` will automatically
248+
insert the given entry in the parallel checkout queue when this feature
249+
is enabled and the entry is eligible; otherwise, it will just write the
250+
entry right away, using the sequential code. In general, callers of the
251+
parallel checkout API should look similar to this:
252+
253+
----------------------------------------------
254+
int pc_workers, pc_threshold, err = 0;
255+
struct checkout state;
256+
257+
get_parallel_checkout_configs(&pc_workers, &pc_threshold);
258+
259+
/*
260+
* This check is not strictly required, but it
261+
* should save some time in sequential mode.
262+
*/
263+
if (pc_workers > 1)
264+
init_parallel_checkout();
265+
266+
for (each cache_entry ce to-be-updated)
267+
err |= checkout_entry(ce, &state, NULL, NULL);
268+
269+
err |= run_parallel_checkout(&state, pc_workers, pc_threshold, NULL, NULL);
270+
----------------------------------------------

0 commit comments

Comments
 (0)