Skip to content

Commit 637fc44

Browse files
jeffhostetlergitster
authored andcommitted
partial-clone: design doc
Design document for partial clone feature. Signed-off-by: Jeff Hostetler <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent 95ec6b1 commit 637fc44

File tree

1 file changed

+324
-0
lines changed

1 file changed

+324
-0
lines changed
Lines changed: 324 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,324 @@
1+
Partial Clone Design Notes
2+
==========================
3+
4+
The "Partial Clone" feature is a performance optimization for Git that
5+
allows Git to function without having a complete copy of the repository.
6+
The goal of this work is to allow Git better handle extremely large
7+
repositories.
8+
9+
During clone and fetch operations, Git downloads the complete contents
10+
and history of the repository. This includes all commits, trees, and
11+
blobs for the complete life of the repository. For extremely large
12+
repositories, clones can take hours (or days) and consume 100+GiB of disk
13+
space.
14+
15+
Often in these repositories there are many blobs and trees that the user
16+
does not need such as:
17+
18+
1. files outside of the user's work area in the tree. For example, in
19+
a repository with 500K directories and 3.5M files in every commit,
20+
we can avoid downloading many objects if the user only needs a
21+
narrow "cone" of the source tree.
22+
23+
2. large binary assets. For example, in a repository where large build
24+
artifacts are checked into the tree, we can avoid downloading all
25+
previous versions of these non-mergeable binary assets and only
26+
download versions that are actually referenced.
27+
28+
Partial clone allows us to avoid downloading such unneeded objects *in
29+
advance* during clone and fetch operations and thereby reduce download
30+
times and disk usage. Missing objects can later be "demand fetched"
31+
if/when needed.
32+
33+
Use of partial clone requires that the user be online and the origin
34+
remote be available for on-demand fetching of missing objects. This may
35+
or may not be problematic for the user. For example, if the user can
36+
stay within the pre-selected subset of the source tree, they may not
37+
encounter any missing objects. Alternatively, the user could try to
38+
pre-fetch various objects if they know that they are going offline.
39+
40+
41+
Non-Goals
42+
---------
43+
44+
Partial clone is a mechanism to limit the number of blobs and trees downloaded
45+
*within* a given range of commits -- and is therefore independent of and not
46+
intended to conflict with existing DAG-level mechanisms to limit the set of
47+
requested commits (i.e. shallow clone, single branch, or fetch '<refspec>').
48+
49+
50+
Design Overview
51+
---------------
52+
53+
Partial clone logically consists of the following parts:
54+
55+
- A mechanism for the client to describe unneeded or unwanted objects to
56+
the server.
57+
58+
- A mechanism for the server to omit such unwanted objects from packfiles
59+
sent to the client.
60+
61+
- A mechanism for the client to gracefully handle missing objects (that
62+
were previously omitted by the server).
63+
64+
- A mechanism for the client to backfill missing objects as needed.
65+
66+
67+
Design Details
68+
--------------
69+
70+
- A new pack-protocol capability "filter" is added to the fetch-pack and
71+
upload-pack negotiation.
72+
73+
This uses the existing capability discovery mechanism.
74+
See "filter" in Documentation/technical/pack-protocol.txt.
75+
76+
- Clients pass a "filter-spec" to clone and fetch which is passed to the
77+
server to request filtering during packfile construction.
78+
79+
There are various filters available to accommodate different situations.
80+
See "--filter=<filter-spec>" in Documentation/rev-list-options.txt.
81+
82+
- On the server pack-objects applies the requested filter-spec as it
83+
creates "filtered" packfiles for the client.
84+
85+
These filtered packfiles are *incomplete* in the traditional sense because
86+
they may contain objects that reference objects not contained in the
87+
packfile and that the client doesn't already have. For example, the
88+
filtered packfile may contain trees or tags that reference missing blobs
89+
or commits that reference missing trees.
90+
91+
- On the client these incomplete packfiles are marked as "promisor packfiles"
92+
and treated differently by various commands.
93+
94+
- On the client a repository extension is added to the local config to
95+
prevent older versions of git from failing mid-operation because of
96+
missing objects that they cannot handle.
97+
See "extensions.partialClone" in Documentation/technical/repository-version.txt"
98+
99+
100+
Handling Missing Objects
101+
------------------------
102+
103+
- An object may be missing due to a partial clone or fetch, or missing due
104+
to repository corruption. To differentiate these cases, the local
105+
repository specially indicates such filtered packfiles obtained from the
106+
promisor remote as "promisor packfiles".
107+
108+
These promisor packfiles consist of a "<name>.promisor" file with
109+
arbitrary contents (like the "<name>.keep" files), in addition to
110+
their "<name>.pack" and "<name>.idx" files.
111+
112+
- The local repository considers a "promisor object" to be an object that
113+
it knows (to the best of its ability) that the promisor remote has promised
114+
that it has, either because the local repository has that object in one of
115+
its promisor packfiles, or because another promisor object refers to it.
116+
117+
When Git encounters a missing object, Git can see if it a promisor object
118+
and handle it appropriately. If not, Git can report a corruption.
119+
120+
This means that there is no need for the client to explicitly maintain an
121+
expensive-to-modify list of missing objects.[a]
122+
123+
- Since almost all Git code currently expects any referenced object to be
124+
present locally and because we do not want to force every command to do
125+
a dry-run first, a fallback mechanism is added to allow Git to attempt
126+
to dynamically fetch missing objects from the promisor remote.
127+
128+
When the normal object lookup fails to find an object, Git invokes
129+
fetch-object to try to get the object from the server and then retry
130+
the object lookup. This allows objects to be "faulted in" without
131+
complicated prediction algorithms.
132+
133+
For efficiency reasons, no check as to whether the missing object is
134+
actually a promisor object is performed.
135+
136+
Dynamic object fetching tends to be slow as objects are fetched one at
137+
a time.
138+
139+
- `checkout` (and any other command using `unpack-trees`) has been taught
140+
to bulk pre-fetch all required missing blobs in a single batch.
141+
142+
- `rev-list` has been taught to print missing objects.
143+
144+
This can be used by other commands to bulk prefetch objects.
145+
For example, a "git log -p A..B" may internally want to first do
146+
something like "git rev-list --objects --quiet --missing=print A..B"
147+
and prefetch those objects in bulk.
148+
149+
- `fsck` has been updated to be fully aware of promisor objects.
150+
151+
- `repack` in GC has been updated to not touch promisor packfiles at all,
152+
and to only repack other objects.
153+
154+
- The global variable "fetch_if_missing" is used to control whether an
155+
object lookup will attempt to dynamically fetch a missing object or
156+
report an error.
157+
158+
We are not happy with this global variable and would like to remove it,
159+
but that requires significant refactoring of the object code to pass an
160+
additional flag. We hope that concurrent efforts to add an ODB API can
161+
encompass this.
162+
163+
164+
Fetching Missing Objects
165+
------------------------
166+
167+
- Fetching of objects is done using the existing transport mechanism using
168+
transport_fetch_refs(), setting a new transport option
169+
TRANS_OPT_NO_DEPENDENTS to indicate that only the objects themselves are
170+
desired, not any object that they refer to.
171+
172+
Because some transports invoke fetch_pack() in the same process, fetch_pack()
173+
has been updated to not use any object flags when the corresponding argument
174+
(no_dependents) is set.
175+
176+
- The local repository sends a request with the hashes of all requested
177+
objects as "want" lines, and does not perform any packfile negotiation.
178+
It then receives a packfile.
179+
180+
- Because we are reusing the existing fetch-pack mechanism, fetching
181+
currently fetches all objects referred to by the requested objects, even
182+
though they are not necessary.
183+
184+
185+
Current Limitations
186+
-------------------
187+
188+
- The remote used for a partial clone (or the first partial fetch
189+
following a regular clone) is marked as the "promisor remote".
190+
191+
We are currently limited to a single promisor remote and only that
192+
remote may be used for subsequent partial fetches.
193+
194+
We accept this limitation because we believe initial users of this
195+
feature will be using it on repositories with a strong single central
196+
server.
197+
198+
- Dynamic object fetching will only ask the promisor remote for missing
199+
objects. We assume that the promisor remote has a complete view of the
200+
repository and can satisfy all such requests.
201+
202+
- Repack essentially treats promisor and non-promisor packfiles as 2
203+
distinct partitions and does not mix them. Repack currently only works
204+
on non-promisor packfiles and loose objects.
205+
206+
- Dynamic object fetching invokes fetch-pack once *for each item*
207+
because most algorithms stumble upon a missing object and need to have
208+
it resolved before continuing their work. This may incur significant
209+
overhead -- and multiple authentication requests -- if many objects are
210+
needed.
211+
212+
- Dynamic object fetching currently uses the existing pack protocol V0
213+
which means that each object is requested via fetch-pack. The server
214+
will send a full set of info/refs when the connection is established.
215+
If there are large number of refs, this may incur significant overhead.
216+
217+
218+
Future Work
219+
-----------
220+
221+
- Allow more than one promisor remote and define a strategy for fetching
222+
missing objects from specific promisor remotes or of iterating over the
223+
set of promisor remotes until a missing object is found.
224+
225+
A user might want to have multiple geographically-close cache servers
226+
for fetching missing blobs while continuing to do filtered `git-fetch`
227+
commands from the central server, for example.
228+
229+
Or the user might want to work in a triangular work flow with multiple
230+
promisor remotes that each have an incomplete view of the repository.
231+
232+
- Allow repack to work on promisor packfiles (while keeping them distinct
233+
from non-promisor packfiles).
234+
235+
- Allow non-pathname-based filters to make use of packfile bitmaps (when
236+
present). This was just an omission during the initial implementation.
237+
238+
- Investigate use of a long-running process to dynamically fetch a series
239+
of objects, such as proposed in [5,6] to reduce process startup and
240+
overhead costs.
241+
242+
It would be nice if pack protocol V2 could allow that long-running
243+
process to make a series of requests over a single long-running
244+
connection.
245+
246+
- Investigate pack protocol V2 to avoid the info/refs broadcast on
247+
each connection with the server to dynamically fetch missing objects.
248+
249+
- Investigate the need to handle loose promisor objects.
250+
251+
Objects in promisor packfiles are allowed to reference missing objects
252+
that can be dynamically fetched from the server. An assumption was
253+
made that loose objects are only created locally and therefore should
254+
not reference a missing object. We may need to revisit that assumption
255+
if, for example, we dynamically fetch a missing tree and store it as a
256+
loose object rather than a single object packfile.
257+
258+
This does not necessarily mean we need to mark loose objects as promisor;
259+
it may be sufficient to relax the object lookup or is-promisor functions.
260+
261+
262+
Non-Tasks
263+
---------
264+
265+
- Every time the subject of "demand loading blobs" comes up it seems
266+
that someone suggests that the server be allowed to "guess" and send
267+
additional objects that may be related to the requested objects.
268+
269+
No work has gone into actually doing that; we're just documenting that
270+
it is a common suggestion. We're not sure how it would work and have
271+
no plans to work on it.
272+
273+
It is valid for the server to send more objects than requested (even
274+
for a dynamic object fetch), but we are not building on that.
275+
276+
277+
Footnotes
278+
---------
279+
280+
[a] expensive-to-modify list of missing objects: Earlier in the design of
281+
partial clone we discussed the need for a single list of missing objects.
282+
This would essentially be a sorted linear list of OIDs that the were
283+
omitted by the server during a clone or subsequent fetches.
284+
285+
This file would need to be loaded into memory on every object lookup.
286+
It would need to be read, updated, and re-written (like the .git/index)
287+
on every explicit "git fetch" command *and* on any dynamic object fetch.
288+
289+
The cost to read, update, and write this file could add significant
290+
overhead to every command if there are many missing objects. For example,
291+
if there are 100M missing blobs, this file would be at least 2GiB on disk.
292+
293+
With the "promisor" concept, we *infer* a missing object based upon the
294+
type of packfile that references it.
295+
296+
297+
Related Links
298+
-------------
299+
[0] https://bugs.chromium.org/p/git/issues/detail?id=2
300+
Chromium work item for: Partial Clone
301+
302+
[1] https://public-inbox.org/git/[email protected]/
303+
Subject: [RFC] Add support for downloading blobs on demand
304+
Date: Fri, 13 Jan 2017 10:52:53 -0500
305+
306+
[2] https://public-inbox.org/git/[email protected]/
307+
Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
308+
Date: Fri, 29 Sep 2017 13:11:36 -0700
309+
310+
[3] https://public-inbox.org/git/[email protected]/
311+
Subject: Proposal for missing blob support in Git repos
312+
Date: Wed, 26 Apr 2017 15:13:46 -0700
313+
314+
[4] https://public-inbox.org/git/[email protected]/
315+
Subject: [PATCH 00/10] RFC Partial Clone and Fetch
316+
Date: Wed, 8 Mar 2017 18:50:29 +0000
317+
318+
[5] https://public-inbox.org/git/[email protected]/
319+
Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module
320+
Date: Fri, 5 May 2017 11:27:52 -0400
321+
322+
[6] https://public-inbox.org/git/[email protected]/
323+
Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand
324+
Date: Fri, 14 Jul 2017 09:26:50 -0400

0 commit comments

Comments
 (0)