|
| 1 | +Partial Clone Design Notes |
| 2 | +========================== |
| 3 | + |
| 4 | +The "Partial Clone" feature is a performance optimization for Git that |
| 5 | +allows Git to function without having a complete copy of the repository. |
| 6 | +The goal of this work is to allow Git better handle extremely large |
| 7 | +repositories. |
| 8 | + |
| 9 | +During clone and fetch operations, Git downloads the complete contents |
| 10 | +and history of the repository. This includes all commits, trees, and |
| 11 | +blobs for the complete life of the repository. For extremely large |
| 12 | +repositories, clones can take hours (or days) and consume 100+GiB of disk |
| 13 | +space. |
| 14 | + |
| 15 | +Often in these repositories there are many blobs and trees that the user |
| 16 | +does not need such as: |
| 17 | + |
| 18 | + 1. files outside of the user's work area in the tree. For example, in |
| 19 | + a repository with 500K directories and 3.5M files in every commit, |
| 20 | + we can avoid downloading many objects if the user only needs a |
| 21 | + narrow "cone" of the source tree. |
| 22 | + |
| 23 | + 2. large binary assets. For example, in a repository where large build |
| 24 | + artifacts are checked into the tree, we can avoid downloading all |
| 25 | + previous versions of these non-mergeable binary assets and only |
| 26 | + download versions that are actually referenced. |
| 27 | + |
| 28 | +Partial clone allows us to avoid downloading such unneeded objects *in |
| 29 | +advance* during clone and fetch operations and thereby reduce download |
| 30 | +times and disk usage. Missing objects can later be "demand fetched" |
| 31 | +if/when needed. |
| 32 | + |
| 33 | +Use of partial clone requires that the user be online and the origin |
| 34 | +remote be available for on-demand fetching of missing objects. This may |
| 35 | +or may not be problematic for the user. For example, if the user can |
| 36 | +stay within the pre-selected subset of the source tree, they may not |
| 37 | +encounter any missing objects. Alternatively, the user could try to |
| 38 | +pre-fetch various objects if they know that they are going offline. |
| 39 | + |
| 40 | + |
| 41 | +Non-Goals |
| 42 | +--------- |
| 43 | + |
| 44 | +Partial clone is a mechanism to limit the number of blobs and trees downloaded |
| 45 | +*within* a given range of commits -- and is therefore independent of and not |
| 46 | +intended to conflict with existing DAG-level mechanisms to limit the set of |
| 47 | +requested commits (i.e. shallow clone, single branch, or fetch '<refspec>'). |
| 48 | + |
| 49 | + |
| 50 | +Design Overview |
| 51 | +--------------- |
| 52 | + |
| 53 | +Partial clone logically consists of the following parts: |
| 54 | + |
| 55 | +- A mechanism for the client to describe unneeded or unwanted objects to |
| 56 | + the server. |
| 57 | + |
| 58 | +- A mechanism for the server to omit such unwanted objects from packfiles |
| 59 | + sent to the client. |
| 60 | + |
| 61 | +- A mechanism for the client to gracefully handle missing objects (that |
| 62 | + were previously omitted by the server). |
| 63 | + |
| 64 | +- A mechanism for the client to backfill missing objects as needed. |
| 65 | + |
| 66 | + |
| 67 | +Design Details |
| 68 | +-------------- |
| 69 | + |
| 70 | +- A new pack-protocol capability "filter" is added to the fetch-pack and |
| 71 | + upload-pack negotiation. |
| 72 | + |
| 73 | + This uses the existing capability discovery mechanism. |
| 74 | + See "filter" in Documentation/technical/pack-protocol.txt. |
| 75 | + |
| 76 | +- Clients pass a "filter-spec" to clone and fetch which is passed to the |
| 77 | + server to request filtering during packfile construction. |
| 78 | + |
| 79 | + There are various filters available to accommodate different situations. |
| 80 | + See "--filter=<filter-spec>" in Documentation/rev-list-options.txt. |
| 81 | + |
| 82 | +- On the server pack-objects applies the requested filter-spec as it |
| 83 | + creates "filtered" packfiles for the client. |
| 84 | + |
| 85 | + These filtered packfiles are *incomplete* in the traditional sense because |
| 86 | + they may contain objects that reference objects not contained in the |
| 87 | + packfile and that the client doesn't already have. For example, the |
| 88 | + filtered packfile may contain trees or tags that reference missing blobs |
| 89 | + or commits that reference missing trees. |
| 90 | + |
| 91 | +- On the client these incomplete packfiles are marked as "promisor packfiles" |
| 92 | + and treated differently by various commands. |
| 93 | + |
| 94 | +- On the client a repository extension is added to the local config to |
| 95 | + prevent older versions of git from failing mid-operation because of |
| 96 | + missing objects that they cannot handle. |
| 97 | + See "extensions.partialClone" in Documentation/technical/repository-version.txt" |
| 98 | + |
| 99 | + |
| 100 | +Handling Missing Objects |
| 101 | +------------------------ |
| 102 | + |
| 103 | +- An object may be missing due to a partial clone or fetch, or missing due |
| 104 | + to repository corruption. To differentiate these cases, the local |
| 105 | + repository specially indicates such filtered packfiles obtained from the |
| 106 | + promisor remote as "promisor packfiles". |
| 107 | + |
| 108 | + These promisor packfiles consist of a "<name>.promisor" file with |
| 109 | + arbitrary contents (like the "<name>.keep" files), in addition to |
| 110 | + their "<name>.pack" and "<name>.idx" files. |
| 111 | + |
| 112 | +- The local repository considers a "promisor object" to be an object that |
| 113 | + it knows (to the best of its ability) that the promisor remote has promised |
| 114 | + that it has, either because the local repository has that object in one of |
| 115 | + its promisor packfiles, or because another promisor object refers to it. |
| 116 | + |
| 117 | + When Git encounters a missing object, Git can see if it a promisor object |
| 118 | + and handle it appropriately. If not, Git can report a corruption. |
| 119 | + |
| 120 | + This means that there is no need for the client to explicitly maintain an |
| 121 | + expensive-to-modify list of missing objects.[a] |
| 122 | + |
| 123 | +- Since almost all Git code currently expects any referenced object to be |
| 124 | + present locally and because we do not want to force every command to do |
| 125 | + a dry-run first, a fallback mechanism is added to allow Git to attempt |
| 126 | + to dynamically fetch missing objects from the promisor remote. |
| 127 | + |
| 128 | + When the normal object lookup fails to find an object, Git invokes |
| 129 | + fetch-object to try to get the object from the server and then retry |
| 130 | + the object lookup. This allows objects to be "faulted in" without |
| 131 | + complicated prediction algorithms. |
| 132 | + |
| 133 | + For efficiency reasons, no check as to whether the missing object is |
| 134 | + actually a promisor object is performed. |
| 135 | + |
| 136 | + Dynamic object fetching tends to be slow as objects are fetched one at |
| 137 | + a time. |
| 138 | + |
| 139 | +- `checkout` (and any other command using `unpack-trees`) has been taught |
| 140 | + to bulk pre-fetch all required missing blobs in a single batch. |
| 141 | + |
| 142 | +- `rev-list` has been taught to print missing objects. |
| 143 | + |
| 144 | + This can be used by other commands to bulk prefetch objects. |
| 145 | + For example, a "git log -p A..B" may internally want to first do |
| 146 | + something like "git rev-list --objects --quiet --missing=print A..B" |
| 147 | + and prefetch those objects in bulk. |
| 148 | + |
| 149 | +- `fsck` has been updated to be fully aware of promisor objects. |
| 150 | + |
| 151 | +- `repack` in GC has been updated to not touch promisor packfiles at all, |
| 152 | + and to only repack other objects. |
| 153 | + |
| 154 | +- The global variable "fetch_if_missing" is used to control whether an |
| 155 | + object lookup will attempt to dynamically fetch a missing object or |
| 156 | + report an error. |
| 157 | + |
| 158 | + We are not happy with this global variable and would like to remove it, |
| 159 | + but that requires significant refactoring of the object code to pass an |
| 160 | + additional flag. We hope that concurrent efforts to add an ODB API can |
| 161 | + encompass this. |
| 162 | + |
| 163 | + |
| 164 | +Fetching Missing Objects |
| 165 | +------------------------ |
| 166 | + |
| 167 | +- Fetching of objects is done using the existing transport mechanism using |
| 168 | + transport_fetch_refs(), setting a new transport option |
| 169 | + TRANS_OPT_NO_DEPENDENTS to indicate that only the objects themselves are |
| 170 | + desired, not any object that they refer to. |
| 171 | + |
| 172 | + Because some transports invoke fetch_pack() in the same process, fetch_pack() |
| 173 | + has been updated to not use any object flags when the corresponding argument |
| 174 | + (no_dependents) is set. |
| 175 | + |
| 176 | +- The local repository sends a request with the hashes of all requested |
| 177 | + objects as "want" lines, and does not perform any packfile negotiation. |
| 178 | + It then receives a packfile. |
| 179 | + |
| 180 | +- Because we are reusing the existing fetch-pack mechanism, fetching |
| 181 | + currently fetches all objects referred to by the requested objects, even |
| 182 | + though they are not necessary. |
| 183 | + |
| 184 | + |
| 185 | +Current Limitations |
| 186 | +------------------- |
| 187 | + |
| 188 | +- The remote used for a partial clone (or the first partial fetch |
| 189 | + following a regular clone) is marked as the "promisor remote". |
| 190 | + |
| 191 | + We are currently limited to a single promisor remote and only that |
| 192 | + remote may be used for subsequent partial fetches. |
| 193 | + |
| 194 | + We accept this limitation because we believe initial users of this |
| 195 | + feature will be using it on repositories with a strong single central |
| 196 | + server. |
| 197 | + |
| 198 | +- Dynamic object fetching will only ask the promisor remote for missing |
| 199 | + objects. We assume that the promisor remote has a complete view of the |
| 200 | + repository and can satisfy all such requests. |
| 201 | + |
| 202 | +- Repack essentially treats promisor and non-promisor packfiles as 2 |
| 203 | + distinct partitions and does not mix them. Repack currently only works |
| 204 | + on non-promisor packfiles and loose objects. |
| 205 | + |
| 206 | +- Dynamic object fetching invokes fetch-pack once *for each item* |
| 207 | + because most algorithms stumble upon a missing object and need to have |
| 208 | + it resolved before continuing their work. This may incur significant |
| 209 | + overhead -- and multiple authentication requests -- if many objects are |
| 210 | + needed. |
| 211 | + |
| 212 | +- Dynamic object fetching currently uses the existing pack protocol V0 |
| 213 | + which means that each object is requested via fetch-pack. The server |
| 214 | + will send a full set of info/refs when the connection is established. |
| 215 | + If there are large number of refs, this may incur significant overhead. |
| 216 | + |
| 217 | + |
| 218 | +Future Work |
| 219 | +----------- |
| 220 | + |
| 221 | +- Allow more than one promisor remote and define a strategy for fetching |
| 222 | + missing objects from specific promisor remotes or of iterating over the |
| 223 | + set of promisor remotes until a missing object is found. |
| 224 | + |
| 225 | + A user might want to have multiple geographically-close cache servers |
| 226 | + for fetching missing blobs while continuing to do filtered `git-fetch` |
| 227 | + commands from the central server, for example. |
| 228 | + |
| 229 | + Or the user might want to work in a triangular work flow with multiple |
| 230 | + promisor remotes that each have an incomplete view of the repository. |
| 231 | + |
| 232 | +- Allow repack to work on promisor packfiles (while keeping them distinct |
| 233 | + from non-promisor packfiles). |
| 234 | + |
| 235 | +- Allow non-pathname-based filters to make use of packfile bitmaps (when |
| 236 | + present). This was just an omission during the initial implementation. |
| 237 | + |
| 238 | +- Investigate use of a long-running process to dynamically fetch a series |
| 239 | + of objects, such as proposed in [5,6] to reduce process startup and |
| 240 | + overhead costs. |
| 241 | + |
| 242 | + It would be nice if pack protocol V2 could allow that long-running |
| 243 | + process to make a series of requests over a single long-running |
| 244 | + connection. |
| 245 | + |
| 246 | +- Investigate pack protocol V2 to avoid the info/refs broadcast on |
| 247 | + each connection with the server to dynamically fetch missing objects. |
| 248 | + |
| 249 | +- Investigate the need to handle loose promisor objects. |
| 250 | + |
| 251 | + Objects in promisor packfiles are allowed to reference missing objects |
| 252 | + that can be dynamically fetched from the server. An assumption was |
| 253 | + made that loose objects are only created locally and therefore should |
| 254 | + not reference a missing object. We may need to revisit that assumption |
| 255 | + if, for example, we dynamically fetch a missing tree and store it as a |
| 256 | + loose object rather than a single object packfile. |
| 257 | + |
| 258 | + This does not necessarily mean we need to mark loose objects as promisor; |
| 259 | + it may be sufficient to relax the object lookup or is-promisor functions. |
| 260 | + |
| 261 | + |
| 262 | +Non-Tasks |
| 263 | +--------- |
| 264 | + |
| 265 | +- Every time the subject of "demand loading blobs" comes up it seems |
| 266 | + that someone suggests that the server be allowed to "guess" and send |
| 267 | + additional objects that may be related to the requested objects. |
| 268 | + |
| 269 | + No work has gone into actually doing that; we're just documenting that |
| 270 | + it is a common suggestion. We're not sure how it would work and have |
| 271 | + no plans to work on it. |
| 272 | + |
| 273 | + It is valid for the server to send more objects than requested (even |
| 274 | + for a dynamic object fetch), but we are not building on that. |
| 275 | + |
| 276 | + |
| 277 | +Footnotes |
| 278 | +--------- |
| 279 | + |
| 280 | +[a] expensive-to-modify list of missing objects: Earlier in the design of |
| 281 | + partial clone we discussed the need for a single list of missing objects. |
| 282 | + This would essentially be a sorted linear list of OIDs that the were |
| 283 | + omitted by the server during a clone or subsequent fetches. |
| 284 | + |
| 285 | + This file would need to be loaded into memory on every object lookup. |
| 286 | + It would need to be read, updated, and re-written (like the .git/index) |
| 287 | + on every explicit "git fetch" command *and* on any dynamic object fetch. |
| 288 | + |
| 289 | + The cost to read, update, and write this file could add significant |
| 290 | + overhead to every command if there are many missing objects. For example, |
| 291 | + if there are 100M missing blobs, this file would be at least 2GiB on disk. |
| 292 | + |
| 293 | + With the "promisor" concept, we *infer* a missing object based upon the |
| 294 | + type of packfile that references it. |
| 295 | + |
| 296 | + |
| 297 | +Related Links |
| 298 | +------------- |
| 299 | +[0] https://bugs.chromium.org/p/git/issues/detail?id=2 |
| 300 | + Chromium work item for: Partial Clone |
| 301 | + |
| 302 | +[1] https://public-inbox.org/git/ [email protected]/ |
| 303 | + Subject: [RFC] Add support for downloading blobs on demand |
| 304 | + Date: Fri, 13 Jan 2017 10:52:53 -0500 |
| 305 | + |
| 306 | +[2] https://public-inbox.org/git/ [email protected]/ |
| 307 | + Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) |
| 308 | + Date: Fri, 29 Sep 2017 13:11:36 -0700 |
| 309 | + |
| 310 | +[3] https://public-inbox.org/git/ [email protected]/ |
| 311 | + Subject: Proposal for missing blob support in Git repos |
| 312 | + Date: Wed, 26 Apr 2017 15:13:46 -0700 |
| 313 | + |
| 314 | +[4] https://public-inbox.org/git/ [email protected]/ |
| 315 | + Subject: [PATCH 00/10] RFC Partial Clone and Fetch |
| 316 | + Date: Wed, 8 Mar 2017 18:50:29 +0000 |
| 317 | + |
| 318 | +[5] https://public-inbox.org/git/ [email protected]/ |
| 319 | + Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module |
| 320 | + Date: Fri, 5 May 2017 11:27:52 -0400 |
| 321 | + |
| 322 | +[6] https://public-inbox.org/git/ [email protected]/ |
| 323 | + Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand |
| 324 | + Date: Fri, 14 Jul 2017 09:26:50 -0400 |
0 commit comments