Skip to content

Commit 3d89a8c

Browse files
ttaylorrgitster
authored andcommitted
Documentation/technical: add cruft-packs.txt
Create a technical document to explain cruft packs. It contains a brief overview of the problem, some background, details on the implementation, and a couple of alternative approaches not considered here. Signed-off-by: Taylor Blau <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent 6cd33dc commit 3d89a8c

File tree

2 files changed

+124
-0
lines changed

2 files changed

+124
-0
lines changed

Documentation/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@ TECH_DOCS += MyFirstContribution
9494
TECH_DOCS += MyFirstObjectWalk
9595
TECH_DOCS += SubmittingPatches
9696
TECH_DOCS += technical/bundle-format
97+
TECH_DOCS += technical/cruft-packs
9798
TECH_DOCS += technical/hash-function-transition
9899
TECH_DOCS += technical/http-protocol
99100
TECH_DOCS += technical/index-format
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
= Cruft packs
2+
3+
The cruft packs feature offer an alternative to Git's traditional mechanism of
4+
removing unreachable objects. This document provides an overview of Git's
5+
pruning mechanism, and how a cruft pack can be used instead to accomplish the
6+
same.
7+
8+
== Background
9+
10+
To remove unreachable objects from your repository, Git offers `git repack -Ad`
11+
(see linkgit:git-repack[1]). Quoting from the documentation:
12+
13+
[quote]
14+
[...] unreachable objects in a previous pack become loose, unpacked objects,
15+
instead of being left in the old pack. [...] loose unreachable objects will be
16+
pruned according to normal expiry rules with the next 'git gc' invocation.
17+
18+
Unreachable objects aren't removed immediately, since doing so could race with
19+
an incoming push which may reference an object which is about to be deleted.
20+
Instead, those unreachable objects are stored as loose object and stay that way
21+
until they are older than the expiration window, at which point they are removed
22+
by linkgit:git-prune[1].
23+
24+
Git must store these unreachable objects loose in order to keep track of their
25+
per-object mtimes. If these unreachable objects were written into one big pack,
26+
then either freshening that pack (because an object contained within it was
27+
re-written) or creating a new pack of unreachable objects would cause the pack's
28+
mtime to get updated, and the objects within it would never leave the expiration
29+
window. Instead, objects are stored loose in order to keep track of the
30+
individual object mtimes and avoid a situation where all cruft objects are
31+
freshened at once.
32+
33+
This can lead to undesirable situations when a repository contains many
34+
unreachable objects which have not yet left the grace period. Having large
35+
directories in the shards of `.git/objects` can lead to decreased performance in
36+
the repository. But given enough unreachable objects, this can lead to inode
37+
starvation and degrade the performance of the whole system. Since we
38+
can never pack those objects, these repositories often take up a large amount of
39+
disk space, since we can only zlib compress them, but not store them in delta
40+
chains.
41+
42+
== Cruft packs
43+
44+
A cruft pack eliminates the need for storing unreachable objects in a loose
45+
state by including the per-object mtimes in a separate file alongside a single
46+
pack containing all loose objects.
47+
48+
A cruft pack is written by `git repack --cruft` when generating a new pack.
49+
linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft`
50+
is a classic all-into-one repack, meaning that everything in the resulting pack is
51+
reachable, and everything else is unreachable. Once written, the `--cruft`
52+
option instructs `git repack` to generate another pack containing only objects
53+
not packed in the previous step (which equates to packing all unreachable
54+
objects together). This progresses as follows:
55+
56+
1. Enumerate every object, marking any object which is (a) not contained in a
57+
kept-pack, and (b) whose mtime is within the grace period as a traversal
58+
tip.
59+
60+
2. Perform a reachability traversal based on the tips gathered in the previous
61+
step, adding every object along the way to the pack.
62+
63+
3. Write the pack out, along with a `.mtimes` file that records the per-object
64+
timestamps.
65+
66+
This mode is invoked internally by linkgit:git-repack[1] when instructed to
67+
write a cruft pack. Crucially, the set of in-core kept packs is exactly the set
68+
of packs which will not be deleted by the repack; in other words, they contain
69+
all of the repository's reachable objects.
70+
71+
When a repository already has a cruft pack, `git repack --cruft` typically only
72+
adds objects to it. An exception to this is when `git repack` is given the
73+
`--cruft-expiration` option, which allows the generated cruft pack to omit
74+
expired objects instead of waiting for linkgit:git-gc[1] to expire those objects
75+
later on.
76+
77+
It is linkgit:git-gc[1] that is typically responsible for removing expired
78+
unreachable objects.
79+
80+
== Caution for mixed-version environments
81+
82+
Repositories that have cruft packs in them will continue to work with any older
83+
version of Git. Note, however, that previous versions of Git which do not
84+
understand the `.mtimes` file will use the cruft pack's mtime as the mtime for
85+
all of the objects in it. In other words, do not expect older (pre-cruft pack)
86+
versions of Git to interpret or even read the contents of the `.mtimes` file.
87+
88+
Note that having mixed versions of Git GC-ing the same repository can lead to
89+
unreachable objects never being completely pruned. This can happen under the
90+
following circumstances:
91+
92+
- An older version of Git running GC explodes the contents of an existing
93+
cruft pack loose, using the cruft pack's mtime.
94+
- A newer version running GC collects those loose objects into a cruft pack,
95+
where the .mtime file reflects the loose object's actual mtimes, but the
96+
cruft pack mtime is "now".
97+
98+
Repeating this process will lead to unreachable objects not getting pruned as a
99+
result of repeatedly resetting the objects' mtimes to the present time.
100+
101+
If you are GC-ing repositories in a mixed version environment, consider omitting
102+
the `--cruft` option when using linkgit:git-repack[1] and linkgit:git-gc[1], and
103+
leaving the `gc.cruftPacks` configuration unset until all writers understand
104+
cruft packs.
105+
106+
== Alternatives
107+
108+
Notable alternatives to this design include:
109+
110+
- The location of the per-object mtime data, and
111+
- Storing unreachable objects in multiple cruft packs.
112+
113+
On the location of mtime data, a new auxiliary file tied to the pack was chosen
114+
to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
115+
support for optional chunks of data, it may make sense to consolidate the
116+
`.mtimes` format into the `.idx` itself.
117+
118+
Storing unreachable objects among multiple cruft packs (e.g., creating a new
119+
cruft pack during each repacking operation including only unreachable objects
120+
which aren't already stored in an earlier cruft pack) is significantly more
121+
complicated to construct, and so aren't pursued here. The obvious drawback to
122+
the current implementation is that the entire cruft pack must be re-written from
123+
scratch.

0 commit comments

Comments
 (0)