Skip to content

Commit ae30d7b

Browse files
derrickstoleegitster
authored andcommitted
graph: add commit graph design document
Add Documentation/technical/commit-graph.txt with details of the planned commit graph feature, including future plans. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent b84f767 commit ae30d7b

File tree

1 file changed

+163
-0
lines changed

1 file changed

+163
-0
lines changed
Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
Git Commit Graph Design Notes
2+
=============================
3+
4+
Git walks the commit graph for many reasons, including:
5+
6+
1. Listing and filtering commit history.
7+
2. Computing merge bases.
8+
9+
These operations can become slow as the commit count grows. The merge
10+
base calculation shows up in many user-facing commands, such as 'merge-base'
11+
or 'status' and can take minutes to compute depending on history shape.
12+
13+
There are two main costs here:
14+
15+
1. Decompressing and parsing commits.
16+
2. Walking the entire graph to satisfy topological order constraints.
17+
18+
The commit graph file is a supplemental data structure that accelerates
19+
commit graph walks. If a user downgrades or disables the 'core.commitGraph'
20+
config setting, then the existing ODB is sufficient. The file is stored
21+
as "commit-graph" either in the .git/objects/info directory or in the info
22+
directory of an alternate.
23+
24+
The commit graph file stores the commit graph structure along with some
25+
extra metadata to speed up graph walks. By listing commit OIDs in lexi-
26+
cographic order, we can identify an integer position for each commit and
27+
refer to the parents of a commit using those integer positions. We use
28+
binary search to find initial commits and then use the integer positions
29+
for fast lookups during the walk.
30+
31+
A consumer may load the following info for a commit from the graph:
32+
33+
1. The commit OID.
34+
2. The list of parents, along with their integer position.
35+
3. The commit date.
36+
4. The root tree OID.
37+
5. The generation number (see definition below).
38+
39+
Values 1-4 satisfy the requirements of parse_commit_gently().
40+
41+
Define the "generation number" of a commit recursively as follows:
42+
43+
* A commit with no parents (a root commit) has generation number one.
44+
45+
* A commit with at least one parent has generation number one more than
46+
the largest generation number among its parents.
47+
48+
Equivalently, the generation number of a commit A is one more than the
49+
length of a longest path from A to a root commit. The recursive definition
50+
is easier to use for computation and observing the following property:
51+
52+
If A and B are commits with generation numbers N and M, respectively,
53+
and N <= M, then A cannot reach B. That is, we know without searching
54+
that B is not an ancestor of A because it is further from a root commit
55+
than A.
56+
57+
Conversely, when checking if A is an ancestor of B, then we only need
58+
to walk commits until all commits on the walk boundary have generation
59+
number at most N. If we walk commits using a priority queue seeded by
60+
generation numbers, then we always expand the boundary commit with highest
61+
generation number and can easily detect the stopping condition.
62+
63+
This property can be used to significantly reduce the time it takes to
64+
walk commits and determine topological relationships. Without generation
65+
numbers, the general heuristic is the following:
66+
67+
If A and B are commits with commit time X and Y, respectively, and
68+
X < Y, then A _probably_ cannot reach B.
69+
70+
This heuristic is currently used whenever the computation is allowed to
71+
violate topological relationships due to clock skew (such as "git log"
72+
with default order), but is not used when the topological order is
73+
required (such as merge base calculations, "git log --graph").
74+
75+
In practice, we expect some commits to be created recently and not stored
76+
in the commit graph. We can treat these commits as having "infinite"
77+
generation number and walk until reaching commits with known generation
78+
number.
79+
80+
Design Details
81+
--------------
82+
83+
- The commit graph file is stored in a file named 'commit-graph' in the
84+
.git/objects/info directory. This could be stored in the info directory
85+
of an alternate.
86+
87+
- The core.commitGraph config setting must be on to consume graph files.
88+
89+
- The file format includes parameters for the object ID hash function,
90+
so a future change of hash algorithm does not require a change in format.
91+
92+
Future Work
93+
-----------
94+
95+
- The commit graph feature currently does not honor commit grafts. This can
96+
be remedied by duplicating or refactoring the current graft logic.
97+
98+
- The 'commit-graph' subcommand does not have a "verify" mode that is
99+
necessary for integration with fsck.
100+
101+
- The file format includes room for precomputed generation numbers. These
102+
are not currently computed, so all generation numbers will be marked as
103+
0 (or "uncomputed"). A later patch will include this calculation.
104+
105+
- After computing and storing generation numbers, we must make graph
106+
walks aware of generation numbers to gain the performance benefits they
107+
enable. This will mostly be accomplished by swapping a commit-date-ordered
108+
priority queue with one ordered by generation number. The following
109+
operations are important candidates:
110+
111+
- paint_down_to_common()
112+
- 'log --topo-order'
113+
114+
- Currently, parse_commit_gently() requires filling in the root tree
115+
object for a commit. This passes through lookup_tree() and consequently
116+
lookup_object(). Also, it calls lookup_commit() when loading the parents.
117+
These method calls check the ODB for object existence, even if the
118+
consumer does not need the content. For example, we do not need the
119+
tree contents when computing merge bases. Now that commit parsing is
120+
removed from the computation time, these lookup operations are the
121+
slowest operations keeping graph walks from being fast. Consider
122+
loading these objects without verifying their existence in the ODB and
123+
only loading them fully when consumers need them. Consider a method
124+
such as "ensure_tree_loaded(commit)" that fully loads a tree before
125+
using commit->tree.
126+
127+
- The current design uses the 'commit-graph' subcommand to generate the graph.
128+
When this feature stabilizes enough to recommend to most users, we should
129+
add automatic graph writes to common operations that create many commits.
130+
For example, one could compute a graph on 'clone', 'fetch', or 'repack'
131+
commands.
132+
133+
- A server could provide a commit graph file as part of the network protocol
134+
to avoid extra calculations by clients. This feature is only of benefit if
135+
the user is willing to trust the file, because verifying the file is correct
136+
is as hard as computing it from scratch.
137+
138+
Related Links
139+
-------------
140+
[0] https://bugs.chromium.org/p/git/issues/detail?id=8
141+
Chromium work item for: Serialized Commit Graph
142+
143+
[1] https://public-inbox.org/git/[email protected]/
144+
An abandoned patch that introduced generation numbers.
145+
146+
[2] https://public-inbox.org/git/[email protected]/
147+
Discussion about generation numbers on commits and how they interact
148+
with fsck.
149+
150+
[3] https://public-inbox.org/git/[email protected]/
151+
More discussion about generation numbers and not storing them inside
152+
commit objects. A valuable quote:
153+
154+
"I think we should be moving more in the direction of keeping
155+
repo-local caches for optimizations. Reachability bitmaps have been
156+
a big performance win. I think we should be doing the same with our
157+
properties of commits. Not just generation numbers, but making it
158+
cheap to access the graph structure without zlib-inflating whole
159+
commit objects (i.e., packv4 or something like the "metapacks" I
160+
proposed a few years ago)."
161+
162+
[4] https://public-inbox.org/git/[email protected]/T/#u
163+
A patch to remove the ahead-behind calculation from 'status'.

0 commit comments

Comments
 (0)