Skip to content

Commit b10edb2

Browse files
committed
Merge branch 'ds/commit-graph'
Precompute and store information necessary for ancestry traversal in a separate file to optimize graph walking. * ds/commit-graph: commit-graph: implement "--append" option commit-graph: build graph from starting commits commit-graph: read only from specific pack-indexes commit: integrate commit graph with commit parsing commit-graph: close under reachability commit-graph: add core.commitGraph setting commit-graph: implement git commit-graph read commit-graph: implement git-commit-graph write commit-graph: implement write_commit_graph() commit-graph: create git-commit-graph builtin graph: add commit graph design document commit-graph: add format document csum-file: refactor finalize_hashfile() method csum-file: rename hashclose() to finalize_hashfile()
2 parents 4f4d0b4 + 7547b95 commit b10edb2

30 files changed

+1587
-21
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@
3535
/git-clone
3636
/git-column
3737
/git-commit
38+
/git-commit-graph
3839
/git-commit-tree
3940
/git-config
4041
/git-count-objects

Documentation/config.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -898,6 +898,10 @@ core.notesRef::
898898
This setting defaults to "refs/notes/commits", and it can be overridden by
899899
the `GIT_NOTES_REF` environment variable. See linkgit:git-notes[1].
900900

901+
core.commitGraph::
902+
Enable git commit graph feature. Allows reading from the
903+
commit-graph file.
904+
901905
core.sparseCheckout::
902906
Enable "sparse checkout" feature. See section "Sparse checkout" in
903907
linkgit:git-read-tree[1] for more information.

Documentation/git-commit-graph.txt

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
git-commit-graph(1)
2+
===================
3+
4+
NAME
5+
----
6+
git-commit-graph - Write and verify Git commit graph files
7+
8+
9+
SYNOPSIS
10+
--------
11+
[verse]
12+
'git commit-graph read' [--object-dir <dir>]
13+
'git commit-graph write' <options> [--object-dir <dir>]
14+
15+
16+
DESCRIPTION
17+
-----------
18+
19+
Manage the serialized commit graph file.
20+
21+
22+
OPTIONS
23+
-------
24+
--object-dir::
25+
Use given directory for the location of packfiles and commit graph
26+
file. This parameter exists to specify the location of an alternate
27+
that only has the objects directory, not a full .git directory. The
28+
commit graph file is expected to be at <dir>/info/commit-graph and
29+
the packfiles are expected to be in <dir>/pack.
30+
31+
32+
COMMANDS
33+
--------
34+
'write'::
35+
36+
Write a commit graph file based on the commits found in packfiles.
37+
+
38+
With the `--stdin-packs` option, generate the new commit graph by
39+
walking objects only in the specified pack-indexes. (Cannot be combined
40+
with --stdin-commits.)
41+
+
42+
With the `--stdin-commits` option, generate the new commit graph by
43+
walking commits starting at the commits specified in stdin as a list
44+
of OIDs in hex, one OID per line. (Cannot be combined with
45+
--stdin-packs.)
46+
+
47+
With the `--append` option, include all commits that are present in the
48+
existing commit-graph file.
49+
50+
'read'::
51+
52+
Read a graph file given by the commit-graph file and output basic
53+
details about the graph file. Used for debugging purposes.
54+
55+
56+
EXAMPLES
57+
--------
58+
59+
* Write a commit graph file for the packed commits in your local .git folder.
60+
+
61+
------------------------------------------------
62+
$ git commit-graph write
63+
------------------------------------------------
64+
65+
* Write a graph file, extending the current graph file using commits
66+
* in <pack-index>.
67+
+
68+
------------------------------------------------
69+
$ echo <pack-index> | git commit-graph write --stdin-packs
70+
------------------------------------------------
71+
72+
* Write a graph file containing all reachable commits.
73+
+
74+
------------------------------------------------
75+
$ git show-ref -s | git commit-graph write --stdin-commits
76+
------------------------------------------------
77+
78+
* Write a graph file containing all commits in the current
79+
* commit-graph file along with those reachable from HEAD.
80+
+
81+
------------------------------------------------
82+
$ git rev-parse HEAD | git commit-graph write --stdin-commits --append
83+
------------------------------------------------
84+
85+
* Read basic information from the commit-graph file.
86+
+
87+
------------------------------------------------
88+
$ git commit-graph read
89+
------------------------------------------------
90+
91+
92+
GIT
93+
---
94+
Part of the linkgit:git[1] suite
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
Git commit graph format
2+
=======================
3+
4+
The Git commit graph stores a list of commit OIDs and some associated
5+
metadata, including:
6+
7+
- The generation number of the commit. Commits with no parents have
8+
generation number 1; commits with parents have generation number
9+
one more than the maximum generation number of its parents. We
10+
reserve zero as special, and can be used to mark a generation
11+
number invalid or as "not computed".
12+
13+
- The root tree OID.
14+
15+
- The commit date.
16+
17+
- The parents of the commit, stored using positional references within
18+
the graph file.
19+
20+
These positional references are stored as unsigned 32-bit integers
21+
corresponding to the array position withing the list of commit OIDs. We
22+
use the most-significant bit for special purposes, so we can store at most
23+
(1 << 31) - 1 (around 2 billion) commits.
24+
25+
== Commit graph files have the following format:
26+
27+
In order to allow extensions that add extra data to the graph, we organize
28+
the body into "chunks" and provide a binary lookup table at the beginning
29+
of the body. The header includes certain values, such as number of chunks
30+
and hash type.
31+
32+
All 4-byte numbers are in network order.
33+
34+
HEADER:
35+
36+
4-byte signature:
37+
The signature is: {'C', 'G', 'P', 'H'}
38+
39+
1-byte version number:
40+
Currently, the only valid version is 1.
41+
42+
1-byte Hash Version (1 = SHA-1)
43+
We infer the hash length (H) from this value.
44+
45+
1-byte number (C) of "chunks"
46+
47+
1-byte (reserved for later use)
48+
Current clients should ignore this value.
49+
50+
CHUNK LOOKUP:
51+
52+
(C + 1) * 12 bytes listing the table of contents for the chunks:
53+
First 4 bytes describe the chunk id. Value 0 is a terminating label.
54+
Other 8 bytes provide the byte-offset in current file for chunk to
55+
start. (Chunks are ordered contiguously in the file, so you can infer
56+
the length using the next chunk position if necessary.) Each chunk
57+
ID appears at most once.
58+
59+
The remaining data in the body is described one chunk at a time, and
60+
these chunks may be given in any order. Chunks are required unless
61+
otherwise specified.
62+
63+
CHUNK DATA:
64+
65+
OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
66+
The ith entry, F[i], stores the number of OIDs with first
67+
byte at most i. Thus F[255] stores the total
68+
number of commits (N).
69+
70+
OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
71+
The OIDs for all commits in the graph, sorted in ascending order.
72+
73+
Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
74+
* The first H bytes are for the OID of the root tree.
75+
* The next 8 bytes are for the positions of the first two parents
76+
of the ith commit. Stores value 0xffffffff if no parent in that
77+
position. If there are more than two parents, the second value
78+
has its most-significant bit on and the other bits store an array
79+
position into the Large Edge List chunk.
80+
* The next 8 bytes store the generation number of the commit and
81+
the commit time in seconds since EPOCH. The generation number
82+
uses the higher 30 bits of the first 4 bytes, while the commit
83+
time uses the 32 bits of the second 4 bytes, along with the lowest
84+
2 bits of the lowest byte, storing the 33rd and 34th bit of the
85+
commit time.
86+
87+
Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
88+
This list of 4-byte values store the second through nth parents for
89+
all octopus merges. The second parent value in the commit data stores
90+
an array position within this list along with the most-significant bit
91+
on. Starting at that array position, iterate through this list of commit
92+
positions for the parents until reaching a value with the most-significant
93+
bit on. The other bits correspond to the position of the last parent.
94+
95+
TRAILER:
96+
97+
H-byte HASH-checksum of all of the above.
Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
Git Commit Graph Design Notes
2+
=============================
3+
4+
Git walks the commit graph for many reasons, including:
5+
6+
1. Listing and filtering commit history.
7+
2. Computing merge bases.
8+
9+
These operations can become slow as the commit count grows. The merge
10+
base calculation shows up in many user-facing commands, such as 'merge-base'
11+
or 'status' and can take minutes to compute depending on history shape.
12+
13+
There are two main costs here:
14+
15+
1. Decompressing and parsing commits.
16+
2. Walking the entire graph to satisfy topological order constraints.
17+
18+
The commit graph file is a supplemental data structure that accelerates
19+
commit graph walks. If a user downgrades or disables the 'core.commitGraph'
20+
config setting, then the existing ODB is sufficient. The file is stored
21+
as "commit-graph" either in the .git/objects/info directory or in the info
22+
directory of an alternate.
23+
24+
The commit graph file stores the commit graph structure along with some
25+
extra metadata to speed up graph walks. By listing commit OIDs in lexi-
26+
cographic order, we can identify an integer position for each commit and
27+
refer to the parents of a commit using those integer positions. We use
28+
binary search to find initial commits and then use the integer positions
29+
for fast lookups during the walk.
30+
31+
A consumer may load the following info for a commit from the graph:
32+
33+
1. The commit OID.
34+
2. The list of parents, along with their integer position.
35+
3. The commit date.
36+
4. The root tree OID.
37+
5. The generation number (see definition below).
38+
39+
Values 1-4 satisfy the requirements of parse_commit_gently().
40+
41+
Define the "generation number" of a commit recursively as follows:
42+
43+
* A commit with no parents (a root commit) has generation number one.
44+
45+
* A commit with at least one parent has generation number one more than
46+
the largest generation number among its parents.
47+
48+
Equivalently, the generation number of a commit A is one more than the
49+
length of a longest path from A to a root commit. The recursive definition
50+
is easier to use for computation and observing the following property:
51+
52+
If A and B are commits with generation numbers N and M, respectively,
53+
and N <= M, then A cannot reach B. That is, we know without searching
54+
that B is not an ancestor of A because it is further from a root commit
55+
than A.
56+
57+
Conversely, when checking if A is an ancestor of B, then we only need
58+
to walk commits until all commits on the walk boundary have generation
59+
number at most N. If we walk commits using a priority queue seeded by
60+
generation numbers, then we always expand the boundary commit with highest
61+
generation number and can easily detect the stopping condition.
62+
63+
This property can be used to significantly reduce the time it takes to
64+
walk commits and determine topological relationships. Without generation
65+
numbers, the general heuristic is the following:
66+
67+
If A and B are commits with commit time X and Y, respectively, and
68+
X < Y, then A _probably_ cannot reach B.
69+
70+
This heuristic is currently used whenever the computation is allowed to
71+
violate topological relationships due to clock skew (such as "git log"
72+
with default order), but is not used when the topological order is
73+
required (such as merge base calculations, "git log --graph").
74+
75+
In practice, we expect some commits to be created recently and not stored
76+
in the commit graph. We can treat these commits as having "infinite"
77+
generation number and walk until reaching commits with known generation
78+
number.
79+
80+
Design Details
81+
--------------
82+
83+
- The commit graph file is stored in a file named 'commit-graph' in the
84+
.git/objects/info directory. This could be stored in the info directory
85+
of an alternate.
86+
87+
- The core.commitGraph config setting must be on to consume graph files.
88+
89+
- The file format includes parameters for the object ID hash function,
90+
so a future change of hash algorithm does not require a change in format.
91+
92+
Future Work
93+
-----------
94+
95+
- The commit graph feature currently does not honor commit grafts. This can
96+
be remedied by duplicating or refactoring the current graft logic.
97+
98+
- The 'commit-graph' subcommand does not have a "verify" mode that is
99+
necessary for integration with fsck.
100+
101+
- The file format includes room for precomputed generation numbers. These
102+
are not currently computed, so all generation numbers will be marked as
103+
0 (or "uncomputed"). A later patch will include this calculation.
104+
105+
- After computing and storing generation numbers, we must make graph
106+
walks aware of generation numbers to gain the performance benefits they
107+
enable. This will mostly be accomplished by swapping a commit-date-ordered
108+
priority queue with one ordered by generation number. The following
109+
operations are important candidates:
110+
111+
- paint_down_to_common()
112+
- 'log --topo-order'
113+
114+
- Currently, parse_commit_gently() requires filling in the root tree
115+
object for a commit. This passes through lookup_tree() and consequently
116+
lookup_object(). Also, it calls lookup_commit() when loading the parents.
117+
These method calls check the ODB for object existence, even if the
118+
consumer does not need the content. For example, we do not need the
119+
tree contents when computing merge bases. Now that commit parsing is
120+
removed from the computation time, these lookup operations are the
121+
slowest operations keeping graph walks from being fast. Consider
122+
loading these objects without verifying their existence in the ODB and
123+
only loading them fully when consumers need them. Consider a method
124+
such as "ensure_tree_loaded(commit)" that fully loads a tree before
125+
using commit->tree.
126+
127+
- The current design uses the 'commit-graph' subcommand to generate the graph.
128+
When this feature stabilizes enough to recommend to most users, we should
129+
add automatic graph writes to common operations that create many commits.
130+
For example, one could compute a graph on 'clone', 'fetch', or 'repack'
131+
commands.
132+
133+
- A server could provide a commit graph file as part of the network protocol
134+
to avoid extra calculations by clients. This feature is only of benefit if
135+
the user is willing to trust the file, because verifying the file is correct
136+
is as hard as computing it from scratch.
137+
138+
Related Links
139+
-------------
140+
[0] https://bugs.chromium.org/p/git/issues/detail?id=8
141+
Chromium work item for: Serialized Commit Graph
142+
143+
[1] https://public-inbox.org/git/[email protected]/
144+
An abandoned patch that introduced generation numbers.
145+
146+
[2] https://public-inbox.org/git/[email protected]/
147+
Discussion about generation numbers on commits and how they interact
148+
with fsck.
149+
150+
[3] https://public-inbox.org/git/[email protected]/
151+
More discussion about generation numbers and not storing them inside
152+
commit objects. A valuable quote:
153+
154+
"I think we should be moving more in the direction of keeping
155+
repo-local caches for optimizations. Reachability bitmaps have been
156+
a big performance win. I think we should be doing the same with our
157+
properties of commits. Not just generation numbers, but making it
158+
cheap to access the graph structure without zlib-inflating whole
159+
commit objects (i.e., packv4 or something like the "metapacks" I
160+
proposed a few years ago)."
161+
162+
[4] https://public-inbox.org/git/[email protected]/T/#u
163+
A patch to remove the ahead-behind calculation from 'status'.

Makefile

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -816,6 +816,7 @@ LIB_OBJS += color.o
816816
LIB_OBJS += column.o
817817
LIB_OBJS += combine-diff.o
818818
LIB_OBJS += commit.o
819+
LIB_OBJS += commit-graph.o
819820
LIB_OBJS += compat/obstack.o
820821
LIB_OBJS += compat/terminal.o
821822
LIB_OBJS += config.o
@@ -995,6 +996,7 @@ BUILTIN_OBJS += builtin/clone.o
995996
BUILTIN_OBJS += builtin/column.o
996997
BUILTIN_OBJS += builtin/commit-tree.o
997998
BUILTIN_OBJS += builtin/commit.o
999+
BUILTIN_OBJS += builtin/commit-graph.o
9981000
BUILTIN_OBJS += builtin/config.o
9991001
BUILTIN_OBJS += builtin/count-objects.o
10001002
BUILTIN_OBJS += builtin/credential.o

alloc.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@ void *alloc_commit_node(void)
9393
struct commit *c = alloc_node(&commit_state, sizeof(struct commit));
9494
c->object.type = OBJ_COMMIT;
9595
c->index = alloc_commit_index();
96+
c->graph_pos = COMMIT_NOT_FROM_GRAPH;
9697
return c;
9798
}
9899

0 commit comments

Comments
 (0)