|
| 1 | +Git Commit Graph Design Notes |
| 2 | +============================= |
| 3 | + |
| 4 | +Git walks the commit graph for many reasons, including: |
| 5 | + |
| 6 | +1. Listing and filtering commit history. |
| 7 | +2. Computing merge bases. |
| 8 | + |
| 9 | +These operations can become slow as the commit count grows. The merge |
| 10 | +base calculation shows up in many user-facing commands, such as 'merge-base' |
| 11 | +or 'status' and can take minutes to compute depending on history shape. |
| 12 | + |
| 13 | +There are two main costs here: |
| 14 | + |
| 15 | +1. Decompressing and parsing commits. |
| 16 | +2. Walking the entire graph to satisfy topological order constraints. |
| 17 | + |
| 18 | +The commit graph file is a supplemental data structure that accelerates |
| 19 | +commit graph walks. If a user downgrades or disables the 'core.commitGraph' |
| 20 | +config setting, then the existing ODB is sufficient. The file is stored |
| 21 | +as "commit-graph" either in the .git/objects/info directory or in the info |
| 22 | +directory of an alternate. |
| 23 | + |
| 24 | +The commit graph file stores the commit graph structure along with some |
| 25 | +extra metadata to speed up graph walks. By listing commit OIDs in lexi- |
| 26 | +cographic order, we can identify an integer position for each commit and |
| 27 | +refer to the parents of a commit using those integer positions. We use |
| 28 | +binary search to find initial commits and then use the integer positions |
| 29 | +for fast lookups during the walk. |
| 30 | + |
| 31 | +A consumer may load the following info for a commit from the graph: |
| 32 | + |
| 33 | +1. The commit OID. |
| 34 | +2. The list of parents, along with their integer position. |
| 35 | +3. The commit date. |
| 36 | +4. The root tree OID. |
| 37 | +5. The generation number (see definition below). |
| 38 | + |
| 39 | +Values 1-4 satisfy the requirements of parse_commit_gently(). |
| 40 | + |
| 41 | +Define the "generation number" of a commit recursively as follows: |
| 42 | + |
| 43 | + * A commit with no parents (a root commit) has generation number one. |
| 44 | + |
| 45 | + * A commit with at least one parent has generation number one more than |
| 46 | + the largest generation number among its parents. |
| 47 | + |
| 48 | +Equivalently, the generation number of a commit A is one more than the |
| 49 | +length of a longest path from A to a root commit. The recursive definition |
| 50 | +is easier to use for computation and observing the following property: |
| 51 | + |
| 52 | + If A and B are commits with generation numbers N and M, respectively, |
| 53 | + and N <= M, then A cannot reach B. That is, we know without searching |
| 54 | + that B is not an ancestor of A because it is further from a root commit |
| 55 | + than A. |
| 56 | + |
| 57 | + Conversely, when checking if A is an ancestor of B, then we only need |
| 58 | + to walk commits until all commits on the walk boundary have generation |
| 59 | + number at most N. If we walk commits using a priority queue seeded by |
| 60 | + generation numbers, then we always expand the boundary commit with highest |
| 61 | + generation number and can easily detect the stopping condition. |
| 62 | + |
| 63 | +This property can be used to significantly reduce the time it takes to |
| 64 | +walk commits and determine topological relationships. Without generation |
| 65 | +numbers, the general heuristic is the following: |
| 66 | + |
| 67 | + If A and B are commits with commit time X and Y, respectively, and |
| 68 | + X < Y, then A _probably_ cannot reach B. |
| 69 | + |
| 70 | +This heuristic is currently used whenever the computation is allowed to |
| 71 | +violate topological relationships due to clock skew (such as "git log" |
| 72 | +with default order), but is not used when the topological order is |
| 73 | +required (such as merge base calculations, "git log --graph"). |
| 74 | + |
| 75 | +In practice, we expect some commits to be created recently and not stored |
| 76 | +in the commit graph. We can treat these commits as having "infinite" |
| 77 | +generation number and walk until reaching commits with known generation |
| 78 | +number. |
| 79 | + |
| 80 | +Design Details |
| 81 | +-------------- |
| 82 | + |
| 83 | +- The commit graph file is stored in a file named 'commit-graph' in the |
| 84 | + .git/objects/info directory. This could be stored in the info directory |
| 85 | + of an alternate. |
| 86 | + |
| 87 | +- The core.commitGraph config setting must be on to consume graph files. |
| 88 | + |
| 89 | +- The file format includes parameters for the object ID hash function, |
| 90 | + so a future change of hash algorithm does not require a change in format. |
| 91 | + |
| 92 | +Future Work |
| 93 | +----------- |
| 94 | + |
| 95 | +- The commit graph feature currently does not honor commit grafts. This can |
| 96 | + be remedied by duplicating or refactoring the current graft logic. |
| 97 | + |
| 98 | +- The 'commit-graph' subcommand does not have a "verify" mode that is |
| 99 | + necessary for integration with fsck. |
| 100 | + |
| 101 | +- The file format includes room for precomputed generation numbers. These |
| 102 | + are not currently computed, so all generation numbers will be marked as |
| 103 | + 0 (or "uncomputed"). A later patch will include this calculation. |
| 104 | + |
| 105 | +- After computing and storing generation numbers, we must make graph |
| 106 | + walks aware of generation numbers to gain the performance benefits they |
| 107 | + enable. This will mostly be accomplished by swapping a commit-date-ordered |
| 108 | + priority queue with one ordered by generation number. The following |
| 109 | + operations are important candidates: |
| 110 | + |
| 111 | + - paint_down_to_common() |
| 112 | + - 'log --topo-order' |
| 113 | + |
| 114 | +- Currently, parse_commit_gently() requires filling in the root tree |
| 115 | + object for a commit. This passes through lookup_tree() and consequently |
| 116 | + lookup_object(). Also, it calls lookup_commit() when loading the parents. |
| 117 | + These method calls check the ODB for object existence, even if the |
| 118 | + consumer does not need the content. For example, we do not need the |
| 119 | + tree contents when computing merge bases. Now that commit parsing is |
| 120 | + removed from the computation time, these lookup operations are the |
| 121 | + slowest operations keeping graph walks from being fast. Consider |
| 122 | + loading these objects without verifying their existence in the ODB and |
| 123 | + only loading them fully when consumers need them. Consider a method |
| 124 | + such as "ensure_tree_loaded(commit)" that fully loads a tree before |
| 125 | + using commit->tree. |
| 126 | + |
| 127 | +- The current design uses the 'commit-graph' subcommand to generate the graph. |
| 128 | + When this feature stabilizes enough to recommend to most users, we should |
| 129 | + add automatic graph writes to common operations that create many commits. |
| 130 | + For example, one could compute a graph on 'clone', 'fetch', or 'repack' |
| 131 | + commands. |
| 132 | + |
| 133 | +- A server could provide a commit graph file as part of the network protocol |
| 134 | + to avoid extra calculations by clients. This feature is only of benefit if |
| 135 | + the user is willing to trust the file, because verifying the file is correct |
| 136 | + is as hard as computing it from scratch. |
| 137 | + |
| 138 | +Related Links |
| 139 | +------------- |
| 140 | +[0] https://bugs.chromium.org/p/git/issues/detail?id=8 |
| 141 | + Chromium work item for: Serialized Commit Graph |
| 142 | + |
| 143 | +[1] https://public-inbox.org/git/ [email protected]/ |
| 144 | + An abandoned patch that introduced generation numbers. |
| 145 | + |
| 146 | +[2] https://public-inbox.org/git/ [email protected]/ |
| 147 | + Discussion about generation numbers on commits and how they interact |
| 148 | + with fsck. |
| 149 | + |
| 150 | +[3] https://public-inbox.org/git/ [email protected]/ |
| 151 | + More discussion about generation numbers and not storing them inside |
| 152 | + commit objects. A valuable quote: |
| 153 | + |
| 154 | + "I think we should be moving more in the direction of keeping |
| 155 | + repo-local caches for optimizations. Reachability bitmaps have been |
| 156 | + a big performance win. I think we should be doing the same with our |
| 157 | + properties of commits. Not just generation numbers, but making it |
| 158 | + cheap to access the graph structure without zlib-inflating whole |
| 159 | + commit objects (i.e., packv4 or something like the "metapacks" I |
| 160 | + proposed a few years ago)." |
| 161 | + |
| 162 | +[4] https://public-inbox.org/git/ [email protected]/T/#u |
| 163 | + A patch to remove the ahead-behind calculation from 'status'. |
0 commit comments