Skip to content

Commit 5ddc828

Browse files
committed
Merge branch 'ds/path-walk' into seen
A new algorithm for object graph traversal to favor visiting the objects at the same tree path in succession (as opposed to visiting objects that are different between trees as we walk commit histories) is introduced to optimize object packing. * ds/path-walk: pack-objects: thread the path-based compression pack-objects: refactor path-walk delta phase scalar: enable path-walk during push via config pack-objects: enable --path-walk via config repack: update usage to match docs repack: add --path-walk option pack-objects: introduce GIT_TEST_PACK_PATH_WALK p5313: add performance tests for --path-walk pack-objects: update usage to match docs pack-objects: add --path-walk option pack-objects: extract should_attempt_deltas() path-walk: add prune_all_uninteresting option revision: create mark_trees_uninteresting_dense() path-walk: allow visiting tags path-walk: allow consumer to specify object types t6601: add helper for testing path-walk API path-walk: introduce an object walk by path
2 parents caca564 + f6d0289 commit 5ddc828

31 files changed

+1563
-50
lines changed

Documentation/config/feature.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,10 @@ walking fewer objects.
2020
+
2121
* `pack.allowPackReuse=multi` may improve the time it takes to create a pack by
2222
reusing objects from multiple packs instead of just one.
23+
+
24+
* `pack.usePathWalk` may speed up packfile creation and make the packfiles be
25+
significantly smaller in the presence of certain filename collisions with Git's
26+
default name-hash.
2327

2428
feature.manyFiles::
2529
Enable config options that optimize for repos with many files in the

Documentation/config/pack.txt

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -155,6 +155,14 @@ pack.useSparse::
155155
commits contain certain types of direct renames. Default is
156156
`true`.
157157

158+
pack.usePathWalk::
159+
When true, git will default to using the '--path-walk' option in
160+
'git pack-objects' when the '--revs' option is present. This
161+
algorithm groups objects by path to maximize the ability to
162+
compute delta chains across historical versions of the same
163+
object. This may disable other options, such as using bitmaps to
164+
enumerate objects.
165+
158166
pack.preferBitmapTips::
159167
When selecting which commits will receive bitmaps, prefer a
160168
commit at the tip of any reference that is a suffix of any value

Documentation/git-pack-objects.txt

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,13 @@ SYNOPSIS
1010
--------
1111
[verse]
1212
'git pack-objects' [-q | --progress | --all-progress] [--all-progress-implied]
13-
[--no-reuse-delta] [--delta-base-offset] [--non-empty]
14-
[--local] [--incremental] [--window=<n>] [--depth=<n>]
15-
[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
16-
[--cruft] [--cruft-expiration=<time>]
17-
[--stdout [--filter=<filter-spec>] | <base-name>]
18-
[--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>
13+
[--no-reuse-delta] [--delta-base-offset] [--non-empty]
14+
[--local] [--incremental] [--window=<n>] [--depth=<n>]
15+
[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
16+
[--cruft] [--cruft-expiration=<time>]
17+
[--stdout [--filter=<filter-spec>] | <base-name>]
18+
[--shallow] [--keep-true-parents] [--[no-]sparse]
19+
[--path-walk] < <object-list>
1920

2021

2122
DESCRIPTION
@@ -345,6 +346,16 @@ raise an error.
345346
Restrict delta matches based on "islands". See DELTA ISLANDS
346347
below.
347348

349+
--path-walk::
350+
By default, `git pack-objects` walks objects in an order that
351+
presents trees and blobs in an order unrelated to the path they
352+
appear relative to a commit's root tree. The `--path-walk` option
353+
enables a different walking algorithm that organizes trees and
354+
blobs by path. This has the potential to improve delta compression
355+
especially in the presence of filenames that cause collisions in
356+
Git's default name-hash algorithm. Due to changing how the objects
357+
are walked, this option is not compatible with `--delta-islands`,
358+
`--shallow`, or `--filter`.
348359

349360
DELTA ISLANDS
350361
-------------

Documentation/git-repack.txt

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,9 @@ git-repack - Pack unpacked objects in a repository
99
SYNOPSIS
1010
--------
1111
[verse]
12-
'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m] [--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>] [--write-midx]
12+
'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]
13+
[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]
14+
[--write-midx] [--path-walk]
1315

1416
DESCRIPTION
1517
-----------
@@ -249,6 +251,19 @@ linkgit:git-multi-pack-index[1]).
249251
Write a multi-pack index (see linkgit:git-multi-pack-index[1])
250252
containing the non-redundant packs.
251253

254+
--path-walk::
255+
This option passes the `--path-walk` option to the underlying
256+
`git pack-options` process (see linkgit:git-pack-objects[1]).
257+
By default, `git pack-objects` walks objects in an order that
258+
presents trees and blobs in an order unrelated to the path they
259+
appear relative to a commit's root tree. The `--path-walk` option
260+
enables a different walking algorithm that organizes trees and
261+
blobs by path. This has the potential to improve delta compression
262+
especially in the presence of filenames that cause collisions in
263+
Git's default name-hash algorithm. Due to changing how the objects
264+
are walked, this option is not compatible with `--delta-islands`
265+
or `--filter`.
266+
252267
CONFIGURATION
253268
-------------
254269

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
Path-Walk API
2+
=============
3+
4+
The path-walk API is used to walk reachable objects, but to visit objects
5+
in batches based on a common path they appear in, or by type.
6+
7+
For example, all reachable commits are visited in a group. All tags are
8+
visited in a group. Then, all root trees are visited. At some point, all
9+
blobs reachable via a path `my/dir/to/A` are visited. When there are
10+
multiple paths possible to reach the same object, then only one of those
11+
paths is used to visit the object.
12+
13+
When walking a range of commits with some `UNINTERESTING` objects, the
14+
objects with the `UNINTERESTING` flag are included in these batches. In
15+
order to walk `UNINTERESTING` objects, the `--boundary` option must be
16+
used in the commit walk in order to visit `UNINTERESTING` commits.
17+
18+
Basics
19+
------
20+
21+
To use the path-walk API, include `path-walk.h` and call
22+
`walk_objects_by_path()` with a customized `path_walk_info` struct. The
23+
struct is used to set all of the options for how the walk should proceed.
24+
Let's dig into the different options and their use.
25+
26+
`path_fn` and `path_fn_data`::
27+
The most important option is the `path_fn` option, which is a
28+
function pointer to the callback that can execute logic on the
29+
object IDs for objects grouped by type and path. This function
30+
also receives a `data` value that corresponds to the
31+
`path_fn_data` member, for providing custom data structures to
32+
this callback function.
33+
34+
`revs`::
35+
To configure the exact details of the reachable set of objects,
36+
use the `revs` member and initialize it using the revision
37+
machinery in `revision.h`. Initialize `revs` using calls such as
38+
`setup_revisions()` or `parse_revision_opt()`. Do not call
39+
`prepare_revision_walk()`, as that will be called within
40+
`walk_objects_by_path()`.
41+
+
42+
It is also important that you do not specify the `--objects` flag for the
43+
`revs` struct. The revision walk should only be used to walk commits, and
44+
the objects will be walked in a separate way based on those starting
45+
commits.
46+
+
47+
If you want the path-walk API to emit `UNINTERESTING` objects based on the
48+
commit walk's boundary, be sure to set `revs.boundary` so the boundary
49+
commits are emitted.
50+
51+
`commits`, `blobs`, `trees`, `tags`::
52+
By default, these members are enabled and signal that the path-walk
53+
API should call the `path_fn` on objects of these types. Specialized
54+
applications could disable some options to make it simpler to walk
55+
the objects or to have fewer calls to `path_fn`.
56+
+
57+
While it is possible to walk only commits in this way, consumers would be
58+
better off using the revision walk API instead.
59+
60+
`prune_all_uninteresting`::
61+
By default, all reachable paths are emitted by the path-walk API.
62+
This option allows consumers to declare that they are not
63+
interested in paths where all included objects are marked with the
64+
`UNINTERESTING` flag. This requires using the `boundary` option in
65+
the revision walk so that the walk emits commits marked with the
66+
`UNINTERESTING` flag.
67+
68+
Examples
69+
--------
70+
71+
See example usages in:
72+
`t/helper/test-path-walk.c`,
73+
`builtin/pack-objects.c`

Makefile

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -822,6 +822,7 @@ TEST_BUILTINS_OBJS += test-parse-options.o
822822
TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
823823
TEST_BUILTINS_OBJS += test-partial-clone.o
824824
TEST_BUILTINS_OBJS += test-path-utils.o
825+
TEST_BUILTINS_OBJS += test-path-walk.o
825826
TEST_BUILTINS_OBJS += test-pcre2-config.o
826827
TEST_BUILTINS_OBJS += test-pkt-line.o
827828
TEST_BUILTINS_OBJS += test-proc-receive.o
@@ -1098,6 +1099,7 @@ LIB_OBJS += parse-options.o
10981099
LIB_OBJS += patch-delta.o
10991100
LIB_OBJS += patch-ids.o
11001101
LIB_OBJS += path.o
1102+
LIB_OBJS += path-walk.o
11011103
LIB_OBJS += pathspec.o
11021104
LIB_OBJS += pkt-line.o
11031105
LIB_OBJS += preload-index.o

0 commit comments

Comments
 (0)