Skip to content

Commit 15afe74

Browse files
committed
Merge branch 'ds/name-hash-tweaks' into jch
"git pack-objects" and its wrapper "git repack" learned an option to use an alternative path-hash function to improve delta-base selection to produce a packfile with deeper history than window size. * ds/name-hash-tweaks: pack-objects: add third name hash version pack-objects: prevent name hash version change test-tool: add helper for name-hash values p5313: add size comparison test pack-objects: add GIT_TEST_NAME_HASH_VERSION repack: add --name-hash-version option pack-objects: add --name-hash-version option pack-objects: create new name-hash function version
2 parents ac5e4c1 + 7a9136a commit 15afe74

23 files changed

+485
-17
lines changed

Documentation/git-pack-objects.txt

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,8 @@ SYNOPSIS
1515
[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
1616
[--cruft] [--cruft-expiration=<time>]
1717
[--stdout [--filter=<filter-spec>] | <base-name>]
18-
[--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>
18+
[--shallow] [--keep-true-parents] [--[no-]sparse]
19+
[--name-hash-version=<n>] < <object-list>
1920

2021

2122
DESCRIPTION
@@ -345,6 +346,44 @@ raise an error.
345346
Restrict delta matches based on "islands". See DELTA ISLANDS
346347
below.
347348

349+
--name-hash-version=<n>::
350+
While performing delta compression, Git groups objects that may be
351+
similar based on heuristics using the path to that object. While
352+
grouping objects by an exact path match is good for paths with
353+
many versions, there are benefits for finding delta pairs across
354+
different full paths. Git collects objects by type and then by a
355+
"name hash" of the path and then by size, hoping to group objects
356+
that will compress well together.
357+
+
358+
The default name hash version is `1`, which prioritizes hash locality by
359+
considering the final bytes of the path as providing the maximum magnitude
360+
to the hash function. This version excels at distinguishing short paths
361+
and finding renames across directories. However, the hash function depends
362+
primarily on the final 16 bytes of the path. If there are many paths in
363+
the repo that have the same final 16 bytes and differ only by parent
364+
directory, then this name-hash may lead to too many collisions and cause
365+
poor results. At the moment, this version is required when writing
366+
reachability bitmap files with `--write-bitmap-index`.
367+
+
368+
The name hash version `2` has similar locality features as version `1`,
369+
except it considers each path component separately and overlays the hashes
370+
with a shift. This still prioritizes the final bytes of the path, but also
371+
"salts" the lower bits of the hash using the parent directory names. This
372+
method allows for some of the locality benefits of version `1` while
373+
breaking most of the collisions from a similarly-named file appearing in
374+
many different directories. At the moment, this version is not allowed
375+
when writing reachability bitmap files with `--write-bitmap-index` and it
376+
will be automatically changed to version `1`.
377+
+
378+
The name hash version `3` abandons the locality features of versions `1`
379+
and `2` in favor of minimizing collisions. The goal here is to separate
380+
objects by their full path and abandon hope for cross-path delta
381+
compression. For this reason, this option is preferred for repacking large
382+
repositories with many versions and many name hash collisions when using
383+
the first two versions. At the moment, this version is not allowed when
384+
writing reachability bitmap files with `--write-bitmap-index` and it will
385+
be automatically changed to version `1`.
386+
348387

349388
DELTA ISLANDS
350389
-------------

Documentation/git-repack.txt

Lines changed: 42 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,9 @@ git-repack - Pack unpacked objects in a repository
99
SYNOPSIS
1010
--------
1111
[verse]
12-
'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m] [--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>] [--write-midx]
12+
'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]
13+
[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]
14+
[--write-midx] [--name-hash-version=<n>]
1315

1416
DESCRIPTION
1517
-----------
@@ -249,6 +251,45 @@ linkgit:git-multi-pack-index[1]).
249251
Write a multi-pack index (see linkgit:git-multi-pack-index[1])
250252
containing the non-redundant packs.
251253

254+
--name-hash-version=<n>::
255+
While performing delta compression, Git groups objects that may be
256+
similar based on heuristics using the path to that object. While
257+
grouping objects by an exact path match is good for paths with
258+
many versions, there are benefits for finding delta pairs across
259+
different full paths. Git collects objects by type and then by a
260+
"name hash" of the path and then by size, hoping to group objects
261+
that will compress well together.
262+
+
263+
The default name hash version is `1`, which prioritizes hash locality by
264+
considering the final bytes of the path as providing the maximum magnitude
265+
to the hash function. This version excels at distinguishing short paths
266+
and finding renames across directories. However, the hash function depends
267+
primarily on the final 16 bytes of the path. If there are many paths in
268+
the repo that have the same final 16 bytes and differ only by parent
269+
directory, then this name-hash may lead to too many collisions and cause
270+
poor results. At the moment, this version is required when writing
271+
reachability bitmap files with `--write-bitmap-index`.
272+
+
273+
The name hash version `2` has similar locality features as version `1`,
274+
except it considers each path component separately and overlays the hashes
275+
with a shift. This still prioritizes the final bytes of the path, but also
276+
"salts" the lower bits of the hash using the parent directory names. This
277+
method allows for some of the locality benefits of version `1` while
278+
breaking most of the collisions from a similarly-named file appearing in
279+
many different directories. At the moment, this version is not allowed
280+
when writing reachability bitmap files with `--write-bitmap-index` and it
281+
will be automatically changed to version `1`.
282+
+
283+
The name hash version `3` abandons the locality features of versions `1`
284+
and `2` in favor of minimizing collisions. The goal here is to separate
285+
objects by their full path and abandon hope for cross-path delta
286+
compression. For this reason, this option is preferred for repacking large
287+
repositories with many versions and many name hash collisions when using
288+
the first two versions. At the moment, this version is not allowed when
289+
writing reachability bitmap files with `--write-bitmap-index` and it will
290+
be automatically changed to version `1`.
291+
292+
252293
CONFIGURATION
253294
-------------
254295

Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -819,6 +819,7 @@ TEST_BUILTINS_OBJS += test-lazy-init-name-hash.o
819819
TEST_BUILTINS_OBJS += test-match-trees.o
820820
TEST_BUILTINS_OBJS += test-mergesort.o
821821
TEST_BUILTINS_OBJS += test-mktemp.o
822+
TEST_BUILTINS_OBJS += test-name-hash.o
822823
TEST_BUILTINS_OBJS += test-online-cpus.o
823824
TEST_BUILTINS_OBJS += test-pack-mtimes.o
824825
TEST_BUILTINS_OBJS += test-parse-options.o

builtin/pack-objects.c

Lines changed: 58 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -269,6 +269,39 @@ struct configured_exclusion {
269269
static struct oidmap configured_exclusions;
270270

271271
static struct oidset excluded_by_config;
272+
static int name_hash_version = -1;
273+
274+
static void validate_name_hash_version(void)
275+
{
276+
if (name_hash_version < 1 || name_hash_version > 3)
277+
die(_("invalid --name-hash-version option: %d"), name_hash_version);
278+
}
279+
280+
static inline uint32_t pack_name_hash_fn(const char *name)
281+
{
282+
static int seen_version = -1;
283+
284+
if (seen_version < 0)
285+
seen_version = name_hash_version;
286+
else if (seen_version != name_hash_version)
287+
BUG("name hash version changed from %d to %d mid-process",
288+
seen_version, name_hash_version);
289+
290+
switch (name_hash_version)
291+
{
292+
case 1:
293+
return pack_name_hash(name);
294+
295+
case 2:
296+
return pack_name_hash_v2(name);
297+
298+
case 3:
299+
return pack_name_hash_v3(name);
300+
301+
default:
302+
BUG("invalid name-hash version: %d", name_hash_version);
303+
}
304+
}
272305

273306
/*
274307
* stats
@@ -1687,7 +1720,7 @@ static int add_object_entry(const struct object_id *oid, enum object_type type,
16871720
return 0;
16881721
}
16891722

1690-
create_object_entry(oid, type, pack_name_hash(name),
1723+
create_object_entry(oid, type, pack_name_hash_fn(name),
16911724
exclude, name && no_try_delta(name),
16921725
found_pack, found_offset);
16931726
return 1;
@@ -1901,7 +1934,7 @@ static void add_preferred_base_object(const char *name)
19011934
{
19021935
struct pbase_tree *it;
19031936
size_t cmplen;
1904-
unsigned hash = pack_name_hash(name);
1937+
unsigned hash = pack_name_hash_fn(name);
19051938

19061939
if (!num_preferred_base || check_pbase_path(hash))
19071940
return;
@@ -3411,7 +3444,7 @@ static void show_object_pack_hint(struct object *object, const char *name,
34113444
* here using a now in order to perhaps improve the delta selection
34123445
* process.
34133446
*/
3414-
oe->hash = pack_name_hash(name);
3447+
oe->hash = pack_name_hash_fn(name);
34153448
oe->no_try_delta = name && no_try_delta(name);
34163449

34173450
stdin_packs_hints_nr++;
@@ -3561,7 +3594,7 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
35613594
entry = packlist_find(&to_pack, oid);
35623595
if (entry) {
35633596
if (name) {
3564-
entry->hash = pack_name_hash(name);
3597+
entry->hash = pack_name_hash_fn(name);
35653598
entry->no_try_delta = no_try_delta(name);
35663599
}
35673600
} else {
@@ -3584,7 +3617,7 @@ static void add_cruft_object_entry(const struct object_id *oid, enum object_type
35843617
return;
35853618
}
35863619

3587-
entry = create_object_entry(oid, type, pack_name_hash(name),
3620+
entry = create_object_entry(oid, type, pack_name_hash_fn(name),
35883621
0, name && no_try_delta(name),
35893622
pack, offset);
35903623
}
@@ -4061,6 +4094,15 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
40614094
if (!(bitmap_git = prepare_bitmap_walk(revs, 0)))
40624095
return -1;
40634096

4097+
/*
4098+
* For now, force the name-hash version to be 1 since that
4099+
* is the version implied by the bitmap format. Later, the
4100+
* format can include this version explicitly in its format,
4101+
* allowing readers to know the version that was used during
4102+
* the bitmap write.
4103+
*/
4104+
name_hash_version = 1;
4105+
40644106
if (pack_options_allow_reuse())
40654107
reuse_partial_packfile_from_bitmap(bitmap_git,
40664108
&reuse_packfiles,
@@ -4436,6 +4478,8 @@ int cmd_pack_objects(int argc,
44364478
OPT_STRING_LIST(0, "uri-protocol", &uri_protocols,
44374479
N_("protocol"),
44384480
N_("exclude any configured uploadpack.blobpackfileuri with this protocol")),
4481+
OPT_INTEGER(0, "name-hash-version", &name_hash_version,
4482+
N_("use the specified name-hash function to group similar objects")),
44394483
OPT_END(),
44404484
};
44414485

@@ -4591,6 +4635,15 @@ int cmd_pack_objects(int argc,
45914635
if (pack_to_stdout || !rev_list_all)
45924636
write_bitmap_index = 0;
45934637

4638+
if (name_hash_version < 0)
4639+
name_hash_version = (int)git_env_ulong("GIT_TEST_NAME_HASH_VERSION", 1);
4640+
4641+
validate_name_hash_version();
4642+
if (write_bitmap_index && name_hash_version != 1) {
4643+
warning(_("currently, --write-bitmap-index requires --name-hash-version=1"));
4644+
name_hash_version = 1;
4645+
}
4646+
45944647
if (use_delta_islands)
45954648
strvec_push(&rp, "--topo-order");
45964649

builtin/repack.c

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,9 @@ static int run_update_server_info = 1;
4141
static char *packdir, *packtmp_name, *packtmp;
4242

4343
static const char *const git_repack_usage[] = {
44-
N_("git repack [<options>]"),
44+
N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n"
45+
"[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]\n"
46+
"[--write-midx] [--name-hash-version=<n>]"),
4547
NULL
4648
};
4749

@@ -60,6 +62,7 @@ struct pack_objects_args {
6062
int no_reuse_object;
6163
int quiet;
6264
int local;
65+
int name_hash_version;
6366
struct list_objects_filter_options filter_options;
6467
};
6568

@@ -308,6 +311,8 @@ static void prepare_pack_objects(struct child_process *cmd,
308311
strvec_pushf(&cmd->args, "--no-reuse-delta");
309312
if (args->no_reuse_object)
310313
strvec_pushf(&cmd->args, "--no-reuse-object");
314+
if (args->name_hash_version)
315+
strvec_pushf(&cmd->args, "--name-hash-version=%d", args->name_hash_version);
311316
if (args->local)
312317
strvec_push(&cmd->args, "--local");
313318
if (args->quiet)
@@ -1205,6 +1210,8 @@ int cmd_repack(int argc,
12051210
N_("pass --no-reuse-delta to git-pack-objects")),
12061211
OPT_BOOL('F', NULL, &po_args.no_reuse_object,
12071212
N_("pass --no-reuse-object to git-pack-objects")),
1213+
OPT_INTEGER(0, "name-hash-version", &po_args.name_hash_version,
1214+
N_("specify the name hash version to use for grouping similar objects by path")),
12081215
OPT_NEGBIT('n', NULL, &run_update_server_info,
12091216
N_("do not run git-update-server-info"), 1),
12101217
OPT__QUIET(&po_args.quiet, N_("be quiet")),

pack-objects.h

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -208,6 +208,60 @@ static inline uint32_t pack_name_hash(const char *name)
208208
return hash;
209209
}
210210

211+
static inline uint32_t pack_name_hash_v2(const char *name)
212+
{
213+
uint32_t hash = 0, base = 0, c;
214+
215+
if (!name)
216+
return 0;
217+
218+
while ((c = *name++)) {
219+
if (isspace(c))
220+
continue;
221+
if (c == '/') {
222+
base = (base >> 6) ^ hash;
223+
hash = 0;
224+
} else {
225+
/*
226+
* 'c' is only a single byte. Reverse it and move
227+
* it to the top of the hash, moving the rest to
228+
* less-significant bits.
229+
*/
230+
c = (c & 0xF0) >> 4 | (c & 0x0F) << 4;
231+
c = (c & 0xCC) >> 2 | (c & 0x33) << 2;
232+
c = (c & 0xAA) >> 1 | (c & 0x55) << 1;
233+
hash = (hash >> 2) + (c << 24);
234+
}
235+
}
236+
return (base >> 6) ^ hash;
237+
}
238+
239+
static inline uint32_t pack_name_hash_v3(const char *name)
240+
{
241+
/*
242+
* This 'bigp' value is a large prime, at least 25% of the max
243+
* value of an uint32_t. Multiplying by this value (modulo 2^32)
244+
* causes the 32 bits to change pseudo-randomly.
245+
*/
246+
const uint32_t bigp = 1234572167U;
247+
uint32_t c, hash = bigp;
248+
249+
if (!name)
250+
return 0;
251+
252+
/*
253+
* Do the simplest thing that will resemble pseudo-randomness: add
254+
* random multiples of a large prime number with a binary shift.
255+
* The goal is not to be cryptographic, but to be generally
256+
* uniformly distributed.
257+
*/
258+
while ((c = *name++) != 0) {
259+
hash += c * bigp;
260+
hash = (hash >> 5) | (hash << 27);
261+
}
262+
return hash;
263+
}
264+
211265
static inline enum object_type oe_type(const struct object_entry *e)
212266
{
213267
return e->type_valid ? e->type_ : OBJ_BAD;

t/README

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -471,6 +471,10 @@ a test and then fails then the whole test run will abort. This can help to make
471471
sure the expected tests are executed and not silently skipped when their
472472
dependency breaks or is simply not present in a new environment.
473473

474+
GIT_TEST_NAME_HASH_VERSION=<int>, when set, causes 'git pack-objects' to
475+
assume '--name-hash-version=<n>'.
476+
477+
474478
Naming Tests
475479
------------
476480

t/helper/meson.build

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ test_tool_sources = [
3434
'test-match-trees.c',
3535
'test-mergesort.c',
3636
'test-mktemp.c',
37+
'test-name-hash.c',
3738
'test-online-cpus.c',
3839
'test-pack-mtimes.c',
3940
'test-parse-options.c',

t/helper/test-name-hash.c

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
/*
2+
* test-name-hash.c: Read a list of paths over stdin and report on their
3+
* name-hash and full name-hash.
4+
*/
5+
6+
#include "test-tool.h"
7+
#include "git-compat-util.h"
8+
#include "pack-objects.h"
9+
#include "strbuf.h"
10+
11+
int cmd__name_hash(int argc UNUSED, const char **argv UNUSED)
12+
{
13+
struct strbuf line = STRBUF_INIT;
14+
15+
while (!strbuf_getline(&line, stdin)) {
16+
printf("%10u ", pack_name_hash(line.buf));
17+
printf("%10u ", pack_name_hash_v2(line.buf));
18+
printf("%10u ", pack_name_hash_v3(line.buf));
19+
printf("%s\n", line.buf);
20+
}
21+
22+
strbuf_release(&line);
23+
return 0;
24+
}

t/helper/test-tool.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ static struct test_cmd cmds[] = {
4444
{ "match-trees", cmd__match_trees },
4545
{ "mergesort", cmd__mergesort },
4646
{ "mktemp", cmd__mktemp },
47+
{ "name-hash", cmd__name_hash },
4748
{ "online-cpus", cmd__online_cpus },
4849
{ "pack-mtimes", cmd__pack_mtimes },
4950
{ "parse-options", cmd__parse_options },

0 commit comments

Comments
 (0)