Skip to content

Commit 64fd7b3

Browse files
committed
pack-objects: add third name hash version
The '--name-hash-version=<n>' option in 'git pack-objects' was introduced to allow for specifying an alternative name hash function when organizing objects for delta compression. The pack_name_hash_v2() function was designed to break some collisions while also preserving some amount of locality for cross-path deltas. However, in some repositories, that effort to preserve locality results in enough collisions that it causes issues with full repacks. Create a third name hash function and extend the '--name-hash-version' option in 'git pack-objects' and 'git repack' to understand it. This hash version abandons all efforts for locality and focuses on creating a somewhat uniformly-distributed hash function to minimize collisions. We can observe the effect of this collision avoidance in a large internal monorepo that suffered from collisions in the previous versions. The updates to p5314-name-hash.sh show these results: Test this tree -------------------------------------------------- 5314.1: paths at head 227.3K 5314.2: distinct hash value: v1 72.3K 5314.3: maximum multiplicity: v1 14.4K 5314.4: distinct hash value: v2 166.5K 5314.5: maximum multiplicity: v2 138 5314.6: distinct hash value: v3 227.3K 5314.7: maximum multiplicity: v3 2 These results demonstrate that of the 227,000+ paths, nearly all of them find distinct hash values. The maximum multiplicity is 2, improved from 138 in the v2 hash function. The v2 hash function also had only 166K distinct values, so it had a wide spread of collisions. A more modest improvement is available in the open source fluentui repo [1] with these results: Test this tree -------------------------------------------------- 5314.1: paths at head 19.5K 5314.2: distinct hash value: v1 8.2K 5314.3: maximum multiplicity: v1 279 5314.4: distinct hash value: v2 17.8K 5314.5: maximum multiplicity: v2 44 5314.6: distinct hash value: v3 19.5K 5314.7: maximum multiplicity: v3 1 [1] https://github.com/microsoft/fluentui However, it is important to demonstrate the effectiveness of this function in the context of compressing a repository. We can use p5313-pack-objects.sh to measure these changes. I will use a simplified table summarizing the output of that performance test. | Test | V1 Time | V2 Time | V3 Time | V1 Size | V2 Size | V3 Size | |-----------|---------|---------|---------|---------|---------|---------| | Thin Pack | 0.37 s | 0.12 s | 0.07 s | 1.2 M | 22.0 K | 20.4 K | | Big Pack | 2.04 s | 2.80 s | 1.40 s | 20.4 M | 25.9 M | 19.2 M | | Shallow | 1.41 s | 1.77 s | 1.27 s | 34.4 M | 33.7 M | 34.8 M | | Repack | 95.70 s | 33.68 s | 20.88 s | 439.3 M | 160.5 M | 169.1 M | Here, there are some performance improvements on a time basis, and the thin and big packs are somewhat smaller in v3. The shallow and repacked packs are somewhat bigger, though, compared to v2. Two repositories that have very few collisions in the v1 name hash are the Git and Linux repositories. Here are their stats for p5313: Git: | Test | V1 Time | V2 Time | V3 Time | V1 Size | V2 Size | V3 Size | |-----------|---------|---------|---------|---------|---------|---------| | Thin Pack | 0.02 s | 0.02 s | 0.02 s | 1.1 K | 1.1 K | 15.3 K | | Big Pack | 1.69 s | 1.95 s | 1.67 s | 13.5 M | 14.5 M | 14.9 M | | Shallow | 1.26 s | 1.29 s | 1.16 s | 12.0 M | 12.2 M | 12.5 M | | Repack | 29.51 s | 29.01 s | 29.08 s | 237.7 M | 238.2 M | 237.7 M | Linux: | Test | V1 Time | V2 Time | V3 Time | V1 Size | V2 Size | V3 Size | |-----------|----------|----------|----------|---------|---------|---------| | Thin Pack | 0.17 s | 0.07 s | 0.07 s | 4.6 K | 4.6 K | 6.8 K | | Big Pack | 17.88 s | 12.35 s | 12.14 s | 201.1 M | 149.1 M | 160.4 M | | Shallow | 11.05 s | 22.94 s | 22.16 s | 269.2 M | 273.8 M | 271.8 M | | Repack | 727.39 s | 566.95 s | 539.33 s | 2.5 G | 2.5 G | 2.6 G | These repositories make good use of the cross-path deltas that come about from the v1 name hash function, so they already had mixed results with the v2 function. The v3 function is generally worse for these repositories. An internal Javascript-based repository with name hash collisions similar to the fluentui repo has these results: | Test | V1 Time | V2 Time | V3 Time | V1 Size | V2 Size | V3 Size | |-----------|-----------|----------|----------|---------|---------|---------| | Thin Pack | 8.28 s | 7.28 s | 0.04 s | 16.8 K | 16.8 K | 3.2 K | | Big Pack | 12.81 s | 11.66 s | 2.52 s | 29.1 M | 29.1 M | 30.6 M | | Shallow | 4.86 s | 4.06 s | 3.77 s | 42.5 M | 44.1 M | 45.7 M | | Repack | 3126.50 s | 496.33 s | 306.86 s | 6.2 G | 855.6 M | 838.2 M | This repository is also listed as "Repo B" in the repacking size table below, along with other Javascript repos that have many name hash collisions with the v1 name hash: | Repo | V1 Size | V2 Size | V3 Size | |----------|-----------|---------|---------| | fluentui | 440 M | 161 M | 170 M | | Repo B | 6,248 M | 856 M | 840 M | | Repo C | 37,278 M | 6,921 M | 6,755 M | | Repo D | 131,204 M | 7,463 M | 7,124 M | While the fluentui repo had an increase in size using the v3 name hash, the others had modest improvements over the v2 name hash. But those modest improvements are dwarfed by the difference from v1 to v2, so it is unlikely that the regression seen in the other scenarios (packfiles that are not from full repacks) will be worth using v3 over v2. That is, unless there are enough collisions even with v2 that the full repack scenario has larger improvements than these. Signed-off-by: Derrick Stolee <[email protected]>
1 parent 3885ef8 commit 64fd7b3

File tree

9 files changed

+60
-12
lines changed

9 files changed

+60
-12
lines changed

Documentation/git-pack-objects.txt

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -374,6 +374,15 @@ breaking most of the collisions from a similarly-named file appearing in
374374
many different directories. At the moment, this version is not allowed
375375
when writing reachability bitmap files with `--write-bitmap-index` and it
376376
will be automatically changed to version `1`.
377+
+
378+
The name hash version `3` abandons the locality features of versions `1`
379+
and `2` in favor of minimizing collisions. The goal here is to separate
380+
objects by their full path and abandon hope for cross-path delta
381+
compression. For this reason, this option is preferred for repacking large
382+
repositories with many versions and many name hash collisions when using
383+
the first two versions. At the moment, this version is not allowed when
384+
writing reachability bitmap files with `--write-bitmap-index` and it will
385+
be automatically changed to version `1`.
377386

378387

379388
DELTA ISLANDS

Documentation/git-repack.txt

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -279,6 +279,15 @@ breaking most of the collisions from a similarly-named file appearing in
279279
many different directories. At the moment, this version is not allowed
280280
when writing reachability bitmap files with `--write-bitmap-index` and it
281281
will be automatically changed to version `1`.
282+
+
283+
The name hash version `3` abandons the locality features of versions `1`
284+
and `2` in favor of minimizing collisions. The goal here is to separate
285+
objects by their full path and abandon hope for cross-path delta
286+
compression. For this reason, this option is preferred for repacking large
287+
repositories with many versions and many name hash collisions when using
288+
the first two versions. At the moment, this version is not allowed when
289+
writing reachability bitmap files with `--write-bitmap-index` and it will
290+
be automatically changed to version `1`.
282291

283292

284293
CONFIGURATION

builtin/pack-objects.c

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -270,7 +270,7 @@ static int name_hash_version = -1;
270270

271271
static void validate_name_hash_version(void)
272272
{
273-
if (name_hash_version < 1 || name_hash_version > 2)
273+
if (name_hash_version < 1 || name_hash_version > 3)
274274
die(_("invalid --name-hash-version option: %d"), name_hash_version);
275275
}
276276

@@ -292,6 +292,9 @@ static inline uint32_t pack_name_hash_fn(const char *name)
292292
case 2:
293293
return pack_name_hash_v2(name);
294294

295+
case 3:
296+
return pack_name_hash_v3(name);
297+
295298
default:
296299
BUG("invalid name-hash version: %d", name_hash_version);
297300
}

pack-objects.h

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -235,6 +235,32 @@ static inline uint32_t pack_name_hash_v2(const char *name)
235235
return (base >> 6) ^ hash;
236236
}
237237

238+
static inline uint32_t pack_name_hash_v3(const char *name)
239+
{
240+
/*
241+
* This 'bigp' value is a large prime, at least 25% of the max
242+
* value of an uint32_t. Multiplying by this value (modulo 2^32)
243+
* causes the 32 bits to change pseudo-randomly.
244+
*/
245+
const uint32_t bigp = 1234572167U;
246+
uint32_t c, hash = bigp;
247+
248+
if (!name)
249+
return 0;
250+
251+
/*
252+
* Do the simplest thing that will resemble pseudo-randomness: add
253+
* random multiples of a large prime number with a binary shift.
254+
* The goal is not to be cryptographic, but to be generally
255+
* uniformly distributed.
256+
*/
257+
while ((c = *name++) != 0) {
258+
hash += c * bigp;
259+
hash = (hash >> 5) | (hash << 27);
260+
}
261+
return hash;
262+
}
263+
238264
static inline enum object_type oe_type(const struct object_entry *e)
239265
{
240266
return e->type_valid ? e->type_ : OBJ_BAD;

t/helper/test-name-hash.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ int cmd__name_hash(int argc UNUSED, const char **argv UNUSED)
1515
while (!strbuf_getline(&line, stdin)) {
1616
printf("%10u ", pack_name_hash(line.buf));
1717
printf("%10u ", pack_name_hash_v2(line.buf));
18+
printf("%10u ", pack_name_hash_v3(line.buf));
1819
printf("%s\n", line.buf);
1920
}
2021

t/perf/p5313-pack-objects.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ test_expect_success 'create rev input' '
2525
EOF
2626
'
2727

28-
for version in 1 2
28+
for version in 1 2 3
2929
do
3030
export version
3131

t/perf/p5314-name-hash.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ test_size 'paths at head' '
1414
test-tool name-hash <path-list >name-hashes
1515
'
1616

17-
for version in 1 2
17+
for version in 1 2 3
1818
do
1919
test_size "distinct hash value: v$version" '
2020
awk "{ print \$$version; }" <name-hashes | sort | \

t/t5300-pack-object.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -680,13 +680,13 @@ test_expect_success 'valid and invalid --name-hash-versions' '
680680
# Valid values are hard to verify other than "do not fail".
681681
# Performance tests will be more valuable to validate these versions.
682682
# Negative values are converted to version 1.
683-
for value in -1 1 2
683+
for value in -1 1 2 3
684684
do
685685
git pack-objects base --all --name-hash-version=$value || return 1
686686
done &&
687687
688688
# Invalid values have clear post-conditions.
689-
for value in 0 3
689+
for value in 0 4
690690
do
691691
test_must_fail git pack-objects base --all --name-hash-version=$value 2>err &&
692692
test_grep "invalid --name-hash-version option" err || return 1

t/t5310-pack-bitmaps.sh

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -45,13 +45,13 @@ test_expect_success 'name-hash value stability' '
4545
test-tool name-hash <names >out &&
4646
4747
cat >expect <<-\EOF &&
48-
2582249472 1763573760 first
49-
2289942528 1188134912 second
50-
2300837888 1130758144 third
51-
2544516325 3963087891 a/one-long-enough-for-collisions
52-
2544516325 4013419539 b/two-long-enough-for-collisions
53-
1420111091 1709547268 many/parts/to/this/path/enough/to/collide/in/v2
54-
1420111091 1709547268 enough/parts/to/this/path/enough/to/collide/in/v2
48+
2582249472 1763573760 3109209818 first
49+
2289942528 1188134912 3781118409 second
50+
2300837888 1130758144 3028707182 third
51+
2544516325 3963087891 3586976147 a/one-long-enough-for-collisions
52+
2544516325 4013419539 1701624798 b/two-long-enough-for-collisions
53+
1420111091 1709547268 2676129939 many/parts/to/this/path/enough/to/collide/in/v2
54+
1420111091 1709547268 2740459187 enough/parts/to/this/path/enough/to/collide/in/v2
5555
EOF
5656
5757
test_cmp expect out

0 commit comments

Comments
 (0)