Skip to content

Commit 7f98707

Browse files
derrickstoleegitster
authored andcommitted
test-tool: add helper for name-hash values
Add a new test-tool helper, name-hash, to output the value of the name-hash algorithms for the input list of strings, one per line. Since the name-hash values can be stored in the .bitmap files, it is important that these hash functions do not change across Git versions. Add a simple test to t5310-pack-bitmaps.sh to provide some testing of the current values. Due to how these functions are implemented, it would be difficult to change them without disturbing these values. The paths used for this test are carefully selected to demonstrate some of the behavior differences of the two current name hash versions, including which conditions will cause them to collide. Create a performance test that uses test_size to demonstrate how collisions occur for these hash algorithms. This test helps inform someone as to the behavior of the name-hash algorithms for their repo based on the paths at HEAD. My copy of the Git repository shows modest statistics around the collisions of the default name-hash algorithm: Test this tree -------------------------------------------------- 5314.1: paths at head 4.5K 5314.2: distinct hash value: v1 4.1K 5314.3: maximum multiplicity: v1 13 5314.4: distinct hash value: v2 4.2K 5314.5: maximum multiplicity: v2 9 Here, the maximum collision multiplicity is 13, but around 10% of paths have a collision with another path. In a more interesting example, the microsoft/fluentui [1] repo had these statistics at time of committing: Test this tree -------------------------------------------------- 5314.1: paths at head 19.5K 5314.2: distinct hash value: v1 8.2K 5314.3: maximum multiplicity: v1 279 5314.4: distinct hash value: v2 17.8K 5314.5: maximum multiplicity: v2 44 [1] https://github.com/microsoft/fluentui That demonstrates that of the nearly twenty thousand path names, they are assigned around eight thousand distinct values. 279 paths are assigned to a single value, leading the packing algorithm to sort objects from those paths together, by size. With the v2 name hash function, the maximum multiplicity lowers to 44, leaving some room for further improvement. In a more extreme example, an internal monorepo had a much worse collision rate: Test this tree -------------------------------------------------- 5314.1: paths at head 227.3K 5314.2: distinct hash value: v1 72.3K 5314.3: maximum multiplicity: v1 14.4K 5314.4: distinct hash value: v2 166.5K 5314.5: maximum multiplicity: v2 138 Here, we can see that the v2 name hash function provides somem improvements, but there are still a number of collisions that could lead to repacking problems at this scale. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent 30696be commit 7f98707

File tree

6 files changed

+87
-0
lines changed

6 files changed

+87
-0
lines changed

Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -816,6 +816,7 @@ TEST_BUILTINS_OBJS += test-lazy-init-name-hash.o
816816
TEST_BUILTINS_OBJS += test-match-trees.o
817817
TEST_BUILTINS_OBJS += test-mergesort.o
818818
TEST_BUILTINS_OBJS += test-mktemp.o
819+
TEST_BUILTINS_OBJS += test-name-hash.o
819820
TEST_BUILTINS_OBJS += test-online-cpus.o
820821
TEST_BUILTINS_OBJS += test-pack-mtimes.o
821822
TEST_BUILTINS_OBJS += test-parse-options.o

t/helper/test-name-hash.c

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
/*
2+
* test-name-hash.c: Read a list of paths over stdin and report on their
3+
* name-hash and full name-hash.
4+
*/
5+
6+
#include "test-tool.h"
7+
#include "git-compat-util.h"
8+
#include "pack-objects.h"
9+
#include "strbuf.h"
10+
11+
int cmd__name_hash(int argc UNUSED, const char **argv UNUSED)
12+
{
13+
struct strbuf line = STRBUF_INIT;
14+
15+
while (!strbuf_getline(&line, stdin)) {
16+
printf("%10u ", pack_name_hash(line.buf));
17+
printf("%10u ", pack_name_hash_v2((unsigned const char *)line.buf));
18+
printf("%s\n", line.buf);
19+
}
20+
21+
strbuf_release(&line);
22+
return 0;
23+
}

t/helper/test-tool.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ static struct test_cmd cmds[] = {
4444
{ "match-trees", cmd__match_trees },
4545
{ "mergesort", cmd__mergesort },
4646
{ "mktemp", cmd__mktemp },
47+
{ "name-hash", cmd__name_hash },
4748
{ "online-cpus", cmd__online_cpus },
4849
{ "pack-mtimes", cmd__pack_mtimes },
4950
{ "parse-options", cmd__parse_options },

t/helper/test-tool.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ int cmd__lazy_init_name_hash(int argc, const char **argv);
3737
int cmd__match_trees(int argc, const char **argv);
3838
int cmd__mergesort(int argc, const char **argv);
3939
int cmd__mktemp(int argc, const char **argv);
40+
int cmd__name_hash(int argc, const char **argv);
4041
int cmd__online_cpus(int argc, const char **argv);
4142
int cmd__pack_mtimes(int argc, const char **argv);
4243
int cmd__parse_options(int argc, const char **argv);

t/perf/p5314-name-hash.sh

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
#!/bin/sh
2+
3+
test_description='Tests pack performance using bitmaps'
4+
. ./perf-lib.sh
5+
6+
GIT_TEST_PASSING_SANITIZE_LEAK=0
7+
export GIT_TEST_PASSING_SANITIZE_LEAK
8+
9+
test_perf_large_repo
10+
11+
test_size 'paths at head' '
12+
git ls-tree -r --name-only HEAD >path-list &&
13+
wc -l <path-list &&
14+
test-tool name-hash <path-list >name-hashes
15+
'
16+
17+
for version in 1 2
18+
do
19+
test_size "distinct hash value: v$version" '
20+
awk "{ print \$$version; }" <name-hashes | sort | \
21+
uniq -c >name-hash-count &&
22+
wc -l <name-hash-count
23+
'
24+
25+
test_size "maximum multiplicity: v$version" '
26+
sort -nr <name-hash-count | head -n 1 | \
27+
awk "{ print \$1; }"
28+
'
29+
done
30+
31+
test_done

t/t5310-pack-bitmaps.sh

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,36 @@ has_any () {
2727
grep -Ff "$1" "$2"
2828
}
2929

30+
# Since name-hash values are stored in the .bitmap files, add a test
31+
# that checks that the name-hash calculations are stable across versions.
32+
# Not exhaustive, but these hashing algorithms would be hard to change
33+
# without causing deviations here.
34+
test_expect_success 'name-hash value stability' '
35+
cat >names <<-\EOF &&
36+
first
37+
second
38+
third
39+
a/one-long-enough-for-collisions
40+
b/two-long-enough-for-collisions
41+
many/parts/to/this/path/enough/to/collide/in/v2
42+
enough/parts/to/this/path/enough/to/collide/in/v2
43+
EOF
44+
45+
test-tool name-hash <names >out &&
46+
47+
cat >expect <<-\EOF &&
48+
2582249472 1763573760 first
49+
2289942528 1188134912 second
50+
2300837888 1130758144 third
51+
2544516325 3963087891 a/one-long-enough-for-collisions
52+
2544516325 4013419539 b/two-long-enough-for-collisions
53+
1420111091 1709547268 many/parts/to/this/path/enough/to/collide/in/v2
54+
1420111091 1709547268 enough/parts/to/this/path/enough/to/collide/in/v2
55+
EOF
56+
57+
test_cmp expect out
58+
'
59+
3060
test_bitmap_cases () {
3161
writeLookupTable=false
3262
for i in "$@"

0 commit comments

Comments
 (0)