Skip to content

Commit 5421e7c

Browse files
ttaylorrgitster
authored andcommitted
commit-graph: reuse existing Bloom filters where possible
In an earlier commit, a bug was described where it's possible for Git to produce non-murmur3 hashes when the platform's "char" type is signed, and there are paths with characters whose highest bit is set (i.e. all characters >= 0x80). That patch allows the caller to control which version of Bloom filters are read and written. However, even on platforms with a signed "char" type, it is possible to reuse existing Bloom filters if and only if there are no changed paths in any commit's first parent tree-diff whose characters have their highest bit set. When this is the case, we can reuse the existing filter without having to compute a new one. This is done by marking trees which are known to have (or not have) any such paths. When a commit's root tree is verified to not have any such paths, we mark it as such and declare that the commit's Bloom filter is reusable. Note that this heuristic only goes in one direction. If neither a commit nor its first parent have any paths in their trees with non-ASCII characters, then we know for certain that a path with non-ASCII characters will not appear in a tree-diff against that commit's first parent. The reverse isn't necessarily true: just because the tree-diff doesn't contain any such paths does not imply that no such paths exist in either tree. So we end up recomputing some Bloom filters that we don't strictly have to (i.e. their bits are the same no matter which version of murmur3 we use). But culling these out is impossible, since we'd have to perform the full tree-diff, which is the same effort as computing the Bloom filter from scratch. But because we can cache our results in each tree's flag bits, we can often avoid recomputing many filters, thereby reducing the time it takes to run $ git commit-graph write --changed-paths --reachable when upgrading from v1 to v2 Bloom filters. To benchmark this, let's generate a commit-graph in linux.git with v1 changed-paths in generation order[^1]: $ git clone [email protected]:torvalds/linux.git $ cd linux $ git commit-graph write --reachable --changed-paths $ graph=".git/objects/info/commit-graph" $ mv $graph{,.bak} Then let's time how long it takes to go from v1 to v2 filters (with and without the upgrade path enabled), resetting the state of the commit-graph each time: $ git config commitGraph.changedPathsVersion 2 $ hyperfine -p 'cp -f $graph.bak $graph' -L v 0,1 \ 'GIT_TEST_UPGRADE_BLOOM_FILTERS={v} git.compile commit-graph write --reachable --changed-paths' On linux.git (where there aren't any non-ASCII paths), the timings indicate that this patch represents a speed-up over recomputing all Bloom filters from scratch: Benchmark 1: GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths Time (mean ± σ): 124.873 s ± 0.316 s [User: 124.081 s, System: 0.643 s] Range (min … max): 124.621 s … 125.227 s 3 runs Benchmark 2: GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths Time (mean ± σ): 79.271 s ± 0.163 s [User: 74.611 s, System: 4.521 s] Range (min … max): 79.112 s … 79.437 s 3 runs Summary 'GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths' ran 1.58 ± 0.01 times faster than 'GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths' On git.git, we do have some non-ASCII paths, giving us a more modest improvement from 4.163 seconds to 3.348 seconds, for a 1.24x speed-up. On my machine, the stats for git.git are: - 8,285 Bloom filters computed from scratch - 10 Bloom filters generated as empty - 4 Bloom filters generated as truncated due to too many changed paths - 65,114 Bloom filters were reused when transitioning from v1 to v2. [^1]: Note that this is is important, since `--stdin-packs` or `--stdin-commits` orders commits in the commit-graph by their pack position (with `--stdin-packs`) or in the raw input (with `--stdin-commits`). Since we compute Bloom filters in the same order that commits appear in the graph, we must see a commit's (first) parent before we process the commit itself. This is only guaranteed to happen when sorting commits by their generation number. Signed-off-by: Taylor Blau <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent df3df2d commit 5421e7c

File tree

5 files changed

+132
-4
lines changed

5 files changed

+132
-4
lines changed

bloom.c

Lines changed: 87 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@
66
#include "commit-graph.h"
77
#include "commit.h"
88
#include "commit-slab.h"
9+
#include "tree.h"
10+
#include "tree-walk.h"
11+
#include "config.h"
912

1013
define_commit_slab(bloom_filter_slab, struct bloom_filter);
1114

@@ -283,6 +286,73 @@ static void init_truncated_large_filter(struct bloom_filter *filter,
283286
filter->version = version;
284287
}
285288

289+
#define VISITED (1u<<21)
290+
#define HIGH_BITS (1u<<22)
291+
292+
static int has_entries_with_high_bit(struct repository *r, struct tree *t)
293+
{
294+
if (parse_tree(t))
295+
return 1;
296+
297+
if (!(t->object.flags & VISITED)) {
298+
struct tree_desc desc;
299+
struct name_entry entry;
300+
301+
init_tree_desc(&desc, &t->object.oid, t->buffer, t->size);
302+
while (tree_entry(&desc, &entry)) {
303+
size_t i;
304+
for (i = 0; i < entry.pathlen; i++) {
305+
if (entry.path[i] & 0x80) {
306+
t->object.flags |= HIGH_BITS;
307+
goto done;
308+
}
309+
}
310+
311+
if (S_ISDIR(entry.mode)) {
312+
struct tree *sub = lookup_tree(r, &entry.oid);
313+
if (sub && has_entries_with_high_bit(r, sub)) {
314+
t->object.flags |= HIGH_BITS;
315+
goto done;
316+
}
317+
}
318+
319+
}
320+
321+
done:
322+
t->object.flags |= VISITED;
323+
}
324+
325+
return !!(t->object.flags & HIGH_BITS);
326+
}
327+
328+
static int commit_tree_has_high_bit_paths(struct repository *r,
329+
struct commit *c)
330+
{
331+
struct tree *t;
332+
if (repo_parse_commit(r, c))
333+
return 1;
334+
t = repo_get_commit_tree(r, c);
335+
if (!t)
336+
return 1;
337+
return has_entries_with_high_bit(r, t);
338+
}
339+
340+
static struct bloom_filter *upgrade_filter(struct repository *r, struct commit *c,
341+
struct bloom_filter *filter,
342+
int hash_version)
343+
{
344+
struct commit_list *p = c->parents;
345+
if (commit_tree_has_high_bit_paths(r, c))
346+
return NULL;
347+
348+
if (p && commit_tree_has_high_bit_paths(r, p->item))
349+
return NULL;
350+
351+
filter->version = hash_version;
352+
353+
return filter;
354+
}
355+
286356
struct bloom_filter *get_bloom_filter(struct repository *r, struct commit *c)
287357
{
288358
struct bloom_filter *filter;
@@ -325,9 +395,23 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
325395
filter, graph_pos);
326396
}
327397

328-
if ((filter->data && filter->len) &&
329-
(!settings || settings->hash_version == filter->version))
330-
return filter;
398+
if (filter->data && filter->len) {
399+
struct bloom_filter *upgrade;
400+
if (!settings || settings->hash_version == filter->version)
401+
return filter;
402+
403+
/* version mismatch, see if we can upgrade */
404+
if (compute_if_not_present &&
405+
git_env_bool("GIT_TEST_UPGRADE_BLOOM_FILTERS", 1)) {
406+
upgrade = upgrade_filter(r, c, filter,
407+
settings->hash_version);
408+
if (upgrade) {
409+
if (computed)
410+
*computed |= BLOOM_UPGRADED;
411+
return upgrade;
412+
}
413+
}
414+
}
331415
if (!compute_if_not_present)
332416
return NULL;
333417

bloom.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,7 @@ enum bloom_filter_computed {
102102
BLOOM_COMPUTED = (1 << 1),
103103
BLOOM_TRUNC_LARGE = (1 << 2),
104104
BLOOM_TRUNC_EMPTY = (1 << 3),
105+
BLOOM_UPGRADED = (1 << 4),
105106
};
106107

107108
struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,

commit-graph.c

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1165,6 +1165,7 @@ struct write_commit_graph_context {
11651165
int count_bloom_filter_not_computed;
11661166
int count_bloom_filter_trunc_empty;
11671167
int count_bloom_filter_trunc_large;
1168+
int count_bloom_filter_upgraded;
11681169
};
11691170

11701171
static int write_graph_chunk_fanout(struct hashfile *f,
@@ -1772,6 +1773,8 @@ static void trace2_bloom_filter_write_statistics(struct write_commit_graph_conte
17721773
ctx->count_bloom_filter_trunc_empty);
17731774
trace2_data_intmax("commit-graph", ctx->r, "filter-trunc-large",
17741775
ctx->count_bloom_filter_trunc_large);
1776+
trace2_data_intmax("commit-graph", ctx->r, "filter-upgraded",
1777+
ctx->count_bloom_filter_upgraded);
17751778
}
17761779

17771780
static void compute_bloom_filters(struct write_commit_graph_context *ctx)
@@ -1813,6 +1816,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
18131816
ctx->count_bloom_filter_trunc_empty++;
18141817
if (computed & BLOOM_TRUNC_LARGE)
18151818
ctx->count_bloom_filter_trunc_large++;
1819+
} else if (computed & BLOOM_UPGRADED) {
1820+
ctx->count_bloom_filter_upgraded++;
18161821
} else if (computed & BLOOM_NOT_COMPUTED)
18171822
ctx->count_bloom_filter_not_computed++;
18181823
ctx->total_bloom_filter_data_size += filter

object.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@ void object_array_init(struct object_array *array);
7575
* commit-reach.c: 16-----19
7676
* sha1-name.c: 20
7777
* list-objects-filter.c: 21
78+
* bloom.c: 2122
7879
* builtin/fsck.c: 0--3
7980
* builtin/gc.c: 0
8081
* builtin/index-pack.c: 2021

t/t4216-log-bloom.sh

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -222,6 +222,10 @@ test_filter_trunc_large () {
222222
grep "\"key\":\"filter-trunc-large\",\"value\":\"$1\"" $2
223223
}
224224

225+
test_filter_upgraded () {
226+
grep "\"key\":\"filter-upgraded\",\"value\":\"$1\"" $2
227+
}
228+
225229
test_expect_success 'correctly report changes over limit' '
226230
git init limits &&
227231
(
@@ -667,7 +671,14 @@ test_expect_success 'when writing another commit graph, preserve existing versio
667671
test_expect_success 'when writing commit graph, do not reuse changed-path of another version' '
668672
git init doublewrite &&
669673
test_commit -C doublewrite c "$CENT" &&
674+
670675
git -C doublewrite config --add commitGraph.changedPathsVersion 1 &&
676+
>trace2.txt &&
677+
GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
678+
git -C doublewrite commit-graph write --reachable --changed-paths &&
679+
test_filter_computed 1 trace2.txt &&
680+
test_filter_upgraded 0 trace2.txt &&
681+
671682
git -C doublewrite commit-graph write --reachable --changed-paths &&
672683
for v in -2 3
673684
do
@@ -678,8 +689,14 @@ test_expect_success 'when writing commit graph, do not reuse changed-path of ano
678689
EOF
679690
test_cmp expect err || return 1
680691
done &&
692+
681693
git -C doublewrite config --add commitGraph.changedPathsVersion 2 &&
682-
git -C doublewrite commit-graph write --reachable --changed-paths &&
694+
>trace2.txt &&
695+
GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
696+
git -C doublewrite commit-graph write --reachable --changed-paths &&
697+
test_filter_computed 1 trace2.txt &&
698+
test_filter_upgraded 0 trace2.txt &&
699+
683700
(
684701
cd doublewrite &&
685702
echo "c01f" >expect &&
@@ -688,6 +705,26 @@ test_expect_success 'when writing commit graph, do not reuse changed-path of ano
688705
)
689706
'
690707

708+
test_expect_success 'when writing commit graph, reuse changed-path of another version where possible' '
709+
git init upgrade &&
710+
711+
test_commit -C upgrade base no-high-bits &&
712+
713+
git -C upgrade config --add commitGraph.changedPathsVersion 1 &&
714+
>trace2.txt &&
715+
GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
716+
git -C upgrade commit-graph write --reachable --changed-paths &&
717+
test_filter_computed 1 trace2.txt &&
718+
test_filter_upgraded 0 trace2.txt &&
719+
720+
git -C upgrade config --add commitGraph.changedPathsVersion 2 &&
721+
>trace2.txt &&
722+
GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
723+
git -C upgrade commit-graph write --reachable --changed-paths &&
724+
test_filter_computed 0 trace2.txt &&
725+
test_filter_upgraded 1 trace2.txt
726+
'
727+
691728
corrupt_graph () {
692729
test_when_finished "rm -rf $graph" &&
693730
git commit-graph write --reachable --changed-paths &&

0 commit comments

Comments
 (0)