Skip to content

Commit 0906ac2

Browse files
derrickstoleegitster
authored andcommitted
blame: use changed-path Bloom filters
The changed-path Bloom filters help reduce the amount of tree parsing required during history queries. Before calculating a diff, we can ask the filter if a path changed between a commit and its first parent. If the filter says "no" then we can move on without parsing trees. If the filter says "maybe" then we parse trees to discover if the answer is actually "yes" or "no". When computing a blame, there is a section in find_origin() that computes a diff between a commit and one of its parents. When this is the first parent, we can check the Bloom filters before calling diff_tree_oid(). In order to make this work with the blame machinery, we need to initialize a struct bloom_key with the initial path. But also, we need to add more keys to a list if a rename is detected. We then check to see if _any_ of these keys answer "maybe" in the diff. During development, I purposefully left out this "add a new key when a rename is detected" to see if the test suite would catch my error. That is how I discovered the issues with GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS from the previous change. With that change, we can feel some confidence in the coverage of this change. If a user requests copy detection using "git blame -C", then there are more places where the set of "important" files can expand. I do not know enough about how this happens in the blame machinery. Thus, the Bloom filter integration is explicitly disabled in this mode. A later change could expand the bloom_key data with an appropriate call (or calls) to add_bloom_key(). If we did not disable this mode, then the following tests would fail: t8003-blame-corner-cases.sh t8011-blame-split-file.sh Generally, this is a performance enhancement and should not change the behavior of 'git blame' in any way. If a repo has a commit-graph file with computed changed-path Bloom filters, then they should notice improved performance for their 'git blame' commands. Here are some example timings that I found by blaming some paths in the Linux kernel repository: git blame arch/x86/kernel/topology.c >/dev/null Before: 0.83s After: 0.24s git blame kernel/time/time.c >/dev/null Before: 0.72s After: 0.24s git blame tools/perf/ui/stdio/hist.c >/dev/null Before: 0.27s After: 0.11s I specifically looked for "deep" paths that were also edited many times. As a counterpoint, the MAINTAINERS file was edited many times but is located in the root tree. This means that the cost of computing a diff relative to the pathspec is very small. Here are the timings for that command: git blame MAINTAINERS >/dev/null Before: 20.1s After: 18.0s These timings are the best of five. The worst-case runs were on the order of 2.5 minutes for both cases. Note that the MAINTAINERS file has 18,740 lines across 17,000+ commits. This happens to be one of the cases where this change provides the least improvement. The lack of improvement for the MAINTAINERS file and the relatively modest improvement for the other examples can be easily explained. The blame machinery needs to compute line-level diffs to determine which lines were changed by each commit. That makes up a large proportion of the computation time, and this change does not attempt to improve on that section of the algorithm. The MAINTAINERS file is large and changed often, so it takes time to determine which lines were updated by which commit. In contrast, the code files are much smaller, and it takes longer to comute the line-by-line diff for a single patch on the Linux mailing lists. Outside of the "-C" integration, I believe there is little more to gain from the changed-path Bloom filters for 'git blame' after this patch. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent b23ea97 commit 0906ac2

File tree

3 files changed

+146
-9
lines changed

3 files changed

+146
-9
lines changed

blame.c

Lines changed: 130 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@
99
#include "blame.h"
1010
#include "alloc.h"
1111
#include "commit-slab.h"
12+
#include "bloom.h"
13+
#include "commit-graph.h"
1214

1315
define_commit_slab(blame_suspects, struct blame_origin *);
1416
static struct blame_suspects blame_suspects;
@@ -1246,13 +1248,75 @@ static int fill_blob_sha1_and_mode(struct repository *r,
12461248
return -1;
12471249
}
12481250

1251+
struct blame_bloom_data {
1252+
/*
1253+
* Changed-path Bloom filter keys. These can help prevent
1254+
* computing diffs against first parents, but we need to
1255+
* expand the list as code is moved or files are renamed.
1256+
*/
1257+
struct bloom_filter_settings *settings;
1258+
struct bloom_key **keys;
1259+
int nr;
1260+
int alloc;
1261+
};
1262+
1263+
static int bloom_count_queries = 0;
1264+
static int bloom_count_no = 0;
1265+
static int maybe_changed_path(struct repository *r,
1266+
struct commit *parent,
1267+
struct blame_origin *origin,
1268+
struct blame_bloom_data *bd)
1269+
{
1270+
int i;
1271+
struct bloom_filter *filter;
1272+
1273+
if (!bd)
1274+
return 1;
1275+
1276+
if (origin->commit->generation == GENERATION_NUMBER_INFINITY)
1277+
return 1;
1278+
1279+
filter = get_bloom_filter(r, origin->commit, 0);
1280+
1281+
if (!filter)
1282+
return 1;
1283+
1284+
bloom_count_queries++;
1285+
for (i = 0; i < bd->nr; i++) {
1286+
if (bloom_filter_contains(filter,
1287+
bd->keys[i],
1288+
bd->settings))
1289+
return 1;
1290+
}
1291+
1292+
bloom_count_no++;
1293+
return 0;
1294+
}
1295+
1296+
static void add_bloom_key(struct blame_bloom_data *bd,
1297+
const char *path)
1298+
{
1299+
if (!bd)
1300+
return;
1301+
1302+
if (bd->nr >= bd->alloc) {
1303+
bd->alloc *= 2;
1304+
REALLOC_ARRAY(bd->keys, bd->alloc);
1305+
}
1306+
1307+
bd->keys[bd->nr] = xmalloc(sizeof(struct bloom_key));
1308+
fill_bloom_key(path, strlen(path), bd->keys[bd->nr], bd->settings);
1309+
bd->nr++;
1310+
}
1311+
12491312
/*
12501313
* We have an origin -- check if the same path exists in the
12511314
* parent and return an origin structure to represent it.
12521315
*/
12531316
static struct blame_origin *find_origin(struct repository *r,
12541317
struct commit *parent,
1255-
struct blame_origin *origin)
1318+
struct blame_origin *origin,
1319+
struct blame_bloom_data *bd)
12561320
{
12571321
struct blame_origin *porigin;
12581322
struct diff_options diff_opts;
@@ -1286,10 +1350,19 @@ static struct blame_origin *find_origin(struct repository *r,
12861350

12871351
if (is_null_oid(&origin->commit->object.oid))
12881352
do_diff_cache(get_commit_tree_oid(parent), &diff_opts);
1289-
else
1290-
diff_tree_oid(get_commit_tree_oid(parent),
1291-
get_commit_tree_oid(origin->commit),
1292-
"", &diff_opts);
1353+
else {
1354+
int compute_diff = 1;
1355+
if (origin->commit->parents &&
1356+
!oidcmp(&parent->object.oid,
1357+
&origin->commit->parents->item->object.oid))
1358+
compute_diff = maybe_changed_path(r, parent,
1359+
origin, bd);
1360+
1361+
if (compute_diff)
1362+
diff_tree_oid(get_commit_tree_oid(parent),
1363+
get_commit_tree_oid(origin->commit),
1364+
"", &diff_opts);
1365+
}
12931366
diffcore_std(&diff_opts);
12941367

12951368
if (!diff_queued_diff.nr) {
@@ -1341,7 +1414,8 @@ static struct blame_origin *find_origin(struct repository *r,
13411414
*/
13421415
static struct blame_origin *find_rename(struct repository *r,
13431416
struct commit *parent,
1344-
struct blame_origin *origin)
1417+
struct blame_origin *origin,
1418+
struct blame_bloom_data *bd)
13451419
{
13461420
struct blame_origin *porigin = NULL;
13471421
struct diff_options diff_opts;
@@ -1366,6 +1440,7 @@ static struct blame_origin *find_rename(struct repository *r,
13661440
struct diff_filepair *p = diff_queued_diff.queue[i];
13671441
if ((p->status == 'R' || p->status == 'C') &&
13681442
!strcmp(p->two->path, origin->path)) {
1443+
add_bloom_key(bd, p->one->path);
13691444
porigin = get_origin(parent, p->one->path);
13701445
oidcpy(&porigin->blob_oid, &p->one->oid);
13711446
porigin->mode = p->one->mode;
@@ -2332,6 +2407,11 @@ static void distribute_blame(struct blame_scoreboard *sb, struct blame_entry *bl
23322407

23332408
#define MAXSG 16
23342409

2410+
typedef struct blame_origin *(*blame_find_alg)(struct repository *,
2411+
struct commit *,
2412+
struct blame_origin *,
2413+
struct blame_bloom_data *);
2414+
23352415
static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin, int opt)
23362416
{
23372417
struct rev_info *revs = sb->revs;
@@ -2356,8 +2436,7 @@ static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin,
23562436
* common cases, then we look for renames in the second pass.
23572437
*/
23582438
for (pass = 0; pass < 2 - sb->no_whole_file_rename; pass++) {
2359-
struct blame_origin *(*find)(struct repository *, struct commit *, struct blame_origin *);
2360-
find = pass ? find_rename : find_origin;
2439+
blame_find_alg find = pass ? find_rename : find_origin;
23612440

23622441
for (i = 0, sg = first_scapegoat(revs, commit, sb->reverse);
23632442
i < num_sg && sg;
@@ -2369,7 +2448,7 @@ static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin,
23692448
continue;
23702449
if (parse_commit(p))
23712450
continue;
2372-
porigin = find(sb->repo, p, origin);
2451+
porigin = find(sb->repo, p, origin, sb->bloom_data);
23732452
if (!porigin)
23742453
continue;
23752454
if (oideq(&porigin->blob_oid, &origin->blob_oid)) {
@@ -2809,3 +2888,45 @@ struct blame_entry *blame_entry_prepend(struct blame_entry *head,
28092888
blame_origin_incref(o);
28102889
return new_head;
28112890
}
2891+
2892+
void setup_blame_bloom_data(struct blame_scoreboard *sb,
2893+
const char *path)
2894+
{
2895+
struct blame_bloom_data *bd;
2896+
2897+
if (!sb->repo->objects->commit_graph)
2898+
return;
2899+
2900+
if (!sb->repo->objects->commit_graph->bloom_filter_settings)
2901+
return;
2902+
2903+
bd = xmalloc(sizeof(struct blame_bloom_data));
2904+
2905+
bd->settings = sb->repo->objects->commit_graph->bloom_filter_settings;
2906+
2907+
bd->alloc = 4;
2908+
bd->nr = 0;
2909+
ALLOC_ARRAY(bd->keys, bd->alloc);
2910+
2911+
add_bloom_key(bd, path);
2912+
2913+
sb->bloom_data = bd;
2914+
}
2915+
2916+
void cleanup_scoreboard(struct blame_scoreboard *sb)
2917+
{
2918+
if (sb->bloom_data) {
2919+
int i;
2920+
for (i = 0; i < sb->bloom_data->nr; i++) {
2921+
free(sb->bloom_data->keys[i]->hashes);
2922+
free(sb->bloom_data->keys[i]);
2923+
}
2924+
free(sb->bloom_data->keys);
2925+
FREE_AND_NULL(sb->bloom_data);
2926+
2927+
trace2_data_intmax("blame", sb->repo,
2928+
"bloom/queries", bloom_count_queries);
2929+
trace2_data_intmax("blame", sb->repo,
2930+
"bloom/response-no", bloom_count_no);
2931+
}
2932+
}

blame.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,8 @@ struct blame_entry {
100100
int unblamable;
101101
};
102102

103+
struct blame_bloom_data;
104+
103105
/*
104106
* The current state of the blame assignment.
105107
*/
@@ -156,6 +158,7 @@ struct blame_scoreboard {
156158
void(*found_guilty_entry)(struct blame_entry *, void *);
157159

158160
void *found_guilty_entry_data;
161+
struct blame_bloom_data *bloom_data;
159162
};
160163

161164
/*
@@ -180,6 +183,9 @@ void init_scoreboard(struct blame_scoreboard *sb);
180183
void setup_scoreboard(struct blame_scoreboard *sb,
181184
const char *path,
182185
struct blame_origin **orig);
186+
void setup_blame_bloom_data(struct blame_scoreboard *sb,
187+
const char *path);
188+
void cleanup_scoreboard(struct blame_scoreboard *sb);
183189

184190
struct blame_entry *blame_entry_prepend(struct blame_entry *head,
185191
long start, long end,

builtin/blame.c

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1061,6 +1061,14 @@ int cmd_blame(int argc, const char **argv, const char *prefix)
10611061
string_list_clear(&ignore_revs_file_list, 0);
10621062
string_list_clear(&ignore_rev_list, 0);
10631063
setup_scoreboard(&sb, path, &o);
1064+
1065+
/*
1066+
* Changed-path Bloom filters are disabled when looking
1067+
* for copies.
1068+
*/
1069+
if (!(opt & PICKAXE_BLAME_COPY))
1070+
setup_blame_bloom_data(&sb, path);
1071+
10641072
lno = sb.num_lines;
10651073

10661074
if (lno && !range_list.nr)
@@ -1164,5 +1172,7 @@ int cmd_blame(int argc, const char **argv, const char *prefix)
11641172
printf("num get patch: %d\n", sb.num_get_patch);
11651173
printf("num commits: %d\n", sb.num_commits);
11661174
}
1175+
1176+
cleanup_scoreboard(&sb);
11671177
return 0;
11681178
}

0 commit comments

Comments
 (0)