Skip to content

Commit 37a2514

Browse files
newrengitster
authored andcommitted
diffcore-rename: use directory rename guided basename comparisons
A previous commit noted that it is very common for people to move files across directories while keeping their filename the same. The last few commits took advantage of this and showed that we can accelerate rename detection significantly using basenames; since files with the same basename serve as likely rename candidates, we can check those first and remove them from the rename candidate pool if they are sufficiently similar. Unfortunately, the previous optimization was limited by the fact that the remaining basenames after exact rename detection are not always unique. Many repositories have hundreds of build files with the same name (e.g. Makefile, .gitignore, build.gradle, etc.), and may even have hundreds of source files with the same name. (For example, the linux kernel has 100 setup.c, 87 irq.c, and 112 core.c files. A repository at $DAYJOB has a lot of ObjectFactory.java and Plugin.java files). For these files with non-unique basenames, we are faced with the task of attempting to determine or guess which directory they may have been relocated to. Such a task is precisely the job of directory rename detection. However, there are two catches: (1) the directory rename detection code has traditionally been part of the merge machinery rather than diffcore-rename.c, and (2) directory rename detection currently runs after regular rename detection is complete. The 1st catch is just an implementation issue that can be overcome by some code shuffling. The 2nd requires us to add a further approximation: we only have access to exact renames at this point, so we need to do directory rename detection based on just exact renames. In some cases we won't have exact renames, in which case this extra optimization won't apply. We also choose to not apply the optimization unless we know that the underlying directory was removed, which will require extra data to be passed in to diffcore_rename_extended(). Also, even if we get a prediction about which directory a file may have relocated to, we will still need to check to see if there is a file in the predicted directory, and then compare the two files to see if they meet the higher min_basename_score threshold required for marking the two files as renames. This commit introduces an idx_possible_rename() function which will do this directory rename detection for us and give us the index within rename_dst of the resulting filename. For now, this function is hardcoded to return -1 (not found) and just hooks up how its results would be used once we have a more complete implementation in place. Reviewed-by: Derrick Stolee <[email protected]> Signed-off-by: Elijah Newren <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent f78cf97 commit 37a2514

File tree

2 files changed

+35
-9
lines changed

2 files changed

+35
-9
lines changed

Documentation/gitdiffcore.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -186,7 +186,7 @@ mark a file pair as a rename and stop considering other candidates for
186186
better matches. At most, one comparison is done per file in this
187187
preliminary pass; so if there are several remaining ext.txt files
188188
throughout the directory hierarchy after exact rename detection, this
189-
preliminary step will be skipped for those files.
189+
preliminary step may be skipped for those files.
190190

191191
Note. When the "-C" option is used with `--find-copies-harder`
192192
option, 'git diff-{asterisk}' commands feed unmodified filepairs to

diffcore-rename.c

Lines changed: 34 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -379,6 +379,12 @@ static const char *get_basename(const char *filename)
379379
return base ? base + 1 : filename;
380380
}
381381

382+
static int idx_possible_rename(char *filename)
383+
{
384+
/* Unconditionally return -1, "not found", for now */
385+
return -1;
386+
}
387+
382388
static int find_basename_matches(struct diff_options *options,
383389
int minimum_score)
384390
{
@@ -415,8 +421,6 @@ static int find_basename_matches(struct diff_options *options,
415421
int i, renames = 0;
416422
struct strintmap sources;
417423
struct strintmap dests;
418-
struct hashmap_iter iter;
419-
struct strmap_entry *entry;
420424

421425
/*
422426
* The prefeteching stuff wants to know if it can skip prefetching
@@ -466,17 +470,39 @@ static int find_basename_matches(struct diff_options *options,
466470
}
467471

468472
/* Now look for basename matchups and do similarity estimation */
469-
strintmap_for_each_entry(&sources, &iter, entry) {
470-
const char *base = entry->key;
471-
intptr_t src_index = (intptr_t)entry->value;
473+
for (i = 0; i < rename_src_nr; ++i) {
474+
char *filename = rename_src[i].p->one->path;
475+
const char *base = NULL;
476+
intptr_t src_index;
472477
intptr_t dst_index;
473-
if (src_index == -1)
474-
continue;
475478

476-
if (0 <= (dst_index = strintmap_get(&dests, base))) {
479+
/*
480+
* If the basename is unique among remaining sources, then
481+
* src_index will equal 'i' and we can attempt to match it
482+
* to a unique basename in the destinations. Otherwise,
483+
* use directory rename heuristics, if possible.
484+
*/
485+
base = get_basename(filename);
486+
src_index = strintmap_get(&sources, base);
487+
assert(src_index == -1 || src_index == i);
488+
489+
if (strintmap_contains(&dests, base)) {
477490
struct diff_filespec *one, *two;
478491
int score;
479492

493+
/* Find a matching destination, if possible */
494+
dst_index = strintmap_get(&dests, base);
495+
if (src_index == -1 || dst_index == -1) {
496+
src_index = i;
497+
dst_index = idx_possible_rename(filename);
498+
}
499+
if (dst_index == -1)
500+
continue;
501+
502+
/* Ignore this dest if already used in a rename */
503+
if (rename_dst[dst_index].is_rename)
504+
continue; /* already used previously */
505+
480506
/* Estimate the similarity */
481507
one = rename_src[src_index].p->one;
482508
two = rename_dst[dst_index].p->two;

0 commit comments

Comments
 (0)