Skip to content

Commit 8d92fb2

Browse files
newrengitster
authored andcommitted
dir: replace exponential algorithm with a linear one
dir's read_directory_recursive() naturally operates recursively in order to walk the directory tree. Treating of directories is sometimes weird because there are so many different permutations about how to handle directories. Some examples: * 'git ls-files -o --directory' only needs to know that a directory itself is untracked; it doesn't need to recurse into it to see what is underneath. * 'git status' needs to recurse into an untracked directory, but only to determine whether or not it is empty. If there are no files underneath, the directory itself will be omitted from the output. If it is not empty, only the directory will be listed. * 'git status --ignored' needs to recurse into untracked directories and report all the ignored entries and then report the directory as untracked -- UNLESS all the entries under the directory are ignored, in which case we don't print any of the entries under the directory and just report the directory itself as ignored. (Note that although this forces us to walk all untracked files underneath the directory as well, we strip them from the output, except for users like 'git clean' who also set DIR_KEEP_TRACKED_CONTENTS.) * For 'git clean', we may need to recurse into a directory that doesn't match any specified pathspecs, if it's possible that there is an entry underneath the directory that can match one of the pathspecs. In such a case, we need to be careful to omit the directory itself from the list of paths (see commit 404ebce ("dir: also check directories for matching pathspecs", 2019-09-17)) Part of the tension noted above is that the treatment of a directory can change based on the files within it, and based on the various settings in dir->flags. Trying to keep this in mind while reading over the code, it is easy to think in terms of "treat_directory() tells us what to do with a directory, and read_directory_recursive() is the thing that recurses". Since we need to look into a directory to know how to treat it, though, it is quite easy to decide to (also) recurse into the directory from treat_directory() by adding a read_directory_recursive() call. Adding such a call is actually fine, IF we make sure that read_directory_recursive() does not also recurse into that same directory. Unfortunately, commit df5bcdf ("dir: recurse into untracked dirs for ignored files", 2017-05-18), added exactly such a case to the code, meaning we'd have two calls to read_directory_recursive() for an untracked directory. So, if we had a file named one/two/three/four/five/somefile.txt and nothing in one/ was tracked, then 'git status --ignored' would call read_directory_recursive() twice on the directory 'one/', and each of those would call read_directory_recursive() twice on the directory 'one/two/', and so on until read_directory_recursive() was called 2^5 times for 'one/two/three/four/five/'. Avoid calling read_directory_recursive() twice per level by moving a lot of the special logic into treat_directory(). Since dir.c is somewhat complex, extra cruft built up around this over time. While trying to unravel it, I noticed several instances where the first call to read_directory_recursive() would return e.g. path_untracked for some directory and a later one would return e.g. path_none, despite the fact that the directory clearly should have been considered untracked. The code happened to work due to the side-effect from the first invocation of adding untracked entries to dir->entries; this allowed it to get the correct output despite the supposed override in return value by the later call. I am somewhat concerned that there are still bugs and maybe even testcases with the wrong expectation. I have tried to carefully document treat_directory() since it becomes more complex after this change (though much of this complexity came from elsewhere that probably deserved better comments to begin with). However, much of my work felt more like a game of whackamole while attempting to make the code match the existing regression tests than an attempt to create an implementation that matched some clear design. That seems wrong to me, but the rules of existing behavior had so many special cases that I had a hard time coming up with some overarching rules about what correct behavior is for all cases, forcing me to hope that the regression tests are correct and sufficient. Such a hope seems likely to be ill-founded, given my experience with dir.c-related testcases in the last few months: Examples where the documentation was hard to parse or even just wrong: * 3aca580 (git-clean.txt: do not claim we will delete files with -n/--dry-run, 2019-09-17) * 09487f2 (clean: avoid removing untracked files in a nested git repository, 2019-09-17) * e86bbcf (clean: disambiguate the definition of -d, 2019-09-17) Examples where testcases were declared wrong and changed: * 09487f2 (clean: avoid removing untracked files in a nested git repository, 2019-09-17) * e86bbcf (clean: disambiguate the definition of -d, 2019-09-17) * a2b1336 (Revert "dir.c: make 'git-status --ignored' work within leading directories", 2019-12-10) Examples where testcases were clearly inadequate: * 502c386 (t7300-clean: demonstrate deleting nested repo with an ignored file breakage, 2019-08-25) * 7541cc5 (t7300: add testcases showing failure to clean specified pathspecs, 2019-09-17) * a5e916c (dir: fix off-by-one error in match_pathspec_item, 2019-09-17) * 404ebce (dir: also check directories for matching pathspecs, 2019-09-17) * 09487f2 (clean: avoid removing untracked files in a nested git repository, 2019-09-17) * e86bbcf (clean: disambiguate the definition of -d, 2019-09-17) * 452efd1 (t3011: demonstrate directory traversal failures, 2019-12-10) * b9670c1 (dir: fix checks on common prefix directory, 2019-12-19) Examples where "correct behavior" was unclear to everyone: https://lore.kernel.org/git/[email protected]/ Other commits of note: * 902b90c (clean: fix theoretical path corruption, 2019-09-17) However, on the positive side, it does make the code much faster. For the following simple shell loop in an empty repository: for depth in $(seq 10 25) do dirs=$(for i in $(seq 1 $depth) ; do printf 'dir/' ; done) rm -rf dir mkdir -p $dirs >$dirs/untracked-file /usr/bin/time --format="$depth: %e" git status --ignored >/dev/null done I saw the following timings, in seconds (note that the numbers are a little noisy from run-to-run, but the trend is very clear with every run): 10: 0.03 11: 0.05 12: 0.08 13: 0.19 14: 0.29 15: 0.50 16: 1.05 17: 2.11 18: 4.11 19: 8.60 20: 17.55 21: 33.87 22: 68.71 23: 140.05 24: 274.45 25: 551.15 For the above run, using strace I can look for the number of untracked directories opened and can verify that it matches the expected 2^($depth+1)-2 (the sum of 2^1 + 2^2 + 2^3 + ... + 2^$depth). After this fix, with strace I can verify that the number of untracked directories that are opened drops to just $depth, and the timings all drop to 0.00. In fact, it isn't until a depth of 190 nested directories that it sometimes starts reporting a time of 0.01 seconds and doesn't consistently report 0.01 seconds until there are 240 nested directories. The previous code would have taken 17.55 * 2^220 / (60*60*24*365) = 9.4 * 10^59 YEARS to have completed the 240 nested directories case. It's not often that you get to speed something up by a factor of 3*10^69. Signed-off-by: Elijah Newren <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent 0bbd0e8 commit 8d92fb2

File tree

1 file changed

+147
-63
lines changed

1 file changed

+147
-63
lines changed

dir.c

Lines changed: 147 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -1659,7 +1659,13 @@ static enum path_treatment treat_directory(struct dir_struct *dir,
16591659
const char *dirname, int len, int baselen, int excluded,
16601660
const struct pathspec *pathspec)
16611661
{
1662-
int nested_repo = 0;
1662+
/*
1663+
* WARNING: From this function, you can return path_recurse or you
1664+
* can call read_directory_recursive() (or neither), but
1665+
* you CAN'T DO BOTH.
1666+
*/
1667+
enum path_treatment state;
1668+
int nested_repo = 0, old_ignored_nr, check_only, stop_early;
16631669
/* The "len-1" is to strip the final '/' */
16641670
enum exist_status status = directory_exists_in_index(istate, dirname, len-1);
16651671

@@ -1711,18 +1717,135 @@ static enum path_treatment treat_directory(struct dir_struct *dir,
17111717

17121718
/* This is the "show_other_directories" case */
17131719

1714-
if (!(dir->flags & DIR_HIDE_EMPTY_DIRECTORIES))
1720+
/*
1721+
* If we have a pathspec which could match something _below_ this
1722+
* directory (e.g. when checking 'subdir/' having a pathspec like
1723+
* 'subdir/some/deep/path/file' or 'subdir/widget-*.c'), then we
1724+
* need to recurse.
1725+
*/
1726+
if (pathspec) {
1727+
int ret = do_match_pathspec(istate, pathspec, dirname, len,
1728+
0 /* prefix */, NULL /* seen */,
1729+
DO_MATCH_LEADING_PATHSPEC);
1730+
if (ret == MATCHED_RECURSIVELY_LEADING_PATHSPEC)
1731+
return path_recurse;
1732+
}
1733+
1734+
/*
1735+
* Other than the path_recurse case immediately above, we only need
1736+
* to recurse into untracked/ignored directories if either of the
1737+
* following bits is set:
1738+
* - DIR_SHOW_IGNORED_TOO (because then we need to determine if
1739+
* there are ignored directories below)
1740+
* - DIR_HIDE_EMPTY_DIRECTORIES (because we have to determine if
1741+
* the directory is empty)
1742+
*/
1743+
if (!(dir->flags & (DIR_SHOW_IGNORED_TOO | DIR_HIDE_EMPTY_DIRECTORIES)))
17151744
return excluded ? path_excluded : path_untracked;
17161745

1746+
/*
1747+
* ...and even if DIR_SHOW_IGNORED_TOO is set, we can still avoid
1748+
* recursing into ignored directories if the path is excluded and
1749+
* DIR_SHOW_IGNORED_TOO_MODE_MATCHING is also set.
1750+
*/
1751+
if (excluded &&
1752+
(dir->flags & DIR_SHOW_IGNORED_TOO) &&
1753+
(dir->flags & DIR_SHOW_IGNORED_TOO_MODE_MATCHING))
1754+
return path_excluded;
1755+
1756+
/*
1757+
* If we have we don't want to know the all the paths under an
1758+
* untracked or ignored directory, we still need to go into the
1759+
* directory to determine if it is empty (because an empty directory
1760+
* should be path_none instead of path_excluded or path_untracked).
1761+
*/
1762+
check_only = ((dir->flags & DIR_HIDE_EMPTY_DIRECTORIES) &&
1763+
!(dir->flags & DIR_SHOW_IGNORED_TOO));
1764+
1765+
/*
1766+
* However, there's another optimization possible as a subset of
1767+
* check_only, based on the cases we have to consider:
1768+
* A) Directory matches no exclude patterns:
1769+
* * Directory is empty => path_none
1770+
* * Directory has an untracked file under it => path_untracked
1771+
* * Directory has only ignored files under it => path_excluded
1772+
* B) Directory matches an exclude pattern:
1773+
* * Directory is empty => path_none
1774+
* * Directory has an untracked file under it => path_excluded
1775+
* * Directory has only ignored files under it => path_excluded
1776+
* In case A, we can exit as soon as we've found an untracked
1777+
* file but otherwise have to walk all files. In case B, though,
1778+
* we can stop at the first file we find under the directory.
1779+
*/
1780+
stop_early = check_only && excluded;
1781+
1782+
/*
1783+
* If /every/ file within an untracked directory is ignored, then
1784+
* we want to treat the directory as ignored (for e.g. status
1785+
* --porcelain), without listing the individual ignored files
1786+
* underneath. To do so, we'll save the current ignored_nr, and
1787+
* pop all the ones added after it if it turns out the entire
1788+
* directory is ignored.
1789+
*/
1790+
old_ignored_nr = dir->ignored_nr;
1791+
1792+
/* Actually recurse into dirname now, we'll fixup the state later. */
17171793
untracked = lookup_untracked(dir->untracked, untracked,
17181794
dirname + baselen, len - baselen);
1795+
state = read_directory_recursive(dir, istate, dirname, len, untracked,
1796+
check_only, stop_early, pathspec);
1797+
1798+
/* There are a variety of reasons we may need to fixup the state... */
1799+
if (state == path_excluded) {
1800+
/* state == path_excluded implies all paths under
1801+
* dirname were ignored...
1802+
*
1803+
* if running e.g. `git status --porcelain --ignored=matching`,
1804+
* then we want to see the subpaths that are ignored.
1805+
*
1806+
* if running e.g. just `git status --porcelain`, then
1807+
* we just want the directory itself to be listed as ignored
1808+
* and not the individual paths underneath.
1809+
*/
1810+
int want_ignored_subpaths =
1811+
((dir->flags & DIR_SHOW_IGNORED_TOO) &&
1812+
(dir->flags & DIR_SHOW_IGNORED_TOO_MODE_MATCHING));
1813+
1814+
if (want_ignored_subpaths) {
1815+
/*
1816+
* with --ignored=matching, we want the subpaths
1817+
* INSTEAD of the directory itself.
1818+
*/
1819+
state = path_none;
1820+
} else {
1821+
int i;
1822+
for (i = old_ignored_nr + 1; i<dir->ignored_nr; ++i)
1823+
FREE_AND_NULL(dir->ignored[i]);
1824+
dir->ignored_nr = old_ignored_nr;
1825+
}
1826+
}
17191827

17201828
/*
1721-
* If this is an excluded directory, then we only need to check if
1722-
* the directory contains any files.
1829+
* If there is nothing under the current directory and we are not
1830+
* hiding empty directories, then we need to report on the
1831+
* untracked or ignored status of the directory itself.
17231832
*/
1724-
return read_directory_recursive(dir, istate, dirname, len,
1725-
untracked, 1, excluded, pathspec);
1833+
if (state == path_none && !(dir->flags & DIR_HIDE_EMPTY_DIRECTORIES))
1834+
state = excluded ? path_excluded : path_untracked;
1835+
1836+
/*
1837+
* We can recurse into untracked directories that don't match any
1838+
* of the given pathspecs when some file underneath the directory
1839+
* might match one of the pathspecs. If so, we should make sure
1840+
* to note that the directory itself did not match.
1841+
*/
1842+
if (pathspec &&
1843+
!match_pathspec(istate, pathspec, dirname, len,
1844+
0 /* prefix */, NULL,
1845+
0 /* do NOT special case dirs */))
1846+
state = path_none;
1847+
1848+
return state;
17261849
}
17271850

17281851
/*
@@ -1870,6 +1993,11 @@ static enum path_treatment treat_path_fast(struct dir_struct *dir,
18701993
int baselen,
18711994
const struct pathspec *pathspec)
18721995
{
1996+
/*
1997+
* WARNING: From this function, you can return path_recurse or you
1998+
* can call read_directory_recursive() (or neither), but
1999+
* you CAN'T DO BOTH.
2000+
*/
18732001
strbuf_setlen(path, baselen);
18742002
if (!cdir->ucd) {
18752003
strbuf_addstr(path, cdir->file);
@@ -1904,7 +2032,6 @@ static enum path_treatment treat_path(struct dir_struct *dir,
19042032
const struct pathspec *pathspec)
19052033
{
19062034
int has_path_in_index, dtype, excluded;
1907-
enum path_treatment path_treatment;
19082035

19092036
if (!cdir->d_name)
19102037
return treat_path_fast(dir, untracked, cdir, istate, path,
@@ -1961,24 +2088,16 @@ static enum path_treatment treat_path(struct dir_struct *dir,
19612088
default:
19622089
return path_none;
19632090
case DT_DIR:
1964-
strbuf_addch(path, '/');
1965-
path_treatment = treat_directory(dir, istate, untracked,
1966-
path->buf, path->len,
1967-
baselen, excluded, pathspec);
19682091
/*
1969-
* If 1) we only want to return directories that
1970-
* match an exclude pattern and 2) this directory does
1971-
* not match an exclude pattern but all of its
1972-
* contents are excluded, then indicate that we should
1973-
* recurse into this directory (instead of marking the
1974-
* directory itself as an ignored path).
2092+
* WARNING: Do not ignore/amend the return value from
2093+
* treat_directory(), and especially do not change it to return
2094+
* path_recurse as that can cause exponential slowdown.
2095+
* Instead, modify treat_directory() to return the right value.
19752096
*/
1976-
if (!excluded &&
1977-
path_treatment == path_excluded &&
1978-
(dir->flags & DIR_SHOW_IGNORED_TOO) &&
1979-
(dir->flags & DIR_SHOW_IGNORED_TOO_MODE_MATCHING))
1980-
return path_recurse;
1981-
return path_treatment;
2097+
strbuf_addch(path, '/');
2098+
return treat_directory(dir, istate, untracked,
2099+
path->buf, path->len,
2100+
baselen, excluded, pathspec);
19822101
case DT_REG:
19832102
case DT_LNK:
19842103
return excluded ? path_excluded : path_untracked;
@@ -2175,14 +2294,10 @@ static enum path_treatment read_directory_recursive(struct dir_struct *dir,
21752294
int stop_at_first_file, const struct pathspec *pathspec)
21762295
{
21772296
/*
2178-
* WARNING WARNING WARNING:
2179-
*
2180-
* Any updates to the traversal logic here may need corresponding
2181-
* updates in treat_leading_path(). See the commit message for the
2182-
* commit adding this warning as well as the commit preceding it
2183-
* for details.
2297+
* WARNING: Do NOT recurse unless path_recurse is returned from
2298+
* treat_path(). Recursing on any other return value
2299+
* can result in exponential slowdown.
21842300
*/
2185-
21862301
struct cached_dir cdir;
21872302
enum path_treatment state, subdir_state, dir_state = path_none;
21882303
struct strbuf path = STRBUF_INIT;
@@ -2204,13 +2319,7 @@ static enum path_treatment read_directory_recursive(struct dir_struct *dir,
22042319
dir_state = state;
22052320

22062321
/* recurse into subdir if instructed by treat_path */
2207-
if ((state == path_recurse) ||
2208-
((state == path_untracked) &&
2209-
(resolve_dtype(cdir.d_type, istate, path.buf, path.len) == DT_DIR) &&
2210-
((dir->flags & DIR_SHOW_IGNORED_TOO) ||
2211-
(pathspec &&
2212-
do_match_pathspec(istate, pathspec, path.buf, path.len,
2213-
baselen, NULL, DO_MATCH_LEADING_PATHSPEC) == MATCHED_RECURSIVELY_LEADING_PATHSPEC)))) {
2322+
if (state == path_recurse) {
22142323
struct untracked_cache_dir *ud;
22152324
ud = lookup_untracked(dir->untracked, untracked,
22162325
path.buf + baselen,
@@ -2294,15 +2403,6 @@ static int treat_leading_path(struct dir_struct *dir,
22942403
const char *path, int len,
22952404
const struct pathspec *pathspec)
22962405
{
2297-
/*
2298-
* WARNING WARNING WARNING:
2299-
*
2300-
* Any updates to the traversal logic here may need corresponding
2301-
* updates in read_directory_recursive(). See 777b420347 (dir:
2302-
* synchronize treat_leading_path() and read_directory_recursive(),
2303-
* 2019-12-19) and its parent commit for details.
2304-
*/
2305-
23062406
struct strbuf sb = STRBUF_INIT;
23072407
struct strbuf subdir = STRBUF_INIT;
23082408
int prevlen, baselen;
@@ -2353,23 +2453,7 @@ static int treat_leading_path(struct dir_struct *dir,
23532453
strbuf_reset(&subdir);
23542454
strbuf_add(&subdir, path+prevlen, baselen-prevlen);
23552455
cdir.d_name = subdir.buf;
2356-
state = treat_path(dir, NULL, &cdir, istate, &sb, prevlen,
2357-
pathspec);
2358-
if (state == path_untracked &&
2359-
resolve_dtype(cdir.d_type, istate, sb.buf, sb.len) == DT_DIR &&
2360-
(dir->flags & DIR_SHOW_IGNORED_TOO ||
2361-
do_match_pathspec(istate, pathspec, sb.buf, sb.len,
2362-
baselen, NULL, DO_MATCH_LEADING_PATHSPEC) == MATCHED_RECURSIVELY_LEADING_PATHSPEC)) {
2363-
if (!match_pathspec(istate, pathspec, sb.buf, sb.len,
2364-
0 /* prefix */, NULL,
2365-
0 /* do NOT special case dirs */))
2366-
state = path_none;
2367-
add_path_to_appropriate_result_list(dir, NULL, &cdir,
2368-
istate,
2369-
&sb, baselen,
2370-
pathspec, state);
2371-
state = path_recurse;
2372-
}
2456+
state = treat_path(dir, NULL, &cdir, istate, &sb, prevlen, pathspec);
23732457

23742458
if (state != path_recurse)
23752459
break; /* do not recurse into it */

0 commit comments

Comments
 (0)