-
Notifications
You must be signed in to change notification settings - Fork 157
Introduce git-blame-tree(1) command #1894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is yet another attempt to upstream the builtin command `git-blame-tree(1)`. This command is similar to git-blame(1) and shows the most recent modification to paths in a tree.. The last attempt (I'm aware of) was made by Ævar in 2023[1]. That series was based of patches by Peff written in 2011[2]. In contrary to Ævar's attempt, my series implements the `git-blame-tree(1)` standalone command. Ævar suggested to only implement the blame-tree subsystem, and test it through test-tool. But since the command line options of that test-tool subcommand were very similar to the standalone command we want to land, I decided transfer the test-tool subcommand into a real Git builtin. Also, nowadays if we want to test the blame-tree subsystem independently, we should use clar unit tests instead of test-tool. This series brings back the --max-depth changes. I know Ævar removed them because there was controversy, but I think they are relevant to include. And I want to open up the conversation again. Also extra patches from Peff are included to implement pathspec tries. I've taken them from Peff's fork on GitHub[3], but they also have been posted to this mailing list[4]. Other improvements I've made: * Rebase onto current master. * Remove the use of USE_THE_REPOSITORY_VARIABLE. * Fix various memory leaks. * Moved code from blame-tree.c to pathspec.c, as suggested in Peff's last patch on GitHub. * Updated the benchmark results in the last commit message, although the numbers were roughly the same. I don't expect this code to be deemed ready to merge, but I mainly want to gather as much feedback as possible and use it to iterate toward actually getting it merged. There are some things I know that are missing, for example `--porcelain`, but those can be added in a follow-up. At the moment there's one test marked `test_expect_failure`, but I was not able yet identify that bug. So, let me know what you think? -- Toon [1]: https://lore.kernel.org/git/[email protected]/ [2]: https://lore.kernel.org/git/[email protected]/ [3]: https://github.com/peff/git/tree/jk/faster-blame-tree-wip [4]: https://lore.kernel.org/git/[email protected]/ Cc: Jeff King <[email protected]> Cc: Patrick Steinhardt <[email protected]> --- b4-submit-tracking --- # This section is used internally by b4 prep for tracking purposes. { "series": { "revision": 1, "change-id": "20250326-toon-blame-tree-2ca74d7712ac", "prefixes": [] } }
In commit 25e5e2b (combine-diff: support format_callback, 2011-08-19), the combined-diff code learned how to make a multi-sourced diff_filepair to pass to a diff callback. When we create each filepair, we do not bother to fill in many of the fields, because they would make no sense (e.g., there can be no rename score or broken_pair flag because we do not go through the diffcore filters). However, we did not even bother to zero them, leading to random values. Let's make sure everything is blank with xcalloc, just as the regular diff code does. We would potentially want to set the "status" flag to something non-zero, but it is not clear to what. Possibly a new DIFF_STATUS_COMBINED would make sense, as this is not strictly a modification, nor does it fit any other category. Since it is not yet clear what callers would want, this patch simply leaves it as "0", the same empty flag that is seen when diffcore_std is not used at all. Signed-off-by: Jeff King <[email protected]>
The within_depth function is used to check whether pathspecs limited by a max-depth parameter are acceptable. It takes a path to check, a maximum depth, and a "base" depth. It counts the components in the path (by counting slashes), adds them to the base, and compare them to the maximum. However, if the base does not have any slashes at all, we always return "true". If the base depth is 0, then this is correct; no matter what the maximum is, we are always within it. However, if the base depth is greater than 0, then we might return an erroneous result. This ends up not causing any user-visible bugs in the current code. The call sites in dir.c always pass a base depth of 0, so are unaffected. But tree_entry_interesting uses this function differently: it will pass the prefix of the current entry, along with a "1" if the entry is a directory, in essence checking whether items inside the entry would be of interest. It turns out not to make a difference in behavior, but the reasoning is complex. Given a tree like: file a/file a/b/file walking the tree and calling tree_entry_interesting will yield the following results: (with max_depth=0): file: yes a: yes a/file: no a/b: no (with max_depth=1): file: yes a: yes a/file: yes a/b: no So we have inconsistent behavior in considering directories interesting. If they are at the edge of our depth but at the root, we will recurse into them, but then find all of their entries uninteresting (e.g., in the first case, we will look at "a" but find "a/*" uninteresting). But if they are at the edge of our depth and not at the root, then we will not recurse (in the second example, we do not even bother entering "a/b"). This turns out not to matter because the only caller which uses max-depth pathspecs is cmd_grep, which only cares about blob entries. From its perspective, it is exactly the same to not recurse into a subtree, or to recurse and find that it contains no matching entries. Not recursing is merely an optimization. It is debatable whether tree_entry_interesting should consider such an entry interesting. The only caller does not care if it sees the tree itself, and can benefit from the optimization. But if we add a "max-depth" limiter to regular diffs, then a diff with DIFF_OPT_TREE_IN_RECURSIVE would probably want to show the tree itself, but not what it contains. This patch just fixes within_depth, which means we consider such entries uninteresting (and makes the current caller happy). If we want to change that in the future, then this fix is still the correct first step, as the current behavior is simply inconsistent. Signed-off-by: Jeff King <[email protected]>
When you are doing a tree-diff, there are basically two options: do not recurse into subtrees at all, or recurse indefinitely. While most callers would want to always recurse and see full pathnames, some may want the efficiency of looking only at a particular level of the tree. This is currently easy to do for the top-level (just turn off recursion), but you cannot say "show me what changed in subdir/, but do not recurse". This patch adds a max-depth parameter which is measured from the closest pathspec match, so that you can do: git log --raw --max-depth=1 a/b/c and see the contents of a/b/c/, but not those of a/b/c/d/ (you would see the --raw entry for a/b/c/d in such a case). Signed-off-by: Jeff King <[email protected]>
Similar to git-blame(1), introduce a new subcommand git-blame-tree(1). This command shows the most recent modification to paths in a tree. It does so by expanding the tree at a given commit, taking note of the current state of each path, and then walking backwards through history looking for commits where each path changed into its final commit ID. Based-on-a-patch-by: Jeff King <[email protected]> Improved-by: "Ævar Arnfjörð Bjarmason" <[email protected]> Signed-off-by: Toon Claes <[email protected]>
When checking if a path matches our pathspec list, in the worst case we walk the whole list linearly to check each pathspec. As a result, an operation like a tree diff does `O(paths * pathspecs)` comparisons. If we assume that the number of pathspecs may approach the number of paths, this is `O(n^2)` in the number of paths. We have to do it this way in the general case because pathspecs may be arbitrary fnmatch expressions, and we cannot create any meaningful index. However, it is common that all of the pathspecs are vanilla prefixes (e.g., "git.c" or "Documentation/") that do not use any wildcards or magic features. In such a case, we can index the pathspec list to give us faster lookups. The simplest solution would be to keep the pathspec list sorted, and use a binary search over the entries for each path. This turns out to be rather tricky, though, as we are not looking for an exact match in the list. We must also find prefix matches, and take care to handle trailing slashes for directories. Instead, this patch introduces a pathspec_trie struct, which represents the pathspecs as a graph of entries. For example, the pathspec list ["foo/bar", "foo/baz/"] would be represented as: (bar, terminal) / (foo) \ (baz, terminal, must_be_dir) To see if a path "foo/eek" is covered by the pathspec, we walk the trie, matching "foo" but ultimately seeing that "eek" is not mentioned. This provides us with two optimizations: 1. The children of each trie node are simple strings representing directory components. This means we can sort and binary-search them, giving us logarithmic lookups (but note that rather than one log lookup on the whole pathname, it is a sequence of smaller log lookups on components of the pathname). 2. Many users of the pathspec code do not feed full pathnames, but instead are walking a tree hierarchy themselves, and limiting the walk as they go. They can walk the trie at the same time, meanining they avoid looking repeatedly at parts of the pathspec we already know are matched. The current code, by contrast, will repeatedly match "foo/" against each pathspec when looking at "foo/bar", "foo/baz", etc. Note that this patch just introduces the data structure; the code is not used anywhere yet. Signed-off-by: Jeff King <[email protected]>
An earlier commit introduced pathspec_tries, but we did not actually generate them by default. This patch causes us to do so when it is possible (i.e., when no wildcards or other pathspec magic are in use). This doesn't actually do anything yet, though, as none of the pathspec users have learned to make use of the tries. We embed the pathspec_trie directly inside the "struct pathspec". This is not strictly necessary, as once created, the trie does not depend on the original pathspec. However, since the intended use is to optimize existing pathspec callers, passing the trie around as part of the pathspec will minimize disruption to the call chain. Signed-off-by: Jeff King <[email protected]>
The tree-diff currently matches each pathspec against every entry of the tree. For the common case of a handful of pathspecs, this is not a big deal. However, if you have a large number of pathspecs, it gets noticeably slow. Now that we have the pathspec_trie optimization, we can do much better (at least for simple cases without wildcards). Here are numbers for running "git rev-list" with limiting pathspecs of varying sizes, both before and after this patch: Test origin HEAD --------------------------------------------------------------- 4001.2: size=1 0.12(0.11+0.00) 0.12(0.12+0.00) +0.0% 4001.3: size=2 0.17(0.16+0.00) 0.16(0.15+0.01) -5.9% 4001.4: size=4 0.17(0.17+0.00) 0.17(0.17+0.00) +0.0% 4001.5: size=8 0.21(0.20+0.00) 0.20(0.20+0.00) -4.8% 4001.6: size=16 0.25(0.24+0.00) 0.21(0.20+0.00) -16.0% 4001.7: size=32 0.31(0.31+0.00) 0.21(0.20+0.00) -32.3% 4001.8: size=64 0.43(0.41+0.01) 0.21(0.21+0.00) -51.2% 4001.9: size=128 0.73(0.72+0.00) 0.22(0.21+0.00) -69.9% 4001.10: size=256 2.02(2.02+0.00) 0.37(0.36+0.00) -81.7% 4001.11: size=512 6.78(6.78+0.00) 0.64(0.64+0.00) -90.6% 4001.12: size=1024 23.67(23.67+0.02) 1.22(1.20+0.01) -94.8% For small pathspecs, we produce no real difference (which is good; we know we are asymptotically better, but we have not regressed our constant factor). Between 16 and 32 pathspecs we start to see some small improvement, and the benefit keeps growing with the number of pathspecs. Obviously these large-pathspec cases are unusual. But you might use them, for example, if the pathspecs were generated programatically (e.g., if you want the history of all files that are in Documentation/ _now_, not what was historically ever there, you would expand the pathspec at the current tree, and feed the result to rev-list). Signed-off-by: Jeff King <[email protected]>
As we find out the "last touched" for each path, we cease to care about that path. However, we traditionally have not told the revision-walking machinery about this, for two reasons: 1. Munging the pathspecs mid-traversal is not an officially supported thing to be doing. It does seem to work, though, and with Duy's recent pathspec refactoring, it's fairly straightforward. 2. The pathspec machinery is slow. We know have to look at each pathspec to see if each tree entry is interesting. And since our cost to handle a diff_filepair is so cheap, it doesn't really help us much. It turns out that (2) is not quite right, though. For the diffs themselves, limiting the pathspec doesn't help at all. But it does mean that we can simplify history more aggressively, which may mean avoiding whole swaths of history that don't touch interesting paths. This happens especially near the end of the traversal, when we are only interested in a few paths, and there are many side branches that do not touch any of them. And as of the last commit, the pathspec machinery is no longer quadratic, so the cost to use it is easily worth paying, even for large trees. Here are runs of `git blame-tree --max-depth=0` for torvalds/linux (a complex history with a small root tree) and git/git (a flat hierarchy with a large root tree): For git/git: Benchmark 1: ./git.old blame-tree --max-depth=0 Time (mean ± σ): 2.223 s ± 0.360 s [User: 1.808 s, System: 0.369 s] Range (min … max): 1.866 s … 2.927 s 10 runs Benchmark 2: ./git.new blame-tree --max-depth=0 Time (mean ± σ): 945.0 ms ± 93.4 ms [User: 783.6 ms, System: 131.0 ms] Range (min … max): 846.4 ms … 1071.5 ms 10 runs Summary ./git.new blame-tree --max-depth=0 ran 2.35 ± 0.45 times faster than git.old blame-tree --max-depth=0 For torvalds/linux: Benchmark 1: ./git.old blame-tree --max-depth=0 Time (mean ± σ): 13.504 s ± 0.545 s [User: 12.605 s, System: 0.583 s] Range (min … max): 12.947 s … 14.488 s 10 runs Benchmark 2: ./git.new blame-tree --max-depth=0 Time (mean ± σ): 622.8 ms ± 39.2 ms [User: 522.1 ms, System: 95.7 ms] Range (min … max): 556.6 ms … 681.6 ms 10 runs Summary ./git.new blame-tree --max-depth=0 ran 21.68 ± 1.62 times faster than ./git.old blame-tree --max-depth=0 Note that the output may change slightly in some cases due to the history pruning. This should be acceptable, as either output is correct in these cases (e.g., scenarios like a cherry-picked commit on two parallel branches). Signed-off-by: Jeff King <[email protected]>
Welcome to GitGitGadgetHi @To1ne, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests. Please make sure that either:
You can CC potential reviewers by adding a footer to the PR description with the following syntax:
NOTE: DO NOT copy/paste your CC list from a previous GGG PR's description, Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:
It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code. Contributing the patchesBefore you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form Both the person who commented An alternative is the channel
Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment If you want to see what email(s) would be sent for a After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail). If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the curl -g --user "<EMailAddress>:<Password>" \
--url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt To iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description):
To send a new iteration, just add another PR comment with the contents: Need help?New contributors who want advice are encouraged to join [email protected], where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join. You may also be able to find help in real time in the developer IRC channel, |
There are issues in commit 0f30e62: |
/allow |
User To1ne is now allowed to use GitGitGadget. |
This is yet another attempt to upstream the builtin command
git-blame-tree(1)
. This command is similar to git-blame(1) and showsthe most recent modification to paths in a tree..
The last attempt (I'm aware of) was made by Ævar in 20231. That
series was based of patches by Peff written in 20112.
In contrary to Ævar's attempt, my series implements the
git-blame-tree(1)
standalone command. Ævar suggested to only implementthe blame-tree subsystem, and test it through test-tool. But since the
command line options of that test-tool subcommand were very similar to
the standalone command we want to land, I decided transfer the test-tool
subcommand into a real Git builtin. Also, nowadays if we want to test
the blame-tree subsystem independently, we should use clar unit tests
instead of test-tool.
This series brings back the --max-depth changes. I know Ævar removed
them because there was controversy, but I think they are relevant to
include. And I want to open up the conversation again.
Also extra patches from Peff are included to implement pathspec tries.
I've taken them from Peff's fork on GitHub3, but they also have been
posted to this mailing list4.
Other improvements I've made:
Rebase onto current master.
Remove the use of USE_THE_REPOSITORY_VARIABLE.
Fix various memory leaks.
Moved code from blame-tree.c to pathspec.c, as suggested in Peff's
last patch on GitHub.
Updated the benchmark results in the last commit message, although the
numbers were roughly the same.
I don't expect this code to be deemed ready to merge, but I mainly want
to gather as much feedback as possible and use it to iterate toward
actually getting it merged.
There are some things I know that are missing, for example
--porcelain
, but those can be added in a follow-up. At the momentthere's one test marked
test_expect_failure
, but I was not able yetidentify that bug.
So, let me know what you think?
--
Toon