Introduce git-blame-tree(1) command #1894

To1ne · 2025-03-26T19:44:16Z

This is yet another attempt to upstream the builtin command
git-blame-tree(1). This command is similar to git-blame(1) and shows
the most recent modification to paths in a tree..

The last attempt (I'm aware of) was made by Ævar in 20231. That
series was based of patches by Peff written in 20112.

In contrary to Ævar's attempt, my series implements the
git-blame-tree(1) standalone command. Ævar suggested to only implement
the blame-tree subsystem, and test it through test-tool. But since the
command line options of that test-tool subcommand were very similar to
the standalone command we want to land, I decided transfer the test-tool
subcommand into a real Git builtin. Also, nowadays if we want to test
the blame-tree subsystem independently, we should use clar unit tests
instead of test-tool.

This series brings back the --max-depth changes. I know Ævar removed
them because there was controversy, but I think they are relevant to
include. And I want to open up the conversation again.

Also extra patches from Peff are included to implement pathspec tries.
I've taken them from Peff's fork on GitHub3, but they also have been
posted to this mailing list4.

Other improvements I've made:

Rebase onto current master.
Remove the use of USE_THE_REPOSITORY_VARIABLE.
Fix various memory leaks.
Moved code from blame-tree.c to pathspec.c, as suggested in Peff's
last patch on GitHub.
Updated the benchmark results in the last commit message, although the
numbers were roughly the same.

I don't expect this code to be deemed ready to merge, but I mainly want
to gather as much feedback as possible and use it to iterate toward
actually getting it merged.

There are some things I know that are missing, for example
--porcelain, but those can be added in a follow-up. At the moment
there's one test marked test_expect_failure, but I was not able yet
identify that bug.

So, let me know what you think?

--
Toon

This is yet another attempt to upstream the builtin command `git-blame-tree(1)`. This command is similar to git-blame(1) and shows the most recent modification to paths in a tree.. The last attempt (I'm aware of) was made by Ævar in 2023[1]. That series was based of patches by Peff written in 2011[2]. In contrary to Ævar's attempt, my series implements the `git-blame-tree(1)` standalone command. Ævar suggested to only implement the blame-tree subsystem, and test it through test-tool. But since the command line options of that test-tool subcommand were very similar to the standalone command we want to land, I decided transfer the test-tool subcommand into a real Git builtin. Also, nowadays if we want to test the blame-tree subsystem independently, we should use clar unit tests instead of test-tool. This series brings back the --max-depth changes. I know Ævar removed them because there was controversy, but I think they are relevant to include. And I want to open up the conversation again. Also extra patches from Peff are included to implement pathspec tries. I've taken them from Peff's fork on GitHub[3], but they also have been posted to this mailing list[4]. Other improvements I've made: * Rebase onto current master. * Remove the use of USE_THE_REPOSITORY_VARIABLE. * Fix various memory leaks. * Moved code from blame-tree.c to pathspec.c, as suggested in Peff's last patch on GitHub. * Updated the benchmark results in the last commit message, although the numbers were roughly the same. I don't expect this code to be deemed ready to merge, but I mainly want to gather as much feedback as possible and use it to iterate toward actually getting it merged. There are some things I know that are missing, for example `--porcelain`, but those can be added in a follow-up. At the moment there's one test marked `test_expect_failure`, but I was not able yet identify that bug. So, let me know what you think? -- Toon [1]: https://lore.kernel.org/git/[email protected]/ [2]: https://lore.kernel.org/git/[email protected]/ [3]: https://github.com/peff/git/tree/jk/faster-blame-tree-wip [4]: https://lore.kernel.org/git/[email protected]/ Cc: Jeff King <[email protected]> Cc: Patrick Steinhardt <[email protected]> --- b4-submit-tracking --- # This section is used internally by b4 prep for tracking purposes. { "series": { "revision": 1, "change-id": "20250326-toon-blame-tree-2ca74d7712ac", "prefixes": [] } }

In commit 25e5e2b (combine-diff: support format_callback, 2011-08-19), the combined-diff code learned how to make a multi-sourced diff_filepair to pass to a diff callback. When we create each filepair, we do not bother to fill in many of the fields, because they would make no sense (e.g., there can be no rename score or broken_pair flag because we do not go through the diffcore filters). However, we did not even bother to zero them, leading to random values. Let's make sure everything is blank with xcalloc, just as the regular diff code does. We would potentially want to set the "status" flag to something non-zero, but it is not clear to what. Possibly a new DIFF_STATUS_COMBINED would make sense, as this is not strictly a modification, nor does it fit any other category. Since it is not yet clear what callers would want, this patch simply leaves it as "0", the same empty flag that is seen when diffcore_std is not used at all. Signed-off-by: Jeff King <[email protected]>

The within_depth function is used to check whether pathspecs limited by a max-depth parameter are acceptable. It takes a path to check, a maximum depth, and a "base" depth. It counts the components in the path (by counting slashes), adds them to the base, and compare them to the maximum. However, if the base does not have any slashes at all, we always return "true". If the base depth is 0, then this is correct; no matter what the maximum is, we are always within it. However, if the base depth is greater than 0, then we might return an erroneous result. This ends up not causing any user-visible bugs in the current code. The call sites in dir.c always pass a base depth of 0, so are unaffected. But tree_entry_interesting uses this function differently: it will pass the prefix of the current entry, along with a "1" if the entry is a directory, in essence checking whether items inside the entry would be of interest. It turns out not to make a difference in behavior, but the reasoning is complex. Given a tree like: file a/file a/b/file walking the tree and calling tree_entry_interesting will yield the following results: (with max_depth=0): file: yes a: yes a/file: no a/b: no (with max_depth=1): file: yes a: yes a/file: yes a/b: no So we have inconsistent behavior in considering directories interesting. If they are at the edge of our depth but at the root, we will recurse into them, but then find all of their entries uninteresting (e.g., in the first case, we will look at "a" but find "a/*" uninteresting). But if they are at the edge of our depth and not at the root, then we will not recurse (in the second example, we do not even bother entering "a/b"). This turns out not to matter because the only caller which uses max-depth pathspecs is cmd_grep, which only cares about blob entries. From its perspective, it is exactly the same to not recurse into a subtree, or to recurse and find that it contains no matching entries. Not recursing is merely an optimization. It is debatable whether tree_entry_interesting should consider such an entry interesting. The only caller does not care if it sees the tree itself, and can benefit from the optimization. But if we add a "max-depth" limiter to regular diffs, then a diff with DIFF_OPT_TREE_IN_RECURSIVE would probably want to show the tree itself, but not what it contains. This patch just fixes within_depth, which means we consider such entries uninteresting (and makes the current caller happy). If we want to change that in the future, then this fix is still the correct first step, as the current behavior is simply inconsistent. Signed-off-by: Jeff King <[email protected]>

When you are doing a tree-diff, there are basically two options: do not recurse into subtrees at all, or recurse indefinitely. While most callers would want to always recurse and see full pathnames, some may want the efficiency of looking only at a particular level of the tree. This is currently easy to do for the top-level (just turn off recursion), but you cannot say "show me what changed in subdir/, but do not recurse". This patch adds a max-depth parameter which is measured from the closest pathspec match, so that you can do: git log --raw --max-depth=1 a/b/c and see the contents of a/b/c/, but not those of a/b/c/d/ (you would see the --raw entry for a/b/c/d in such a case). Signed-off-by: Jeff King <[email protected]>

Similar to git-blame(1), introduce a new subcommand git-blame-tree(1). This command shows the most recent modification to paths in a tree. It does so by expanding the tree at a given commit, taking note of the current state of each path, and then walking backwards through history looking for commits where each path changed into its final commit ID. Based-on-a-patch-by: Jeff King <[email protected]> Improved-by: "Ævar Arnfjörð Bjarmason" <[email protected]> Signed-off-by: Toon Claes <[email protected]>

When checking if a path matches our pathspec list, in the worst case we walk the whole list linearly to check each pathspec. As a result, an operation like a tree diff does `O(paths * pathspecs)` comparisons. If we assume that the number of pathspecs may approach the number of paths, this is `O(n^2)` in the number of paths. We have to do it this way in the general case because pathspecs may be arbitrary fnmatch expressions, and we cannot create any meaningful index. However, it is common that all of the pathspecs are vanilla prefixes (e.g., "git.c" or "Documentation/") that do not use any wildcards or magic features. In such a case, we can index the pathspec list to give us faster lookups. The simplest solution would be to keep the pathspec list sorted, and use a binary search over the entries for each path. This turns out to be rather tricky, though, as we are not looking for an exact match in the list. We must also find prefix matches, and take care to handle trailing slashes for directories. Instead, this patch introduces a pathspec_trie struct, which represents the pathspecs as a graph of entries. For example, the pathspec list ["foo/bar", "foo/baz/"] would be represented as: (bar, terminal) / (foo) \ (baz, terminal, must_be_dir) To see if a path "foo/eek" is covered by the pathspec, we walk the trie, matching "foo" but ultimately seeing that "eek" is not mentioned. This provides us with two optimizations: 1. The children of each trie node are simple strings representing directory components. This means we can sort and binary-search them, giving us logarithmic lookups (but note that rather than one log lookup on the whole pathname, it is a sequence of smaller log lookups on components of the pathname). 2. Many users of the pathspec code do not feed full pathnames, but instead are walking a tree hierarchy themselves, and limiting the walk as they go. They can walk the trie at the same time, meanining they avoid looking repeatedly at parts of the pathspec we already know are matched. The current code, by contrast, will repeatedly match "foo/" against each pathspec when looking at "foo/bar", "foo/baz", etc. Note that this patch just introduces the data structure; the code is not used anywhere yet. Signed-off-by: Jeff King <[email protected]>

An earlier commit introduced pathspec_tries, but we did not actually generate them by default. This patch causes us to do so when it is possible (i.e., when no wildcards or other pathspec magic are in use). This doesn't actually do anything yet, though, as none of the pathspec users have learned to make use of the tries. We embed the pathspec_trie directly inside the "struct pathspec". This is not strictly necessary, as once created, the trie does not depend on the original pathspec. However, since the intended use is to optimize existing pathspec callers, passing the trie around as part of the pathspec will minimize disruption to the call chain. Signed-off-by: Jeff King <[email protected]>

The tree-diff currently matches each pathspec against every entry of the tree. For the common case of a handful of pathspecs, this is not a big deal. However, if you have a large number of pathspecs, it gets noticeably slow. Now that we have the pathspec_trie optimization, we can do much better (at least for simple cases without wildcards). Here are numbers for running "git rev-list" with limiting pathspecs of varying sizes, both before and after this patch: Test origin HEAD --------------------------------------------------------------- 4001.2: size=1 0.12(0.11+0.00) 0.12(0.12+0.00) +0.0% 4001.3: size=2 0.17(0.16+0.00) 0.16(0.15+0.01) -5.9% 4001.4: size=4 0.17(0.17+0.00) 0.17(0.17+0.00) +0.0% 4001.5: size=8 0.21(0.20+0.00) 0.20(0.20+0.00) -4.8% 4001.6: size=16 0.25(0.24+0.00) 0.21(0.20+0.00) -16.0% 4001.7: size=32 0.31(0.31+0.00) 0.21(0.20+0.00) -32.3% 4001.8: size=64 0.43(0.41+0.01) 0.21(0.21+0.00) -51.2% 4001.9: size=128 0.73(0.72+0.00) 0.22(0.21+0.00) -69.9% 4001.10: size=256 2.02(2.02+0.00) 0.37(0.36+0.00) -81.7% 4001.11: size=512 6.78(6.78+0.00) 0.64(0.64+0.00) -90.6% 4001.12: size=1024 23.67(23.67+0.02) 1.22(1.20+0.01) -94.8% For small pathspecs, we produce no real difference (which is good; we know we are asymptotically better, but we have not regressed our constant factor). Between 16 and 32 pathspecs we start to see some small improvement, and the benefit keeps growing with the number of pathspecs. Obviously these large-pathspec cases are unusual. But you might use them, for example, if the pathspecs were generated programatically (e.g., if you want the history of all files that are in Documentation/ _now_, not what was historically ever there, you would expand the pathspec at the current tree, and feed the result to rev-list). Signed-off-by: Jeff King <[email protected]>

As we find out the "last touched" for each path, we cease to care about that path. However, we traditionally have not told the revision-walking machinery about this, for two reasons: 1. Munging the pathspecs mid-traversal is not an officially supported thing to be doing. It does seem to work, though, and with Duy's recent pathspec refactoring, it's fairly straightforward. 2. The pathspec machinery is slow. We know have to look at each pathspec to see if each tree entry is interesting. And since our cost to handle a diff_filepair is so cheap, it doesn't really help us much. It turns out that (2) is not quite right, though. For the diffs themselves, limiting the pathspec doesn't help at all. But it does mean that we can simplify history more aggressively, which may mean avoiding whole swaths of history that don't touch interesting paths. This happens especially near the end of the traversal, when we are only interested in a few paths, and there are many side branches that do not touch any of them. And as of the last commit, the pathspec machinery is no longer quadratic, so the cost to use it is easily worth paying, even for large trees. Here are runs of `git blame-tree --max-depth=0` for torvalds/linux (a complex history with a small root tree) and git/git (a flat hierarchy with a large root tree): For git/git: Benchmark 1: ./git.old blame-tree --max-depth=0 Time (mean ± σ): 2.223 s ± 0.360 s [User: 1.808 s, System: 0.369 s] Range (min … max): 1.866 s … 2.927 s 10 runs Benchmark 2: ./git.new blame-tree --max-depth=0 Time (mean ± σ): 945.0 ms ± 93.4 ms [User: 783.6 ms, System: 131.0 ms] Range (min … max): 846.4 ms … 1071.5 ms 10 runs Summary ./git.new blame-tree --max-depth=0 ran 2.35 ± 0.45 times faster than git.old blame-tree --max-depth=0 For torvalds/linux: Benchmark 1: ./git.old blame-tree --max-depth=0 Time (mean ± σ): 13.504 s ± 0.545 s [User: 12.605 s, System: 0.583 s] Range (min … max): 12.947 s … 14.488 s 10 runs Benchmark 2: ./git.new blame-tree --max-depth=0 Time (mean ± σ): 622.8 ms ± 39.2 ms [User: 522.1 ms, System: 95.7 ms] Range (min … max): 556.6 ms … 681.6 ms 10 runs Summary ./git.new blame-tree --max-depth=0 ran 21.68 ± 1.62 times faster than ./git.old blame-tree --max-depth=0 Note that the output may change slightly in some cases due to the history pruning. This should be acceptable, as either output is correct in these cases (e.g., scenarios like a cherry-picked commit on two parallel branches). Signed-off-by: Jeff King <[email protected]>

gitgitgadget · 2025-03-26T19:45:18Z

Welcome to GitGitGadget

Hi @To1ne, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests.

Please make sure that either:

Your Pull Request has a good description, if it consists of multiple commits, as it will be used as cover letter.
Your Pull Request description is empty, if it consists of a single commit, as the commit message should be descriptive enough by itself.

You can CC potential reviewers by adding a footer to the PR description with the following syntax:

CC: Revi Ewer <[email protected]>, Ill Takalook <[email protected]>

NOTE: DO NOT copy/paste your CC list from a previous GGG PR's description,
because it will result in a malformed CC list on the mailing list. See
example.

Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:

the lines should not exceed 76 columns,
the first line should be like a header and typically start with a prefix like "tests:" or "revisions:" to state which subsystem the change is about, and
the commit messages' body should be describing the "why?" of the change.
Finally, the commit messages should end in a Signed-off-by: line matching the commits' author.

It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code.

Contributing the patches

Before you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form /allow. A good way to find other contributors is to locate recent pull requests where someone has been /allowed:

Search: is:pr is:open "/allow"

Both the person who commented /allow and the PR author are able to /allow you.

An alternative is the channel #git-devel on the Libera Chat IRC network:

<newcontributor> I've just created my first PR, could someone please /allow me? https://github.com/gitgitgadget/git/pull/12345
<veteran> newcontributor: it is done
<newcontributor> thanks!

Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment /submit.

If you want to see what email(s) would be sent for a /submit request, add a PR comment /preview to have the email(s) sent to you. You must have a public GitHub email address for this. Note that any reviewers CC'd via the list in the PR description will not actually be sent emails.

After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail).

If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the (raw) link), then import it into your mail program. If you use GMail, you can do this via:

curl -g --user "<EMailAddress>:<Password>" \
    --url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt

To iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description):

Changes since v1:
- Fixed a typo in the commit message (found by ...)
- Added a code comment to ... as suggested by ...
...

To send a new iteration, just add another PR comment with the contents: /submit.

Need help?

New contributors who want advice are encouraged to join [email protected], where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join.

You may also be able to find help in real time in the developer IRC channel, #git-devel on Libera Chat. Remember that IRC does not support offline messaging, so if you send someone a private message and log out, they cannot respond to you. The scrollback of #git-devel is archived, though.

gitgitgadget · 2025-03-26T19:45:19Z

There are issues in commit 0f30e62:
Introduce git-blame-tree(1) command
Commit not signed off

dscho · 2025-03-26T20:33:08Z

/allow

gitgitgadget · 2025-03-26T20:33:56Z

User To1ne is now allowed to use GitGitGadget.

To1ne and others added 9 commits March 26, 2025 17:24

gitgitgadget bot added the new user label Mar 26, 2025

To1ne closed this Mar 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce git-blame-tree(1) command #1894

Introduce git-blame-tree(1) command #1894

Uh oh!

To1ne commented Mar 26, 2025

Uh oh!

gitgitgadget bot commented Mar 26, 2025

Uh oh!

gitgitgadget bot commented Mar 26, 2025

Uh oh!

dscho commented Mar 26, 2025

Uh oh!

gitgitgadget bot commented Mar 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Introduce git-blame-tree(1) command #1894

Introduce git-blame-tree(1) command #1894

Uh oh!

Conversation

To1ne commented Mar 26, 2025

Uh oh!

gitgitgadget bot commented Mar 26, 2025

Welcome to GitGitGadget

Contributing the patches

Need help?

Uh oh!

gitgitgadget bot commented Mar 26, 2025

Uh oh!

dscho commented Mar 26, 2025

Uh oh!

gitgitgadget bot commented Mar 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants