Skip to content

Conversation

To1ne
Copy link

@To1ne To1ne commented Mar 26, 2025

This is yet another attempt to upstream the builtin command
git-blame-tree(1). This command is similar to git-blame(1) and shows
the most recent modification to paths in a tree..

The last attempt (I'm aware of) was made by Ævar in 20231. That
series was based of patches by Peff written in 20112.

In contrary to Ævar's attempt, my series implements the
git-blame-tree(1) standalone command. Ævar suggested to only implement
the blame-tree subsystem, and test it through test-tool. But since the
command line options of that test-tool subcommand were very similar to
the standalone command we want to land, I decided transfer the test-tool
subcommand into a real Git builtin. Also, nowadays if we want to test
the blame-tree subsystem independently, we should use clar unit tests
instead of test-tool.

This series brings back the --max-depth changes. I know Ævar removed
them because there was controversy, but I think they are relevant to
include. And I want to open up the conversation again.

Also extra patches from Peff are included to implement pathspec tries.
I've taken them from Peff's fork on GitHub3, but they also have been
posted to this mailing list4.

Other improvements I've made:

  • Rebase onto current master.

  • Remove the use of USE_THE_REPOSITORY_VARIABLE.

  • Fix various memory leaks.

  • Moved code from blame-tree.c to pathspec.c, as suggested in Peff's
    last patch on GitHub.

  • Updated the benchmark results in the last commit message, although the
    numbers were roughly the same.

I don't expect this code to be deemed ready to merge, but I mainly want
to gather as much feedback as possible and use it to iterate toward
actually getting it merged.

There are some things I know that are missing, for example
--porcelain, but those can be added in a follow-up. At the moment
there's one test marked test_expect_failure, but I was not able yet
identify that bug.

So, let me know what you think?

--
Toon

To1ne and others added 9 commits March 26, 2025 17:24
This is yet another attempt to upstream the builtin command
`git-blame-tree(1)`. This command is similar to git-blame(1) and shows
the most recent modification to paths in a tree..

The last attempt (I'm aware of) was made by Ævar in 2023[1]. That
series was based of patches by Peff written in 2011[2].

In contrary to Ævar's attempt, my series implements the
`git-blame-tree(1)` standalone command. Ævar suggested to only implement
the blame-tree subsystem, and test it through test-tool. But since the
command line options of that test-tool subcommand were very similar to
the standalone command we want to land, I decided transfer the test-tool
subcommand into a real Git builtin. Also, nowadays if we want to test
the blame-tree subsystem independently, we should use clar unit tests
instead of test-tool.

This series brings back the --max-depth changes. I know Ævar removed
them because there was controversy, but I think they are relevant to
include. And I want to open up the conversation again.

Also extra patches from Peff are included to implement pathspec tries.
I've taken them from Peff's fork on GitHub[3], but they also have been
posted to this mailing list[4].

Other improvements I've made:

* Rebase onto current master.

* Remove the use of USE_THE_REPOSITORY_VARIABLE.

* Fix various memory leaks.

* Moved code from blame-tree.c to pathspec.c, as suggested in Peff's
  last patch on GitHub.

* Updated the benchmark results in the last commit message, although the
  numbers were roughly the same.

I don't expect this code to be deemed ready to merge, but I mainly want
to gather as much feedback as possible and use it to iterate toward
actually getting it merged.

There are some things I know that are missing, for example
`--porcelain`, but those can be added in a follow-up. At the moment
there's one test marked `test_expect_failure`, but I was not able yet
identify that bug.

So, let me know what you think?

--
Toon

[1]: https://lore.kernel.org/git/[email protected]/
[2]: https://lore.kernel.org/git/[email protected]/
[3]: https://github.com/peff/git/tree/jk/faster-blame-tree-wip
[4]: https://lore.kernel.org/git/[email protected]/

Cc: Jeff King <[email protected]>
Cc: Patrick Steinhardt <[email protected]>

--- b4-submit-tracking ---
# This section is used internally by b4 prep for tracking purposes.
{
  "series": {
    "revision": 1,
    "change-id": "20250326-toon-blame-tree-2ca74d7712ac",
    "prefixes": []
  }
}
In commit 25e5e2b (combine-diff: support format_callback,
2011-08-19), the combined-diff code learned how to make a multi-sourced
diff_filepair to pass to a diff callback. When we create each filepair,
we do not bother to fill in many of the fields, because they would make
no sense (e.g., there can be no rename score or broken_pair flag because
we do not go through the diffcore filters). However, we did not even
bother to zero them, leading to random values. Let's make sure
everything is blank with xcalloc, just as the regular diff code does.

We would potentially want to set the "status" flag to
something non-zero, but it is not clear to what. Possibly a
new DIFF_STATUS_COMBINED would make sense, as this is not
strictly a modification, nor does it fit any other category.

Since it is not yet clear what callers would want, this
patch simply leaves it as "0", the same empty flag that is
seen when diffcore_std is not used at all.

Signed-off-by: Jeff King <[email protected]>
The within_depth function is used to check whether pathspecs
limited by a max-depth parameter are acceptable. It takes a
path to check, a maximum depth, and a "base" depth. It
counts the components in the path (by counting slashes),
adds them to the base, and compare them to the maximum.

However, if the base does not have any slashes at all, we
always return "true". If the base depth is 0, then this is
correct; no matter what the maximum is, we are always within
it. However, if the base depth is greater than 0, then we
might return an erroneous result.

This ends up not causing any user-visible bugs in the
current code. The call sites in dir.c always pass a base
depth of 0, so are unaffected. But tree_entry_interesting
uses this function differently: it will pass the prefix of
the current entry, along with a "1" if the entry is a
directory, in essence checking whether items inside the
entry would be of interest. It turns out not to make a
difference in behavior, but the reasoning is complex.

Given a tree like:

  file
  a/file
  a/b/file

walking the tree and calling tree_entry_interesting will
yield the following results:

  (with max_depth=0):
      file: yes
         a: yes
    a/file: no
       a/b: no

  (with max_depth=1):
      file: yes
         a: yes
    a/file: yes
       a/b: no

So we have inconsistent behavior in considering directories
interesting. If they are at the edge of our depth but at the
root, we will recurse into them, but then find all of their
entries uninteresting (e.g., in the first case, we will look
at "a" but find "a/*" uninteresting). But if they are at the
edge of our depth and not at the root, then we will not
recurse (in the second example, we do not even bother
entering "a/b").

This turns out not to matter because the only caller which
uses max-depth pathspecs is cmd_grep, which only cares about
blob entries. From its perspective, it is exactly the same
to not recurse into a subtree, or to recurse and find that
it contains no matching entries. Not recursing is merely an
optimization.

It is debatable whether tree_entry_interesting should
consider such an entry interesting. The only caller does not
care if it sees the tree itself, and can benefit from the
optimization. But if we add a "max-depth" limiter to regular
diffs, then a diff with DIFF_OPT_TREE_IN_RECURSIVE would
probably want to show the tree itself, but not what it
contains.

This patch just fixes within_depth, which means we consider
such entries uninteresting (and makes the current caller
happy).  If we want to change that in the future, then this
fix is still the correct first step, as the current behavior is
simply inconsistent.

Signed-off-by: Jeff King <[email protected]>
When you are doing a tree-diff, there are basically two
options: do not recurse into subtrees at all, or recurse
indefinitely.  While most callers would want to always
recurse and see full pathnames, some may want the efficiency
of looking only at a particular level of the tree. This is
currently easy to do for the top-level (just turn off
recursion), but you cannot say "show me what changed in
subdir/, but do not recurse".

This patch adds a max-depth parameter which is measured from
the closest pathspec match, so that you can do:

  git log --raw --max-depth=1 a/b/c

and see the contents of a/b/c/, but not those of a/b/c/d/
(you would see the --raw entry for a/b/c/d in such a case).

Signed-off-by: Jeff King <[email protected]>
Similar to git-blame(1), introduce a new subcommand git-blame-tree(1).
This command shows the most recent modification to paths in a tree. It
does so by expanding the tree at a given commit, taking note of the
current state of each path, and then walking backwards through history
looking for commits where each path changed into its final commit ID.

Based-on-a-patch-by: Jeff King <[email protected]>
Improved-by: "Ævar Arnfjörð Bjarmason" <[email protected]>
Signed-off-by: Toon Claes <[email protected]>
When checking if a path matches our pathspec list, in the
worst case we walk the whole list linearly to check each
pathspec.  As a result, an operation like a tree diff does
`O(paths * pathspecs)` comparisons. If we assume that the
number of pathspecs may approach the number of paths, this
is `O(n^2)` in the number of paths.  We have to do it this
way in the general case because pathspecs may be arbitrary
fnmatch expressions, and we cannot create any meaningful
index.

However, it is common that all of the pathspecs are vanilla
prefixes (e.g., "git.c" or "Documentation/") that do not use
any wildcards or magic features. In such a case, we can
index the pathspec list to give us faster lookups.

The simplest solution would be to keep the pathspec list
sorted, and use a binary search over the entries for each
path. This turns out to be rather tricky, though, as we are
not looking for an exact match in the list. We must also
find prefix matches, and take care to handle trailing
slashes for directories.

Instead, this patch introduces a pathspec_trie struct, which
represents the pathspecs as a graph of entries. For
example, the pathspec list ["foo/bar", "foo/baz/"] would be
represented as:

        (bar, terminal)
       /
  (foo)
       \
        (baz, terminal, must_be_dir)

To see if a path "foo/eek" is covered by the pathspec, we
walk the trie, matching "foo" but ultimately seeing that
"eek" is not mentioned.

This provides us with two optimizations:

  1. The children of each trie node are simple strings
     representing directory components. This means we can sort
     and binary-search them, giving us logarithmic lookups
     (but note that rather than one log lookup on the whole
     pathname, it is a sequence of smaller log lookups on
     components of the pathname).

  2. Many users of the pathspec code do not feed full
     pathnames, but instead are walking a tree hierarchy
     themselves, and limiting the walk as they go. They can
     walk the trie at the same time, meanining they avoid
     looking repeatedly at parts of the pathspec we already
     know are matched. The current code, by contrast, will
     repeatedly match "foo/" against each pathspec when
     looking at "foo/bar", "foo/baz", etc.

Note that this patch just introduces the data structure; the
code is not used anywhere yet.

Signed-off-by: Jeff King <[email protected]>
An earlier commit introduced pathspec_tries, but we did not
actually generate them by default. This patch causes us to
do so when it is possible (i.e., when no wildcards or other
pathspec magic are in use). This doesn't actually do
anything yet, though, as none of the pathspec users have
learned to make use of the tries.

We embed the pathspec_trie directly inside the "struct
pathspec". This is not strictly necessary, as once created,
the trie does not depend on the original pathspec. However,
since the intended use is to optimize existing pathspec
callers, passing the trie around as part of the pathspec
will minimize disruption to the call chain.

Signed-off-by: Jeff King <[email protected]>
The tree-diff currently matches each pathspec against every
entry of the tree. For the common case of a handful of
pathspecs, this is not a big deal. However, if you have a
large number of pathspecs, it gets noticeably slow.

Now that we have the pathspec_trie optimization, we can do
much better (at least for simple cases without wildcards).
Here are numbers for running "git rev-list" with limiting
pathspecs of varying sizes, both before and after this
patch:

Test                 origin              HEAD
---------------------------------------------------------------
4001.2: size=1       0.12(0.11+0.00)     0.12(0.12+0.00) +0.0%
4001.3: size=2       0.17(0.16+0.00)     0.16(0.15+0.01) -5.9%
4001.4: size=4       0.17(0.17+0.00)     0.17(0.17+0.00) +0.0%
4001.5: size=8       0.21(0.20+0.00)     0.20(0.20+0.00) -4.8%
4001.6: size=16      0.25(0.24+0.00)     0.21(0.20+0.00) -16.0%
4001.7: size=32      0.31(0.31+0.00)     0.21(0.20+0.00) -32.3%
4001.8: size=64      0.43(0.41+0.01)     0.21(0.21+0.00) -51.2%
4001.9: size=128     0.73(0.72+0.00)     0.22(0.21+0.00) -69.9%
4001.10: size=256    2.02(2.02+0.00)     0.37(0.36+0.00) -81.7%
4001.11: size=512    6.78(6.78+0.00)     0.64(0.64+0.00) -90.6%
4001.12: size=1024   23.67(23.67+0.02)   1.22(1.20+0.01) -94.8%

For small pathspecs, we produce no real difference (which is
good; we know we are asymptotically better, but we have not
regressed our constant factor). Between 16 and 32 pathspecs we
start to see some small improvement, and the benefit keeps
growing with the number of pathspecs.

Obviously these large-pathspec cases are unusual. But you
might use them, for example, if the pathspecs were generated
programatically (e.g., if you want the history of all files
that are in Documentation/ _now_, not what was historically
ever there, you would expand the pathspec at the current
tree, and feed the result to rev-list).

Signed-off-by: Jeff King <[email protected]>
As we find out the "last touched" for each path, we cease to
care about that path. However, we traditionally have not
told the revision-walking machinery about this, for two
reasons:

  1. Munging the pathspecs mid-traversal is not an
     officially supported thing to be doing. It does seem to
     work, though, and with Duy's recent pathspec
     refactoring, it's fairly straightforward.

  2. The pathspec machinery is slow. We know have to look at
     each pathspec to see if each tree entry is interesting.
     And since our cost to handle a diff_filepair is so
     cheap, it doesn't really help us much.

It turns out that (2) is not quite right, though. For the
diffs themselves, limiting the pathspec doesn't help at all.
But it does mean that we can simplify history more
aggressively, which may mean avoiding whole swaths of
history that don't touch interesting paths. This happens
especially near the end of the traversal, when we are only
interested in a few paths, and there are many side branches
that do not touch any of them.

And as of the last commit, the pathspec machinery is no
longer quadratic, so the cost to use it is easily worth
paying, even for large trees.

Here are runs of `git blame-tree --max-depth=0` for
torvalds/linux (a complex history with a small root tree)
and git/git (a flat hierarchy with a large root tree):

For git/git:

    Benchmark 1: ./git.old blame-tree --max-depth=0
      Time (mean ± σ):      2.223 s ±  0.360 s    [User: 1.808 s, System: 0.369 s]
      Range (min … max):    1.866 s …  2.927 s    10 runs

    Benchmark 2: ./git.new blame-tree --max-depth=0
      Time (mean ± σ):     945.0 ms ±  93.4 ms    [User: 783.6 ms, System: 131.0 ms]
      Range (min … max):   846.4 ms … 1071.5 ms    10 runs

    Summary
      ./git.new blame-tree --max-depth=0 ran
        2.35 ± 0.45 times faster than git.old blame-tree --max-depth=0

For torvalds/linux:

    Benchmark 1: ./git.old blame-tree --max-depth=0
      Time (mean ± σ):     13.504 s ±  0.545 s    [User: 12.605 s, System: 0.583 s]
      Range (min … max):   12.947 s … 14.488 s    10 runs

    Benchmark 2: ./git.new blame-tree --max-depth=0
      Time (mean ± σ):     622.8 ms ±  39.2 ms    [User: 522.1 ms, System: 95.7 ms]
      Range (min … max):   556.6 ms … 681.6 ms    10 runs

    Summary
      ./git.new blame-tree --max-depth=0 ran
       21.68 ± 1.62 times faster than ./git.old blame-tree --max-depth=0

Note that the output may change slightly in some cases due
to the history pruning. This should be acceptable, as either
output is correct in these cases (e.g., scenarios like a
cherry-picked commit on two parallel branches).

Signed-off-by: Jeff King <[email protected]>
Copy link

gitgitgadget bot commented Mar 26, 2025

Welcome to GitGitGadget

Hi @To1ne, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests.

Please make sure that either:

  • Your Pull Request has a good description, if it consists of multiple commits, as it will be used as cover letter.
  • Your Pull Request description is empty, if it consists of a single commit, as the commit message should be descriptive enough by itself.

You can CC potential reviewers by adding a footer to the PR description with the following syntax:

CC: Revi Ewer <[email protected]>, Ill Takalook <[email protected]>

NOTE: DO NOT copy/paste your CC list from a previous GGG PR's description,
because it will result in a malformed CC list on the mailing list. See
example.

Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:

  • the lines should not exceed 76 columns,
  • the first line should be like a header and typically start with a prefix like "tests:" or "revisions:" to state which subsystem the change is about, and
  • the commit messages' body should be describing the "why?" of the change.
  • Finally, the commit messages should end in a Signed-off-by: line matching the commits' author.

It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code.

Contributing the patches

Before you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form /allow. A good way to find other contributors is to locate recent pull requests where someone has been /allowed:

Both the person who commented /allow and the PR author are able to /allow you.

An alternative is the channel #git-devel on the Libera Chat IRC network:

<newcontributor> I've just created my first PR, could someone please /allow me? https://github.com/gitgitgadget/git/pull/12345
<veteran> newcontributor: it is done
<newcontributor> thanks!

Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment /submit.

If you want to see what email(s) would be sent for a /submit request, add a PR comment /preview to have the email(s) sent to you. You must have a public GitHub email address for this. Note that any reviewers CC'd via the list in the PR description will not actually be sent emails.

After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail).

If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the (raw) link), then import it into your mail program. If you use GMail, you can do this via:

curl -g --user "<EMailAddress>:<Password>" \
    --url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt

To iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description):

Changes since v1:
- Fixed a typo in the commit message (found by ...)
- Added a code comment to ... as suggested by ...
...

To send a new iteration, just add another PR comment with the contents: /submit.

Need help?

New contributors who want advice are encouraged to join [email protected], where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join.

You may also be able to find help in real time in the developer IRC channel, #git-devel on Libera Chat. Remember that IRC does not support offline messaging, so if you send someone a private message and log out, they cannot respond to you. The scrollback of #git-devel is archived, though.

Copy link

gitgitgadget bot commented Mar 26, 2025

There are issues in commit 0f30e62:
Introduce git-blame-tree(1) command
Commit not signed off

@To1ne To1ne closed this Mar 26, 2025
@dscho
Copy link
Member

dscho commented Mar 26, 2025

/allow

Copy link

gitgitgadget bot commented Mar 26, 2025

User To1ne is now allowed to use GitGitGadget.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants