- 
                Notifications
    You must be signed in to change notification settings 
- Fork 2.7k
[WORK-IN-PROGRESS] Introduce the path walk API into Git for Windows #5146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In anticipation of a few planned applications, introduce the most basic form of a path-walk API. It currently assumes that there are no UNINTERESTING objects, and does not include any complicated filters. It calls a function pointer on groups of tree and blob objects as grouped by path. This only includes objects the first time they are discovered, so an object that appears at multiple paths will not be included in two batches. There are many future adaptations that could be made, but they are left for future updates when consumers are ready to take advantage of those features. RFC TODO: It would be helpful to create a test-tool that allows printing of each batch for strong testing. Signed-off-by: Derrick Stolee <[email protected]>
In anticipation of implementing 'git backfill', populate the necessary files with the boilerplate of a new builtin. RFC TODO: When preparing this for a full implementation, make sure it is based on the newest standards introduced by [1]. [1] https://lore.kernel.org/git/[email protected]/T/#m606036ea2e75a6d6819d6b5c90e729643b0ff7f7 [PATCH 1/3] builtin: add a repository parameter for builtin functions Signed-off-by: Derrick Stolee <[email protected]>
The default behavior of 'git backfill' is to fetch all missing blobs that are reachable from HEAD. Document and test this behavior. The implementation is a very simple use of the path-walk API, initializing the revision walk at HEAD to start the path-walk from all commits reachable from HEAD. Ignore the object arrays that correspond to tree entries, assuming that they are all present already. Signed-off-by: Derrick Stolee <[email protected]>
Users may want to specify a minimum batch size for their needs. This is only a minimum: the path-walk API provides a list of OIDs that correspond to the same path, and thus it is optimal to allow delta compression across those objects in a single server request. We could consider limiting the request to have a maximum batch size in the future. Signed-off-by: Derrick Stolee <[email protected]>
One way to significantly reduce the cost of a Git clone and later fetches is to use a blobless partial clone and combine that with a sparse-checkout that reduces the paths that need to be populated in the working directory. Not only does this reduce the cost of clones and fetches, the sparse-checkout reduces the number of objects needed to download from a promisor remote. However, history investigations can be expensie as computing blob diffs will trigger promisor remote requests for one object at a time. This can be avoided by downloading the blobs needed for the given sparse-checkout using 'git backfill' and its new '--sparse' mode, at a time that the user is willing to pay that extra cost. Note that this is distinctly different from the '--filter=sparse:<oid>' option, as this assumes that the partial clone has all reachable trees and we are using client-side logic to avoid downloading blobs outside of the sparse-checkout cone. This avoids the server-side cost of walking trees while also achieving a similar goal. It also downloads in batches based on similar path names, presenting a resumable download if things are interrupted. This augments the path-walk API to have a possibly-NULL 'pl' member that may point to a 'struct pattern_list'. This could be more general than the sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently the only consumer. Be sure to test this in both cone mode and not cone mode. Cone mode has the benefit that the path-walk can skip certain paths once they would expand beyond the sparse-checkout. Signed-off-by: Derrick Stolee <[email protected]>
This adds the ability to ask for the commits as a single list. This will also reduce the calls in 'git backfill' to be a BUG() statement if called with anything other than blobs. Signed-off-by: Derrick Stolee <[email protected]>
The previous change introduced the '--[no-]sparse' option for the 'git backfill' command, but did not assume it as enabled by default. However, this is likely the behavior that users will most often want to happen. Without this default, users with a small sparse-checkout may be confused when 'git backfill' downloads every version of every object in the full history. However, this is left as a separate change so this decision can be reviewed independently of the value of the '--[no-]sparse' option. Add a test of adding the '--sparse' option to a repo without sparse-checkout to make it clear that supplying it without a sparse-checkout is an error. Signed-off-by: Derrick Stolee <[email protected]>
In anticipation of using the path-walk API to analyze tags or include them in a pack-file, add the ability to walk the tags that were included in the revision walk. Signed-off-by: Derrick Stolee <[email protected]>
Start work on a new `git survey` command to scan the repository for monorepo performance and scaling problems. The goal is to measure the various known "dimensions of scale" and serve as a foundation for adding additional measurements as we learn more about Git monorepo scaling problems. The initial goal is to complement the scanning and analysis performed by the GO-based `git-sizer` (https://github.com/github/git-sizer) tool. It is hoped that by creating a builtin command, we may be able to take advantage of internal Git data structures and code that is not accessible from GO to gain further insight into potential scaling problems. RFC TODO: Adapt this boilerplat to match the upcoming changes to builtin methods that include a 'struct repository' pointer. Co-authored-by: Derrick Stolee <[email protected]> Signed-off-by: Jeff Hostetler <[email protected]> Signed-off-by: Derrick Stolee <[email protected]>
By default we will scan all references in "refs/heads/", "refs/tags/" and "refs/remotes/". Add command line opts let the use ask for all refs or a subset of them and to include a detached HEAD. Signed-off-by: Jeff Hostetler <[email protected]> Signed-off-by: Derrick Stolee <[email protected]>
Collect the set of requested branches, tags, and etc into a ref_array and collect the set of requested patterns into a strvec. RFC TODO: This patch has some changes that should be in the previous patch, to make the diff look a lot better. Co-authored-by: Derrick Stolee <[email protected]> Signed-off-by: Jeff Hostetler <[email protected]> Signed-off-by: Derrick Stolee <[email protected]>
When 'git survey' provides information to the user, this will be presented in one of two formats: plaintext and JSON. The JSON implementation will be delayed until the functionality is complete for the plaintext format. The most important parts of the plaintext format are headers specifying the different sections of the report and tables providing concreted data. Create a custom table data structure that allows specifying a list of strings for the row values. When printing the table, check each column for the maximum width so we can create a table of the correct size from the start. The table structure is designed to be flexible to the different kinds of output that will be implemented in future changes. Signed-off-by: Derrick Stolee <[email protected]>
At the moment, nothing is obvious about the reason for the use of the
path-walk API, but this will become more prevelant in future iterations. For
now, use the path-walk API to sum up the counts of each kind of object.
For example, this is the reachable object summary output for my local repo:
REACHABLE OBJECT SUMMARY
========================
Object Type |  Count
------------+-------
       Tags |      0
    Commits | 178573
      Trees | 312745
      Blobs | 183035
(Note: the "Tags" are zero right now because the path-walk API has not been
integrated to walk tags yet. This will be fixed in a later change.)
RFC TODO: make sure tags are walked before this change.
Signed-off-by: Derrick Stolee <[email protected]>
    The sparse tree walk algorithm was created in d5d2e93 (revision: implement sparse algorithm, 2019-01-16) and involves using the mark_trees_uninteresting_sparse() method. This method takes a repository and an oidset of tree IDs, some of which have the UNINTERESTING flag and some of which do not. Create a method that has an equivalent set of preconditions but uses a "dense" walk (recursively visits all reachable trees, as long as they have not previously been marked UNINTERESTING). This is an important difference from mark_tree_uninteresting(), which short-circuits if the given tree has the UNINTERESTING flag. A use of this method will be added in a later change, with a condition set whether the sparse or dense approach should be used. Signed-off-by: Derrick Stolee <[email protected]>
Now that we have explored objects by count, we can expand that a bit more to summarize the data for the on-disk and inflated size of those objects. This information is helpful for diagnosing both why disk space (and perhaps clone or fetch times) is growing but also why certain operations are slow because the inflated size of the abstract objects that must be processed is so large. Signed-off-by: Derrick Stolee <[email protected]>
This option causes the path-walk API to act like the sparse tree-walk algorithm implemented by mark_trees_uninteresting_sparse() in list-objects.c. Starting from the commits marked as UNINTERESTING, their root trees and all objects reachable from those trees are UNINTERSTING, at least as we walk path-by-path. When we reach a path where all objects associated with that path are marked UNINTERESTING, then do no continue walking the children of that path. We need to be careful to pass the UNINTERESTING flag in a deep way on the UNINTERESTING objects before we start the path-walk, or else the depth-first search for the path-walk API may accidentally report some objects as interesting. Signed-off-by: Derrick Stolee <[email protected]>
Signed-off-by: Derrick Stolee <[email protected]>
In order to more easily compute delta bases among objects that appear at the exact same path, add a --path-walk option to 'git pack-objects'. This option will use the path-walk API instead of the object walk given by the revision machinery. Since objects will be provided in batches representing a common path, those objects can be tested for delta bases immediately instead of waiting for a sort of the full object list by name-hash. This has multiple benefits, including avoiding collisions by name-hash. The objects marked as UNINTERESTING are included in these batches, so we are guaranteeing some locality to find good delta bases. After the individual passes are done on a per-path basis, the default name-hash is used to find other opportunistic delta bases that did not match exactly by the full path name. RFC TODO: It is important to note that this option is inherently incompatible with using a bitmap index. This walk probably also does not work with other advanced features, such as delta islands. Getting ahead of myself, this option compares well with --full-name-hash when the packfile is large enough, but also performs at least as well as the default in all cases that I've seen. RFC TODO: this should probably be recording the batch locations to another list so they could be processed in a second phase using threads. RFC TODO: list some examples of how this outperforms previous pack-objects strategies. (This is coming in later commits that include performance test changes.) Signed-off-by: Derrick Stolee <[email protected]>
In future changes, we will make use of these methods. The intention is to keep track of the top contributors according to some metric. We don't want to store all of the entries and do a sort at the end, so track a constant-size table and remove rows that get pushed out depending on the chosen sorting algorithm. Co-authored-by: Jeff Hostetler <[email protected]> Signed-off-by; Jeff Hostetler <[email protected]> Signed-off-by: Derrick Stolee <[email protected]>
Signed-off-by: Derrick Stolee <[email protected]>
Since we are already walking our reachable objects using the path-walk API, let's now collect lists of the paths that contribute most to different metrics. Specifically, we care about * Number of versions. * Total size on disk. * Total inflated size (no delta or zlib compression). This information can be critical to discovering which parts of the repository are causing the most growth, especially on-disk size. Different packing strategies might help compress data more efficiently, but the toal inflated size is a representation of the raw size of all snapshots of those paths. Even when stored efficiently on disk, that size represents how much information must be processed to complete a command such as 'git blame'. Since the on-disk size is likely to be fragile, stop testing the exact output of 'git survey' and check that the correct set of headers is output. Signed-off-by: Derrick Stolee <[email protected]>
There are many tests that validate whether 'git pack-objects' works as expected. Instead of duplicating these tests, add a new test environment variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default when specified. This was useful in testing the implementation of the --path-walk implementation, especially in conjunction with test such as: - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of the --sparse option and how it combines with --path-walk. RFC TODO: list other helpful test cases, as well as the ones where the behavior breaks if this is enabled... Signed-off-by: Derrick Stolee <[email protected]>
To test the benefits of the new --path-walk option in 'git pack-objects', create a performance test that times the process but also compares the size of the output. Against the microsoft/fluentui repo [1] against a particular commit [2], this has reproducible results of a similar scale: Test this tree --------------------------------------------------------------- 5313.2: thin pack 0.39(0.48+0.03) 5313.3: thin pack size 1.2M 5313.4: thin pack with --path-walk 0.09(0.07+0.01) 5313.5: thin pack size with --path-walk 20.8K 5313.6: big recent pack 2.13(8.29+0.26) 5313.7: big recent pack size 17.7M 5313.8: big recent pack with --path-walk 3.18(4.21+0.22) 5313.9: big recent pack size with --path-walk 15.0M [1] https://github.com/microsoft/reactui [2] e70848ebac1cd720875bccaa3026f4a9ed700e08 RFC TODO: Note that the path-walk version is slower for the big case, but the delta calculation is single-threaded with the current implementation! It's still faster for the small case that mimics a typical push. Signed-off-by: Derrick Stolee <[email protected]>
Since 'git pack-objects' supports a --path-walk option, allow passing it through in 'git repack'. This presents interesting testing opportunities for comparing the different repacking strategies against each other. For the microsoft/fluentui repo [1], the results are very interesting: Test this tree ------------------------------------------------------------------- 5313.10: full repack 97.91(663.47+2.83) 5313.11: full repack size 449.1K 5313.12: full repack with --path-walk 105.42(120.49+0.95) 5313.13: full repack size with --path-walk 159.1K [1] https://github.com/microsoft/fluentui This repo suffers from having a lot of paths that collide in the name hash, so examining them in groups by path leads to better deltas. Also, in this case, the single-threaded implementation is competitive with the full repack. This is saving time diffing files that have significant differences from each other. Signed-off-by: Derrick Stolee <[email protected]>
Users may want to enable the --path-walk option for 'git pack-objects' by default, especially underneath commands like 'git push' or 'git repack'. This should be limited to client repositories, since the --path-walk option disables bitmap walks, so would be bad to include in Git servers when serving fetches and clones. There is potential that it may be helpful to consider when repacking the repository, to take advantage of improved deltas across historical versions of the same files. Much like how "pack.useSparse" was introduced and included in "feature.experimental" before being enabled by default, use the repository settings infrastructure to make the new "pack.usePathWalk" config enabled by "feature.experimental" and "feature.manyFiles". Signed-off-by: Derrick Stolee <[email protected]>
RFC NOTE: this is essentially the same as the patch introduced independently of the RFC, but now is on top of the --path-walk option instead. This is included in the RFC for comparison purposes. RFC NOTE: As you can see from the details below, the --full-name-hash option essentially attempts to do similar things as the --path-walk option, but sometimes misses the mark. Collisions still happen with the --full-name-hash option, leading to some misses. However, in cases where the default name-hash algorithm has low collision rates and deltas are actually desired across objects with similar names but different full names, the --path-walk option can still take advantage of the default name hash approach. Here are the new performance details simulating a single push in an internal monorepo using a lot of paths that collide in the default name hash. We can see that --full-name-hash gets close to the --path-walk option's size. Test this tree -------------------------------------------------------------- 5313.2: thin pack 2.43(2.92+0.14) 5313.3: thin pack size 4.5M 5313.4: thin pack with --full-name-hash 0.31(0.49+0.12) 5313.5: thin pack size with --full-name-hash 15.5K 5313.6: thin pack with --path-walk 0.35(0.31+0.04) 5313.7: thin pack size with --path-walk 14.2K However, when simulating pushes on repositories that do not have issues with name-hash collisions, the --full-name-hash option presents a potential of worse delta calculations, such as this example using my local Git repository: Test this tree -------------------------------------------------------------- 5313.2: thin pack 0.03(0.01+0.01) 5313.3: thin pack size 475 5313.4: thin pack with --full-name-hash 0.02(0.01+0.01) 5313.5: thin pack size with --full-name-hash 14.8K 5313.6: thin pack with --path-walk 0.02(0.01+0.01) 5313.7: thin pack size with --path-walk 475 Note that the path-walk option found the same delta bases as the default options in this case. In the full repack case, the --full-name-hash option may be preferable because it interacts well with other advanced features, such as using bitmap indexes and tracking delta islands. Signed-off-by: Derrick Stolee <[email protected]>
Using this tool, we can count how many distinct name-hash values exist
within a list of paths. Examples include
 git ls-tree -r --name-only HEAD | \
	     test-tool name-hash | \
  	      awk "{print \$1;}" | \
  		 sort -ns | uniq | wc -l
which outputs the number of distinct name-hash values that appear at
HEAD. Or, the following which presents the resulting name-hash values of
maximum multiplicity:
 git ls-tree -r --name-only HEAD | \
	     test-tool name-hash | \
	      awk "{print \$1;}" | \
	       sort -n | uniq -c | sort -nr | head -n 25
For an internal monorepo with around a quarter million paths at HEAD,
the highest multiplicity for the standard name-hash function was 14,424
while the full name-hash algorithm had only seven hash values with any
collision, with a maximum multiplicity of two.
Signed-off-by: Derrick Stolee <[email protected]>
    This test helps inform someone as to the behavior of the name-hash algorithms for their repo based on the paths at HEAD. For example, the microsoft/fluentui repo had these statistics at time of committing: Test this tree ----------------------------------------------------------------- 5314.1: paths at head 19.6K 5314.2: number of distinct name-hashes 8.2K 5314.3: number of distinct full-name-hashes 19.6K 5314.4: maximum multiplicity of name-hashes 279 5314.5: maximum multiplicity of fullname-hashes 1 That demonstrates that of the nearly twenty thousand path names, they are assigned around eight thousand distinct values. 279 paths are assigned to a single value, leading the packing algorithm to sort objects from those paths together, by size. In this repository, no collisions occur for the full-name-hash algorithm. In a more extreme example, an internal monorepo had a much worse collision rate: Test this tree ----------------------------------------------------------------- 5314.1: paths at head 221.6K 5314.2: number of distinct name-hashes 72.0K 5314.3: number of distinct full-name-hashes 221.6K 5314.4: maximum multiplicity of name-hashes 14.4K 5314.5: maximum multiplicity of fullname-hashes 2 Even in this repository with many more paths at HEAD, the collision rate was low and the maximum number of paths being grouped into a single bucket by the full-path-name algorithm was two. Signed-off-by: Derrick Stolee <[email protected]>
Repositories registered with Scalar are expected to be client-only repositories that are rather large. This means that they are more likely to be good candidates for using the --path-walk option when running 'git pack-objects', especially under the hood of 'git push'. Enable this config in Scalar repositories. Signed-off-by: Derrick Stolee <[email protected]>
In order to debug what is going on during delta calculations, add a --debug-file=<file> option to 'git pack-objects'. This leads to sending a JSON-formatted description of the delta information to that file. Signed-off-by: Derrick Stolee <[email protected]>
These patches introduce the 'git backfill' builtin including its '--batch-size' and '--sparse' options. While this is the first and simplest application, it is also the lowest priority in terms of user need.
These patches reimplement a subset of the functionality of 'git survey' as well as generalize some of the data structures that Jeff's implementation made. The flexibility hopefully comes through to show the potential for future extensions. These patches are quite rough and will need more attention before they can be sent for full review.
Here is where I think the meat of the RFC really lies. There are still some rough edges, but the data will show that 'git pack-objects --path-walk' has the potential to be an extremely effective way to pack objects. (Caveats will come later in the analysis section.)
This is a simplified version of the patch series that was split out by itself earlier for full review. This was split out on its own partly because it doesn't actually use the path-walk API. This has benefits and drawbacks, but it seems like a quick win for many scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only got around to looking at the backfill part yet, will continue reviewing later.
| list->type = type; | ||
| strmap_put(&ctx->paths_to_lists, path.buf, list); | ||
| string_list_append(&ctx->path_stack, path.buf); | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to add this, out of an abundance of caution:
else if (list->type != type)
    warning("both blob and tree objects found at '%s'", path.buf);
| for (size_t i = 0; i < list->oids.nr; i++) { | ||
| ret |= add_children(ctx, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing on errors may be the right thing to do always. On the off-chance that a user may want to return early on error, we may want to add a flag for that.
| } | ||
|  | ||
| oid_array_clear(&list->oids); | ||
| strmap_remove(&ctx->paths_to_lists, path, 1); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether this is safe. If the same path appears multiple times on the stack, e.g. if two merged branches separately removed the same file, wouldn't we then remove the paths_to_lists item prematurely? This might be intentional, though, as we stop once we're hitting a blob that we'd already seen. Therefore in a history like this:
... - A - B - C
        \   /
          D
where file.txt was present in A, deleted in B and D, and missing from C, we would probably hit the file in B first, walk its history all the way until the commit that added it, then remove the item from paths_to_lists, then re-add it when encountering it in D, see that we handled it in A already, and re-remove it. That might be what we need here, but I have not been able to wrap my head around this enough yet to be confident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is important to realize that a path appears at most once. The walk goes like this:
- Walk all commits that we are going to walk. Initialize the "stack" to include only the root path ("")
- Walk all root trees that we are going to walk, initializing the "stack" with the list of all paths that appear as entries of those trees.
- Continue walking the stack, knowing that we visit each path only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the crucial part I was missing is that we first walk all commits, build up the huge, giant list for "" (i.e. all the top-level trees). Then we iterate over all of those before going any further. Crucially, no single tree walk happens until every last of the commits has been walked.
For some reason, I got the impression that we're doing a depth-first search where we immediately iterate into the full tree of the first commit, until we exhausted all of the blobs reachable from that root tree, and only then walk to the next commit. But that's not what we do ;-)
We basically walk all the paths in a depth-first manner, but not in alphabetical order because we end up plucking off the paths from the top of the stack (i.e. the "alphabetically last" item seen so far, but that only holds true if the root tree has no child trees, otherwise the order gets messy real quickly). So the order is somewhat jumbled, I guess ;-)
Thank you for explaining this so patiently. I think my confusion might be a good motivation to add a little primer about the order to the commit message of the "path-walk: introduce an object walk by path" commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was working on a technical doc for the API but don't have it committed or pushed anywhere. This confirms that it is necessary.
| { | ||
| struct tree_desc desc; | ||
| struct name_entry entry; | ||
| struct strbuf path = STRBUF_INIT; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For performance reasons, it might make sense to pass in base_path as a strbuf, so that only one scratch buffer is used for the entirety of the path walk (at least in single-threaded mode). But we can do that later.
| int ret; | ||
|  | ||
| if (ctx->sparse) { | ||
| CALLOC_ARRAY(info.pl, 1); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to release the allocated memory somewhere?
This is currently primarily here for me to test out a couple of ideas or to investigate how this path walk API thing works. Signed-off-by: Johannes Schindelin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another batch of my review.
| } | ||
| } | ||
|  | ||
| info->path_fn("initial", &tags, OBJ_TAG, info->path_fn_data); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably say "tags" instead of "initial"? And likewise the other path_fn("initial", ...) call may want to say "commits" instead?
And maybe even "tags" is misleading, because there could be blobs tracked under the file name tags? It is a valid path name, after all. We could play games by using a path that is invalid, such as .git/refs/tags (but what to do for commits? .git/refs/heads is wrong, they are not necessarily branch tips). Maybe better just use "".
The same goes for tagged-trees and tagged-blobs, too.
Also, do we want to prevent the function from being called with zero tags in case the caller asked for tags to be included but there were none to be found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This "initial" string comes from the way that this is emitted in the other object walk. I'll try to find pointers to that or reasoning for that to be important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that means that you have to look at the object type first in the callback, and only interpret the path parameter as actual path in the OBJ_BLOB case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, we want the path to matter for trees, too. When considering deltas in the pack-objects case, these paths don't matter for commits or tags because everything is sorted by type before compared by name-hash, so the set of deltas that are considered are the same in each case.
| /* | ||
| * Walk any pending objects at this point, but they should only | ||
| * be tags. | ||
| */ | ||
| for (size_t i = 0; i < info->revs->pending.nr; i++) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fear that we have to handle tags in revs->pending first, accumulating the list before walking the commits, but then only processing them afterward (to avoid double-accounting for blobs that are both tagged directly and that are reachable via a tree).
I've just added a test helper in 489ce0c and used the left-over of t8100-*.sh -d:
$ ./bin-wrappers/test-tool -C ./t/trash\ directory.t8100-git-survey/ path-walk --no-blobs --tags --no-commits -v four
visiting path 'initial' (count: 0) of type 'tag'
$ git -C ./t/trash\ directory.t8100-git-survey/ rev-parse four four^0
dcd3534d11452b820a49532b41e7494f3ddfa2a8
5e53728740668d567dad45ca14d258b8f75fb7c1
$ ./bin-wrappers/test-tool -C ./t/trash\ directory.t8100-git-survey/ path-walk --no-blobs --tags --commits -v four
visiting path 'initial' (count: 7) of type 'commit'
        5e53728740668d567dad45ca14d258b8f75fb7c1
        6cf975cad5d4c3afe9d9c1beb519506ab26f2968
        5e51a08864ae54b185d9ccb441e3b84272a1d962
        bd080f6764f92d2e5f3c14d9ab1241495b919820
        123d744603d8c13997f7dd86b4f066cc68305762
        446a0a30049ffec3f6a1af16c07e8c0941b11660
        d8452bb98e031b9f6d1e3804a8cecc14e3124c9e
visiting path 'initial' (count: 0) of type 'tag'As you can see, there is a tag object four but it is not reported because the revision walk removes it from the pending list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tag stuff is the weakest part of the path-walk API and demonstrates the need for a git rev-list-like test helper that checks which objects are collected for each path and object type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully the test helper I implemented is a good start?
|  | ||
| int cmd_survey(int argc, const char **argv, const char *prefix) | ||
| { | ||
| if (argc == 2 && !strcmp(argv[1], "-h")) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking from experience, we must prefix this with a message that the command is experimental and that its options and its output and operation are subject to frequent change. I.e. something like this.
This is a huge feature, and it would be good to let users play with it. To make it safer to play with, it needs to be guarded behind command-line/config options.