Skip to content

Commit e565f37

Browse files
committed
Merge branch 'ds/backfill'
Lazy-loading missing files in a blobless clone on demand is costly as it tends to be one-blob-at-a-time. "git backfill" is introduced to help bulk-download necessary files beforehand. * ds/backfill: backfill: assume --sparse when sparse-checkout is enabled backfill: add --sparse option backfill: add --min-batch-size=<n> option backfill: basic functionality and tests backfill: add builtin boilerplate
2 parents 0394451 + 85127bc commit e565f37

18 files changed

+540
-14
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
/git-apply
2020
/git-archimport
2121
/git-archive
22+
/git-backfill
2223
/git-bisect
2324
/git-blame
2425
/git-branch

Documentation/git-backfill.adoc

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
git-backfill(1)
2+
===============
3+
4+
NAME
5+
----
6+
git-backfill - Download missing objects in a partial clone
7+
8+
9+
SYNOPSIS
10+
--------
11+
[synopsis]
12+
git backfill [--min-batch-size=<n>] [--[no-]sparse]
13+
14+
DESCRIPTION
15+
-----------
16+
17+
Blobless partial clones are created using `git clone --filter=blob:none`
18+
and then configure the local repository such that the Git client avoids
19+
downloading blob objects unless they are required for a local operation.
20+
This initially means that the clone and later fetches download reachable
21+
commits and trees but no blobs. Later operations that change the `HEAD`
22+
pointer, such as `git checkout` or `git merge`, may need to download
23+
missing blobs in order to complete their operation.
24+
25+
In the worst cases, commands that compute blob diffs, such as `git blame`,
26+
become very slow as they download the missing blobs in single-blob
27+
requests to satisfy the missing object as the Git command needs it. This
28+
leads to multiple download requests and no ability for the Git server to
29+
provide delta compression across those objects.
30+
31+
The `git backfill` command provides a way for the user to request that
32+
Git downloads the missing blobs (with optional filters) such that the
33+
missing blobs representing historical versions of files can be downloaded
34+
in batches. The `backfill` command attempts to optimize the request by
35+
grouping blobs that appear at the same path, hopefully leading to good
36+
delta compression in the packfile sent by the server.
37+
38+
In this way, `git backfill` provides a mechanism to break a large clone
39+
into smaller chunks. Starting with a blobless partial clone with `git
40+
clone --filter=blob:none` and then running `git backfill` in the local
41+
repository provides a way to download all reachable objects in several
42+
smaller network calls than downloading the entire repository at clone
43+
time.
44+
45+
By default, `git backfill` downloads all blobs reachable from the `HEAD`
46+
commit. This set can be restricted or expanded using various options.
47+
48+
THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR MAY CHANGE IN THE FUTURE.
49+
50+
51+
OPTIONS
52+
-------
53+
54+
`--min-batch-size=<n>`::
55+
Specify a minimum size for a batch of missing objects to request
56+
from the server. This size may be exceeded by the last set of
57+
blobs seen at a given path. The default minimum batch size is
58+
50,000.
59+
60+
`--[no-]sparse`::
61+
Only download objects if they appear at a path that matches the
62+
current sparse-checkout. If the sparse-checkout feature is enabled,
63+
then `--sparse` is assumed and can be disabled with `--no-sparse`.
64+
65+
SEE ALSO
66+
--------
67+
linkgit:git-clone[1].
68+
69+
GIT
70+
---
71+
Part of the linkgit:git[1] suite

Documentation/meson.build

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ manpages = {
66
'git-apply.adoc' : 1,
77
'git-archimport.adoc' : 1,
88
'git-archive.adoc' : 1,
9+
'git-backfill.adoc' : 1,
910
'git-bisect.adoc' : 1,
1011
'git-blame.adoc' : 1,
1112
'git-branch.adoc' : 1,

Documentation/technical/api-path-walk.adoc

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,8 +56,17 @@ better off using the revision walk API instead.
5656
the revision walk so that the walk emits commits marked with the
5757
`UNINTERESTING` flag.
5858

59+
`pl`::
60+
This pattern list pointer allows focusing the path-walk search to
61+
a set of patterns, only emitting paths that match the given
62+
patterns. See linkgit:gitignore[5] or
63+
linkgit:git-sparse-checkout[1] for details about pattern lists.
64+
When the pattern list uses cone-mode patterns, then the path-walk
65+
API can prune the set of paths it walks to improve performance.
66+
5967
Examples
6068
--------
6169

6270
See example usages in:
63-
`t/helper/test-path-walk.c`
71+
`t/helper/test-path-walk.c`,
72+
`builtin/backfill.c`

Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1212,6 +1212,7 @@ BUILTIN_OBJS += builtin/am.o
12121212
BUILTIN_OBJS += builtin/annotate.o
12131213
BUILTIN_OBJS += builtin/apply.o
12141214
BUILTIN_OBJS += builtin/archive.o
1215+
BUILTIN_OBJS += builtin/backfill.o
12151216
BUILTIN_OBJS += builtin/bisect.o
12161217
BUILTIN_OBJS += builtin/blame.o
12171218
BUILTIN_OBJS += builtin/branch.o

builtin.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,7 @@ int cmd_am(int argc, const char **argv, const char *prefix, struct repository *r
120120
int cmd_annotate(int argc, const char **argv, const char *prefix, struct repository *repo);
121121
int cmd_apply(int argc, const char **argv, const char *prefix, struct repository *repo);
122122
int cmd_archive(int argc, const char **argv, const char *prefix, struct repository *repo);
123+
int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo);
123124
int cmd_bisect(int argc, const char **argv, const char *prefix, struct repository *repo);
124125
int cmd_blame(int argc, const char **argv, const char *prefix, struct repository *repo);
125126
int cmd_branch(int argc, const char **argv, const char *prefix, struct repository *repo);

builtin/backfill.c

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
/* We need this macro to access core_apply_sparse_checkout */
2+
#define USE_THE_REPOSITORY_VARIABLE
3+
4+
#include "builtin.h"
5+
#include "git-compat-util.h"
6+
#include "config.h"
7+
#include "parse-options.h"
8+
#include "repository.h"
9+
#include "commit.h"
10+
#include "dir.h"
11+
#include "environment.h"
12+
#include "hex.h"
13+
#include "tree.h"
14+
#include "tree-walk.h"
15+
#include "object.h"
16+
#include "object-store-ll.h"
17+
#include "oid-array.h"
18+
#include "oidset.h"
19+
#include "promisor-remote.h"
20+
#include "strmap.h"
21+
#include "string-list.h"
22+
#include "revision.h"
23+
#include "trace2.h"
24+
#include "progress.h"
25+
#include "packfile.h"
26+
#include "path-walk.h"
27+
28+
static const char * const builtin_backfill_usage[] = {
29+
N_("git backfill [--min-batch-size=<n>] [--[no-]sparse]"),
30+
NULL
31+
};
32+
33+
struct backfill_context {
34+
struct repository *repo;
35+
struct oid_array current_batch;
36+
size_t min_batch_size;
37+
int sparse;
38+
};
39+
40+
static void backfill_context_clear(struct backfill_context *ctx)
41+
{
42+
oid_array_clear(&ctx->current_batch);
43+
}
44+
45+
static void download_batch(struct backfill_context *ctx)
46+
{
47+
promisor_remote_get_direct(ctx->repo,
48+
ctx->current_batch.oid,
49+
ctx->current_batch.nr);
50+
oid_array_clear(&ctx->current_batch);
51+
52+
/*
53+
* We likely have a new packfile. Add it to the packed list to
54+
* avoid possible duplicate downloads of the same objects.
55+
*/
56+
reprepare_packed_git(ctx->repo);
57+
}
58+
59+
static int fill_missing_blobs(const char *path UNUSED,
60+
struct oid_array *list,
61+
enum object_type type,
62+
void *data)
63+
{
64+
struct backfill_context *ctx = data;
65+
66+
if (type != OBJ_BLOB)
67+
return 0;
68+
69+
for (size_t i = 0; i < list->nr; i++) {
70+
if (!has_object(ctx->repo, &list->oid[i],
71+
OBJECT_INFO_FOR_PREFETCH))
72+
oid_array_append(&ctx->current_batch, &list->oid[i]);
73+
}
74+
75+
if (ctx->current_batch.nr >= ctx->min_batch_size)
76+
download_batch(ctx);
77+
78+
return 0;
79+
}
80+
81+
static int do_backfill(struct backfill_context *ctx)
82+
{
83+
struct rev_info revs;
84+
struct path_walk_info info = PATH_WALK_INFO_INIT;
85+
int ret;
86+
87+
if (ctx->sparse) {
88+
CALLOC_ARRAY(info.pl, 1);
89+
if (get_sparse_checkout_patterns(info.pl)) {
90+
path_walk_info_clear(&info);
91+
return error(_("problem loading sparse-checkout"));
92+
}
93+
}
94+
95+
repo_init_revisions(ctx->repo, &revs, "");
96+
handle_revision_arg("HEAD", &revs, 0, 0);
97+
98+
info.blobs = 1;
99+
info.tags = info.commits = info.trees = 0;
100+
101+
info.revs = &revs;
102+
info.path_fn = fill_missing_blobs;
103+
info.path_fn_data = ctx;
104+
105+
ret = walk_objects_by_path(&info);
106+
107+
/* Download the objects that did not fill a batch. */
108+
if (!ret)
109+
download_batch(ctx);
110+
111+
path_walk_info_clear(&info);
112+
release_revisions(&revs);
113+
return ret;
114+
}
115+
116+
int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo)
117+
{
118+
int result;
119+
struct backfill_context ctx = {
120+
.repo = repo,
121+
.current_batch = OID_ARRAY_INIT,
122+
.min_batch_size = 50000,
123+
.sparse = 0,
124+
};
125+
struct option options[] = {
126+
OPT_INTEGER(0, "min-batch-size", &ctx.min_batch_size,
127+
N_("Minimum number of objects to request at a time")),
128+
OPT_BOOL(0, "sparse", &ctx.sparse,
129+
N_("Restrict the missing objects to the current sparse-checkout")),
130+
OPT_END(),
131+
};
132+
133+
show_usage_with_options_if_asked(argc, argv,
134+
builtin_backfill_usage, options);
135+
136+
argc = parse_options(argc, argv, prefix, options, builtin_backfill_usage,
137+
0);
138+
139+
repo_config(repo, git_default_config, NULL);
140+
141+
if (ctx.sparse < 0)
142+
ctx.sparse = core_apply_sparse_checkout;
143+
144+
result = do_backfill(&ctx);
145+
backfill_context_clear(&ctx);
146+
return result;
147+
}

command-list.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ git-annotate ancillaryinterrogators
6060
git-apply plumbingmanipulators complete
6161
git-archimport foreignscminterface
6262
git-archive mainporcelain
63+
git-backfill mainporcelain history
6364
git-bisect mainporcelain info
6465
git-blame ancillaryinterrogators complete
6566
git-branch mainporcelain history

dir.c

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1093,10 +1093,6 @@ static void invalidate_directory(struct untracked_cache *uc,
10931093
dir->dirs[i]->recurse = 0;
10941094
}
10951095

1096-
static int add_patterns_from_buffer(char *buf, size_t size,
1097-
const char *base, int baselen,
1098-
struct pattern_list *pl);
1099-
11001096
/* Flags for add_patterns() */
11011097
#define PATTERN_NOFOLLOW (1<<0)
11021098

@@ -1186,9 +1182,9 @@ static int add_patterns(const char *fname, const char *base, int baselen,
11861182
return 0;
11871183
}
11881184

1189-
static int add_patterns_from_buffer(char *buf, size_t size,
1190-
const char *base, int baselen,
1191-
struct pattern_list *pl)
1185+
int add_patterns_from_buffer(char *buf, size_t size,
1186+
const char *base, int baselen,
1187+
struct pattern_list *pl)
11921188
{
11931189
char *orig = buf;
11941190
int i, lineno = 1;

dir.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -467,6 +467,9 @@ void add_patterns_from_file(struct dir_struct *, const char *fname);
467467
int add_patterns_from_blob_to_list(struct object_id *oid,
468468
const char *base, int baselen,
469469
struct pattern_list *pl);
470+
int add_patterns_from_buffer(char *buf, size_t size,
471+
const char *base, int baselen,
472+
struct pattern_list *pl);
470473
void parse_path_pattern(const char **string, int *patternlen, unsigned *flags, int *nowildcardlen);
471474
void add_pattern(const char *string, const char *base,
472475
int baselen, struct pattern_list *pl, int srcpos);

0 commit comments

Comments
 (0)