Skip to content

Commit 37dc6d8

Browse files
ttaylorrgitster
authored andcommitted
builtin/repack.c: implement support for --max-cruft-size
Cruft packs are an alternative mechanism for storing a collection of unreachable objects whose mtimes are recent enough to avoid being pruned out of the repository. When cruft packs were first introduced back in b757353 (builtin/pack-objects.c: --cruft without expiration, 2022-05-20) and a7d4938 (builtin/pack-objects.c: --cruft with expiration, 2022-05-20), the recommended workflow consisted of: - Repacking periodically, either by packing anything loose in the repository (via `git repack -d`) or producing a geometric sequence of packs (via `git repack --geometric=<d> -d`). - Every so often, splitting the repository into two packs, one cruft to store the unreachable objects, and another non-cruft pack to store the reachable objects. Repositories may (out of band with the above) choose periodically to prune out some unreachable objects which have aged out of the grace period by generating a pack with `--cruft-expiration=<approxidate>`. This allowed repositories to maintain relatively few packs on average, and quarantine unreachable objects together in a cruft pack, avoiding the pitfalls of holding unreachable objects as loose while they age out (for more, see some of the details in 3d89a8c (Documentation/technical: add cruft-packs.txt, 2022-05-20)). This all works, but can be costly from an I/O-perspective when frequently repacking a repository that has many unreachable objects. This problem is exacerbated when those unreachable objects are rarely (if every) pruned. Since there is at most one cruft pack in the above scheme, each time we update the cruft pack it must be rewritten from scratch. Because much of the pack is reused, this is a relatively inexpensive operation from a CPU-perspective, but is very costly in terms of I/O since we end up rewriting basically the same pack (plus any new unreachable objects that have entered the repository since the last time a cruft pack was generated). At the time, we decided against implementing more robust support for multiple cruft packs. This patch implements that support which we were lacking. Introduce a new option `--max-cruft-size` which allows repositories to accumulate cruft packs up to a given size, after which point a new generation of cruft packs can accumulate until it reaches the maximum size, and so on. To generate a new cruft pack, the process works like so: - Sort a list of any existing cruft packs in ascending order of pack size. - Starting from the beginning of the list, group cruft packs together while the accumulated size is smaller than the maximum specified pack size. - Combine the objects in these cruft packs together into a new cruft pack, along with any other unreachable objects which have since entered the repository. Once a cruft pack grows beyond the size specified via `--max-cruft-size` the pack is effectively frozen. This limits the I/O churn up to a quadratic function of the value specified by the `--max-cruft-size` option, instead of behaving quadratically in the number of total unreachable objects. When pruning unreachable objects, we bypass the new code paths which combine small cruft packs together, and instead start from scratch, passing in the appropriate `--max-pack-size` down to `pack-objects`, putting it in charge of keeping the resulting set of cruft packs sized correctly. This may seem like further I/O churn, but in practice it isn't so bad. We could prune old cruft packs for whom all or most objects are removed, and then generate a new cruft pack with just the remaining set of objects. But this additional complexity buys us relatively little, because most objects end up being pruned anyway, so the I/O churn is well contained. Signed-off-by: Taylor Blau <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent b5b1f4c commit 37dc6d8

File tree

7 files changed

+426
-11
lines changed

7 files changed

+426
-11
lines changed

Documentation/config/gc.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,12 @@ gc.cruftPacks::
8686
linkgit:git-repack[1]) instead of as loose objects. The default
8787
is `true`.
8888

89+
gc.maxCruftSize::
90+
Limit the size of new cruft packs when repacking. When
91+
specified in addition to `--max-cruft-size`, the command line
92+
option takes priority. See the `--max-cruft-size` option of
93+
linkgit:git-repack[1].
94+
8995
gc.pruneExpire::
9096
When 'git gc' is run, it will call 'prune --expire 2.weeks.ago'
9197
(and 'repack --cruft --cruft-expiration 2.weeks.ago' if using

Documentation/git-gc.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,13 @@ be performed as well.
5959
cruft pack instead of storing them as loose objects. `--cruft`
6060
is on by default.
6161

62+
--max-cruft-size=<n>::
63+
When packing unreachable objects into a cruft pack, limit the
64+
size of new cruft packs to be at most `<n>` bytes. Overrides any
65+
value specified via the `gc.maxCruftSize` configuration. See
66+
the `--max-cruft-size` option of linkgit:git-repack[1] for
67+
more.
68+
6269
--prune=<date>::
6370
Prune loose objects older than date (default is 2 weeks ago,
6471
overridable by the config variable `gc.pruneExpire`).

Documentation/git-repack.txt

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,17 @@ to the new separate pack will be written.
7474
immediately instead of waiting for the next `git gc` invocation.
7575
Only useful with `--cruft -d`.
7676

77+
--max-cruft-size=<n>::
78+
Repack cruft objects into packs as large as `<n>` bytes before
79+
creating new packs. As long as there are enough cruft packs
80+
smaller than `<n>`, repacking will cause a new cruft pack to
81+
be created containing objects from any combined cruft packs,
82+
along with any new unreachable objects. Cruft packs larger than
83+
`<n>` will not be modified. When the new cruft pack is larger
84+
than `<n>` bytes, it will be split into multiple packs, all of
85+
which are guaranteed to be at most `<n>` bytes in size. Only
86+
useful with `--cruft -d`.
87+
7788
--expire-to=<dir>::
7889
Write a cruft pack containing pruned objects (if any) to the
7990
directory `<dir>`. This option is useful for keeping a copy of

builtin/gc.c

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ static const char * const builtin_gc_usage[] = {
5252
static int pack_refs = 1;
5353
static int prune_reflogs = 1;
5454
static int cruft_packs = 1;
55+
static unsigned long max_cruft_size;
5556
static int aggressive_depth = 50;
5657
static int aggressive_window = 250;
5758
static int gc_auto_threshold = 6700;
@@ -163,6 +164,7 @@ static void gc_config(void)
163164
git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
164165
git_config_get_bool("gc.autodetach", &detach_auto);
165166
git_config_get_bool("gc.cruftpacks", &cruft_packs);
167+
git_config_get_ulong("gc.maxcruftsize", &max_cruft_size);
166168
git_config_get_expiry("gc.pruneexpire", &prune_expire);
167169
git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire);
168170
git_config_get_expiry("gc.logexpiry", &gc_log_expire);
@@ -347,6 +349,9 @@ static void add_repack_all_option(struct string_list *keep_pack)
347349
strvec_push(&repack, "--cruft");
348350
if (prune_expire)
349351
strvec_pushf(&repack, "--cruft-expiration=%s", prune_expire);
352+
if (max_cruft_size)
353+
strvec_pushf(&repack, "--max-cruft-size=%lu",
354+
max_cruft_size);
350355
} else {
351356
strvec_push(&repack, "-A");
352357
if (prune_expire)
@@ -575,6 +580,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
575580
N_("prune unreferenced objects"),
576581
PARSE_OPT_OPTARG, NULL, (intptr_t)prune_expire },
577582
OPT_BOOL(0, "cruft", &cruft_packs, N_("pack unreferenced objects separately")),
583+
OPT_MAGNITUDE(0, "max-cruft-size", &max_cruft_size,
584+
N_("with --cruft, limit the size of new cruft packs")),
578585
OPT_BOOL(0, "aggressive", &aggressive, N_("be more thorough (increased runtime)")),
579586
OPT_BOOL_F(0, "auto", &auto_gc, N_("enable auto-gc mode"),
580587
PARSE_OPT_NOCOMPLETE),

builtin/repack.c

Lines changed: 123 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727
#define PACK_CRUFT 4
2828

2929
#define DELETE_PACK 1
30+
#define RETAIN_PACK 2
3031

3132
static int pack_everything;
3233
static int delta_base_offset = 1;
@@ -116,11 +117,26 @@ static void pack_mark_for_deletion(struct string_list_item *item)
116117
item->util = (void*)((uintptr_t)item->util | DELETE_PACK);
117118
}
118119

120+
static void pack_unmark_for_deletion(struct string_list_item *item)
121+
{
122+
item->util = (void*)((uintptr_t)item->util & ~DELETE_PACK);
123+
}
124+
119125
static int pack_is_marked_for_deletion(struct string_list_item *item)
120126
{
121127
return (uintptr_t)item->util & DELETE_PACK;
122128
}
123129

130+
static void pack_mark_retained(struct string_list_item *item)
131+
{
132+
item->util = (void*)((uintptr_t)item->util | RETAIN_PACK);
133+
}
134+
135+
static int pack_is_retained(struct string_list_item *item)
136+
{
137+
return (uintptr_t)item->util & RETAIN_PACK;
138+
}
139+
124140
static void mark_packs_for_deletion_1(struct string_list *names,
125141
struct string_list *list)
126142
{
@@ -133,17 +149,39 @@ static void mark_packs_for_deletion_1(struct string_list *names,
133149
if (len < hexsz)
134150
continue;
135151
sha1 = item->string + len - hexsz;
136-
/*
137-
* Mark this pack for deletion, which ensures that this
138-
* pack won't be included in a MIDX (if `--write-midx`
139-
* was given) and that we will actually delete this pack
140-
* (if `-d` was given).
141-
*/
142-
if (!string_list_has_string(names, sha1))
152+
153+
if (pack_is_retained(item)) {
154+
pack_unmark_for_deletion(item);
155+
} else if (!string_list_has_string(names, sha1)) {
156+
/*
157+
* Mark this pack for deletion, which ensures
158+
* that this pack won't be included in a MIDX
159+
* (if `--write-midx` was given) and that we
160+
* will actually delete this pack (if `-d` was
161+
* given).
162+
*/
143163
pack_mark_for_deletion(item);
164+
}
144165
}
145166
}
146167

168+
static void retain_cruft_pack(struct existing_packs *existing,
169+
struct packed_git *cruft)
170+
{
171+
struct strbuf buf = STRBUF_INIT;
172+
struct string_list_item *item;
173+
174+
strbuf_addstr(&buf, pack_basename(cruft));
175+
strbuf_strip_suffix(&buf, ".pack");
176+
177+
item = string_list_lookup(&existing->cruft_packs, buf.buf);
178+
if (!item)
179+
BUG("could not find cruft pack '%s'", pack_basename(cruft));
180+
181+
pack_mark_retained(item);
182+
strbuf_release(&buf);
183+
}
184+
147185
static void mark_packs_for_deletion(struct existing_packs *existing,
148186
struct string_list *names)
149187

@@ -225,6 +263,8 @@ static void collect_pack_filenames(struct existing_packs *existing,
225263
}
226264

227265
string_list_sort(&existing->kept_packs);
266+
string_list_sort(&existing->non_kept_packs);
267+
string_list_sort(&existing->cruft_packs);
228268
strbuf_release(&buf);
229269
}
230270

@@ -806,6 +846,72 @@ static void remove_redundant_bitmaps(struct string_list *include,
806846
strbuf_release(&path);
807847
}
808848

849+
static int existing_cruft_pack_cmp(const void *va, const void *vb)
850+
{
851+
struct packed_git *a = *(struct packed_git **)va;
852+
struct packed_git *b = *(struct packed_git **)vb;
853+
854+
if (a->pack_size < b->pack_size)
855+
return -1;
856+
if (a->pack_size > b->pack_size)
857+
return 1;
858+
return 0;
859+
}
860+
861+
static void collapse_small_cruft_packs(FILE *in, size_t max_size,
862+
struct existing_packs *existing)
863+
{
864+
struct packed_git **existing_cruft, *p;
865+
struct strbuf buf = STRBUF_INIT;
866+
size_t total_size = 0;
867+
size_t existing_cruft_nr = 0;
868+
size_t i;
869+
870+
ALLOC_ARRAY(existing_cruft, existing->cruft_packs.nr);
871+
872+
for (p = get_all_packs(the_repository); p; p = p->next) {
873+
if (!(p->is_cruft && p->pack_local))
874+
continue;
875+
876+
strbuf_reset(&buf);
877+
strbuf_addstr(&buf, pack_basename(p));
878+
strbuf_strip_suffix(&buf, ".pack");
879+
880+
if (!string_list_has_string(&existing->cruft_packs, buf.buf))
881+
continue;
882+
883+
if (existing_cruft_nr >= existing->cruft_packs.nr)
884+
BUG("too many cruft packs (found %"PRIuMAX", but knew "
885+
"of %"PRIuMAX")",
886+
(uintmax_t)existing_cruft_nr + 1,
887+
(uintmax_t)existing->cruft_packs.nr);
888+
existing_cruft[existing_cruft_nr++] = p;
889+
}
890+
891+
QSORT(existing_cruft, existing_cruft_nr, existing_cruft_pack_cmp);
892+
893+
for (i = 0; i < existing_cruft_nr; i++) {
894+
size_t proposed;
895+
896+
p = existing_cruft[i];
897+
proposed = st_add(total_size, p->pack_size);
898+
899+
if (proposed <= max_size) {
900+
total_size = proposed;
901+
fprintf(in, "-%s\n", pack_basename(p));
902+
} else {
903+
retain_cruft_pack(existing, p);
904+
fprintf(in, "%s\n", pack_basename(p));
905+
}
906+
}
907+
908+
for (i = 0; i < existing->non_kept_packs.nr; i++)
909+
fprintf(in, "-%s.pack\n",
910+
existing->non_kept_packs.items[i].string);
911+
912+
strbuf_release(&buf);
913+
}
914+
809915
static int write_cruft_pack(const struct pack_objects_args *args,
810916
const char *destination,
811917
const char *pack_prefix,
@@ -853,10 +959,14 @@ static int write_cruft_pack(const struct pack_objects_args *args,
853959
in = xfdopen(cmd.in, "w");
854960
for_each_string_list_item(item, names)
855961
fprintf(in, "%s-%s.pack\n", pack_prefix, item->string);
856-
for_each_string_list_item(item, &existing->non_kept_packs)
857-
fprintf(in, "-%s.pack\n", item->string);
858-
for_each_string_list_item(item, &existing->cruft_packs)
859-
fprintf(in, "-%s.pack\n", item->string);
962+
if (args->max_pack_size && !cruft_expiration) {
963+
collapse_small_cruft_packs(in, args->max_pack_size, existing);
964+
} else {
965+
for_each_string_list_item(item, &existing->non_kept_packs)
966+
fprintf(in, "-%s.pack\n", item->string);
967+
for_each_string_list_item(item, &existing->cruft_packs)
968+
fprintf(in, "-%s.pack\n", item->string);
969+
}
860970
for_each_string_list_item(item, &existing->kept_packs)
861971
fprintf(in, "%s.pack\n", item->string);
862972
fclose(in);
@@ -919,6 +1029,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
9191029
PACK_CRUFT),
9201030
OPT_STRING(0, "cruft-expiration", &cruft_expiration, N_("approxidate"),
9211031
N_("with --cruft, expire objects older than this")),
1032+
OPT_MAGNITUDE(0, "max-cruft-size", &cruft_po_args.max_pack_size,
1033+
N_("with --cruft, limit the size of new cruft packs")),
9221034
OPT_BOOL('d', NULL, &delete_redundant,
9231035
N_("remove redundant packs, and run git-prune-packed")),
9241036
OPT_BOOL('f', NULL, &po_args.no_reuse_delta,

t/t6500-gc.sh

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -303,6 +303,33 @@ test_expect_success 'gc.bigPackThreshold ignores cruft packs' '
303303
)
304304
'
305305

306+
cruft_max_size_opts="git repack -d -l --cruft --cruft-expiration=2.weeks.ago"
307+
308+
test_expect_success 'setup for --max-cruft-size tests' '
309+
git init cruft--max-size &&
310+
(
311+
cd cruft--max-size &&
312+
prepare_cruft_history
313+
)
314+
'
315+
316+
test_expect_success '--max-cruft-size sets appropriate repack options' '
317+
GIT_TRACE2_EVENT=$(pwd)/trace2.txt git -C cruft--max-size \
318+
gc --cruft --max-cruft-size=1M &&
319+
test_subcommand $cruft_max_size_opts --max-cruft-size=1048576 <trace2.txt
320+
'
321+
322+
test_expect_success 'gc.maxCruftSize sets appropriate repack options' '
323+
GIT_TRACE2_EVENT=$(pwd)/trace2.txt \
324+
git -C cruft--max-size -c gc.maxCruftSize=2M gc --cruft &&
325+
test_subcommand $cruft_max_size_opts --max-cruft-size=2097152 <trace2.txt &&
326+
327+
GIT_TRACE2_EVENT=$(pwd)/trace2.txt \
328+
git -C cruft--max-size -c gc.maxCruftSize=2M gc --cruft \
329+
--max-cruft-size=3M &&
330+
test_subcommand $cruft_max_size_opts --max-cruft-size=3145728 <trace2.txt
331+
'
332+
306333
run_and_wait_for_auto_gc () {
307334
# We read stdout from gc for the side effect of waiting until the
308335
# background gc process exits, closing its fd 9. Furthermore, the

0 commit comments

Comments
 (0)