Skip to content

Commit 5ee86c2

Browse files
ttaylorrgitster
authored andcommitted
repack: exclude cruft pack(s) from the MIDX where possible
In ddee370 (builtin/repack.c: add cruft packs to MIDX during geometric repack, 2022-05-20), repack began adding cruft pack(s) to the MIDX with '--write-midx' to ensure that the resulting MIDX was always closed under reachability in order to generate reachability bitmaps. While the previous patch added the '--stdin-packs=follow' option to pack-objects, it is not yet on by default. Given that, suppose you have a once-unreachable object packed in a cruft pack, which later becomes reachable from one or more objects in a geometrically repacked pack. That once-unreachable object *won't* appear in the new pack, since the cruft pack was not specified as included or excluded when the geometrically repacked pack was created with 'pack-objects --stdin-packs' (*not* '--stdin-packs=follow', which is not on). If that new pack is included in a MIDX without the cruft pack, then trying to generate bitmaps for that MIDX may fail. This happens when the bitmap selection process picks one or more commits which reach the once-unreachable objects. To mitigate this failure mode, commit ddee370 ensures that the MIDX will be closed under reachability by including cruft pack(s). If cruft pack(s) were not included, we would fail to generate a MIDX bitmap. But ddee370 alludes to the fact that this is sub-optimal by saying [...] it's desirable to avoid including cruft packs in the MIDX because it causes the MIDX to store a bunch of objects which are likely to get thrown away. , which is true, but hides an even larger problem. If repositories rarely prune their unreachable objects and/or have many of them, the MIDX must keep track of a large number of objects which bloats the MIDX and slows down object lookup. This is doubly unfortunate because the vast majority of objects in cruft pack(s) are unlikely to be read. But any object lookups that go through the MIDX must binary search over them anyway, slowing down object lookups using the MIDX. This patch causes geometrically-repacked packs to contain a copy of any once-unreachable object(s) with 'git pack-objects --stdin-packs=follow', allowing us to avoid including any cruft packs in the MIDX. This is because a sequence of geometrically-repacked packs that were all generated with '--stdin-packs=follow' are guaranteed to have their union be closed under reachability. Note that you cannot guarantee that a collection of packs is closed under reachability if not all of them were generated with "following" as above. One tell-tale sign that not all geometrically-repacked packs in the MIDX were generated with "following" is to see if there is a pack in the existing MIDX that is not going to be somehow represented (either verbatim or as part of a geometric rollup) in the new MIDX. If there is, then starting to generate packs with "following" during geometric repacking won't work, since it's open to the same race as described above. But if you're starting from scratch (e.g., building the first MIDX after an all-into-one '--cruft' repack), then you can guarantee that the union of subsequently generated packs from geometric repacking *is* closed under reachability. (One exception here is when "starting from scratch" results in a noop repack, e.g., because the non-cruft pack(s) in a repository already form a geometric progression. Since we can't tell whether or not those were generated with '--stdin-packs=follow', they may depend on once-unreachable objects, so we have to include the cruft pack in the MIDX in this case.) Detect when this is the case and avoid including cruft packs in the MIDX where possible. The existing behavior remains the default, and the new behavior is available with the config 'repack.midxMustIncludeCruft' set to 'false'. Signed-off-by: Taylor Blau <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent cd846ba commit 5ee86c2

File tree

3 files changed

+319
-20
lines changed

3 files changed

+319
-20
lines changed

Documentation/config/repack.adoc

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,3 +39,10 @@ repack.cruftThreads::
3939
a cruft pack and the respective parameters are not given over
4040
the command line. See similarly named `pack.*` configuration
4141
variables for defaults and meaning.
42+
43+
repack.midxMustContainCruft::
44+
When set to true, linkgit:git-repack[1] will unconditionally include
45+
cruft pack(s), if any, in the multi-pack index when invoked with
46+
`--write-midx`. When false, cruft packs are only included in the MIDX
47+
when necessary (e.g., because they might be required to form a
48+
reachability closure with MIDX bitmaps). Defaults to true.

builtin/repack.c

Lines changed: 167 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ static int write_bitmaps = -1;
3939
static int use_delta_islands;
4040
static int run_update_server_info = 1;
4141
static char *packdir, *packtmp_name, *packtmp;
42+
static int midx_must_contain_cruft = 1;
4243

4344
static const char *const git_repack_usage[] = {
4445
N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n"
@@ -108,6 +109,10 @@ static int repack_config(const char *var, const char *value,
108109
free(cruft_po_args->threads);
109110
return git_config_string(&cruft_po_args->threads, var, value);
110111
}
112+
if (!strcmp(var, "repack.midxmustcontaincruft")) {
113+
midx_must_contain_cruft = git_config_bool(var, value);
114+
return 0;
115+
}
111116
return git_default_config(var, value, ctx, cb);
112117
}
113118

@@ -690,6 +695,77 @@ static void free_pack_geometry(struct pack_geometry *geometry)
690695
free(geometry->pack);
691696
}
692697

698+
static int midx_has_unknown_packs(char **midx_pack_names,
699+
size_t midx_pack_names_nr,
700+
struct string_list *include,
701+
struct pack_geometry *geometry,
702+
struct existing_packs *existing)
703+
{
704+
size_t i;
705+
706+
string_list_sort(include);
707+
708+
for (i = 0; i < midx_pack_names_nr; i++) {
709+
const char *pack_name = midx_pack_names[i];
710+
711+
/*
712+
* Determine whether or not each MIDX'd pack from the existing
713+
* MIDX (if any) is represented in the new MIDX. For each pack
714+
* in the MIDX, it must either be:
715+
*
716+
* - In the "include" list of packs to be included in the new
717+
* MIDX. Note this function is called before the include
718+
* list is populated with any cruft pack(s).
719+
*
720+
* - Below the geometric split line (if using pack geometry),
721+
* indicating that the pack won't be included in the new
722+
* MIDX, but its contents were rolled up as part of the
723+
* geometric repack.
724+
*
725+
* - In the existing non-kept packs list (if not using pack
726+
* geometry), and marked as non-deleted.
727+
*/
728+
if (string_list_has_string(include, pack_name)) {
729+
continue;
730+
} else if (geometry) {
731+
struct strbuf buf = STRBUF_INIT;
732+
uint32_t j;
733+
734+
for (j = 0; j < geometry->split; j++) {
735+
strbuf_reset(&buf);
736+
strbuf_addstr(&buf, pack_basename(geometry->pack[j]));
737+
strbuf_strip_suffix(&buf, ".pack");
738+
strbuf_addstr(&buf, ".idx");
739+
740+
if (!strcmp(pack_name, buf.buf)) {
741+
strbuf_release(&buf);
742+
break;
743+
}
744+
}
745+
746+
strbuf_release(&buf);
747+
748+
if (j < geometry->split)
749+
continue;
750+
} else {
751+
struct string_list_item *item;
752+
753+
item = string_list_lookup(&existing->non_kept_packs,
754+
pack_name);
755+
if (item && !pack_is_marked_for_deletion(item))
756+
continue;
757+
}
758+
759+
/*
760+
* If we got to this point, the MIDX includes some pack that we
761+
* don't know about.
762+
*/
763+
return 1;
764+
}
765+
766+
return 0;
767+
}
768+
693769
struct midx_snapshot_ref_data {
694770
struct tempfile *f;
695771
struct oidset seen;
@@ -758,6 +834,8 @@ static void midx_snapshot_refs(struct tempfile *f)
758834

759835
static void midx_included_packs(struct string_list *include,
760836
struct existing_packs *existing,
837+
char **midx_pack_names,
838+
size_t midx_pack_names_nr,
761839
struct string_list *names,
762840
struct pack_geometry *geometry)
763841
{
@@ -811,26 +889,56 @@ static void midx_included_packs(struct string_list *include,
811889
}
812890
}
813891

814-
for_each_string_list_item(item, &existing->cruft_packs) {
892+
if (midx_must_contain_cruft ||
893+
midx_has_unknown_packs(midx_pack_names, midx_pack_names_nr,
894+
include, geometry, existing)) {
815895
/*
816-
* When doing a --geometric repack, there is no need to check
817-
* for deleted packs, since we're by definition not doing an
818-
* ALL_INTO_ONE repack (hence no packs will be deleted).
819-
* Otherwise we must check for and exclude any packs which are
820-
* enqueued for deletion.
896+
* If there are one or more unknown pack(s) present (see
897+
* midx_has_unknown_packs() for what makes a pack
898+
* "unknown") in the MIDX before the repack, keep them
899+
* as they may be required to form a reachability
900+
* closure if the MIDX is bitmapped.
821901
*
822-
* So we could omit the conditional below in the --geometric
823-
* case, but doing so is unnecessary since no packs are marked
824-
* as pending deletion (since we only call
825-
* `mark_packs_for_deletion()` when doing an all-into-one
826-
* repack).
902+
* For example, a cruft pack can be required to form a
903+
* reachability closure if the MIDX is bitmapped and one
904+
* or more of the bitmap's selected commits reaches a
905+
* once-cruft object that was later made reachable.
827906
*/
828-
if (pack_is_marked_for_deletion(item))
829-
continue;
907+
for_each_string_list_item(item, &existing->cruft_packs) {
908+
/*
909+
* When doing a --geometric repack, there is no
910+
* need to check for deleted packs, since we're
911+
* by definition not doing an ALL_INTO_ONE
912+
* repack (hence no packs will be deleted).
913+
* Otherwise we must check for and exclude any
914+
* packs which are enqueued for deletion.
915+
*
916+
* So we could omit the conditional below in the
917+
* --geometric case, but doing so is unnecessary
918+
* since no packs are marked as pending
919+
* deletion (since we only call
920+
* `mark_packs_for_deletion()` when doing an
921+
* all-into-one repack).
922+
*/
923+
if (pack_is_marked_for_deletion(item))
924+
continue;
830925

831-
strbuf_reset(&buf);
832-
strbuf_addf(&buf, "%s.idx", item->string);
833-
string_list_insert(include, buf.buf);
926+
strbuf_reset(&buf);
927+
strbuf_addf(&buf, "%s.idx", item->string);
928+
string_list_insert(include, buf.buf);
929+
}
930+
} else {
931+
/*
932+
* Modern versions of Git (with the appropriate
933+
* configuration setting) will write new copies of
934+
* once-cruft objects when doing a --geometric repack.
935+
*
936+
* If the MIDX has no cruft pack, new packs written
937+
* during a --geometric repack will not rely on the
938+
* cruft pack to form a reachability closure, so we can
939+
* avoid including them in the MIDX in that case.
940+
*/
941+
;
834942
}
835943

836944
strbuf_release(&buf);
@@ -1145,6 +1253,8 @@ int cmd_repack(int argc,
11451253
struct tempfile *refs_snapshot = NULL;
11461254
int i, ext, ret;
11471255
int show_progress;
1256+
char **midx_pack_names = NULL;
1257+
size_t midx_pack_names_nr = 0;
11481258

11491259
/* variables to be filled by option parsing */
11501260
int delete_redundant = 0;
@@ -1361,7 +1471,10 @@ int cmd_repack(int argc,
13611471
!(pack_everything & PACK_CRUFT))
13621472
strvec_push(&cmd.args, "--pack-loose-unreachable");
13631473
} else if (geometry.split_factor) {
1364-
strvec_push(&cmd.args, "--stdin-packs");
1474+
if (midx_must_contain_cruft)
1475+
strvec_push(&cmd.args, "--stdin-packs");
1476+
else
1477+
strvec_push(&cmd.args, "--stdin-packs=follow");
13651478
strvec_push(&cmd.args, "--unpacked");
13661479
} else {
13671480
strvec_push(&cmd.args, "--unpacked");
@@ -1401,8 +1514,25 @@ int cmd_repack(int argc,
14011514
if (ret)
14021515
goto cleanup;
14031516

1404-
if (!names.nr && !po_args.quiet)
1405-
printf_ln(_("Nothing new to pack."));
1517+
if (!names.nr) {
1518+
if (!po_args.quiet)
1519+
printf_ln(_("Nothing new to pack."));
1520+
/*
1521+
* If we didn't write any new packs, the non-cruft packs
1522+
* may refer to once-unreachable objects in the cruft
1523+
* pack(s).
1524+
*
1525+
* If there isn't already a MIDX, the one we write
1526+
* must include the cruft pack(s), in case the
1527+
* non-cruft pack(s) refer to once-cruft objects.
1528+
*
1529+
* If there is already a MIDX, we can punt here, since
1530+
* midx_has_unknown_packs() will make the decision for
1531+
* us.
1532+
*/
1533+
if (!get_local_multi_pack_index(the_repository))
1534+
midx_must_contain_cruft = 1;
1535+
}
14061536

14071537
if (pack_everything & PACK_CRUFT) {
14081538
const char *pack_prefix = find_pack_prefix(packdir, packtmp);
@@ -1483,6 +1613,19 @@ int cmd_repack(int argc,
14831613

14841614
string_list_sort(&names);
14851615

1616+
if (get_local_multi_pack_index(the_repository)) {
1617+
struct multi_pack_index *m =
1618+
get_local_multi_pack_index(the_repository);
1619+
1620+
ALLOC_ARRAY(midx_pack_names,
1621+
m->num_packs + m->num_packs_in_base);
1622+
1623+
for (; m; m = m->base_midx)
1624+
for (uint32_t i = 0; i < m->num_packs; i++)
1625+
midx_pack_names[midx_pack_names_nr++] =
1626+
xstrdup(m->pack_names[i]);
1627+
}
1628+
14861629
close_object_store(the_repository->objects);
14871630

14881631
/*
@@ -1524,7 +1667,8 @@ int cmd_repack(int argc,
15241667

15251668
if (write_midx) {
15261669
struct string_list include = STRING_LIST_INIT_DUP;
1527-
midx_included_packs(&include, &existing, &names, &geometry);
1670+
midx_included_packs(&include, &existing, midx_pack_names,
1671+
midx_pack_names_nr, &names, &geometry);
15281672

15291673
ret = write_midx_included_packs(&include, &geometry, &names,
15301674
refs_snapshot ? get_tempfile_path(refs_snapshot) : NULL,
@@ -1575,6 +1719,9 @@ int cmd_repack(int argc,
15751719
string_list_clear(&names, 1);
15761720
existing_packs_release(&existing);
15771721
free_pack_geometry(&geometry);
1722+
for (size_t i = 0; i < midx_pack_names_nr; i++)
1723+
free(midx_pack_names[i]);
1724+
free(midx_pack_names);
15781725
pack_objects_args_release(&po_args);
15791726
pack_objects_args_release(&cruft_po_args);
15801727

0 commit comments

Comments
 (0)