Skip to content

Commit 5c16683

Browse files
ttaylorrgitster
authored andcommitted
repack: exclude cruft pack(s) from the MIDX where possible
In ddee370 (builtin/repack.c: add cruft packs to MIDX during geometric repack, 2022-05-20), repack began adding cruft pack(s) to the MIDX with '--write-midx' to ensure that the resulting MIDX was always closed under reachability in order to generate reachability bitmaps. While the previous patch added the '--stdin-packs=follow' option to pack-objects, it is not yet on by default. Given that, suppose you have a once-unreachable object packed in a cruft pack, which later becomes reachable from one or more objects in a geometrically repacked pack. That once-unreachable object *won't* appear in the new pack, since the cruft pack was not specified as included or excluded when the geometrically repacked pack was created with 'pack-objects --stdin-packs' (*not* '--stdin-packs=follow', which is not on). If that new pack is included in a MIDX without the cruft pack, then trying to generate bitmaps for that MIDX may fail. This happens when the bitmap selection process picks one or more commits which reach the once-unreachable objects. To mitigate this failure mode, commit ddee370 ensures that the MIDX will be closed under reachability by including cruft pack(s). If cruft pack(s) were not included, we would fail to generate a MIDX bitmap. But ddee370 alludes to the fact that this is sub-optimal by saying [...] it's desirable to avoid including cruft packs in the MIDX because it causes the MIDX to store a bunch of objects which are likely to get thrown away. , which is true, but hides an even larger problem. If repositories rarely prune their unreachable objects and/or have many of them, the MIDX must keep track of a large number of objects which bloats the MIDX and slows down object lookup. This is doubly unfortunate because the vast majority of objects in cruft pack(s) are unlikely to be read. But any object lookups that go through the MIDX must binary search over them anyway, slowing down object lookups using the MIDX. This patch causes geometrically-repacked packs to contain a copy of any once-unreachable object(s) with 'git pack-objects --stdin-packs=follow', allowing us to avoid including any cruft packs in the MIDX. This is because a sequence of geometrically-repacked packs that were all generated with '--stdin-packs=follow' are guaranteed to have their union be closed under reachability. Note that you cannot guarantee that a collection of packs is closed under reachability if not all of them were generated with "following" as above. One tell-tale sign that not all geometrically-repacked packs in the MIDX were generated with "following" is to see if there is a pack in the existing MIDX that is not going to be somehow represented (either verbatim or as part of a geometric rollup) in the new MIDX. If there is, then starting to generate packs with "following" during geometric repacking won't work, since it's open to the same race as described above. But if you're starting from scratch (e.g., building the first MIDX after an all-into-one '--cruft' repack), then you can guarantee that the union of subsequently generated packs from geometric repacking *is* closed under reachability. Detect when this is the case and avoid including cruft packs in the MIDX where possible. The existing behavior remains the default, and the new behavior is available with the config 'repack.midxMustIncludeCruft' set to 'false'. Signed-off-by: Taylor Blau <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent 27f9425 commit 5c16683

File tree

3 files changed

+242
-18
lines changed

3 files changed

+242
-18
lines changed

Documentation/config/repack.adoc

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,3 +39,10 @@ repack.cruftThreads::
3939
a cruft pack and the respective parameters are not given over
4040
the command line. See similarly named `pack.*` configuration
4141
variables for defaults and meaning.
42+
43+
repack.midxMustContainCruft::
44+
When set to true, linkgit:git-repack[1] will unconditionally include
45+
cruft pack(s), if any, in the multi-pack index when invoked with
46+
`--write-midx`. When false, cruft packs are only included in the MIDX
47+
when necessary (e.g., because they might be required to form a
48+
reachability closure with MIDX bitmaps). Defaults to true.

builtin/repack.c

Lines changed: 145 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ static int write_bitmaps = -1;
3939
static int use_delta_islands;
4040
static int run_update_server_info = 1;
4141
static char *packdir, *packtmp_name, *packtmp;
42+
static int midx_must_contain_cruft = 1;
4243

4344
static const char *const git_repack_usage[] = {
4445
N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n"
@@ -107,6 +108,10 @@ static int repack_config(const char *var, const char *value,
107108
free(cruft_po_args->threads);
108109
return git_config_string(&cruft_po_args->threads, var, value);
109110
}
111+
if (!strcmp(var, "repack.midxmustcontaincruft")) {
112+
midx_must_contain_cruft = git_config_bool(var, value);
113+
return 0;
114+
}
110115
return git_default_config(var, value, ctx, cb);
111116
}
112117

@@ -687,6 +692,77 @@ static void free_pack_geometry(struct pack_geometry *geometry)
687692
free(geometry->pack);
688693
}
689694

695+
static int midx_has_unknown_packs(char **midx_pack_names,
696+
size_t midx_pack_names_nr,
697+
struct string_list *include,
698+
struct pack_geometry *geometry,
699+
struct existing_packs *existing)
700+
{
701+
size_t i;
702+
703+
string_list_sort(include);
704+
705+
for (i = 0; i < midx_pack_names_nr; i++) {
706+
const char *pack_name = midx_pack_names[i];
707+
708+
/*
709+
* Determine whether or not each MIDX'd pack from the existing
710+
* MIDX (if any) is represented in the new MIDX. For each pack
711+
* in the MIDX, it must either be:
712+
*
713+
* - In the "include" list of packs to be included in the new
714+
* MIDX. Note this function is called before the include
715+
* list is populated with any cruft pack(s).
716+
*
717+
* - Below the geometric split line (if using pack geometry),
718+
* indicating that the pack won't be included in the new
719+
* MIDX, but its contents were rolled up as part of the
720+
* geometric repack.
721+
*
722+
* - In the existing non-kept packs list (if not using pack
723+
* geometry), and marked as non-deleted.
724+
*/
725+
if (string_list_has_string(include, pack_name)) {
726+
continue;
727+
} else if (geometry) {
728+
struct strbuf buf = STRBUF_INIT;
729+
uint32_t j;
730+
731+
for (j = 0; j < geometry->split; j++) {
732+
strbuf_reset(&buf);
733+
strbuf_addstr(&buf, pack_basename(geometry->pack[j]));
734+
strbuf_strip_suffix(&buf, ".pack");
735+
strbuf_addstr(&buf, ".idx");
736+
737+
if (!strcmp(pack_name, buf.buf)) {
738+
strbuf_release(&buf);
739+
break;
740+
}
741+
}
742+
743+
strbuf_release(&buf);
744+
745+
if (j < geometry->split)
746+
continue;
747+
} else {
748+
struct string_list_item *item;
749+
750+
item = string_list_lookup(&existing->non_kept_packs,
751+
pack_name);
752+
if (item && !pack_is_marked_for_deletion(item))
753+
continue;
754+
}
755+
756+
/*
757+
* If we got to this point, the MIDX includes some pack that we
758+
* don't know about.
759+
*/
760+
return 1;
761+
}
762+
763+
return 0;
764+
}
765+
690766
struct midx_snapshot_ref_data {
691767
struct tempfile *f;
692768
struct oidset seen;
@@ -755,6 +831,8 @@ static void midx_snapshot_refs(struct tempfile *f)
755831

756832
static void midx_included_packs(struct string_list *include,
757833
struct existing_packs *existing,
834+
char **midx_pack_names,
835+
size_t midx_pack_names_nr,
758836
struct string_list *names,
759837
struct pack_geometry *geometry)
760838
{
@@ -808,26 +886,56 @@ static void midx_included_packs(struct string_list *include,
808886
}
809887
}
810888

811-
for_each_string_list_item(item, &existing->cruft_packs) {
889+
if (midx_must_contain_cruft ||
890+
midx_has_unknown_packs(midx_pack_names, midx_pack_names_nr,
891+
include, geometry, existing)) {
812892
/*
813-
* When doing a --geometric repack, there is no need to check
814-
* for deleted packs, since we're by definition not doing an
815-
* ALL_INTO_ONE repack (hence no packs will be deleted).
816-
* Otherwise we must check for and exclude any packs which are
817-
* enqueued for deletion.
893+
* If there are one or more unknown pack(s) present (see
894+
* midx_has_unknown_packs() for what makes a pack
895+
* "unknown") in the MIDX before the repack, keep them
896+
* as they may be required to form a reachability
897+
* closure if the MIDX is bitmapped.
818898
*
819-
* So we could omit the conditional below in the --geometric
820-
* case, but doing so is unnecessary since no packs are marked
821-
* as pending deletion (since we only call
822-
* `mark_packs_for_deletion()` when doing an all-into-one
823-
* repack).
899+
* For example, a cruft pack can be required to form a
900+
* reachability closure if the MIDX is bitmapped and one
901+
* or more of the bitmap's selected commits reaches a
902+
* once-cruft object that was later made reachable.
824903
*/
825-
if (pack_is_marked_for_deletion(item))
826-
continue;
904+
for_each_string_list_item(item, &existing->cruft_packs) {
905+
/*
906+
* When doing a --geometric repack, there is no
907+
* need to check for deleted packs, since we're
908+
* by definition not doing an ALL_INTO_ONE
909+
* repack (hence no packs will be deleted).
910+
* Otherwise we must check for and exclude any
911+
* packs which are enqueued for deletion.
912+
*
913+
* So we could omit the conditional below in the
914+
* --geometric case, but doing so is unnecessary
915+
* since no packs are marked as pending
916+
* deletion (since we only call
917+
* `mark_packs_for_deletion()` when doing an
918+
* all-into-one repack).
919+
*/
920+
if (pack_is_marked_for_deletion(item))
921+
continue;
827922

828-
strbuf_reset(&buf);
829-
strbuf_addf(&buf, "%s.idx", item->string);
830-
string_list_insert(include, buf.buf);
923+
strbuf_reset(&buf);
924+
strbuf_addf(&buf, "%s.idx", item->string);
925+
string_list_insert(include, buf.buf);
926+
}
927+
} else {
928+
/*
929+
* Modern versions of Git (with the appropriate
930+
* configuration setting) will write new copies of
931+
* once-cruft objects when doing a --geometric repack.
932+
*
933+
* If the MIDX has no cruft pack, new packs written
934+
* during a --geometric repack will not rely on the
935+
* cruft pack to form a reachability closure, so we can
936+
* avoid including them in the MIDX in that case.
937+
*/
938+
;
831939
}
832940

833941
strbuf_release(&buf);
@@ -1142,6 +1250,8 @@ int cmd_repack(int argc,
11421250
struct tempfile *refs_snapshot = NULL;
11431251
int i, ext, ret;
11441252
int show_progress;
1253+
char **midx_pack_names = NULL;
1254+
size_t midx_pack_names_nr = 0;
11451255

11461256
/* variables to be filled by option parsing */
11471257
int delete_redundant = 0;
@@ -1356,7 +1466,10 @@ int cmd_repack(int argc,
13561466
!(pack_everything & PACK_CRUFT))
13571467
strvec_push(&cmd.args, "--pack-loose-unreachable");
13581468
} else if (geometry.split_factor) {
1359-
strvec_push(&cmd.args, "--stdin-packs");
1469+
if (midx_must_contain_cruft)
1470+
strvec_push(&cmd.args, "--stdin-packs");
1471+
else
1472+
strvec_push(&cmd.args, "--stdin-packs=follow");
13601473
strvec_push(&cmd.args, "--unpacked");
13611474
} else {
13621475
strvec_push(&cmd.args, "--unpacked");
@@ -1478,6 +1591,16 @@ int cmd_repack(int argc,
14781591

14791592
string_list_sort(&names);
14801593

1594+
if (get_local_multi_pack_index(the_repository)) {
1595+
uint32_t i;
1596+
struct multi_pack_index *m =
1597+
get_local_multi_pack_index(the_repository);
1598+
1599+
ALLOC_ARRAY(midx_pack_names, m->num_packs);
1600+
for (i = 0; i < m->num_packs; i++)
1601+
midx_pack_names[midx_pack_names_nr++] = xstrdup(m->pack_names[i]);
1602+
}
1603+
14811604
close_object_store(the_repository->objects);
14821605

14831606
/*
@@ -1519,7 +1642,8 @@ int cmd_repack(int argc,
15191642

15201643
if (write_midx) {
15211644
struct string_list include = STRING_LIST_INIT_DUP;
1522-
midx_included_packs(&include, &existing, &names, &geometry);
1645+
midx_included_packs(&include, &existing, midx_pack_names,
1646+
midx_pack_names_nr, &names, &geometry);
15231647

15241648
ret = write_midx_included_packs(&include, &geometry, &names,
15251649
refs_snapshot ? get_tempfile_path(refs_snapshot) : NULL,
@@ -1570,6 +1694,9 @@ int cmd_repack(int argc,
15701694
string_list_clear(&names, 1);
15711695
existing_packs_release(&existing);
15721696
free_pack_geometry(&geometry);
1697+
for (size_t i = 0; i < midx_pack_names_nr; i++)
1698+
free(midx_pack_names[i]);
1699+
free(midx_pack_names);
15731700
pack_objects_args_release(&po_args);
15741701
pack_objects_args_release(&cruft_po_args);
15751702

t/t7704-repack-cruft.sh

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -724,4 +724,94 @@ test_expect_success 'cruft repack respects --quiet' '
724724
)
725725
'
726726

727+
setup_cruft_exclude_tests() {
728+
git init "$1" &&
729+
(
730+
cd "$1" &&
731+
732+
git config repack.midxMustContainCruft false &&
733+
734+
test_commit one &&
735+
736+
test_commit --no-tag two &&
737+
two="$(git rev-parse HEAD)" &&
738+
test_commit --no-tag three &&
739+
three="$(git rev-parse HEAD)" &&
740+
git reset --hard one &&
741+
git reflog expire --all --expire=all &&
742+
743+
GIT_TEST_MULTI_PACK_INDEX=0 git repack --cruft -d &&
744+
745+
git merge $two &&
746+
test_commit four
747+
)
748+
}
749+
750+
test_expect_success 'repack --write-midx excludes cruft where possible' '
751+
setup_cruft_exclude_tests exclude-cruft-when-possible &&
752+
(
753+
cd exclude-cruft-when-possible &&
754+
755+
GIT_TEST_MULTI_PACK_INDEX=0 \
756+
git repack -d --geometric=2 --write-midx --write-bitmap-index &&
757+
758+
test-tool read-midx --show-objects $objdir >midx &&
759+
cruft="$(ls $packdir/*.mtimes)" &&
760+
test_grep ! "$(basename "$cruft" .mtimes).idx" midx &&
761+
762+
git rev-list --all --objects --no-object-names >reachable.raw &&
763+
sort reachable.raw >reachable.objects &&
764+
awk "/\.pack$/ { print \$1 }" <midx | sort >midx.objects &&
765+
766+
test_cmp reachable.objects midx.objects
767+
)
768+
'
769+
770+
test_expect_success 'repack --write-midx includes cruft when instructed' '
771+
setup_cruft_exclude_tests exclude-cruft-when-instructed &&
772+
(
773+
cd exclude-cruft-when-instructed &&
774+
775+
GIT_TEST_MULTI_PACK_INDEX=0 \
776+
git -c repack.midxMustContainCruft=true repack \
777+
-d --geometric=2 --write-midx --write-bitmap-index &&
778+
779+
test-tool read-midx --show-objects $objdir >midx &&
780+
cruft="$(ls $packdir/*.mtimes)" &&
781+
test_grep "$(basename "$cruft" .mtimes).idx" midx &&
782+
783+
git cat-file --batch-check="%(objectname)" --batch-all-objects \
784+
>all.objects &&
785+
awk "/\.pack$/ { print \$1 }" <midx | sort >midx.objects &&
786+
787+
test_cmp all.objects midx.objects
788+
)
789+
'
790+
791+
test_expect_success 'repack --write-midx includes cruft when necessary' '
792+
setup_cruft_exclude_tests exclude-cruft-when-necessary &&
793+
(
794+
cd exclude-cruft-when-necessary &&
795+
796+
test_path_is_file $(ls $packdir/pack-*.mtimes) &&
797+
ls $packdir/pack-*.idx | sort >packs.all &&
798+
grep -o "pack-.*\.idx$" packs.all >in &&
799+
800+
git multi-pack-index write --stdin-packs --bitmap <in &&
801+
802+
test_commit five &&
803+
GIT_TEST_MULTI_PACK_INDEX=0 \
804+
git repack -d --geometric=2 --write-midx --write-bitmap-index &&
805+
806+
test-tool read-midx --show-objects $objdir >midx &&
807+
awk "/\.pack$/ { print \$1 }" <midx | sort >midx.objects &&
808+
git cat-file --batch-all-objects --batch-check="%(objectname)" \
809+
>expect.objects &&
810+
test_cmp expect.objects midx.objects &&
811+
812+
grep "^pack-" midx >midx.packs &&
813+
test_line_count = "$(($(wc -l <packs.all) + 1))" midx.packs
814+
)
815+
'
816+
727817
test_done

0 commit comments

Comments
 (0)