Skip to content

Commit 52fe41f

Browse files
derrickstoleegitster
authored andcommitted
maintenance: add incremental-repack task
The previous change cleaned up loose objects using the 'loose-objects' that can be run safely in the background. Add a similar job that performs similar cleanups for pack-files. One issue with running 'git repack' is that it is designed to repack all pack-files into a single pack-file. While this is the most space-efficient way to store object data, it is not time or memory efficient. This becomes extremely important if the repo is so large that a user struggles to store two copies of the pack on their disk. Instead, perform an "incremental" repack by collecting a few small pack-files into a new pack-file. The multi-pack-index facilitates this process ever since 'git multi-pack-index expire' was added in 19575c7 (multi-pack-index: implement 'expire' subcommand, 2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1 (midx: implement midx_repack(), 2019-06-10). The 'incremental-repack' task runs the following steps: 1. 'git multi-pack-index write' creates a multi-pack-index file if one did not exist, and otherwise will update the multi-pack-index with any new pack-files that appeared since the last write. This is particularly relevant with the background fetch job. When the multi-pack-index sees two copies of the same object, it stores the offset data into the newer pack-file. This means that some old pack-files could become "unreferenced" which I will use to mean "a pack-file that is in the pack-file list of the multi-pack-index but none of the objects in the multi-pack-index reference a location inside that pack-file." 2. 'git multi-pack-index expire' deletes any unreferenced pack-files and updaes the multi-pack-index to drop those pack-files from the list. This is safe to do as concurrent Git processes will see the multi-pack-index and not open those packs when looking for object contents. (Similar to the 'loose-objects' job, there are some Git commands that open pack-files regardless of the multi-pack-index, but they are rarely used. Further, a user that self-selects to use background operations would likely refrain from using those commands.) 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set of pack-files that are listed in the multi-pack-index and creates a new pack-file containing the objects whose offsets are listed by the multi-pack-index to be in those objects. The set of pack- files is selected greedily by sorting the pack-files by modified time and adding a pack-file to the set if its "expected size" is smaller than the batch size until the total expected size of the selected pack-files is at least the batch size. The "expected size" is calculated by taking the size of the pack-file divided by the number of objects in the pack-file and multiplied by the number of objects from the multi-pack-index with offset in that pack-file. The expected size approximates how much data from that pack-file will contribute to the resulting pack-file size. The intention is that the resulting pack-file will be close in size to the provided batch size. The next run of the incremental-repack task will delete these repacked pack-files during the 'expire' step. In this version, the batch size is set to "0" which ignores the size restrictions when selecting the pack-files. It instead selects all pack-files and repacks all packed objects into a single pack-file. This will be updated in the next change, but it requires doing some calculations that are better isolated to a separate change. These steps are based on a similar background maintenance step in Scalar (and VFS for Git) [1]. This was incredibly effective for users of the Windows OS repository. After using the same VFS for Git repository for over a year, some users had _thousands_ of pack-files that combined to up to 250 GB of data. We noticed a few users were running into the open file descriptor limits (due in part to a bug in the multi-pack-index fixed by af96fe3 (midx: add packs to packed_git linked list, 2019-04-29). These pack-files were mostly small since they contained the commits and trees that were pushed to the origin in a given hour. The GVFS protocol includes a "prefetch" step that asks for pre-computed pack- files containing commits and trees by timestamp. These pack-files were grouped into "daily" pack-files once a day for up to 30 days. If a user did not request prefetch packs for over 30 days, then they would get the entire history of commits and trees in a new, large pack-file. This led to a large number of pack-files that had poor delta compression. By running this pack-file maintenance step once per day, these repos with thousands of packs spanning 200+ GB dropped to dozens of pack- files spanning 30-50 GB. This was done all without removing objects from the system and using a constant batch size of two gigabytes. Once the work was done to reduce the pack-files to small sizes, the batch size of two gigabytes means that not every run triggers a repack operation, so the following run will not expire a pack-file. This has kept these repos in a "clean" state. [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent efdd2f0 commit 52fe41f

File tree

4 files changed

+133
-0
lines changed

4 files changed

+133
-0
lines changed

Documentation/git-maintenance.txt

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,24 @@ loose-objects::
8585
advisable to enable both the `loose-objects` and `gc` tasks at the
8686
same time.
8787

88+
incremental-repack::
89+
The `incremental-repack` job repacks the object directory
90+
using the `multi-pack-index` feature. In order to prevent race
91+
conditions with concurrent Git commands, it follows a two-step
92+
process. First, it calls `git multi-pack-index expire` to delete
93+
pack-files unreferenced by the `multi-pack-index` file. Second, it
94+
calls `git multi-pack-index repack` to select several small
95+
pack-files and repack them into a bigger one, and then update the
96+
`multi-pack-index` entries that refer to the small pack-files to
97+
refer to the new pack-file. This prepares those small pack-files
98+
for deletion upon the next run of `git multi-pack-index expire`.
99+
The selection of the small pack-files is such that the expected
100+
size of the big pack-file is at least the batch size; see the
101+
`--batch-size` option for the `repack` subcommand in
102+
linkgit:git-multi-pack-index[1]. The default batch-size is zero,
103+
which is a special case that attempts to repack all pack-files
104+
into a single pack-file.
105+
88106
OPTIONS
89107
-------
90108
--auto::

builtin/gc.c

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1001,6 +1001,77 @@ static int maintenance_task_loose_objects(struct maintenance_run_opts *opts)
10011001
return prune_packed(opts) || pack_loose(opts);
10021002
}
10031003

1004+
static int multi_pack_index_write(struct maintenance_run_opts *opts)
1005+
{
1006+
struct child_process child = CHILD_PROCESS_INIT;
1007+
1008+
child.git_cmd = 1;
1009+
strvec_pushl(&child.args, "multi-pack-index", "write", NULL);
1010+
1011+
if (opts->quiet)
1012+
strvec_push(&child.args, "--no-progress");
1013+
1014+
if (run_command(&child))
1015+
return error(_("failed to write multi-pack-index"));
1016+
1017+
return 0;
1018+
}
1019+
1020+
static int multi_pack_index_expire(struct maintenance_run_opts *opts)
1021+
{
1022+
struct child_process child = CHILD_PROCESS_INIT;
1023+
1024+
child.git_cmd = 1;
1025+
strvec_pushl(&child.args, "multi-pack-index", "expire", NULL);
1026+
1027+
if (opts->quiet)
1028+
strvec_push(&child.args, "--no-progress");
1029+
1030+
close_object_store(the_repository->objects);
1031+
1032+
if (run_command(&child))
1033+
return error(_("'git multi-pack-index expire' failed"));
1034+
1035+
return 0;
1036+
}
1037+
1038+
static int multi_pack_index_repack(struct maintenance_run_opts *opts)
1039+
{
1040+
struct child_process child = CHILD_PROCESS_INIT;
1041+
1042+
child.git_cmd = 1;
1043+
strvec_pushl(&child.args, "multi-pack-index", "repack", NULL);
1044+
1045+
if (opts->quiet)
1046+
strvec_push(&child.args, "--no-progress");
1047+
1048+
strvec_push(&child.args, "--batch-size=0");
1049+
1050+
close_object_store(the_repository->objects);
1051+
1052+
if (run_command(&child))
1053+
return error(_("'git multi-pack-index repack' failed"));
1054+
1055+
return 0;
1056+
}
1057+
1058+
static int maintenance_task_incremental_repack(struct maintenance_run_opts *opts)
1059+
{
1060+
prepare_repo_settings(the_repository);
1061+
if (!the_repository->settings.core_multi_pack_index) {
1062+
warning(_("skipping incremental-repack task because core.multiPackIndex is disabled"));
1063+
return 0;
1064+
}
1065+
1066+
if (multi_pack_index_write(opts))
1067+
return 1;
1068+
if (multi_pack_index_expire(opts))
1069+
return 1;
1070+
if (multi_pack_index_repack(opts))
1071+
return 1;
1072+
return 0;
1073+
}
1074+
10041075
typedef int maintenance_task_fn(struct maintenance_run_opts *opts);
10051076

10061077
/*
@@ -1023,6 +1094,7 @@ struct maintenance_task {
10231094
enum maintenance_task_label {
10241095
TASK_PREFETCH,
10251096
TASK_LOOSE_OBJECTS,
1097+
TASK_INCREMENTAL_REPACK,
10261098
TASK_GC,
10271099
TASK_COMMIT_GRAPH,
10281100

@@ -1040,6 +1112,10 @@ static struct maintenance_task tasks[] = {
10401112
maintenance_task_loose_objects,
10411113
loose_object_auto_condition,
10421114
},
1115+
[TASK_INCREMENTAL_REPACK] = {
1116+
"incremental-repack",
1117+
maintenance_task_incremental_repack,
1118+
},
10431119
[TASK_GC] = {
10441120
"gc",
10451121
maintenance_task_gc,

t/t5319-multi-pack-index.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
test_description='multi-pack-indexes'
44
. ./test-lib.sh
55

6+
GIT_TEST_MULTI_PACK_INDEX=0
67
objdir=.git/objects
78

89
midx_read_expect () {

t/t7900-maintenance.sh

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ test_description='git maintenance builtin'
55
. ./test-lib.sh
66

77
GIT_TEST_COMMIT_GRAPH=0
8+
GIT_TEST_MULTI_PACK_INDEX=0
89

910
test_expect_success 'help text' '
1011
test_expect_code 129 git maintenance -h 2>err &&
@@ -149,4 +150,41 @@ test_expect_success 'maintenance.loose-objects.auto' '
149150
test_subcommand git prune-packed --quiet <trace-loC
150151
'
151152

153+
test_expect_success 'incremental-repack task' '
154+
packDir=.git/objects/pack &&
155+
for i in $(test_seq 1 5)
156+
do
157+
test_commit $i || return 1
158+
done &&
159+
160+
# Create three disjoint pack-files with size BIG, small, small.
161+
echo HEAD~2 | git pack-objects --revs $packDir/test-1 &&
162+
test_tick &&
163+
git pack-objects --revs $packDir/test-2 <<-\EOF &&
164+
HEAD~1
165+
^HEAD~2
166+
EOF
167+
test_tick &&
168+
git pack-objects --revs $packDir/test-3 <<-\EOF &&
169+
HEAD
170+
^HEAD~1
171+
EOF
172+
rm -f $packDir/pack-* &&
173+
rm -f $packDir/loose-* &&
174+
ls $packDir/*.pack >packs-before &&
175+
test_line_count = 3 packs-before &&
176+
177+
# the job repacks the two into a new pack, but does not
178+
# delete the old ones.
179+
git maintenance run --task=incremental-repack &&
180+
ls $packDir/*.pack >packs-between &&
181+
test_line_count = 4 packs-between &&
182+
183+
# the job deletes the two old packs, and does not write
184+
# a new one because only one pack remains.
185+
git maintenance run --task=incremental-repack &&
186+
ls .git/objects/pack/*.pack >packs-after &&
187+
test_line_count = 1 packs-after
188+
'
189+
152190
test_done

0 commit comments

Comments
 (0)