Skip to content

Commit aaf8122

Browse files
chiyutianyigitster
authored andcommitted
unpack-objects: use stream_loose_object() to unpack large objects
Make use of the stream_loose_object() function introduced in the preceding commit to unpack large objects. Before this we'd need to malloc() the size of the blob before unpacking it, which could cause OOM with very large blobs. We could use the new streaming interface to unpack all blobs, but doing so would be much slower, as demonstrated e.g. with this benchmark using git-hyperfine[0]: rm -rf /tmp/scalar.git && git clone --bare https://github.com/Microsoft/scalar.git /tmp/scalar.git && mv /tmp/scalar.git/objects/pack/*.pack /tmp/scalar.git/my.pack && git hyperfine \ -r 2 --warmup 1 \ -L rev origin/master,HEAD -L v "10,512,1k,1m" \ -s 'make' \ -p 'git init --bare dest.git' \ -c 'rm -rf dest.git' \ './git -C dest.git -c core.bigFileThreshold={v} unpack-objects </tmp/scalar.git/my.pack' Here we'll perform worse with lower core.bigFileThreshold settings with this change in terms of speed, but we're getting lower memory use in return: Summary './git -C dest.git -c core.bigFileThreshold=10 unpack-objects </tmp/scalar.git/my.pack' in 'origin/master' ran 1.01 ± 0.01 times faster than './git -C dest.git -c core.bigFileThreshold=1k unpack-objects </tmp/scalar.git/my.pack' in 'origin/master' 1.01 ± 0.01 times faster than './git -C dest.git -c core.bigFileThreshold=1m unpack-objects </tmp/scalar.git/my.pack' in 'origin/master' 1.01 ± 0.02 times faster than './git -C dest.git -c core.bigFileThreshold=1m unpack-objects </tmp/scalar.git/my.pack' in 'HEAD' 1.02 ± 0.00 times faster than './git -C dest.git -c core.bigFileThreshold=512 unpack-objects </tmp/scalar.git/my.pack' in 'origin/master' 1.09 ± 0.01 times faster than './git -C dest.git -c core.bigFileThreshold=1k unpack-objects </tmp/scalar.git/my.pack' in 'HEAD' 1.10 ± 0.00 times faster than './git -C dest.git -c core.bigFileThreshold=512 unpack-objects </tmp/scalar.git/my.pack' in 'HEAD' 1.11 ± 0.00 times faster than './git -C dest.git -c core.bigFileThreshold=10 unpack-objects </tmp/scalar.git/my.pack' in 'HEAD' A better benchmark to demonstrate the benefits of that this one, which creates an artificial repo with a 1, 25, 50, 75 and 100MB blob: rm -rf /tmp/repo && git init /tmp/repo && ( cd /tmp/repo && for i in 1 25 50 75 100 do dd if=/dev/urandom of=blob.$i count=$(($i*1024)) bs=1024 done && git add blob.* && git commit -mblobs && git gc && PACK=$(echo .git/objects/pack/pack-*.pack) && cp "$PACK" my.pack ) && git hyperfine \ --show-output \ -L rev origin/master,HEAD -L v "512,50m,100m" \ -s 'make' \ -p 'git init --bare dest.git' \ -c 'rm -rf dest.git' \ '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold={v} unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum' Using this test we'll always use >100MB of memory on origin/master (around ~105MB), but max out at e.g. ~55MB if we set core.bigFileThreshold=50m. The relevant "Maximum resident set size" lines were manually added below the relevant benchmark: '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=50m unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum' in 'origin/master' ran Maximum resident set size (kbytes): 107080 1.02 ± 0.78 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=512 unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum' in 'origin/master' Maximum resident set size (kbytes): 106968 1.09 ± 0.79 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=100m unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum' in 'origin/master' Maximum resident set size (kbytes): 107032 1.42 ± 1.07 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=100m unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum' in 'HEAD' Maximum resident set size (kbytes): 107072 1.83 ± 1.02 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=50m unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum' in 'HEAD' Maximum resident set size (kbytes): 55704 2.16 ± 1.19 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=512 unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum' in 'HEAD' Maximum resident set size (kbytes): 4564 This shows that if you have enough memory this new streaming method is slower the lower you set the streaming threshold, but the benefit is more bounded memory use. An earlier version of this patch introduced a new "core.bigFileStreamingThreshold" instead of re-using the existing "core.bigFileThreshold" variable[1]. As noted in a detailed overview of its users in [2] using it has several different meanings. Still, we consider it good enough to simply re-use it. While it's possible that someone might want to e.g. consider objects "small" for the purposes of diffing but "big" for the purposes of writing them such use-cases are probably too obscure to worry about. We can always split up "core.bigFileThreshold" in the future if there's a need for that. 0. https://github.com/avar/git-hyperfine/ 1. https://lore.kernel.org/git/[email protected]/ 2. https://lore.kernel.org/git/[email protected]/ Helped-by: Ævar Arnfjörð Bjarmason <[email protected]> Helped-by: Derrick Stolee <[email protected]> Helped-by: Jiang Xin <[email protected]> Signed-off-by: Han Xin <[email protected]> Signed-off-by: Ævar Arnfjörð Bjarmason <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent 3c3ca0b commit aaf8122

File tree

3 files changed

+109
-7
lines changed

3 files changed

+109
-7
lines changed

Documentation/config/core.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -468,8 +468,8 @@ usage, at the slight expense of increased disk usage.
468468
* Will generally be streamed when written, which avoids excessive
469469
memory usage, at the cost of some fixed overhead. Commands that make
470470
use of this include linkgit:git-archive[1],
471-
linkgit:git-fast-import[1], linkgit:git-index-pack[1] and
472-
linkgit:git-fsck[1].
471+
linkgit:git-fast-import[1], linkgit:git-index-pack[1],
472+
linkgit:git-unpack-objects[1] and linkgit:git-fsck[1].
473473

474474
core.excludesFile::
475475
Specifies the pathname to the file that contains patterns to

builtin/unpack-objects.c

Lines changed: 68 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -351,6 +351,68 @@ static void unpack_non_delta_entry(enum object_type type, unsigned long size,
351351
write_object(nr, type, buf, size);
352352
}
353353

354+
struct input_zstream_data {
355+
git_zstream *zstream;
356+
unsigned char buf[8192];
357+
int status;
358+
};
359+
360+
static const void *feed_input_zstream(struct input_stream *in_stream,
361+
unsigned long *readlen)
362+
{
363+
struct input_zstream_data *data = in_stream->data;
364+
git_zstream *zstream = data->zstream;
365+
void *in = fill(1);
366+
367+
if (in_stream->is_finished) {
368+
*readlen = 0;
369+
return NULL;
370+
}
371+
372+
zstream->next_out = data->buf;
373+
zstream->avail_out = sizeof(data->buf);
374+
zstream->next_in = in;
375+
zstream->avail_in = len;
376+
377+
data->status = git_inflate(zstream, 0);
378+
379+
in_stream->is_finished = data->status != Z_OK;
380+
use(len - zstream->avail_in);
381+
*readlen = sizeof(data->buf) - zstream->avail_out;
382+
383+
return data->buf;
384+
}
385+
386+
static void stream_blob(unsigned long size, unsigned nr)
387+
{
388+
git_zstream zstream = { 0 };
389+
struct input_zstream_data data = { 0 };
390+
struct input_stream in_stream = {
391+
.read = feed_input_zstream,
392+
.data = &data,
393+
};
394+
struct obj_info *info = &obj_list[nr];
395+
396+
data.zstream = &zstream;
397+
git_inflate_init(&zstream);
398+
399+
if (stream_loose_object(&in_stream, size, &info->oid))
400+
die(_("failed to write object in stream"));
401+
402+
if (data.status != Z_STREAM_END)
403+
die(_("inflate returned (%d)"), data.status);
404+
git_inflate_end(&zstream);
405+
406+
if (strict) {
407+
struct blob *blob = lookup_blob(the_repository, &info->oid);
408+
409+
if (!blob)
410+
die(_("invalid blob object from stream"));
411+
blob->object.flags |= FLAG_WRITTEN;
412+
}
413+
info->obj = NULL;
414+
}
415+
354416
static int resolve_against_held(unsigned nr, const struct object_id *base,
355417
void *delta_data, unsigned long delta_size)
356418
{
@@ -483,9 +545,14 @@ static void unpack_one(unsigned nr)
483545
}
484546

485547
switch (type) {
548+
case OBJ_BLOB:
549+
if (!dry_run && size > big_file_threshold) {
550+
stream_blob(size, nr);
551+
return;
552+
}
553+
/* fallthrough */
486554
case OBJ_COMMIT:
487555
case OBJ_TREE:
488-
case OBJ_BLOB:
489556
case OBJ_TAG:
490557
unpack_non_delta_entry(type, size, nr);
491558
return;

t/t5351-unpack-large-objects.sh

Lines changed: 39 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,15 +9,19 @@ test_description='git unpack-objects with large objects'
99

1010
prepare_dest () {
1111
test_when_finished "rm -rf dest.git" &&
12-
git init --bare dest.git
12+
git init --bare dest.git &&
13+
git -C dest.git config core.bigFileThreshold "$1"
1314
}
1415

1516
test_expect_success "create large objects (1.5 MB) and PACK" '
1617
test-tool genrandom foo 1500000 >big-blob &&
1718
test_commit --append foo big-blob &&
1819
test-tool genrandom bar 1500000 >big-blob &&
1920
test_commit --append bar big-blob &&
20-
PACK=$(echo HEAD | git pack-objects --revs pack)
21+
PACK=$(echo HEAD | git pack-objects --revs pack) &&
22+
git verify-pack -v pack-$PACK.pack >out &&
23+
sed -n -e "s/^\([0-9a-f][0-9a-f]*\).*\(commit\|tree\|blob\).*/\1/p" \
24+
<out >obj-list
2125
'
2226

2327
test_expect_success 'set memory limitation to 1MB' '
@@ -26,16 +30,47 @@ test_expect_success 'set memory limitation to 1MB' '
2630
'
2731

2832
test_expect_success 'unpack-objects failed under memory limitation' '
29-
prepare_dest &&
33+
prepare_dest 2m &&
3034
test_must_fail git -C dest.git unpack-objects <pack-$PACK.pack 2>err &&
3135
grep "fatal: attempting to allocate" err
3236
'
3337

3438
test_expect_success 'unpack-objects works with memory limitation in dry-run mode' '
35-
prepare_dest &&
39+
prepare_dest 2m &&
3640
git -C dest.git unpack-objects -n <pack-$PACK.pack &&
3741
test_stdout_line_count = 0 find dest.git/objects -type f &&
3842
test_dir_is_empty dest.git/objects/pack
3943
'
4044

45+
test_expect_success 'unpack big object in stream' '
46+
prepare_dest 1m &&
47+
git -C dest.git unpack-objects <pack-$PACK.pack &&
48+
test_dir_is_empty dest.git/objects/pack
49+
'
50+
51+
BATCH_CONFIGURATION='-c core.fsync=loose-object -c core.fsyncmethod=batch'
52+
53+
test_expect_success 'unpack big object in stream (core.fsyncmethod=batch)' '
54+
prepare_dest 1m &&
55+
GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
56+
git -C dest.git $BATCH_CONFIGURATION unpack-objects <pack-$PACK.pack &&
57+
grep fsync/hardware-flush trace2.txt &&
58+
test_dir_is_empty dest.git/objects/pack &&
59+
git -C dest.git cat-file --batch-check="%(objectname)" <obj-list >current &&
60+
cmp obj-list current
61+
'
62+
63+
test_expect_success 'do not unpack existing large objects' '
64+
prepare_dest 1m &&
65+
git -C dest.git index-pack --stdin <pack-$PACK.pack &&
66+
git -C dest.git unpack-objects <pack-$PACK.pack &&
67+
68+
# The destination came up with the exact same pack...
69+
DEST_PACK=$(echo dest.git/objects/pack/pack-*.pack) &&
70+
test_cmp pack-$PACK.pack $DEST_PACK &&
71+
72+
# ...and wrote no loose objects
73+
test_stdout_line_count = 0 find dest.git/objects -type f ! -name "pack-*"
74+
'
75+
4176
test_done

0 commit comments

Comments
 (0)