Skip to content

Commit 171bdac

Browse files
peffgitster
authored andcommitted
sha1-lookup: handle duplicate keys with GIT_USE_LOOKUP
The sha1_entry_pos function tries to be smart about selecting the middle of a range for its binary search by looking at the value differences between the "lo" and "hi" constraints. However, it is unable to cope with entries with duplicate keys in the sorted list. We may hit a point in the search where both our "lo" and "hi" point to the same key. In this case, the range of values between our endpoints is 0, and trying to scale the difference between our key and the endpoints over that range is undefined (i.e., divide by zero). The current code catches this with an "assert(lov < hiv)". Moreover, after seeing that the first 20 byte of the key are the same, we will try to establish a value from the 21st byte. Which is nonsensical. Instead, we can detect the case that we are in a run of duplicates, and simply do a final comparison against any one of them (since they are all the same, it does not matter which). If the keys match, we have found our entry (or one of them, anyway). If not, then we know that we do not need to look further, as we must be in a run of the duplicate key. Signed-off-by: Jeff King <[email protected]> Acked-by: Nicolas Pitre <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent 54c93cb commit 171bdac

File tree

3 files changed

+198
-0
lines changed

3 files changed

+198
-0
lines changed

sha1-lookup.c

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -204,7 +204,54 @@ int sha1_entry_pos(const void *table,
204204
* byte 0 thru (ofs-1) are the same between
205205
* lo and hi; ofs is the first byte that is
206206
* different.
207+
*
208+
* If ofs==20, then no bytes are different,
209+
* meaning we have entries with duplicate
210+
* keys. We know that we are in a solid run
211+
* of this entry (because the entries are
212+
* sorted, and our lo and hi are the same,
213+
* there can be nothing but this single key
214+
* in between). So we can stop the search.
215+
* Either one of these entries is it (and
216+
* we do not care which), or we do not have
217+
* it.
218+
*
219+
* Furthermore, we know that one of our
220+
* endpoints must be the edge of the run of
221+
* duplicates. For example, given this
222+
* sequence:
223+
*
224+
* idx 0 1 2 3 4 5
225+
* key A C C C C D
226+
*
227+
* If we are searching for "B", we might
228+
* hit the duplicate run at lo=1, hi=3
229+
* (e.g., by first mi=3, then mi=0). But we
230+
* can never have lo > 1, because B < C.
231+
* That is, if our key is less than the
232+
* run, we know that "lo" is the edge, but
233+
* we can say nothing of "hi". Similarly,
234+
* if our key is greater than the run, we
235+
* know that "hi" is the edge, but we can
236+
* say nothing of "lo".
237+
*
238+
* Therefore if we do not find it, we also
239+
* know where it would go if it did exist:
240+
* just on the far side of the edge that we
241+
* know about.
207242
*/
243+
if (ofs == 20) {
244+
mi = lo;
245+
mi_key = base + elem_size * mi + key_offset;
246+
cmp = memcmp(mi_key, key, 20);
247+
if (!cmp)
248+
return mi;
249+
if (cmp < 0)
250+
return -1 - hi;
251+
else
252+
return -1 - lo;
253+
}
254+
208255
hiv = hi_key[ofs_0];
209256
if (ofs_0 < 19)
210257
hiv = (hiv << 8) | hi_key[ofs_0+1];

t/lib-pack.sh

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
#!/bin/sh
2+
#
3+
# Support routines for hand-crafting weird or malicious packs.
4+
#
5+
# You can make a complete pack like:
6+
#
7+
# pack_header 2 >foo.pack &&
8+
# pack_obj e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 >>foo.pack &&
9+
# pack_obj e68fe8129b546b101aee9510c5328e7f21ca1d18 >>foo.pack &&
10+
# pack_trailer foo.pack
11+
12+
# Print the big-endian 4-byte octal representation of $1
13+
uint32_octal () {
14+
n=$1
15+
printf '\%o' $(($n / 16777216)); n=$((n % 16777216))
16+
printf '\%o' $(($n / 65536)); n=$((n % 65536))
17+
printf '\%o' $(($n / 256)); n=$((n % 256))
18+
printf '\%o' $(($n ));
19+
}
20+
21+
# Print the big-endian 4-byte binary representation of $1
22+
uint32_binary () {
23+
printf "$(uint32_octal "$1")"
24+
}
25+
26+
# Print a pack header, version 2, for a pack with $1 objects
27+
pack_header () {
28+
printf 'PACK' &&
29+
printf '\0\0\0\2' &&
30+
uint32_binary "$1"
31+
}
32+
33+
# Print the pack data for object $1, as a delta against object $2 (or as a full
34+
# object if $2 is missing or empty). The output is suitable for including
35+
# directly in the packfile, and represents the entirety of the object entry.
36+
# Doing this on the fly (especially picking your deltas) is quite tricky, so we
37+
# have hardcoded some well-known objects. See the case statements below for the
38+
# complete list.
39+
pack_obj () {
40+
case "$1" in
41+
# empty blob
42+
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391)
43+
case "$2" in
44+
'')
45+
printf '\060\170\234\003\0\0\0\0\1'
46+
return
47+
;;
48+
esac
49+
;;
50+
51+
# blob containing "\7\76"
52+
e68fe8129b546b101aee9510c5328e7f21ca1d18)
53+
case "$2" in
54+
'')
55+
printf '\062\170\234\143\267\3\0\0\116\0\106'
56+
return
57+
;;
58+
esac
59+
;;
60+
esac
61+
62+
echo >&2 "BUG: don't know how to print $1${2:+ (from $2)}"
63+
return 1
64+
}
65+
66+
# Compute and append pack trailer to "$1"
67+
pack_trailer () {
68+
test-sha1 -b <"$1" >trailer.tmp &&
69+
cat trailer.tmp >>"$1" &&
70+
rm -f trailer.tmp
71+
}
72+
73+
# Remove any existing packs to make sure that
74+
# whatever we index next will be the pack that we
75+
# actually use.
76+
clear_packs () {
77+
rm -f .git/objects/pack/*
78+
}

t/t5308-pack-detect-duplicates.sh

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
#!/bin/sh
2+
3+
test_description='handling of duplicate objects in incoming packfiles'
4+
. ./test-lib.sh
5+
. "$TEST_DIRECTORY"/lib-pack.sh
6+
7+
# The sha1s we have in our pack. It's important that these have the same
8+
# starting byte, so that they end up in the same fanout section of the index.
9+
# That lets us make sure we are exercising the binary search with both sets.
10+
LO_SHA1=e68fe8129b546b101aee9510c5328e7f21ca1d18
11+
HI_SHA1=e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
12+
13+
# And here's a "missing sha1" which will produce failed lookups. It must also
14+
# be in the same fanout section, and should be between the two (so that during
15+
# our binary search, we are sure to end up looking at one or the other of the
16+
# duplicate runs).
17+
MISSING_SHA1='e69d000000000000000000000000000000000000'
18+
19+
# git will never intentionally create packfiles with
20+
# duplicate objects, so we have to construct them by hand.
21+
#
22+
# $1 is the name of the packfile to create
23+
#
24+
# $2 is the number of times to duplicate each object
25+
create_pack () {
26+
pack_header "$((2 * $2))" >"$1" &&
27+
for i in $(test_seq 1 "$2"); do
28+
pack_obj $LO_SHA1 &&
29+
pack_obj $HI_SHA1
30+
done >>"$1" &&
31+
pack_trailer "$1"
32+
}
33+
34+
# double-check that create_pack actually works
35+
test_expect_success 'pack with no duplicates' '
36+
create_pack no-dups.pack 1 &&
37+
git index-pack --stdin <no-dups.pack
38+
'
39+
40+
test_expect_success 'index-pack will allow duplicate objects by default' '
41+
clear_packs &&
42+
create_pack dups.pack 100 &&
43+
git index-pack --stdin <dups.pack
44+
'
45+
46+
test_expect_success 'create batch-check test vectors' '
47+
cat >input <<-EOF &&
48+
$LO_SHA1
49+
$HI_SHA1
50+
$MISSING_SHA1
51+
EOF
52+
cat >expect <<-EOF
53+
$LO_SHA1 blob 2
54+
$HI_SHA1 blob 0
55+
$MISSING_SHA1 missing
56+
EOF
57+
'
58+
59+
test_expect_success 'lookup in duplicated pack (binary search)' '
60+
git cat-file --batch-check <input >actual &&
61+
test_cmp expect actual
62+
'
63+
64+
test_expect_success 'lookup in duplicated pack (GIT_USE_LOOKUP)' '
65+
(
66+
GIT_USE_LOOKUP=1 &&
67+
export GIT_USE_LOOKUP &&
68+
git cat-file --batch-check <input >actual
69+
) &&
70+
test_cmp expect actual
71+
'
72+
73+
test_done

0 commit comments

Comments
 (0)