Skip to content

Commit 4f22b10

Browse files
peffgitster
authored andcommitted
do not stream large files to pack when filters are in use
Because git's object format requires us to specify the number of bytes in the object in its header, we must know the size before streaming a blob into the object database. This is not a problem when adding a regular file, as we can get the size from stat(). However, when filters are in use (such as autocrlf, or the ident, filter, or eol gitattributes), we have no idea what the ultimate size will be. The current code just punts on the whole issue and ignores filter configuration entirely for files larger than core.bigfilethreshold. This can generate confusing results if you use filters for large binary files, as the filter will suddenly stop working as the file goes over a certain size. Rather than try to handle unknown input sizes with streaming, this patch just turns off the streaming optimization when filters are in use. This has a slight performance regression in a very specific case: if you have autocrlf on, but no gitattributes, a large binary file will avoid the streaming code path because we don't know beforehand whether it will need conversion or not. But if you are handling large binary files, you should be marking them as such via attributes (or at least not using autocrlf, and instead marking your text files as such). And the flip side is that if you have a large _non_-binary file, there is a correctness improvement; before we did not apply the conversion at all. The first half of the new t1051 script covers these failures on input. The second half tests the matching output code paths. These already work correctly, and do not need any adjustment. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent 4c3b57b commit 4f22b10

File tree

2 files changed

+95
-5
lines changed

2 files changed

+95
-5
lines changed

sha1_file.c

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2688,10 +2688,13 @@ static int index_core(unsigned char *sha1, int fd, size_t size,
26882688
* This also bypasses the usual "convert-to-git" dance, and that is on
26892689
* purpose. We could write a streaming version of the converting
26902690
* functions and insert that before feeding the data to fast-import
2691-
* (or equivalent in-core API described above), but the primary
2692-
* motivation for trying to stream from the working tree file and to
2693-
* avoid mmaping it in core is to deal with large binary blobs, and
2694-
* by definition they do _not_ want to get any conversion.
2691+
* (or equivalent in-core API described above). However, that is
2692+
* somewhat complicated, as we do not know the size of the filter
2693+
* result, which we need to know beforehand when writing a git object.
2694+
* Since the primary motivation for trying to stream from the working
2695+
* tree file and to avoid mmaping it in core is to deal with large
2696+
* binary blobs, they generally do not want to get any conversion, and
2697+
* callers should avoid this code path when filters are requested.
26952698
*/
26962699
static int index_stream(unsigned char *sha1, int fd, size_t size,
26972700
enum object_type type, const char *path,
@@ -2766,7 +2769,8 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st,
27662769

27672770
if (!S_ISREG(st->st_mode))
27682771
ret = index_pipe(sha1, fd, type, path, flags);
2769-
else if (size <= big_file_threshold || type != OBJ_BLOB)
2772+
else if (size <= big_file_threshold || type != OBJ_BLOB ||
2773+
(path && would_convert_to_git(path, NULL, 0, 0)))
27702774
ret = index_core(sha1, fd, size, type, path, flags);
27712775
else
27722776
ret = index_stream(sha1, fd, size, type, path, flags);

t/t1051-large-conversion.sh

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
#!/bin/sh
2+
3+
test_description='test conversion filters on large files'
4+
. ./test-lib.sh
5+
6+
set_attr() {
7+
test_when_finished 'rm -f .gitattributes' &&
8+
echo "* $*" >.gitattributes
9+
}
10+
11+
check_input() {
12+
git read-tree --empty &&
13+
git add small large &&
14+
git cat-file blob :small >small.index &&
15+
git cat-file blob :large | head -n 1 >large.index &&
16+
test_cmp small.index large.index
17+
}
18+
19+
check_output() {
20+
rm -f small large &&
21+
git checkout small large &&
22+
head -n 1 large >large.head &&
23+
test_cmp small large.head
24+
}
25+
26+
test_expect_success 'setup input tests' '
27+
printf "\$Id: foo\$\\r\\n" >small &&
28+
cat small small >large &&
29+
git config core.bigfilethreshold 20 &&
30+
git config filter.test.clean "sed s/.*/CLEAN/"
31+
'
32+
33+
test_expect_success 'autocrlf=true converts on input' '
34+
test_config core.autocrlf true &&
35+
check_input
36+
'
37+
38+
test_expect_success 'eol=crlf converts on input' '
39+
set_attr eol=crlf &&
40+
check_input
41+
'
42+
43+
test_expect_success 'ident converts on input' '
44+
set_attr ident &&
45+
check_input
46+
'
47+
48+
test_expect_success 'user-defined filters convert on input' '
49+
set_attr filter=test &&
50+
check_input
51+
'
52+
53+
test_expect_success 'setup output tests' '
54+
echo "\$Id\$" >small &&
55+
cat small small >large &&
56+
git add small large &&
57+
git config core.bigfilethreshold 7 &&
58+
git config filter.test.smudge "sed s/.*/SMUDGE/"
59+
'
60+
61+
test_expect_success 'autocrlf=true converts on output' '
62+
test_config core.autocrlf true &&
63+
check_output
64+
'
65+
66+
test_expect_success 'eol=crlf converts on output' '
67+
set_attr eol=crlf &&
68+
check_output
69+
'
70+
71+
test_expect_success 'user-defined filters convert on output' '
72+
set_attr filter=test &&
73+
check_output
74+
'
75+
76+
test_expect_success 'ident converts on output' '
77+
set_attr ident &&
78+
rm -f small large &&
79+
git checkout small large &&
80+
sed -n "s/Id: .*/Id: SHA/p" <small >small.clean &&
81+
head -n 1 large >large.head &&
82+
sed -n "s/Id: .*/Id: SHA/p" <large.head >large.clean &&
83+
test_cmp small.clean large.clean
84+
'
85+
86+
test_done

0 commit comments

Comments
 (0)