Skip to content

Commit 8e43090

Browse files
peffgitster
authored andcommitted
fsck: do not assume NUL-termination of buffers
The fsck code operates on an object buffer represented as a pointer/len combination. However, the parsing of commits and tags is a little bit loose; we mostly scan left-to-right through the buffer, without checking whether we've gone past the length we were given. This has traditionally been OK because the buffers we feed to fsck always have an extra NUL after the end of the object content, which ends any left-to-right scan. That has always been true for objects we read from the odb, and we made it true for incoming index-pack/unpack-objects checks in a1e920a (index-pack: terminate object buffers with NUL, 2014-12-08). However, we recently added an exception: hash-object asks index_fd() to do fsck checks. That _may_ have an extra NUL (if we read from a pipe into a strbuf), but it might not (if we read the contents from the file). Nor can we just teach it to always add a NUL. We may mmap the on-disk file, which will not have any extra bytes (if it's a multiple of the page size). Not to mention that this is a rather subtle assumption for the fsck code to make. Instead, let's make sure that the fsck parsers don't ever look past the size of the buffer they've been given. This _almost_ works already, thanks to earlier work in 4d0d897 (Make sure fsck_commit_buffer() does not run out of the buffer, 2014-09-11). The theory there is that we check up front whether we have the end of header double-newline separator. And then any left-to-right scanning we do is OK as long as it stops when it hits that boundary. However, we later softened that in 84d18c0 (fsck: it is OK for a tag and a commit to lack the body, 2015-06-28), which allows the double-newline header to be missing, but does require that the header ends in a newline. That was OK back then, because of the NUL-termination guarantees (including the one from a1e920a mentioned above). Because 84d18c0 guarantees that any header line does end in a newline, we are still OK with most of the left-to-right scanning. We only need to take care after completing a line, to check that there is another line (and we didn't run out of buffer). Most of these checks are just need to check "buffer < buffer_end" (where buffer is advanced as we parse) before scanning for the next header line. But here are a few notes: - we don't technically need to check for remaining buffer before parsing the very first line ("tree" for a commit, or "object" for a tag), because verify_headers() rejects a totally empty buffer. But we'll do so in the name of consistency and defensiveness. - there are some calls to strchr('\n'). These are actually OK by the "the final header line must end in a newline" guarantee from verify_headers(). They will always find that rather than run off the end of the buffer. Curiously, they do check for a NULL return and complain, but I believe that condition can never be reached. However, I converted them to use memchr() with a proper size and retained the NULL checks. Using memchr() is not much longer and makes it more obvious what is going on. Likewise, retaining the NULL checks serves as a defensive measure in case my analysis is wrong. - commit 9a1a3a4 (mktag: allow omitting the header/body \n separator, 2021-01-05), does check for the end-of-buffer condition, but does so with "!*buffer", relying explicitly on the NUL termination. We can accomplish the same thing with a pointer comparison. I also folded it into the follow-on conditional that checks the contents of the buffer, for consistency with the other checks. - fsck_ident() uses parse_timestamp(), which is based on strtoumax(). That function will happily skip past leading whitespace, including newlines, which makes it a risk. We can fix this by scanning to the first digit ourselves, and then using parse_timestamp() to do the actual numeric conversion. Note that as a side effect this fixes the fact that we missed zero-padded timestamps like "<email> 0123" (whereas we would complain about "<email> 0123"). I doubt anybody cares, but I mention it here for completeness. - fsck_tree() does not need any modifications. It relies on decode_tree_entry() to do the actual parsing, and that function checks both that there are enough bytes in the buffer to represent an entry, and that there is a NUL at the appropriate spot (one hash-length from the end; this may not be the NUL for the entry we are parsing, but we know that in the worst case, everything from our current position to that NUL is a filename, so we won't run out of bytes). In addition to fixing the code itself, we'd like to make sure our rather subtle assumptions are not violated in the future. So this patch does two more things: - add comments around verify_headers() documenting the link between what it checks and the memory safety of the callers. I don't expect this code to be modified frequently, but this may help somebody from accidentally breaking things. - add a thorough set of tests covering truncations at various key spots (e.g., for a "tree $oid" line, in the middle of the word "tree", right after it, after the space, in the middle of the $oid, and right at the end of the line. Most of these are fine already (it is only truncating right at the end of the line that is currently broken). And some of them are not even possible with the current code (we parse "tree " as a unit, so truncating before the space is equivalent). But I aimed here to consider the code a black box and look for any truncations that would be a problem for a left-to-right parser. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent 69bbbe4 commit 8e43090

File tree

2 files changed

+194
-13
lines changed

2 files changed

+194
-13
lines changed

fsck.c

Lines changed: 54 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -748,6 +748,23 @@ static int fsck_tree(const struct object_id *tree_oid,
748748
return retval;
749749
}
750750

751+
/*
752+
* Confirm that the headers of a commit or tag object end in a reasonable way,
753+
* either with the usual "\n\n" separator, or at least with a trailing newline
754+
* on the final header line.
755+
*
756+
* This property is important for the memory safety of our callers. It allows
757+
* them to scan the buffer linewise without constantly checking the remaining
758+
* size as long as:
759+
*
760+
* - they check that there are bytes left in the buffer at the start of any
761+
* line (i.e., that the last newline they saw was not the final one we
762+
* found here)
763+
*
764+
* - any intra-line scanning they do will stop at a newline, which will worst
765+
* case hit the newline we found here as the end-of-header. This makes it
766+
* OK for them to use helpers like parse_oid_hex(), or even skip_prefix().
767+
*/
751768
static int verify_headers(const void *data, unsigned long size,
752769
const struct object_id *oid, enum object_type type,
753770
struct fsck_options *options)
@@ -808,6 +825,20 @@ static int fsck_ident(const char **ident,
808825
if (*p != ' ')
809826
return report(options, oid, type, FSCK_MSG_MISSING_SPACE_BEFORE_DATE, "invalid author/committer line - missing space before date");
810827
p++;
828+
/*
829+
* Our timestamp parser is based on the C strto*() functions, which
830+
* will happily eat whitespace, including the newline that is supposed
831+
* to prevent us walking past the end of the buffer. So do our own
832+
* scan, skipping linear whitespace but not newlines, and then
833+
* confirming we found a digit. We _could_ be even more strict here,
834+
* as we really expect only a single space, but since we have
835+
* traditionally allowed extra whitespace, we'll continue to do so.
836+
*/
837+
while (*p == ' ' || *p == '\t')
838+
p++;
839+
if (!isdigit(*p))
840+
return report(options, oid, type, FSCK_MSG_BAD_DATE,
841+
"invalid author/committer line - bad date");
811842
if (*p == '0' && p[1] != ' ')
812843
return report(options, oid, type, FSCK_MSG_ZERO_PADDED_DATE, "invalid author/committer line - zero-padded date");
813844
if (date_overflows(parse_timestamp(p, &end, 10)))
@@ -834,20 +865,26 @@ static int fsck_commit(const struct object_id *oid,
834865
unsigned author_count;
835866
int err;
836867
const char *buffer_begin = buffer;
868+
const char *buffer_end = buffer + size;
837869
const char *p;
838870

871+
/*
872+
* We _must_ stop parsing immediately if this reports failure, as the
873+
* memory safety of the rest of the function depends on it. See the
874+
* comment above the definition of verify_headers() for more details.
875+
*/
839876
if (verify_headers(buffer, size, oid, OBJ_COMMIT, options))
840877
return -1;
841878

842-
if (!skip_prefix(buffer, "tree ", &buffer))
879+
if (buffer >= buffer_end || !skip_prefix(buffer, "tree ", &buffer))
843880
return report(options, oid, OBJ_COMMIT, FSCK_MSG_MISSING_TREE, "invalid format - expected 'tree' line");
844881
if (parse_oid_hex(buffer, &tree_oid, &p) || *p != '\n') {
845882
err = report(options, oid, OBJ_COMMIT, FSCK_MSG_BAD_TREE_SHA1, "invalid 'tree' line format - bad sha1");
846883
if (err)
847884
return err;
848885
}
849886
buffer = p + 1;
850-
while (skip_prefix(buffer, "parent ", &buffer)) {
887+
while (buffer < buffer_end && skip_prefix(buffer, "parent ", &buffer)) {
851888
if (parse_oid_hex(buffer, &parent_oid, &p) || *p != '\n') {
852889
err = report(options, oid, OBJ_COMMIT, FSCK_MSG_BAD_PARENT_SHA1, "invalid 'parent' line format - bad sha1");
853890
if (err)
@@ -856,7 +893,7 @@ static int fsck_commit(const struct object_id *oid,
856893
buffer = p + 1;
857894
}
858895
author_count = 0;
859-
while (skip_prefix(buffer, "author ", &buffer)) {
896+
while (buffer < buffer_end && skip_prefix(buffer, "author ", &buffer)) {
860897
author_count++;
861898
err = fsck_ident(&buffer, oid, OBJ_COMMIT, options);
862899
if (err)
@@ -868,7 +905,7 @@ static int fsck_commit(const struct object_id *oid,
868905
err = report(options, oid, OBJ_COMMIT, FSCK_MSG_MULTIPLE_AUTHORS, "invalid format - multiple 'author' lines");
869906
if (err)
870907
return err;
871-
if (!skip_prefix(buffer, "committer ", &buffer))
908+
if (buffer >= buffer_end || !skip_prefix(buffer, "committer ", &buffer))
872909
return report(options, oid, OBJ_COMMIT, FSCK_MSG_MISSING_COMMITTER, "invalid format - expected 'committer' line");
873910
err = fsck_ident(&buffer, oid, OBJ_COMMIT, options);
874911
if (err)
@@ -899,13 +936,19 @@ int fsck_tag_standalone(const struct object_id *oid, const char *buffer,
899936
int ret = 0;
900937
char *eol;
901938
struct strbuf sb = STRBUF_INIT;
939+
const char *buffer_end = buffer + size;
902940
const char *p;
903941

942+
/*
943+
* We _must_ stop parsing immediately if this reports failure, as the
944+
* memory safety of the rest of the function depends on it. See the
945+
* comment above the definition of verify_headers() for more details.
946+
*/
904947
ret = verify_headers(buffer, size, oid, OBJ_TAG, options);
905948
if (ret)
906949
goto done;
907950

908-
if (!skip_prefix(buffer, "object ", &buffer)) {
951+
if (buffer >= buffer_end || !skip_prefix(buffer, "object ", &buffer)) {
909952
ret = report(options, oid, OBJ_TAG, FSCK_MSG_MISSING_OBJECT, "invalid format - expected 'object' line");
910953
goto done;
911954
}
@@ -916,11 +959,11 @@ int fsck_tag_standalone(const struct object_id *oid, const char *buffer,
916959
}
917960
buffer = p + 1;
918961

919-
if (!skip_prefix(buffer, "type ", &buffer)) {
962+
if (buffer >= buffer_end || !skip_prefix(buffer, "type ", &buffer)) {
920963
ret = report(options, oid, OBJ_TAG, FSCK_MSG_MISSING_TYPE_ENTRY, "invalid format - expected 'type' line");
921964
goto done;
922965
}
923-
eol = strchr(buffer, '\n');
966+
eol = memchr(buffer, '\n', buffer_end - buffer);
924967
if (!eol) {
925968
ret = report(options, oid, OBJ_TAG, FSCK_MSG_MISSING_TYPE, "invalid format - unexpected end after 'type' line");
926969
goto done;
@@ -932,11 +975,11 @@ int fsck_tag_standalone(const struct object_id *oid, const char *buffer,
932975
goto done;
933976
buffer = eol + 1;
934977

935-
if (!skip_prefix(buffer, "tag ", &buffer)) {
978+
if (buffer >= buffer_end || !skip_prefix(buffer, "tag ", &buffer)) {
936979
ret = report(options, oid, OBJ_TAG, FSCK_MSG_MISSING_TAG_ENTRY, "invalid format - expected 'tag' line");
937980
goto done;
938981
}
939-
eol = strchr(buffer, '\n');
982+
eol = memchr(buffer, '\n', buffer_end - buffer);
940983
if (!eol) {
941984
ret = report(options, oid, OBJ_TAG, FSCK_MSG_MISSING_TAG, "invalid format - unexpected end after 'type' line");
942985
goto done;
@@ -952,18 +995,16 @@ int fsck_tag_standalone(const struct object_id *oid, const char *buffer,
952995
}
953996
buffer = eol + 1;
954997

955-
if (!skip_prefix(buffer, "tagger ", &buffer)) {
998+
if (buffer >= buffer_end || !skip_prefix(buffer, "tagger ", &buffer)) {
956999
/* early tags do not contain 'tagger' lines; warn only */
9571000
ret = report(options, oid, OBJ_TAG, FSCK_MSG_MISSING_TAGGER_ENTRY, "invalid format - expected 'tagger' line");
9581001
if (ret)
9591002
goto done;
9601003
}
9611004
else
9621005
ret = fsck_ident(&buffer, oid, OBJ_TAG, options);
963-
if (!*buffer)
964-
goto done;
9651006

966-
if (!starts_with(buffer, "\n")) {
1007+
if (buffer < buffer_end && !starts_with(buffer, "\n")) {
9671008
/*
9681009
* The verify_headers() check will allow
9691010
* e.g. "[...]tagger <tagger>\nsome

t/t1451-fsck-buffer.sh

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
#!/bin/sh
2+
3+
test_description='fsck on buffers without NUL termination
4+
5+
The goal here is to make sure that the various fsck parsers never look
6+
past the end of the buffer they are given, even when encountering broken
7+
or truncated objects.
8+
9+
We have to use "hash-object" for this because most code paths that read objects
10+
append an extra NUL for safety after the buffer. But hash-object, since it is
11+
reading straight from a file (and possibly even mmap-ing it) cannot always do
12+
so.
13+
14+
These tests _might_ catch such overruns in normal use, but should be run with
15+
ASan or valgrind for more confidence.
16+
'
17+
. ./test-lib.sh
18+
19+
# the general idea for tags and commits is to build up the "base" file
20+
# progressively, and then test new truncations on top of it.
21+
reset () {
22+
test_expect_success 'reset input to empty' '
23+
>base
24+
'
25+
}
26+
27+
add () {
28+
content="$1"
29+
type=${content%% *}
30+
test_expect_success "add $type line" '
31+
echo "$content" >>base
32+
'
33+
}
34+
35+
check () {
36+
type=$1
37+
fsck=$2
38+
content=$3
39+
test_expect_success "truncated $type ($fsck, \"$content\")" '
40+
# do not pipe into hash-object here; we want to increase
41+
# the chance that it uses a fixed-size buffer or mmap,
42+
# and a pipe would be read into a strbuf.
43+
{
44+
cat base &&
45+
echo "$content"
46+
} >input &&
47+
test_must_fail git hash-object -t "$type" input 2>err &&
48+
grep "$fsck" err
49+
'
50+
}
51+
52+
test_expect_success 'create valid objects' '
53+
git commit --allow-empty -m foo &&
54+
commit=$(git rev-parse --verify HEAD) &&
55+
tree=$(git rev-parse --verify HEAD^{tree})
56+
'
57+
58+
reset
59+
check commit missingTree ""
60+
check commit missingTree "tr"
61+
check commit missingTree "tree"
62+
check commit badTreeSha1 "tree "
63+
check commit badTreeSha1 "tree 1234"
64+
add "tree $tree"
65+
66+
# these expect missingAuthor because "parent" is optional
67+
check commit missingAuthor ""
68+
check commit missingAuthor "par"
69+
check commit missingAuthor "parent"
70+
check commit badParentSha1 "parent "
71+
check commit badParentSha1 "parent 1234"
72+
add "parent $commit"
73+
74+
check commit missingAuthor ""
75+
check commit missingAuthor "au"
76+
check commit missingAuthor "author"
77+
ident_checks () {
78+
check $1 missingEmail "$2 "
79+
check $1 missingEmail "$2 name"
80+
check $1 badEmail "$2 name <"
81+
check $1 badEmail "$2 name <email"
82+
check $1 missingSpaceBeforeDate "$2 name <email>"
83+
check $1 badDate "$2 name <email> "
84+
check $1 badDate "$2 name <email> 1234"
85+
check $1 badTimezone "$2 name <email> 1234 "
86+
check $1 badTimezone "$2 name <email> 1234 +"
87+
}
88+
ident_checks commit author
89+
add "author name <email> 1234 +0000"
90+
91+
check commit missingCommitter ""
92+
check commit missingCommitter "co"
93+
check commit missingCommitter "committer"
94+
ident_checks commit committer
95+
add "committer name <email> 1234 +0000"
96+
97+
reset
98+
check tag missingObject ""
99+
check tag missingObject "obj"
100+
check tag missingObject "object"
101+
check tag badObjectSha1 "object "
102+
check tag badObjectSha1 "object 1234"
103+
add "object $commit"
104+
105+
check tag missingType ""
106+
check tag missingType "ty"
107+
check tag missingType "type"
108+
check tag badType "type "
109+
check tag badType "type com"
110+
add "type commit"
111+
112+
check tag missingTagEntry ""
113+
check tag missingTagEntry "ta"
114+
check tag missingTagEntry "tag"
115+
check tag badTagName "tag "
116+
add "tag foo"
117+
118+
check tag missingTagger ""
119+
check tag missingTagger "ta"
120+
check tag missingTagger "tagger"
121+
ident_checks tag tagger
122+
123+
# trees are a binary format and can't use our earlier helpers
124+
test_expect_success 'truncated tree (short hash)' '
125+
printf "100644 foo\0\1\1\1\1" >input &&
126+
test_must_fail git hash-object -t tree input 2>err &&
127+
grep badTree err
128+
'
129+
130+
test_expect_success 'truncated tree (missing nul)' '
131+
# these two things are indistinguishable to the parser. The important
132+
# thing about this is example is that there are enough bytes to
133+
# make up a hash, and that there is no NUL (and we confirm that the
134+
# parser does not walk past the end of the buffer).
135+
printf "100644 a long filename, or a hash with missing nul?" >input &&
136+
test_must_fail git hash-object -t tree input 2>err &&
137+
grep badTree err
138+
'
139+
140+
test_done

0 commit comments

Comments
 (0)