Initialize a few uninitialized variables #1888

dscho · 2025-03-24T11:13:55Z

When I ran CodeQL on Git's source code, it said that that variables might be uninitialized in a few places.

cc: Taylor Blau [email protected]
cc: Jeff King [email protected]

…alized The large `switch` statement makes it a bit impractical to reason about the code. One of the code paths can technically lead to using `size` without being initialized: if the `t` case is taken and the type name is set to the empty string, we would actually leave `size` unintialized right until we use it. Practically, this cannot happen because the `do_oid_object_info_extended()` function is expected to always populate the `type_name` if asked for. However, it is quite unnecessary to leave the code as unwieldy to reason about: Just initialize the variable to 0 and be done with it. Signed-off-by: Johannes Schindelin <[email protected]>

In `fsck_commit()`, after counting the authors of a commit, we set the `err` variable either when there was no author, or when there were more than two authors recorded. Then we access the `err` variable to figure out whether we should return early. But if there was exactly one author, that variable is still uninitialized. Let's just initialize the variable. This issue was pointed out by CodeQL. Signed-off-by: Johannes Schindelin <[email protected]>

The `revindex_size` value is uninitialized in case the function is erroring out, but we want to assign its value. Let's just initialize it. Signed-off-by: Johannes Schindelin <[email protected]>

The `mtimes_size` variable is uninitialzed when the function errors out, yet its value is assigned to another variable. Let's just initialize it. Signed-off-by: Johannes Schindelin <[email protected]>

dscho · 2025-03-27T12:43:02Z

/submit

gitgitgadget · 2025-03-27T12:44:04Z

Submitted as [email protected]

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1888/dscho/uninitialized-variables-v1

To fetch this version to local tag pr-1888/dscho/uninitialized-variables-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1888/dscho/uninitialized-variables-v1

gitgitgadget · 2025-03-27T14:25:32Z

pack-revindex.c

 				   uint32_t num_objects,
 				   const uint32_t **data_p, size_t *len_p)
 {
 	int fd, ret = 0;


On the Git mailing list, Taylor Blau wrote (reply to this):

On Thu, Mar 27, 2025 at 12:43:48PM +0000, Johannes Schindelin via GitGitGadget wrote: > From: Johannes Schindelin <[email protected]> > > The `revindex_size` value is uninitialized in case the function is > erroring out, but we want to assign its value. Let's just initialize it. > > Signed-off-by: Johannes Schindelin <[email protected]> > --- > pack-revindex.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/pack-revindex.c b/pack-revindex.c > index d3832478d99..3b007d771b3 100644 > --- a/pack-revindex.c > +++ b/pack-revindex.c > @@ -208,7 +208,7 @@ static int load_revindex_from_disk(char *revindex_name, > int fd, ret = 0; > struct stat st; > void *data = NULL; > - size_t revindex_size; > + size_t revindex_size = 0; I'm certainly not opposed to initializing variables proactively, but in this particular case I don't think it's necessary. We assign 'revindex_size' out to 'len_p' when we enter the cleanup routine label if 'ret' is zero. We'll use 'revindex_size' in the same label to munmap() when 'ret' is non-zero, but only if 'data' is also initialized. So there are two conditions where we'll enter the cleanup label before assigning 'revindex_size', when git_open() returns a negative value, or fstat()ing the descriptor that git_open() gave us returns a non-zero value. In both of those cases, ret is non-zero (it is assigned to 1 and the return value of error_errno() in those cases, respectively). Since 'data' is also NULL here, this function will terminate without using the uninitialized 'revindex_size'. If both of those work (i.e., we opened the file and fstat()ed it successfully), then we'll have revindex_size initialized to st.st_size (really the result of calling xsize_t() on it). There are two sanity checks on the size, both of which happen before we have mmap()ed the file, and both sanity checks set 'ret' to a non-zero value upon failure. So by the time we '*len_p = revindex_size' it is guaranteed to be initialized and just junk bytes on the stack. Did this trigger a warning from a static analyzer or something? If so, I'm happy to take this patch to appease it. Perhaps that it what's going on since I recall you mentioning that you were working on enabling CodeQL in Microsoft's fork of Git. But if not I might suggest dropping this patch for the reasons above. Thanks, Taylor

gitgitgadget · 2025-03-27T14:25:35Z

User Taylor Blau <[email protected]> has been added to the cc: list.

gitgitgadget · 2025-03-27T14:25:36Z

pack-mtimes.c

 				 uint32_t num_objects,
 				 const uint32_t **data_p, size_t *len_p)
 {
 	int fd, ret = 0;


On the Git mailing list, Taylor Blau wrote (reply to this):

On Thu, Mar 27, 2025 at 12:43:49PM +0000, Johannes Schindelin via GitGitGadget wrote: > From: Johannes Schindelin <[email protected]> > > The `mtimes_size` variable is uninitialzed when the function errors out, > yet its value is assigned to another variable. Let's just initialize it. > > Signed-off-by: Johannes Schindelin <[email protected]> > --- > pack-mtimes.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/pack-mtimes.c b/pack-mtimes.c > index cdf30b8d2b0..c1f531d45a0 100644 > --- a/pack-mtimes.c > +++ b/pack-mtimes.c > @@ -29,7 +29,7 @@ static int load_pack_mtimes_file(char *mtimes_file, > int fd, ret = 0; > struct stat st; > uint32_t *data = NULL; > - size_t mtimes_size, expected_size; > + size_t mtimes_size = 0, expected_size; Hmm. This one follows an identical line of reasoning as in my previous response in the thread. So I think this one is likewise unnecessary (though not harmful, and certainly useful if it appeases static analysis tools, etc). Thanks, Taylor

gitgitgadget · 2025-03-28T03:50:45Z

builtin/cat-file.c

 			int unknown_type)
 {
 	int ret;
 	struct object_id oid;


On the Git mailing list, Jeff King wrote (reply to this):

On Thu, Mar 27, 2025 at 12:43:46PM +0000, Johannes Schindelin via GitGitGadget wrote: > From: Johannes Schindelin <[email protected]> > > The large `switch` statement makes it a bit impractical to reason about > the code. > > One of the code paths can technically lead to using `size` without being > initialized: if the `t` case is taken and the type name is set to the > empty string, we would actually leave `size` unintialized right until we > use it. I don't think that's quite true. If we have an empty type name we leave the switch and hit these lines: if (!buf) die("git cat-file %s: bad file", obj_name); write_or_die(1, buf, size); Since we set buf to NULL before the switch and never touch it in the 't' case, we'll always hit that die() call. So this really is a false positive, regardless of what happens to the type name buffer. I'm a little surprised that CodeQL would get this wrong, just because it is very easy to see that buf is not touched in the 't' case at all (and thus must be NULL). But maybe I'm missing something. I do agree that the flow through the switch statement (where "break" is good for some cases and a failure mode for others) makes this code rather hard to reason about. I'm sure it could be rewritten, but I'm not sure if it's worth spending time on. > Practically, this cannot happen because the > `do_oid_object_info_extended()` function is expected to always populate > the `type_name` if asked for. However, it is quite unnecessary to leave > the code as unwieldy to reason about: Just initialize the variable to 0 > and be done with it. You can trigger the path in question like this: oid=$(echo foo | git hash-object --literally --stdin -w -t '') git cat-file --allow-unknown -t $oid which hits the "bad file" message. (Obviously the above is horrible and arguably something we should consider forbidding; I have some patches moving towards ripping out support for non-standard types entirely). > diff --git a/builtin/cat-file.c b/builtin/cat-file.c > index b13561cf73b..128c901fa8e 100644 > --- a/builtin/cat-file.c > +++ b/builtin/cat-file.c > @@ -104,7 +104,7 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name, > struct object_id oid; > enum object_type type; > char *buf; > - unsigned long size; > + unsigned long size = 0; So even though I think your analysis above had a few wrong details, I do agree this is a false positive in CodeQL and is probably OK to fix as you do here. Though it might make more sense to do it alongside the assignment to "buf" (or to move the initialization of "buf" up here). -Peff

gitgitgadget · 2025-03-28T03:50:47Z

User Jeff King <[email protected]> has been added to the cc: list.

gitgitgadget · 2025-03-28T04:11:39Z

fsck.c

 static int fsck_commit(const struct object_id *oid,
 		       const char *buffer, unsigned long size,
 		       struct fsck_options *options)
 {


On the Git mailing list, Jeff King wrote (reply to this):

On Thu, Mar 27, 2025 at 12:43:47PM +0000, Johannes Schindelin via GitGitGadget wrote: > From: Johannes Schindelin <[email protected]> > > In `fsck_commit()`, after counting the authors of a commit, we set the > `err` variable either when there was no author, or when there were more > than two authors recorded. Then we access the `err` variable to figure > out whether we should return early. But if there was exactly one author, > that variable is still uninitialized. > > Let's just initialize the variable. > > This issue was pointed out by CodeQL. Hmm, I'd think we would hit this case all the time, since commits generally have one author. But I think it's another false positive. The code in question is this: author_count = 0; while (buffer < buffer_end && skip_prefix(buffer, "author ", &buffer)) { author_count++; err = fsck_ident(&buffer, oid, OBJ_COMMIT, options); if (err) return err; } if (author_count < 1) err = report(options, oid, OBJ_COMMIT, FSCK_MSG_MISSING_AUTHOR, "invalid format - expected 'author' line"); else if (author_count > 1) err = report(options, oid, OBJ_COMMIT, FSCK_MSG_MULTIPLE_AUTHORS, "invalid format - multiple 'author' lines"); if (err) return err; So we set "err" as soon as we find _any_ author (when we check whether it is properly formatted via fsck_ident). And author_count will not be incremented if we did not find one. So either we must have assigned the result of fsck_ident(), or we will hit the "author_count < 1" case and assign there. It's certainly confusing, though, since "err" gets used in so many spots. I think the whole thing would be easier to understand if we had tighter-scoped single use variables like this: diff --git a/fsck.c b/fsck.c index 9fc4c25ffd..ea72b3247d 100644 --- a/fsck.c +++ b/fsck.c @@ -925,7 +925,6 @@ static int fsck_commit(const struct object_id *oid, { struct object_id tree_oid, parent_oid; unsigned author_count; - int err; const char *buffer_begin = buffer; const char *buffer_end = buffer + size; const char *p; @@ -941,39 +940,44 @@ static int fsck_commit(const struct object_id *oid, if (buffer >= buffer_end || !skip_prefix(buffer, "tree ", &buffer)) return report(options, oid, OBJ_COMMIT, FSCK_MSG_MISSING_TREE, "invalid format - expected 'tree' line"); if (parse_oid_hex(buffer, &tree_oid, &p) || *p != '\n') { - err = report(options, oid, OBJ_COMMIT, FSCK_MSG_BAD_TREE_SHA1, "invalid 'tree' line format - bad sha1"); + int err = report(options, oid, OBJ_COMMIT, FSCK_MSG_BAD_TREE_SHA1, "invalid 'tree' line format - bad sha1"); if (err) return err; } buffer = p + 1; while (buffer < buffer_end && skip_prefix(buffer, "parent ", &buffer)) { if (parse_oid_hex(buffer, &parent_oid, &p) || *p != '\n') { - err = report(options, oid, OBJ_COMMIT, FSCK_MSG_BAD_PARENT_SHA1, "invalid 'parent' line format - bad sha1"); + int err = report(options, oid, OBJ_COMMIT, FSCK_MSG_BAD_PARENT_SHA1, "invalid 'parent' line format - bad sha1"); if (err) return err; } buffer = p + 1; } author_count = 0; while (buffer < buffer_end && skip_prefix(buffer, "author ", &buffer)) { + int err = fsck_ident(&buffer, oid, OBJ_COMMIT, options); + if (err) + return err; author_count++; - err = fsck_ident(&buffer, oid, OBJ_COMMIT, options); + } + if (author_count < 1) { + int err = report(options, oid, OBJ_COMMIT, FSCK_MSG_MISSING_AUTHOR, "invalid format - expected 'author' line"); + if (err) + return err; + } else if (author_count > 1) { + int err = report(options, oid, OBJ_COMMIT, FSCK_MSG_MULTIPLE_AUTHORS, "invalid format - multiple 'author' lines"); if (err) return err; } - if (author_count < 1) - err = report(options, oid, OBJ_COMMIT, FSCK_MSG_MISSING_AUTHOR, "invalid format - expected 'author' line"); - else if (author_count > 1) - err = report(options, oid, OBJ_COMMIT, FSCK_MSG_MULTIPLE_AUTHORS, "invalid format - multiple 'author' lines"); - if (err) - return err; if (buffer >= buffer_end || !skip_prefix(buffer, "committer ", &buffer)) return report(options, oid, OBJ_COMMIT, FSCK_MSG_MISSING_COMMITTER, "invalid format - expected 'committer' line"); - err = fsck_ident(&buffer, oid, OBJ_COMMIT, options); - if (err) - return err; + else { + int err = fsck_ident(&buffer, oid, OBJ_COMMIT, options); + if (err) + return err; + } if (memchr(buffer_begin, '\0', size)) { - err = report(options, oid, OBJ_COMMIT, FSCK_MSG_NUL_IN_COMMIT, + int err = report(options, oid, OBJ_COMMIT, FSCK_MSG_NUL_IN_COMMIT, "NUL byte in the commit object body"); if (err) return err; And then it is obvious that the general pattern is to propagate "err" from individual calls (and the ones that do not stick out like sore thumbs; are those bugs where we should keep going if the user set those message types to warn/ignore?). You could even wrap the pattern in a macro, though perhaps that is getting too magical. The resulting logic is easier to follow, though, if you can look past the macro: diff --git a/fsck.c b/fsck.c index ea72b3247d..8c7ac3c448 100644 --- a/fsck.c +++ b/fsck.c @@ -919,6 +919,12 @@ static int fsck_ident(const char **ident, return 0; } +#define MAYBE_RETURN(x) do { \ + int err = (x); \ + if (err) \ + return err; \ +} while (0) + static int fsck_commit(const struct object_id *oid, const char *buffer, unsigned long size, struct fsck_options *options) @@ -939,49 +945,30 @@ static int fsck_commit(const struct object_id *oid, if (buffer >= buffer_end || !skip_prefix(buffer, "tree ", &buffer)) return report(options, oid, OBJ_COMMIT, FSCK_MSG_MISSING_TREE, "invalid format - expected 'tree' line"); - if (parse_oid_hex(buffer, &tree_oid, &p) || *p != '\n') { - int err = report(options, oid, OBJ_COMMIT, FSCK_MSG_BAD_TREE_SHA1, "invalid 'tree' line format - bad sha1"); - if (err) - return err; - } + if (parse_oid_hex(buffer, &tree_oid, &p) || *p != '\n') + MAYBE_RETURN(report(options, oid, OBJ_COMMIT, FSCK_MSG_BAD_TREE_SHA1, "invalid 'tree' line format - bad sha1")); buffer = p + 1; while (buffer < buffer_end && skip_prefix(buffer, "parent ", &buffer)) { - if (parse_oid_hex(buffer, &parent_oid, &p) || *p != '\n') { - int err = report(options, oid, OBJ_COMMIT, FSCK_MSG_BAD_PARENT_SHA1, "invalid 'parent' line format - bad sha1"); - if (err) - return err; - } + if (parse_oid_hex(buffer, &parent_oid, &p) || *p != '\n') + MAYBE_RETURN(report(options, oid, OBJ_COMMIT, FSCK_MSG_BAD_PARENT_SHA1, "invalid 'parent' line format - bad sha1")); buffer = p + 1; } author_count = 0; while (buffer < buffer_end && skip_prefix(buffer, "author ", &buffer)) { - int err = fsck_ident(&buffer, oid, OBJ_COMMIT, options); - if (err) - return err; + MAYBE_RETURN(fsck_ident(&buffer, oid, OBJ_COMMIT, options)); author_count++; } - if (author_count < 1) { - int err = report(options, oid, OBJ_COMMIT, FSCK_MSG_MISSING_AUTHOR, "invalid format - expected 'author' line"); - if (err) - return err; - } else if (author_count > 1) { - int err = report(options, oid, OBJ_COMMIT, FSCK_MSG_MULTIPLE_AUTHORS, "invalid format - multiple 'author' lines"); - if (err) - return err; - } + if (author_count < 1) + MAYBE_RETURN(report(options, oid, OBJ_COMMIT, FSCK_MSG_MISSING_AUTHOR, "invalid format - expected 'author' line")); + else if (author_count > 1) + MAYBE_RETURN(report(options, oid, OBJ_COMMIT, FSCK_MSG_MULTIPLE_AUTHORS, "invalid format - multiple 'author' lines")); if (buffer >= buffer_end || !skip_prefix(buffer, "committer ", &buffer)) return report(options, oid, OBJ_COMMIT, FSCK_MSG_MISSING_COMMITTER, "invalid format - expected 'committer' line"); - else { - int err = fsck_ident(&buffer, oid, OBJ_COMMIT, options); - if (err) - return err; - } - if (memchr(buffer_begin, '\0', size)) { - int err = report(options, oid, OBJ_COMMIT, FSCK_MSG_NUL_IN_COMMIT, - "NUL byte in the commit object body"); - if (err) - return err; - } + else + MAYBE_RETURN(fsck_ident(&buffer, oid, OBJ_COMMIT, options)); + + if (memchr(buffer_begin, '\0', size)) + MAYBE_RETURN(report(options, oid, OBJ_COMMIT, FSCK_MSG_NUL_IN_COMMIT, "NUL byte in the commit object body")); return 0; } I'd suspect that just the first patch above would fix the CodeQL issue. It's certainly a larger diff, but IMHO the result is less confusing for humans, too. -Peff

dscho · 2025-07-02T08:51:25Z

This discussion has become unproductive a long time ago, it was never my intention to "fix" the code (and I disagree that the awkward macro would make the code better, to the contrary), but just to suppress CodeQL alerts in favor of increasing the SNR.

I'll just suppress the affected CodeQL queries.

This PR adds a [CodeQL](https://codeql.github.com/) workflow to this repository. CodeQL is touted as a "semantic code analysis engine", i.e. its intention isn't really a full industry-grade static code analyzer like Coverity (which we [already run in our CI builds](https://github.com/microsoft/git/actions/workflows/coverity.yml)). This shows in the number of queries we have to suppress because they would result in an unwieldy number of false positives. Or, one might argue, it shows how convoluted part of Git's logic is, which may not only confuse CodeQL but also human readers. The latter might not be a _direct_ security issue, but as Tony Hoare [said](https://en.wikiquote.org/wiki/C._A._R._Hoare#The_Emperor's_Old_Clothes): > There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult. I tried to make the code simpler with gitgitgadget#1888 and with gitgitgadget#1890, but the discussion went astray from my original purpose, instead veering off into the direction of arguing in complicated lines of reasoning that the current code is actually correct. Nevertheless, I do include these patches here, structured via merge commits to make it easier to drop them should too many merge conflicts in rebases to upstream Git versions make that advisable. However, for most of the alerts, the approach I took is exclude the queries wholesale, when they caused alerts. I could have taken the approach to suppress the alerts in a fine-grained way (via `codeql[<query-name>]`) so that true positives aren't suppressed in addition to false positives. However, I am loathe to take that approach because a voice inside me fully expects backlash on the Git mailing list along the lines of :"You're littering the code just to appease CodeQL!". Theoretically, an alternative exists: To develop modified versions of those CodeQL queries, e.g. to ignore paths inside the `.git/` directory in the `cpp/toctou-race-condition` query. However, CodeQL is a language that is not only (intentionally?) difficult to develop in by virtue of being declarative instead of imperative, its debugging facilities are pretty much non-existent. Given all of the above, the obvious question begs itself: Why bother with CodeQL at all? The truthful answer is: It is mandated by the Secure Future Initiative. And who knows, maybe CodeQL will come in handy in the future? After all, it is a framework more than a solution, and should in principle be able to help with answering questions like: "Which call paths into `libgit.a` could result in `die()` being called?".

gitgitgadget · 2025-07-17T16:47:47Z

On the Git mailing list, Johannes Schindelin wrote (reply to this):

Hi,

On Thu, 27 Mar 2025, Johannes Schindelin via GitGitGadget wrote:

> When I ran CodeQL on Git's source code, it said that that variables might be
> uninitialized in a few places.

For the record, I am abandoning my efforts to upstream this. I was never
interested in the actual changes, I was interested in getting CodeQL to
run on Git's code base without reporting false positives, so that I could
compare the quality of the reports against Coverity's. Therefore, I was
not really prepared to polish these patches as if they were something
important: They are not, and I cannot justify spending any more time on
these patches. I will carry them as-are in microsoft/git, under the label
'uninitialized-variables', as long as they apply without major merge
conflict headaches.

Ciao,
Johannes

dscho added 4 commits March 21, 2025 11:44

load_revindex_from_disk(): avoid accessing uninitialized data

b990192

The `revindex_size` value is uninitialized in case the function is erroring out, but we want to assign its value. Let's just initialize it. Signed-off-by: Johannes Schindelin <[email protected]>

load_pack_mtimes_file(): avoid accessing uninitialized data

d630e95

The `mtimes_size` variable is uninitialzed when the function errors out, yet its value is assigned to another variable. Let's just initialize it. Signed-off-by: Johannes Schindelin <[email protected]>

dscho self-assigned this Mar 24, 2025

gitgitgadget bot reviewed Mar 27, 2025

View reviewed changes

gitgitgadget bot reviewed Mar 28, 2025

View reviewed changes

dscho closed this Jul 2, 2025

dscho deleted the uninitialized-variables branch July 2, 2025 08:51

dscho mentioned this pull request Jul 7, 2025

Run CodeQL as part of the CI microsoft/git#771

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Initialize a few uninitialized variables #1888

Initialize a few uninitialized variables #1888

Uh oh!

dscho commented Mar 24, 2025 •

edited by gitgitgadget bot

Loading

Uh oh!

dscho commented Mar 27, 2025

Uh oh!

gitgitgadget bot commented Mar 27, 2025

Uh oh!

gitgitgadget bot Mar 27, 2025

Uh oh!

gitgitgadget bot commented Mar 27, 2025

Uh oh!

gitgitgadget bot Mar 27, 2025

Uh oh!

gitgitgadget bot Mar 28, 2025

Uh oh!

gitgitgadget bot commented Mar 28, 2025

Uh oh!

gitgitgadget bot Mar 28, 2025

Uh oh!

dscho commented Jul 2, 2025 •

edited

Loading

Uh oh!

gitgitgadget bot commented Jul 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Initialize a few uninitialized variables #1888

Initialize a few uninitialized variables #1888

Uh oh!

Conversation

dscho commented Mar 24, 2025 • edited by gitgitgadget bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dscho commented Mar 27, 2025

Uh oh!

gitgitgadget bot commented Mar 27, 2025

Uh oh!

gitgitgadget bot Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot commented Mar 27, 2025

Uh oh!

gitgitgadget bot Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot commented Mar 28, 2025

Uh oh!

gitgitgadget bot Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

dscho commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gitgitgadget bot commented Jul 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dscho commented Mar 24, 2025 •

edited by gitgitgadget bot

Loading

dscho commented Jul 2, 2025 •

edited

Loading