Run CodeQL as part of the CI #771

dscho · 2025-07-07T09:37:24Z

This PR adds a CodeQL workflow to this repository. CodeQL is touted as a "semantic code analysis engine", i.e. its intention isn't really a full industry-grade static code analyzer like Coverity (which we already run in our CI builds). This shows in the number of queries we have to suppress because they would result in an unwieldy number of false positives.

Or, one might argue, it shows how convoluted part of Git's logic is, which may not only confuse CodeQL but also human readers. The latter might not be a direct security issue, but as Tony Hoare said:

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.

I tried to make the code simpler with gitgitgadget#1888 and with gitgitgadget#1890, but the discussion went astray from my original purpose, instead veering off into the direction of arguing in complicated lines of reasoning that the current code is actually correct. Nevertheless, I do include these patches here, structured via merge commits to make it easier to drop them should too many merge conflicts in rebases to upstream Git versions make that advisable.

However, for most of the alerts, the approach I took is exclude the queries wholesale, when they caused alerts. I could have taken the approach to suppress the alerts in a fine-grained way (via codeql[<query-name>]) so that true positives aren't suppressed in addition to false positives. However, I am loathe to take that approach because a voice inside me fully expects backlash on the Git mailing list along the lines of :"You're littering the code just to appease CodeQL!".

Theoretically, an alternative exists: To develop modified versions of those CodeQL queries, e.g. to ignore paths inside the .git/ directory in the cpp/toctou-race-condition query. However, CodeQL is a language that is not only (intentionally?) difficult to develop in by virtue of being declarative instead of imperative, its debugging facilities are pretty much non-existent.

Given all of the above, the obvious question begs itself: Why bother with CodeQL at all? The truthful answer is: It is mandated by the Secure Future Initiative. And who knows, maybe CodeQL will come in handy in the future? After all, it is a framework more than a solution, and should in principle be able to help with answering questions like: "Which call paths into libgit.a could result in die() being called?".

When `clar_parse_args()` encounters an unhandled command-line option, it wants to print out an error message that says so, and show the option that it encountered. However, the format claims that it wants to show a string, whereas using `argument[1]` provided only the first character. This was side-stepped in upstream `clar` in 7eaea8b2da (clar: argument parsing improvements, 2025-01-16) by not even showing the option anymore. Let's just fix this in Git until the next time Git synchronizes with an upstream `clar` version again. Signed-off-by: Johannes Schindelin <[email protected]>

When accessing an array element (or, in these instances, characters in a string), the check whether the array index is within bounds should always come before accessing said element. Signed-off-by: Johannes Schindelin <[email protected]>

…alized The large `switch` statement makes it a bit impractical to reason about the code. One of the code paths can technically lead to using `size` without being initialized: if the `t` case is taken and the type name is set to the empty string, we would actually leave `size` unintialized right until we use it. Practically, this cannot happen because the `do_oid_object_info_extended()` function is expected to always populate the `type_name` if asked for. However, it is quite unnecessary to leave the code as unwieldy to reason about: Just initialize the variable to 0 and be done with it. Signed-off-by: Johannes Schindelin <[email protected]>

In `fsck_commit()`, after counting the authors of a commit, we set the `err` variable either when there was no author, or when there were more than two authors recorded. Then we access the `err` variable to figure out whether we should return early. But if there was exactly one author, that variable is still uninitialized. Let's just initialize the variable. This issue was pointed out by CodeQL. Signed-off-by: Johannes Schindelin <[email protected]>

The `revindex_size` value is uninitialized in case the function is erroring out, but we want to assign its value. Let's just initialize it. Signed-off-by: Johannes Schindelin <[email protected]>

The `mtimes_size` variable is uninitialzed when the function errors out, yet its value is assigned to another variable. Let's just initialize it. Signed-off-by: Johannes Schindelin <[email protected]>

On the off chance that `lookup_decoration()` cannot find anything, let `leave_one_treesame_to_parent()` return gracefully instead of crashing. Pointed out by CodeQL. Signed-off-by: Johannes Schindelin <[email protected]>

CodeQL points out that `lookup_commit_reference()` can return NULL values. Signed-off-by: Johannes Schindelin <[email protected]>

CodeQL points out that `parse_object()` can return NULL values. Signed-off-by: Johannes Schindelin <[email protected]>

CodeQL points out that `lookup_commit()` can return NULL values. Signed-off-by: Johannes Schindelin <[email protected]>

CodeQL points out that `parse_tree_indirect()` can return NULL values. Signed-off-by: Johannes Schindelin <[email protected]>

CodeQL points out that `lookup_commit()` can return NULL values. Signed-off-by: Johannes Schindelin <[email protected]>

CodeQL points out that `branch_get()` can return NULL values. Signed-off-by: Johannes Schindelin <[email protected]>

CodeQL points out that `lookup_commit_reference()` can return NULL values. Signed-off-by: Johannes Schindelin <[email protected]>

CodeQL points out that `branch_get()` can return NULL values. Signed-off-by: Johannes Schindelin <[email protected]>

CodeQL is GitHub's native offering of a static code analyzer, and hence integrates with GitHub Actions better than any other static code analyzer. By default, it comes with a large range of "queries" that test for common code patterns that should be avoided. For now, we only target source code written in C, via the `language: cpp` directive. Just in case that other languages should be targeted, too, this GitHub workflow job is set up as a matrix job to make that easier in the future. For full documentation, see https://docs.github.com/en/code-security/code-scanning/introduction-to-code-scanning/about-code-scanning-with-codeql Co-authored-by: Pierre Tempel <[email protected]> Co-authored-by: Arthur Baars <[email protected]> Signed-off-by: Johannes Schindelin <[email protected]>

In some instances, CodeQL's web UI on github.com leaves questions unanswered. For example, in some alerts it is really necessary to follow the entire "taint flow" to understand why something might be an issue. The alerts for the `cpp/uncontrolled-allocation-size` rule, for example, are all false positives, and only when inspecting the exact flow does it become obvious that one alert wants to point out that the size of a binary patch hunk, which is specified in the patch, is then used to determine how much memory to allocate, which may potentially run out of memory (and is hence just Git doing what it is asked to, and does not need to be changed). To help with those issues, publish the `.sarif` file as part of every workflow run; This allows downloading that file and inspecting it e.g. with the SARIF viewer extension in VS Code (for details, see https://marketplace.visualstudio.com/items?itemName=MS-SarifVSCode.sarif-viewer). Signed-off-by: Johannes Schindelin <[email protected]>

…fter release As pointed out by CodeQL, it is a potentially dangerous practice to store local variables' addresses in non-local structs. My original intention was to make sure to clear it out after it was used, and before the function returns (which is when the address would go stale). However, I faced too much resistance in the Git project against such patches, there seemed to always be the overwhelming sentiment that the code isn't broken (even if it requires a complex and demanding analysis to wrap one's head around _that_). Therefore, I will be pragmatic and simply ask CodeQL to hold its peace about this issue forever. Signed-off-by: Johannes Schindelin <[email protected]>

As pointed out by CodeQL, it could be NULL and we usually check for that. Signed-off-by: Johannes Schindelin <[email protected]>

On the off-chance that it's NULL... Signed-off-by: Johannes Schindelin <[email protected]>

As pointed out by CodeQL, `lookup_commit()` can return NULL. Signed-off-by: Johannes Schindelin <[email protected]>

The code is a bit too hard to reason about for CodeQL to figure out whether the `fill_commit_graph_info()` function is at all called after `write_commit_graph()` returns (and hence whether `topo_levels` goes out of context before it is used again). The Git project insists that this is correct (and does not want to make the code more obviously correct), so let's silence CodeQL's complaints in this instance. Signed-off-by: Johannes Schindelin <[email protected]>

A couple of CodeQL's queries are opinionated in a way that is obviously not shared by Git's source code's state, and apparently intentionally so. For example, the "For loop variable changed in body" query as well as the "No trivial switch statements" one result in too many results that are apparently intentional in Git's source code. Let's not worry about those, then. Also, Git has plenty of instances where variables shadow other variables. Other valid yet not quite critical issues identified by CodeQL include complex conditionals and nested switch statements spanning several pages. We probably want to address these issues at some stage, but they are not as critical as other problems pointed out by CodeQL, so let's silence those queries for now and take care of them at a later stage. Signed-off-by: Johannes Schindelin <[email protected]>

The `localtime()` function is inherently thread-unsafe and should not be used anymore. Let's use `localtime_r()` instead. Signed-off-by: Johannes Schindelin <[email protected]>

Some fixes in `clar`, pointed out by CodeQL. Signed-off-by: Johannes Schindelin <[email protected]>

A few places where CodeQL thinks that variables might be uninitialized. Signed-off-by: Johannes Schindelin <[email protected]>

These patches implement some defensive programming to address complaints some static analyzers might have. Signed-off-by: Johannes Schindelin <[email protected]>

CodeQL pointed out a couple of issues, which are addressed in this patch series. Signed-off-by: Johannes Schindelin <[email protected]>

This patch series has been long in the making, ever since Johannes Nicolai and myself spiked this in November/December 2020. Signed-off-by: Johannes Schindelin <[email protected]>

dscho · 2025-07-07T10:15:43Z

Scalar Functional Tests / Scalar Functional Tests (windows-2019, ignored) (pull_request) Cancelled after 5s

That's because windows-2019 was retired; That is fixed in 3710e34 over in #770.

dscho and others added 25 commits July 6, 2025 19:49

load_revindex_from_disk(): avoid accessing uninitialized data

c004e4e

The `revindex_size` value is uninitialized in case the function is erroring out, but we want to assign its value. Let's just initialize it. Signed-off-by: Johannes Schindelin <[email protected]>

load_pack_mtimes_file(): avoid accessing uninitialized data

c94ab2b

The `mtimes_size` variable is uninitialzed when the function errors out, yet its value is assigned to another variable. Let's just initialize it. Signed-off-by: Johannes Schindelin <[email protected]>

revision: defensive programming

d1df00a

On the off chance that `lookup_decoration()` cannot find anything, let `leave_one_treesame_to_parent()` return gracefully instead of crashing. Pointed out by CodeQL. Signed-off-by: Johannes Schindelin <[email protected]>

get_parent(): defensive programming

94c4074

CodeQL points out that `lookup_commit_reference()` can return NULL values. Signed-off-by: Johannes Schindelin <[email protected]>

fetch-pack: defensive programming

6d7c86f

CodeQL points out that `parse_object()` can return NULL values. Signed-off-by: Johannes Schindelin <[email protected]>

unparse_commit(): defensive programming

216c78c

CodeQL points out that `lookup_commit()` can return NULL values. Signed-off-by: Johannes Schindelin <[email protected]>

verify_commit_graph(): defensive programming

499173e

CodeQL points out that `lookup_commit()` can return NULL values. Signed-off-by: Johannes Schindelin <[email protected]>

stash: defensive programming

d9a4c85

CodeQL points out that `parse_tree_indirect()` can return NULL values. Signed-off-by: Johannes Schindelin <[email protected]>

stash: defensive programming

e8f6e01

CodeQL points out that `lookup_commit()` can return NULL values. Signed-off-by: Johannes Schindelin <[email protected]>

push: defensive programming

984e44b

CodeQL points out that `branch_get()` can return NULL values. Signed-off-by: Johannes Schindelin <[email protected]>

fetch: defensive programming

e94d1ae

CodeQL points out that `branch_get()` can return NULL values. Signed-off-by: Johannes Schindelin <[email protected]>

describe: defensive programming

b81a679

CodeQL points out that `lookup_commit_reference()` can return NULL values. Signed-off-by: Johannes Schindelin <[email protected]>

inherit_tracking(): defensive programming

838a96c

CodeQL points out that `branch_get()` can return NULL values. Signed-off-by: Johannes Schindelin <[email protected]>

submodule: check return value of submodule_from_path()

24e8e35

As pointed out by CodeQL, it could be NULL and we usually check for that. Signed-off-by: Johannes Schindelin <[email protected]>

test-tool repository: check return value of lookup_commit()

548b20a

On the off-chance that it's NULL... Signed-off-by: Johannes Schindelin <[email protected]>

shallow: handle missing shallow commits gracefully

3d91638

As pointed out by CodeQL, `lookup_commit()` can return NULL. Signed-off-by: Johannes Schindelin <[email protected]>

dscho requested a review from mjcheetham July 7, 2025 09:37

dscho self-assigned this Jul 7, 2025

dscho added 3 commits July 7, 2025 11:54

clar(clar_summary_testsuite): avoid thread-unsafe localtime()

adfc130

The `localtime()` function is inherently thread-unsafe and should not be used anymore. Let's use `localtime_r()` instead. Signed-off-by: Johannes Schindelin <[email protected]>

Merge branch 'clar'

6e7901c

Some fixes in `clar`, pointed out by CodeQL. Signed-off-by: Johannes Schindelin <[email protected]>

Merge branch 'uninitialized-variables'

789e115

A few places where CodeQL thinks that variables might be uninitialized. Signed-off-by: Johannes Schindelin <[email protected]>

dscho added 4 commits July 7, 2025 11:54

Merge branch 'defensive-programming'

f793bf8

These patches implement some defensive programming to address complaints some static analyzers might have. Signed-off-by: Johannes Schindelin <[email protected]>

Merge branch 'codeql-fixes'

42e298f

CodeQL pointed out a couple of issues, which are addressed in this patch series. Signed-off-by: Johannes Schindelin <[email protected]>

Merge branch 'codeql'

8ed2c85

This patch series has been long in the making, ever since Johannes Nicolai and myself spiked this in November/December 2020. Signed-off-by: Johannes Schindelin <[email protected]>

Merge branch 'codeql'

3e70801

This patch series has been long in the making, ever since Johannes Nicolai and myself spiked this in November/December 2020. Signed-off-by: Johannes Schindelin <[email protected]>

dscho force-pushed the codeql branch from 3bd89d1 to 3e70801 Compare July 7, 2025 09:54

mjcheetham approved these changes Jul 7, 2025

View reviewed changes

dscho merged commit 290f9d0 into vfs-2.50.0 Jul 7, 2025
176 of 178 checks passed

dscho deleted the codeql branch July 7, 2025 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Run CodeQL as part of the CI #771

Run CodeQL as part of the CI #771

Uh oh!

dscho commented Jul 7, 2025

Uh oh!

dscho commented Jul 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Run CodeQL as part of the CI #771

Run CodeQL as part of the CI #771

Uh oh!

Conversation

dscho commented Jul 7, 2025

Uh oh!

dscho commented Jul 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants