Skip to content

fix(sshutil): add mutex to TOFU known_hosts to prevent race condition#644

Merged
ArangoGutierrez merged 1 commit intoNVIDIA:mainfrom
ArangoGutierrez:fix/tofu-race-condition
Feb 13, 2026
Merged

fix(sshutil): add mutex to TOFU known_hosts to prevent race condition#644
ArangoGutierrez merged 1 commit intoNVIDIA:mainfrom
ArangoGutierrez:fix/tofu-race-condition

Conversation

@ArangoGutierrez
Copy link
Collaborator

Summary

  • Add package-level sync.Mutex to serialize access to the known_hosts file in the TOFU host key callback
  • Prevents TOCTOU race when multiple SSH connections are established concurrently during cluster provisioning

Audit Finding

Finding #13 (MEDIUM): Concurrent SSH connections could race on the known_hosts file read-then-write, causing duplicate entries or inconsistent state.

Test plan

  • Existing TOFU tests pass with -race flag
  • golangci-lint clean
  • go build compiles

Concurrent SSH connections (during cluster provisioning) could race on
the known_hosts file read-then-write, causing duplicate entries or
inconsistent state. Add a package-level mutex around the callback.

Audit finding NVIDIA#13 (MEDIUM).

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copilot AI review requested due to automatic review settings February 12, 2026 19:30
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds in-process synchronization to Holodeck’s SSH TOFU host key callback to avoid concurrent read/write races on the cached known_hosts file during parallel provisioning.

Changes:

  • Introduce a package-level sync.Mutex to serialize TOFU known_hosts access.
  • Acquire/release the mutex at the start of the ssh.HostKeyCallback to prevent TOCTOU between read and append.

Comment on lines +43 to +45
tofuMu.Lock()
defer tofuMu.Unlock()

Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new global mutex is intended to fix a concurrency race, but there’s no test exercising concurrent calls to the callback. Adding a unit test that runs many goroutines calling this callback against the same temp known_hosts path (and asserting the file ends up with a single correct entry and no errors) would prevent regressions and would fail under -race if the locking is removed/broken.

Copilot generated this review using guidance from repository custom instructions.
@coveralls
Copy link

Pull Request Test Coverage Report for Build 21961219156

Details

  • 3 of 3 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.03%) to 47.531%

Totals Coverage Status
Change from base Build 21955389842: 0.03%
Covered Lines: 2503
Relevant Lines: 5266

💛 - Coveralls

@ArangoGutierrez ArangoGutierrez merged commit 38c2b4b into NVIDIA:main Feb 13, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants