Skip to content

Conversation

@mismithhisler
Copy link
Member

@mismithhisler mismithhisler commented Sep 12, 2025

Description

This PR adds tracking of http connections and emits those as the metric nomad.agent.http.connection. We use a gauge for this metric so we can track long lived connections.

Testing & Reproduction steps

In one terminal, run a dev agent. In another terminal, run nomad monitor. Wait the in_memory_collection_interval. In a third terminal, run curl -s localhost:4646/v1/metrics | jq '.Gauges[] | select(.Name | contains("nomad.nomad.agent.http"))'. Observe the value of 1.0 for the total connections.

Links

Docs deploy preview: https://nomad-git-f-add-total-http-connections-metric-hashicorp.vercel.app/nomad/docs/reference/metrics#agent-metrics

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad website documentation to reflect this. Refer to
    the website README for docs guidelines. Please also consider whether the
    change requires notes within the upgrade guide.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.
  • If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

@mismithhisler mismithhisler self-assigned this Sep 12, 2025
serverInitializationErrors error

connCount int = 0
countMux *sync.Mutex = &sync.Mutex{}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a lock to the ConnState callback could potentially introduce some contention if there are many incoming requests and a short telemetry.collectionInterval. This may be worth running some benchmarks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I scripted a very basic test hitting the nomad api concurrently. Comparing the times to a previous version of Nomad without this metric, it showed no meaningful difference in total time. This is not surprising as the handling of connections is already bottlenecked by the lock within the connection limiter.

connMux.Unlock()

// Call connection limiter if enabled
if connLimit > 0 {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're already introducing some closures here with the connCount and mutex, adding the connLimit as one more closure allows us to condense the logic a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to do a little truth table for myself to verify we hadn't changed the behavior unexpectedly. 😁 I think this holds for both the old and new versions so LGTM

non-TLS or
no handshake timeout
connLimit > 0 call connLimiter? set deadlines?
yes yes yes no
yes no no no
no yes yes yes
no no no yes

@aimeeu aimeeu added the theme/docs Documentation issues and enhancements label Sep 18, 2025
@mismithhisler mismithhisler marked this pull request as ready for review October 28, 2025 15:12
@mismithhisler mismithhisler requested review from a team as code owners October 28, 2025 15:12
@jrasell jrasell added this to the 1.11.x milestone Oct 28, 2025
@aimeeu
Copy link
Contributor

aimeeu commented Nov 19, 2025

@mismithhisler Since the docs content has migrated to the web-unified-docs repo, the updates in this PR need to be recreated in the other repo. I can do the docs update or help, once Eng decides which release this is going into.


// lock connCount to avoid torn reads, as this is updated by ConnState callbacks
countMux.Lock()
metrics.SetGauge([]string{"nomad", "agent", "http", "connections"}, float32(connCount))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For something as simple as this int, would an atomic be better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'd probably make this an atomic, so you never have to worry about holding it open or where to unlock it. Just mind you don't accidentally copy it, same as with the mutex.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Don't forget a changelog.


// lock connCount to avoid torn reads, as this is updated by ConnState callbacks
countMux.Lock()
metrics.SetGauge([]string{"nomad", "agent", "http", "connections"}, float32(connCount))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'd probably make this an atomic, so you never have to worry about holding it open or where to unlock it. Just mind you don't accidentally copy it, same as with the mutex.

connMux.Unlock()

// Call connection limiter if enabled
if connLimit > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to do a little truth table for myself to verify we hadn't changed the behavior unexpectedly. 😁 I think this holds for both the old and new versions so LGTM

non-TLS or
no handshake timeout
connLimit > 0 call connLimiter? set deadlines?
yes yes yes no
yes no no no
no yes yes yes
no no no yes

@tgross
Copy link
Member

tgross commented Jan 23, 2026

@mismithhisler what do you think about closing out #7893 while we're here and adding some metrics on rejected connections too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

theme/docs Documentation issues and enhancements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants