-
Notifications
You must be signed in to change notification settings - Fork 2.1k
metrics: add total http connections metric for agent #26756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| serverInitializationErrors error | ||
|
|
||
| connCount int = 0 | ||
| countMux *sync.Mutex = &sync.Mutex{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a lock to the ConnState callback could potentially introduce some contention if there are many incoming requests and a short telemetry.collectionInterval. This may be worth running some benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I scripted a very basic test hitting the nomad api concurrently. Comparing the times to a previous version of Nomad without this metric, it showed no meaningful difference in total time. This is not surprising as the handling of connections is already bottlenecked by the lock within the connection limiter.
| connMux.Unlock() | ||
|
|
||
| // Call connection limiter if enabled | ||
| if connLimit > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're already introducing some closures here with the connCount and mutex, adding the connLimit as one more closure allows us to condense the logic a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to do a little truth table for myself to verify we hadn't changed the behavior unexpectedly. 😁 I think this holds for both the old and new versions so LGTM
| non-TLS or no handshake timeout |
connLimit > 0 | call connLimiter? | set deadlines? |
|---|---|---|---|
| yes | yes | yes | no |
| yes | no | no | no |
| no | yes | yes | yes |
| no | no | no | yes |
|
@mismithhisler Since the docs content has migrated to the web-unified-docs repo, the updates in this PR need to be recreated in the other repo. I can do the docs update or help, once Eng decides which release this is going into. |
|
|
||
| // lock connCount to avoid torn reads, as this is updated by ConnState callbacks | ||
| countMux.Lock() | ||
| metrics.SetGauge([]string{"nomad", "agent", "http", "connections"}, float32(connCount)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For something as simple as this int, would an atomic be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'd probably make this an atomic, so you never have to worry about holding it open or where to unlock it. Just mind you don't accidentally copy it, same as with the mutex.
tgross
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Don't forget a changelog.
|
|
||
| // lock connCount to avoid torn reads, as this is updated by ConnState callbacks | ||
| countMux.Lock() | ||
| metrics.SetGauge([]string{"nomad", "agent", "http", "connections"}, float32(connCount)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'd probably make this an atomic, so you never have to worry about holding it open or where to unlock it. Just mind you don't accidentally copy it, same as with the mutex.
| connMux.Unlock() | ||
|
|
||
| // Call connection limiter if enabled | ||
| if connLimit > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to do a little truth table for myself to verify we hadn't changed the behavior unexpectedly. 😁 I think this holds for both the old and new versions so LGTM
| non-TLS or no handshake timeout |
connLimit > 0 | call connLimiter? | set deadlines? |
|---|---|---|---|
| yes | yes | yes | no |
| yes | no | no | no |
| no | yes | yes | yes |
| no | no | no | yes |
|
@mismithhisler what do you think about closing out #7893 while we're here and adding some metrics on rejected connections too? |
Description
This PR adds tracking of http connections and emits those as the metric
nomad.agent.http.connection. We use a gauge for this metric so we can track long lived connections.Testing & Reproduction steps
In one terminal, run a dev agent. In another terminal, run
nomad monitor. Wait thein_memory_collection_interval. In a third terminal, runcurl -s localhost:4646/v1/metrics | jq '.Gauges[] | select(.Name | contains("nomad.nomad.agent.http"))'. Observe the value of1.0for the total connections.Links
Docs deploy preview: https://nomad-git-f-add-total-http-connections-metric-hashicorp.vercel.app/nomad/docs/reference/metrics#agent-metrics
Contributor Checklist
changelog entry using the
make clcommand.ensure regressions will be caught.
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.
Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.
Changes to Security Controls
Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.