MWI: Add teleport_bot_instances metric#59774
Conversation
|
Amplify deployment status
|
e19e015 to
578873d
Compare
ec832ae to
e8377fc
Compare
| for version, count := range byVersion { | ||
| gauge.With(prometheus.Labels{ | ||
| teleport.TagVersion: version, | ||
| teleport.TagAutomaticUpdates: "true", | ||
| }).Set(float64(count)) | ||
| } |
There was a problem hiding this comment.
This looks like an infinitely growing number of timeseries, never reset during the process lifetime. I'm not sure how much of an issue this is but this smells. You might want to leave a comment saying that no other labels should be added to this metric without further safeties to mitigate label cardinality.
For reference I wrote but never had time to merge a metrics RFD: https://github.com/gravitational/teleport/blob/92a08edc66a192a161a90ca5cf10162085d63d9c/rfd/0197-prometheus-metrics.md
There was a problem hiding this comment.
This sounds like a problem we should actually address - since if you upgraded all your tbots from one version to another, they'd all still show up under the old version 😓 I think I ran into this in the past year or two - let me see if I can find how I resolved it.
There was a problem hiding this comment.
So I've handled this two different ways previously:
- Just reset the GaugeVec before you write to it. This obviously leaves a very short period where it's state is "weird" but often if you've precomputed the values before you start writing into it - this period is very very short.
- Just directly implemented the Collector interface myself. I actually feel like I tend to prefer this to (1) since it's less hacky but you also benefit less from the guard rails the SDK provides.
There was a problem hiding this comment.
Thanks for linking to the RFD! That's a really useful resource.
On the resetting thing, I'm already calling gauge.Reset on L316, which I think is what @strideynet suggested above. Is that enough to drop the stale series?
e8377fc to
d413d02
Compare
changelog: MWI: Add `teleport_bot_instances` metric
d413d02 to
4b8cc27
Compare
* MWI: Generate `AutoUpdateBotInstanceReport` resource (#59738) * MWI: Add `tctl` get and delete mappings for `AutoUpdateBotInstanceReport` (#60017) * MWI: Add `teleport_bot_instances` metric (#59774) * MWI: Log on `AutoUpdateBotInstanceReport` generation failure (#60191) * Fix passing lock by value * Allow `machineid.AutoUpdateVersionReporter` to shut down correctly (#60219)
* MWI: Generate `AutoUpdateBotInstanceReport` resource (#59738) * MWI: Add `tctl` get and delete mappings for `AutoUpdateBotInstanceReport` (#60017) * MWI: Add `teleport_bot_instances` metric (#59774) * MWI: Log on `AutoUpdateBotInstanceReport` generation failure (#60191) * Allow `machineid.AutoUpdateVersionReporter` to shut down correctly (#60219)
Exposes the number of bot instances by version, so that Teleport Cloud and other operators can take it into account when performing cluster upgrades.
changelog: MWI: Add
teleport_bot_instancesmetric to track the number of bot instances across the cluster, by version