Skip to content

MWI: Add teleport_bot_instances metric#59774

Merged
boxofrad merged 3 commits intomasterfrom
boxofrad/bot-version-metrics
Oct 8, 2025
Merged

MWI: Add teleport_bot_instances metric#59774
boxofrad merged 3 commits intomasterfrom
boxofrad/bot-version-metrics

Conversation

@boxofrad
Copy link
Contributor

@boxofrad boxofrad commented Sep 30, 2025

Exposes the number of bot instances by version, so that Teleport Cloud and other operators can take it into account when performing cluster upgrades.

changelog: MWI: Add teleport_bot_instances metric to track the number of bot instances across the cluster, by version

@github-actions
Copy link
Contributor

github-actions bot commented Sep 30, 2025

Amplify deployment status

Branch Commit Job ID Status Preview Updated (UTC)
boxofrad/bot-version-metrics ab16d12 5 ✅SUCCEED boxofrad-bot-version-metrics 2025-10-07 18:12:26

Comment on lines +337 to +342
for version, count := range byVersion {
gauge.With(prometheus.Labels{
teleport.TagVersion: version,
teleport.TagAutomaticUpdates: "true",
}).Set(float64(count))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an infinitely growing number of timeseries, never reset during the process lifetime. I'm not sure how much of an issue this is but this smells. You might want to leave a comment saying that no other labels should be added to this metric without further safeties to mitigate label cardinality.

For reference I wrote but never had time to merge a metrics RFD: https://github.com/gravitational/teleport/blob/92a08edc66a192a161a90ca5cf10162085d63d9c/rfd/0197-prometheus-metrics.md

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a problem we should actually address - since if you upgraded all your tbots from one version to another, they'd all still show up under the old version 😓 I think I ran into this in the past year or two - let me see if I can find how I resolved it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've handled this two different ways previously:

  1. Just reset the GaugeVec before you write to it. This obviously leaves a very short period where it's state is "weird" but often if you've precomputed the values before you start writing into it - this period is very very short.
  2. Just directly implemented the Collector interface myself. I actually feel like I tend to prefer this to (1) since it's less hacky but you also benefit less from the guard rails the SDK provides.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for linking to the RFD! That's a really useful resource.

On the resetting thing, I'm already calling gauge.Reset on L316, which I think is what @strideynet suggested above. Is that enough to drop the stale series?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops - I missed that 🙈

@boxofrad boxofrad force-pushed the boxofrad/bot-version-metrics branch from e8377fc to d413d02 Compare October 7, 2025 11:49
Base automatically changed from boxofrad/bot-version-report to master October 7, 2025 14:54
changelog: MWI: Add `teleport_bot_instances` metric
@boxofrad boxofrad force-pushed the boxofrad/bot-version-metrics branch from d413d02 to 4b8cc27 Compare October 7, 2025 15:24
@boxofrad boxofrad added this pull request to the merge queue Oct 7, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 7, 2025
@boxofrad boxofrad added this pull request to the merge queue Oct 8, 2025
Merged via the queue into master with commit 040caaf Oct 8, 2025
43 checks passed
@boxofrad boxofrad deleted the boxofrad/bot-version-metrics branch October 8, 2025 09:34
boxofrad added a commit that referenced this pull request Oct 13, 2025
boxofrad added a commit that referenced this pull request Oct 14, 2025
boxofrad added a commit that referenced this pull request Oct 15, 2025
boxofrad added a commit that referenced this pull request Oct 15, 2025
boxofrad added a commit that referenced this pull request Oct 15, 2025
github-merge-queue bot pushed a commit that referenced this pull request Oct 15, 2025
* MWI: Generate `AutoUpdateBotInstanceReport` resource (#59738)

* MWI: Add `tctl` get and delete mappings for `AutoUpdateBotInstanceReport` (#60017)

* MWI: Add `teleport_bot_instances` metric (#59774)

* MWI: Log on `AutoUpdateBotInstanceReport` generation failure (#60191)

* Fix passing lock by value

* Allow `machineid.AutoUpdateVersionReporter` to shut down correctly (#60219)
github-merge-queue bot pushed a commit that referenced this pull request Oct 15, 2025
* MWI: Generate `AutoUpdateBotInstanceReport` resource (#59738)

* MWI: Add `tctl` get and delete mappings for `AutoUpdateBotInstanceReport` (#60017)

* MWI: Add `teleport_bot_instances` metric (#59774)

* MWI: Log on `AutoUpdateBotInstanceReport` generation failure (#60191)

* Allow `machineid.AutoUpdateVersionReporter` to shut down correctly (#60219)
rhammonds-teleport pushed a commit that referenced this pull request Nov 6, 2025
* MWI: Add `teleport_bot_instances` metric

changelog: MWI: Add `teleport_bot_instances` metric

* Use `InEpsilon`

* Fix import ordering
mmcallister pushed a commit that referenced this pull request Nov 19, 2025
* MWI: Add `teleport_bot_instances` metric

changelog: MWI: Add `teleport_bot_instances` metric

* Use `InEpsilon`

* Fix import ordering
mmcallister pushed a commit that referenced this pull request Nov 20, 2025
* MWI: Add `teleport_bot_instances` metric

changelog: MWI: Add `teleport_bot_instances` metric

* Use `InEpsilon`

* Fix import ordering
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation machine-id observability Used for metrics and insight into Teleport. size/md

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants