Merge pull request #1150 from buildkite/docs/add-buildkite-agent-metrics-cli-section

gilesgas · web-flow · commit 83be56b36408 · 2026-03-02T17:16:10.000+11:00
Add buildkite-agent-metrics CLI section to monitoring and observability docs
diff --git a/app/models/page/renderers/external_link.rb b/app/models/page/renderers/external_link.rb
@@ -24,7 +24,7 @@ def initialize(node)
   def external_link?
 
     def has_internal_link_prefix?
-      INTERNAL_LINK_PREFIXES.any? { |prefix| @href.include?(prefix) }
+      INTERNAL_LINK_PREFIXES.any? { |prefix| @href.start_with?(prefix) }
     end
 
     def buildkite_domain?
@@ -41,10 +41,8 @@ def buildkite_domain?
   end
 
   def decorate_external_link_node
-    unless node['class']
-      node.set_attribute('class', 'external-link')
-      node.set_attribute('target', '_blank')
-    end
+    node.set_attribute('class', 'external-link') unless node['class']
+    node.set_attribute('target', '_blank')
   end
 
   def process
diff --git a/pages/agent/self_hosted/monitoring_and_observability.md b/pages/agent/self_hosted/monitoring_and_observability.md
@@ -124,6 +124,112 @@ Once enabled, the agent will generate the following metrics (duration measured i
 - `buildkite.jobs.duration.success.median`
 - `buildkite.jobs.duration.success.95percentile`
 
+## Buildkite agent metrics CLI
+
+The [buildkite-agent-metrics](https://github.com/buildkite/buildkite-agent-metrics) tool is a standalone command-line binary that collects agent and job metrics from the [`metrics` endpoint of the Buildkite agent API](/docs/apis/agent-api/metrics) and publishes these metrics to a monitoring and observability backend of your choice. This tool is particularly useful for enabling autoscaling based on queue depth and agent availability.
+
+The tool supports the following backends:
+
+- [AWS CloudWatch](https://aws.amazon.com/cloudwatch/) (default)
+- [StatsD](https://github.com/etsy/statsd) (including Datadog-compatible tagging)
+- [Prometheus](https://prometheus.io)
+- [Google Cloud Monitoring](https://cloud.google.com/monitoring)
+- [New Relic](https://newrelic.com/products/insights)
+- [OpenTelemetry](https://opentelemetry.io)
+
+### Installing
+
+Download the latest binary from [GitHub Releases](https://github.com/buildkite/buildkite-agent-metrics/releases), or run it as a Docker container:
+
+```shell
+docker run --rm public.ecr.aws/buildkite/agent-metrics:latest \
+  -token "$BUILDKITE_AGENT_TOKEN" \
+  -interval 30s \
+  -queue my-queue
+```
+
+You can also install from source using Go:
+
+```shell
+go install github.com/buildkite/buildkite-agent-metrics/v5@latest
+```
+
+### Running
+
+The tool requires an [agent token](/docs/agent/self-hosted/tokens), which could be the same one used when [assigning the self-hosted agent to a queue](/docs/agent/queues#assigning-a-self-hosted-agent-to-a-queue), or another agent token configured within the same [cluster](/docs/pipelines/security/clusters). The simplest deployment runs it as a long-running daemon that collects metrics across all queues in an organization:
+
+```shell
+buildkite-agent-metrics -token "$BUILDKITE_AGENT_TOKEN" -interval 30s
+```
+
+To restrict collection to specific queues, use the `-queue` flag (repeatable):
+
+```shell
+buildkite-agent-metrics -token "$BUILDKITE_AGENT_TOKEN" -interval 30s -queue my-queue
+```
+
+To select a backend, use the `-backend` flag:
+
+```shell
+buildkite-agent-metrics -token "$BUILDKITE_AGENT_TOKEN" -interval 30s -backend statsd
+```
+
+### Collected metrics
+
+The tool collects the following metrics per organization and per queue:
+
+<table class="responsive-table">
+  <thead>
+    <tr>
+      <th style="width:35%">Metric</th>
+      <th>Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <% [
+      {
+        metric: "`ScheduledJobsCount`",
+        description: "Jobs waiting in the queue for an available agent. This should be close to zero if you have sufficient agent capacity."
+      },
+      {
+        metric: "`RunningJobsCount`",
+        description: "Jobs currently being executed by agents."
+      },
+      {
+        metric: "`WaitingJobsCount`",
+        description: "Jobs that can't be scheduled yet due to dependencies or `wait` steps. Useful for autoscaling, as these represent work that starts soon."
+      },
+      {
+        metric: "`UnfinishedJobsCount`",
+        description: "All jobs that have been scheduled but haven't finished. Includes both running and scheduled jobs."
+      },
+      {
+        metric: "`IdleAgentsCount`",
+        description: "Agents connected but not running a job."
+      },
+      {
+        metric: "`BusyAgentsCount`",
+        description: "Agents currently running a job."
+      },
+      {
+        metric: "`TotalAgentsCount`",
+        description: "Total number of connected agents."
+      },
+      {
+        metric: "`BusyAgentPercentage`",
+        description: "Percentage of agents currently busy."
+      }
+    ].each do |row| %>
+      <tr>
+        <td><%= render_markdown(text: row[:metric]) %></td>
+        <td><%= render_markdown(text: row[:description]) %></td>
+      </tr>
+    <% end %>
+  </tbody>
+</table>
+
+For more details on configuration options, AWS Lambda deployment, and backend-specific settings, see the [buildkite-agent-metrics README](https://github.com/buildkite/buildkite-agent-metrics?tab=readme-ov-file#buildkite-agent-metrics).
+
 ## Tracing
 
 For Datadog APM or OpenTelemetry tracing, see [Tracing in the Buildkite agent](/docs/agent/self-hosted/monitoring-and-observability/tracing).
diff --git a/pages/apis/agent_api.md b/pages/apis/agent_api.md
@@ -4,7 +4,7 @@ The agent REST API is used to retrieve agent metrics, register agents, de-regist
 
 The agent REST API's _publicly_ available endpoints include:
 
-- [`/metrics`](/docs/apis/agent-api/metrics): Used to retrieve information about current self-hosted agents associated with a Buildkite cluster. The [Buildkite Agent Metrics](https://github.com/buildkite/buildkite-agent-metrics) CLI tool uses the data returned by the metrics endpoint for agent autoscaling.
+- [`/metrics`](/docs/apis/agent-api/metrics): Used to retrieve information about current self-hosted agents associated with a Buildkite cluster. The [buildkite-agent-metrics](/docs/agent/self-hosted/monitoring-and-observability#buildkite-agent-metrics-cli) CLI tool uses the data returned by the metrics endpoint for agent autoscaling.
 - [`/stacks`](/docs/apis/agent-api/stacks): Used to implement a _stack_ on a self-hosted queue. A stack is a long-running controller process that watches the queue for jobs, and runs Buildkite agents on demand to run these jobs.
 
 All other endpoints in the agent API are intended only for use by the Buildkite agent, therefore stability and backwards compatibility are not guaranteed, and changes won't be announced.
diff --git a/pages/pipelines/best_practices/agent_management.md b/pages/pipelines/best_practices/agent_management.md
@@ -96,7 +96,7 @@ Learn more about using clusters and queues in [Managing clusters](/docs/pipeline
 
 ## Right-sizing of your agent fleet
 
-- Monitor queue times with [cluster insights](/docs/pipelines/security/clusters#cluster-insights) and [Buildkite agent Metrics](https://github.com/buildkite/buildkite-agent-metrics).
+- Monitor queue times with [cluster insights](/docs/pipelines/security/clusters#cluster-insights) and the [buildkite-agent-metrics](/docs/agent/self-hosted/monitoring-and-observability#buildkite-agent-metrics-cli) tool.
 - Use cloud-based autoscaling ([Elastic CI Stack for AWS](https://github.com/buildkite/elastic-ci-stack-for-aws), [Buildkite agent Scaler](https://github.com/buildkite/buildkite-agent-scaler), [Agent Stack for Kubernetes](/docs/agent/self-hosted/agent-stack-k8s)).
 - Maintain dedicated pools for CPU-intensive, GPU-enabled, or OS-specific workloads.
 - Configure [graceful termination](/docs/agent/lifecycle#signal-handling) to allow jobs to complete.
diff --git a/pages/pipelines/best_practices/parallel_builds.md b/pages/pipelines/best_practices/parallel_builds.md
@@ -147,7 +147,7 @@ In addition to the [Elastic CI Stack for AWS](/docs/agent/self-hosted/aws/elasti
 - [Pipelines REST API](/docs/apis/rest-api/pipelines) and [Agents API](/docs/apis/rest-api/agents) you're able to fetch each pipeline's job count, and information about each agent.
 - [Agent priorities](/docs/agent/self-hosted/prioritization) allow you to define which agents are assigned work first, such as high performance ephemeral agents.
 - [Agent queues](/docs/agent/queues) allow you to divide your agent pools into separate groups for scaling and performance purposes.
-- [buildkite-agent-metrics](https://github.com/buildkite/buildkite-agent-metrics) tool allow you to collect your organization's Buildkite metrics and report them to AWS CloudWatch and StatsD.
+- [buildkite-agent-metrics](/docs/agent/self-hosted/monitoring-and-observability#buildkite-agent-metrics-cli) tool allows you to collect your organization's Buildkite metrics and report them to a range of backends including AWS CloudWatch, StatsD, Prometheus, and OpenTelemetry.
 
 Using these tools you can automate your build infrastructure, scale your agents based on demand, and massively reduce build times using job parallelism.
 
diff --git a/spec/models/page/renderer_spec.rb b/spec/models/page/renderer_spec.rb
@@ -169,11 +169,11 @@
 
     it "does not affect links with existing css classes" do
       md = <<~MD
-        <p><a href="https://www.github.com/buildkite/docs" class="Docs__example-repo">Docs repo</a></p>
+        <p><a href="https://www.github.com/buildkite/docs" class="Docs__example-repo" target="_blank">Docs repo</a></p>
       MD
 
       html = <<~HTML
-        <p><a href="https://www.github.com/buildkite/docs" class="Docs__example-repo">Docs repo</a></p>
+        <p><a href="https://www.github.com/buildkite/docs" class="Docs__example-repo" target="_blank">Docs repo</a></p>
       HTML
 
       expect(Page::Renderer.render(md).strip).to eql(html.strip)