Skip to content

Commit 426b54e

Browse files
yanksyoonyanksyooncbartz
authored
feat: prometheus metrics (#582)
* add prometheus client to dependency & order deps alphabetically * add metrics route * refactor internal function * add reconcile metrics * implement reconcile metrics * initial implementation * test: add integration test * test: test flavor label * fix: enable allure test to collect reports before running tests * remove dep * debug: try two tests * test: mark openstack test * test: move integration test pre run script * fix: juju microk8s setup script * test: dont switch back to original controller * switch back to original controller git push * unset juju model env var * raw json output * debug * fix: label name * test: fix offer name relation * test: do not use juju controller from env var * test: do not raise on error while waiting for idle * test: model integrate & consume * test: keep microk8s model * test: deploy cos agent * test: add grafana agent series * test: fix typo * test: assign grafana agent channel * test: juju ops lib is ... * test: use series * test: swap out ops_test with jubilant * test: generator type hint fix * test: add prometheus datasource * test: separate micrlk8s juju from lxd juju * test: add controller prefix to model * test: store model & controller name * test: remove duplicate model naming w/ controller name prefix * test: add fixture dependency * test: switch controller fixture for juju offers * test: switch controller select model * test: modify juju controller & model env vars * test: try direct cli_bin call * test: try no env patch * test: assert subprocess result * test: set additional juju envs * test: use controller & model params * test: try consume model w/ model params * test: use direct consume command * test: setup microstack metallb for ingress * test: check for metrics availability * test: add schema to request * test: fix syntax err * test: fix prometheus ip * test: wait for openstack metrics longer * test: test only metrics that have been generated * test: restore tests * chore: move integration test script * ci: separate tests depending on k8s * test: fix lints * fix: lint * fix: lint * ci: remove unused setup script * docs: add changelog & increment version * ci: separate test names * chore: refactor the metrics to be fetched without side effects * chore: increment metrics by 1 * chore: move code * fix: lint fixe * fix integration test wf file --------- Co-authored-by: yanksyoon <[email protected]> Co-authored-by: Christopher Bartz <[email protected]>
1 parent 1aa16ad commit 426b54e

File tree

16 files changed

+341
-621
lines changed

16 files changed

+341
-621
lines changed

.github/workflows/e2e_test.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,6 @@ jobs:
1616
secrets: inherit
1717
with:
1818
juju-channel: 3.6/stable
19-
pre-run-script: scripts/setup-integration-tests.sh
2019
provider: lxd
2120
test-tox-env: integration-juju3.6
2221
modules: '["test_e2e"]'

.github/workflows/integration_test.yaml

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -12,35 +12,34 @@ concurrency:
1212
cancel-in-progress: true
1313

1414
jobs:
15-
openstack-interface-tests-private-endpoint:
16-
name: openstack interface test using private-endpoint
15+
openstack-integration-tests-private-endpoint:
16+
name: Integration test using private-endpoint
1717
uses: canonical/operator-workflows/.github/workflows/integration_test.yaml@main
1818
secrets: inherit
1919
with:
2020
juju-channel: 3.6/stable
21-
pre-run-script: scripts/setup-integration-tests.sh
2221
provider: lxd
2322
test-tox-env: integration-juju3.6
24-
modules: '["test_runner_manager_openstack"]'
25-
extra-arguments: '--log-format="%(asctime)s %(levelname)s %(message)s"'
23+
modules: '["test_charm_metrics_failure", "test_charm_metrics_success", "test_charm_fork_repo", "test_charm_fork_path_change", "test_charm_no_runner", "test_charm_runner", "test_debug_ssh", "test_charm_upgrade", "test_reactive", "test_jobmanager_prespawned", "test_jobmanager_reactive"]'
24+
extra-arguments: '-m openstack --log-format="%(asctime)s %(levelname)s %(message)s"'
2625
self-hosted-runner: true
2726
self-hosted-runner-label: stg-private-endpoint
28-
openstack-integration-tests-private-endpoint:
29-
name: Integration test using private-endpoint
27+
openstack-integration-tests-cross-controller-private-endpoint:
28+
name: Cross controller integration test using private-endpoint
3029
uses: canonical/operator-workflows/.github/workflows/integration_test.yaml@main
3130
secrets: inherit
3231
with:
3332
juju-channel: 3.6/stable
34-
pre-run-script: scripts/setup-integration-tests.sh
33+
pre-run-script: tests/integration/setup-integration-tests.sh
3534
provider: lxd
3635
test-tox-env: integration-juju3.6
37-
modules: '["test_charm_metrics_failure", "test_charm_metrics_success", "test_charm_fork_repo", "test_charm_fork_path_change", "test_charm_no_runner", "test_charm_runner", "test_debug_ssh", "test_charm_upgrade", "test_reactive", "test_jobmanager_prespawned", "test_jobmanager_reactive"]'
36+
modules: '["test_prometheus_metrics"]'
3837
extra-arguments: '-m openstack --log-format="%(asctime)s %(levelname)s %(message)s"'
3938
self-hosted-runner: true
4039
self-hosted-runner-label: stg-private-endpoint
4140
allure-report:
4241
if: ${{ (success() || failure()) && github.event_name == 'schedule' }}
4342
needs:
44-
- openstack-interface-tests-private-endpoint
4543
- openstack-integration-tests-private-endpoint
44+
- openstack-integration-tests-cross-controller-private-endpoint
4645
uses: canonical/operator-workflows/.github/workflows/allure_report.yaml@main

docs/changelog.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ This changelog documents user-relevant changes to the GitHub runner charm.
44

55
### 2025-06-30
66
- New configuration options aproxy-exclude-addresses and aproxy-redirect-ports for allowing aproxy to redirect arbitrary TCP traffic
7+
- Added prometheus metrics to the GitHub runner manager application.
78

89
## 2025-06-26
910

github-runner-manager/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
[project]
55
name = "github-runner-manager"
6-
version = "0.5.0"
6+
version = "0.6.0"
77
authors = [
88
{ name = "Canonical IS DevOps", email = "[email protected]" },
99
]

github-runner-manager/requirements.txt

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
1+
click==8.2.1
12
fabric >=3,<4
3+
flask==3.1.1
24
ghapi
35
jinja2
46
kombu==5.5.3
57
openstacksdk==4.5.0
8+
prometheus-client==0.22.1
69
pydantic < 2
710
pymongo==4.13.0
8-
click==8.2.1
9-
flask==3.1.1

github-runner-manager/src/github_runner_manager/http_server.py

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
from threading import Lock
1313

1414
from flask import Flask, request
15+
from prometheus_client import generate_latest
1516

1617
from github_runner_manager.configuration import ApplicationConfiguration
1718
from github_runner_manager.errors import CloudError, LockError
@@ -44,7 +45,7 @@ def check_runner() -> tuple[str, int]:
4445
Returns:
4546
Information on the runners in JSON format.
4647
"""
47-
app_config = app.config[APP_CONFIG_NAME]
48+
app_config: ApplicationConfiguration = app.config[APP_CONFIG_NAME]
4849
app.logger.info("Checking runners...")
4950
runner_scaler = get_runner_scaler(app_config)
5051
try:
@@ -72,7 +73,7 @@ def flush_runner() -> tuple[str, int]:
7273
if flush_busy_str in ("True", "true"):
7374
flush_busy = True
7475

75-
lock = get_lock()
76+
lock = _get_lock()
7677
with lock:
7778
app.logger.info("Flushing runners...")
7879
runner_scaler = get_runner_scaler(app_config)
@@ -87,7 +88,7 @@ def flush_runner() -> tuple[str, int]:
8788
return ("", 204)
8889

8990

90-
def get_lock() -> Lock:
91+
def _get_lock() -> Lock:
9192
"""Get the lock representing modification access to the set of runners.
9293
9394
Raises:
@@ -103,6 +104,16 @@ def get_lock() -> Lock:
103104
raise LockError("Lock not configured")
104105

105106

107+
@app.route("/metrics", methods=["GET"])
108+
def metrics() -> bytes:
109+
"""Return prometheus metrics from default registry.
110+
111+
Returns:
112+
The latest metrics from the default Prometheus registry.
113+
"""
114+
return generate_latest()
115+
116+
106117
@dataclass
107118
class FlaskArgs:
108119
"""Arguments for Flask HTTP server.

github-runner-manager/src/github_runner_manager/manager/runner_manager.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
from github_runner_manager.metrics import events as metric_events
2424
from github_runner_manager.metrics import github as github_metrics
2525
from github_runner_manager.metrics import runner as runner_metrics
26+
from github_runner_manager.metrics.reconcile import CLEANED_RUNNERS_TOTAL
2627
from github_runner_manager.metrics.runner import RunnerMetrics
2728
from github_runner_manager.openstack_cloud.constants import CREATE_SERVER_TIMEOUT
2829
from github_runner_manager.platform.platform_provider import (
@@ -348,6 +349,7 @@ def _delete_cloud_runners(
348349

349350
logging.info("Delete runner in cloud: %s", cloud_runner.instance_id)
350351
runner_metric = self._cloud.delete_runner(cloud_runner.instance_id)
352+
CLEANED_RUNNERS_TOTAL.labels(self.manager_name).inc(1)
351353
if not runner_metric:
352354
logger.error("No metrics returned after deleting %s", cloud_runner.instance_id)
353355
else:

github-runner-manager/src/github_runner_manager/manager/runner_scaler.py

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,7 @@
88
from dataclasses import dataclass
99

1010
import github_runner_manager.reactive.runner_manager as reactive_runner_manager
11-
from github_runner_manager.configuration import (
12-
ApplicationConfiguration,
13-
UserInfo,
14-
)
11+
from github_runner_manager.configuration import ApplicationConfiguration, UserInfo
1512
from github_runner_manager.constants import GITHUB_SELF_HOSTED_ARCH_LABELS
1613
from github_runner_manager.errors import (
1714
CloudError,
@@ -28,6 +25,12 @@
2825
RunnerMetadata,
2926
)
3027
from github_runner_manager.metrics import events as metric_events
28+
from github_runner_manager.metrics.reconcile import (
29+
BUSY_RUNNERS_COUNT,
30+
EXPECTED_RUNNERS_COUNT,
31+
IDLE_RUNNERS_COUNT,
32+
RECONCILE_DURATION_SECONDS,
33+
)
3134
from github_runner_manager.openstack_cloud.models import OpenStackServerConfig
3235
from github_runner_manager.openstack_cloud.openstack_runner_manager import (
3336
OpenStackRunnerManager,
@@ -216,6 +219,8 @@ def __init__( # pylint: disable=too-many-arguments, too-many-positional-argumen
216219
self._platform_name = platform_name
217220
self._python_path = python_path
218221

222+
EXPECTED_RUNNERS_COUNT.labels(self._manager.manager_name).set(self._base_quantity)
223+
219224
def get_runner_info(self) -> RunnerInfo:
220225
"""Get information on the runners.
221226
@@ -321,7 +326,10 @@ def reconcile(self) -> int:
321326
flavor=self._manager.manager_name,
322327
expected_runner_quantity=expected_runner_quantity,
323328
)
324-
_issue_reconciliation_metric(reconcile_metric_data)
329+
RECONCILE_DURATION_SECONDS.labels(self._manager.manager_name).observe(
330+
end_timestamp - start_timestamp
331+
)
332+
_issue_reconciliation_metric(reconcile_metric_data, self._manager.manager_name)
325333

326334
logger.info("Finished reconciliation.")
327335

@@ -403,12 +411,13 @@ def _log_runners(runner_list: tuple[RunnerInstance]) -> None:
403411

404412

405413
def _issue_reconciliation_metric(
406-
reconcile_metric_data: _ReconcileMetricData,
414+
reconcile_metric_data: _ReconcileMetricData, manager_name: str
407415
) -> None:
408416
"""Issue the reconciliation metric.
409417
410418
Args:
411419
reconcile_metric_data: The data used to issue the reconciliation metric.
420+
manager_name: The name of the manager.
412421
"""
413422
idle_runners = {
414423
runner.name
@@ -431,6 +440,9 @@ def _issue_reconciliation_metric(
431440
logger.info("Current available runners (idle + healthy offline): %s", available_runners)
432441
logger.info("Current active runners: %s", active_runners)
433442

443+
BUSY_RUNNERS_COUNT.labels(manager_name).set(len(active_runners))
444+
IDLE_RUNNERS_COUNT.labels(manager_name).set(len(idle_runners))
445+
434446
try:
435447

436448
metric_events.issue_event(
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Copyright 2025 Canonical Ltd.
2+
# See LICENSE file for licensing details.
3+
4+
"""Module for collecting metrics related to the reconciliation process."""
5+
6+
from prometheus_client import Gauge, Histogram
7+
8+
LABEL_FLAVOR = "flavor"
9+
10+
RECONCILE_DURATION_SECONDS = Histogram(
11+
name="reconcile_duration_seconds",
12+
documentation="Duration of reconciliation (seconds)",
13+
labelnames=[LABEL_FLAVOR],
14+
)
15+
EXPECTED_RUNNERS_COUNT = Gauge(
16+
name="expected_runners_count",
17+
documentation="Expected number of runners",
18+
labelnames=[LABEL_FLAVOR],
19+
)
20+
BUSY_RUNNERS_COUNT = Gauge(
21+
name="busy_runners_count",
22+
documentation="Number of busy runners",
23+
labelnames=[LABEL_FLAVOR],
24+
)
25+
IDLE_RUNNERS_COUNT = Gauge(
26+
name="idle_runners_count",
27+
documentation="Number of idle runners",
28+
labelnames=[LABEL_FLAVOR],
29+
)
30+
CLEANED_RUNNERS_TOTAL = Gauge(
31+
name="cleaned_runners_total",
32+
documentation="Total number of runners cleaned up",
33+
labelnames=[LABEL_FLAVOR],
34+
)

scripts/setup-integration-tests.sh

Lines changed: 0 additions & 6 deletions
This file was deleted.

0 commit comments

Comments
 (0)