chore: refactor metrics classes [DO NOT MERGE WITHOUT CHANGING BASE TO MAIN] #613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

yanksyoon wants to merge 6 commits into chore/rename-path-constants from chore/refactor-metrics-classes

Member

yanksyoon commented Aug 6, 2025

Applicable spec:

Overview

Make RunnerMetrics into an interface READONLY dataclass using Protocol + @Property
Refactor complex _pull_runner_metrics function
Merge parsing + instantiating in PulledMetrics class
Rename runner_installed & other data attributes to be more descriptive

Rationale

This allows capturing accurate data for Prometheus metrics and disallows mutating data

Juju Events Changes

Module Changes

Library Changes

Checklist

The charm style guide was applied.
The contributing guide was applied.
The changes are compliant with ISD054 - Managing Charm Complexity
The documentation for charmhub is updated.
The PR is tagged with appropriate label (urgent, trivial, complex).
The changelog is updated with changes that affects the users of the charm.
The application version number is updated in github-runner-manager/pyproject.toml.

No user facing changes

yanksyoon added 6 commits

August 6, 2025 08:36


          chore: refactor RunnerMetrics to be an interface(protocol)

dbee048


          chore: refactor PulledMetrics to adhere to protocol

d036728


          chore: remove PulledMetrics conversion

e6f61b4


          test: add factories for dataclasses

52198ba


          test: refactor test to interface + data tests

5b8c68a


          test: remove dataclass conversion

4a25772

yanksyoon requested a review from cbartz as a code owner

August 6, 2025 08:53

yanksyoon added the complex label

yanksyoon requested review from yhaliaw and javierdelapuente as code owners

August 6, 2025 08:53

yanksyoon changed the title ~~chore: refactor metrics classes~~ chore: refactor metrics classes [DO NOT MERGE WITHOUT CHANGING BASE TO MAIN]

Contributor

github-actions bot commented Aug 6, 2025

Test results for commit `4a25772`

Test coverage for 4a25772

Wrote XML report to coverage/coverage.xml

Static code analysis report

Run started:2025-08-06 08:57:59.773666

Test results:
  No issues identified.

Code scanned:
  Total lines of code: 2026
  Total lines skipped (#nosec): 2
  Total potential issues skipped due to specifically being disabled (e.g., #nosec BXXX): 1

Run metrics:
  Total issues (by severity):
  	Undefined: 0
  	Low: 0
  	Medium: 0
  	High: 0
  Total issues (by confidence):
  	Undefined: 0
  	Low: 0
  	Medium: 0
  	High: 0
Files skipped (0):

cbartz reviewed

View reviewed changes

Collaborator

cbartz left a comment

Thanks! Some questions/remarks about semantic changes, otherwise it looks good.

github-runner-manager/src/github_runner_manager/manager/vm_manager.py

Comment on lines +214 to +218

+                  def installation_start_timestamp(self) -> NonNegativeFloat:  # type: ignore
+                      """UNIX timestamp of in which the VM setup started."""
+                  @property
+                  def installation_end_timestamp(self) -> NonNegativeFloat | None:

Collaborator

cbartz Aug 6, 2025

Before it was

    installation_start_timestamp: NonNegativeFloat | None
    installed_timestamp: NonNegativeFloat

so it seems it should be

Suggested change

      
                def installation_start_timestamp(self) -> NonNegativeFloat:  # type: ignore
          
                    """UNIX timestamp of in which the VM setup started."""
          
                @property
          
                def installation_end_timestamp(self) -> NonNegativeFloat | None:
          
                def installation_start_timestamp(self) -> NonNegativeFloat | None:  # type: ignore
          
                    """UNIX timestamp of in which the VM setup started."""
          
                @property
          
                def installation_end_timestamp(self) -> NonNegativeFloat:

Member Author

yanksyoon Aug 7, 2025

I think the previous implementation as wrong (diff here).

Installation start timestamp is referring to the OpenStack VM created_at timestamp which the type hint suggests that it cannot be None.
Installation end timestamp, however, is a metric that is pulled over SSH connection which could be None due to some failures.

One question I have is whether Installation start timestamp (created_at timestamp) from OpenStack VM might be None which I don't think it shoud?

github-runner-manager/src/github_runner_manager/metrics/runner.py

-                          )
-                          return None
+                  @property
+                  def installation_end_timestamp(self) -> NonNegativeFloat | None:

Collaborator

cbartz Aug 6, 2025

Suggested change

      
                def installation_end_timestamp(self) -> NonNegativeFloat | None:
          
                def installation_end_timestamp(self) -> NonNegativeFloat:

If I am correct with my other comment above

github-runner-manager/src/github_runner_manager/metrics/runner.py

@@ @@ -385,12 +401,12 @@ def _issue_runner_installed( @@
                   Returns:
                       The type of the issued event.
                   """
+                  installation_end_timestamp = runner_metrics.installation_end_timestamp or 0

Collaborator

cbartz Aug 6, 2025

Suggested change

      
                installation_end_timestamp = runner_metrics.installation_end_timestamp or 0
          
                installation_end_timestamp = runner_metrics.installation_end_timestamp

if I am correct with my other comment above

github-runner-manager/src/github_runner_manager/metrics/runner.py

                       flavor=flavor,
                       # the installation_start_timestamp should be present
-                      duration=runner_metrics.installed_timestamp  # type: ignore
-                      - runner_metrics.installation_start_timestamp,  # type: ignore
+                      duration=max(installation_end_timestamp - runner_metrics.installation_start_timestamp, 0),

Collaborator

cbartz Aug 6, 2025

why the max change here, which looks like a semantic change?

github-runner-manager/src/github_runner_manager/metrics/runner.py

    
            @@ -466,15 +482,17 @@ def _create_runner_start(
          
                  # might be higher than the pre-job timestamp. This is due to the fact that we issue the runner

                  # installed timestamp for Openstack after waiting with delays for the runner to be ready.

                  # We set the idle_duration to 0 in this case.

                  if pre_job_metrics.timestamp < runner_metrics.installed_timestamp:

                  if pre_job_metrics.timestamp < (runner_metrics.installation_end_timestamp or 0):

Collaborator

cbartz Aug 6, 2025

Suggested change

      
                if pre_job_metrics.timestamp < (runner_metrics.installation_end_timestamp or 0):
          
                if pre_job_metrics.timestamp < runner_metrics.installation_end_timestamp:

if I am correct with my comment above that the timestamp should always be present

github-runner-manager/src/github_runner_manager/metrics/runner.py

Comment on lines +191 to +196

+                  if timestamp := metrics_contents_map.get(RUNNER_INSTALLED_TS_FILE_PATH, None):
+                      try:
+                          runner_installed_timestamp = float(timestamp)
+                      except ValueError:
+                          logger.warning("Corrupt runner installed timestamp: %s", timestamp)

Collaborator

cbartz Aug 6, 2025

This seems to be a semantic change: Before the PR , if the timestamp was not present, we would not issue any metrics at all

        if self.runner_installed is None:
            logger.error(
                "Invalid pulled metrics. No runner_installed information for %s.", instance_id
            )
            return None

Member Author

yanksyoon Aug 7, 2025

Leaving a comment after discussion:

This semantics change is proposed to convey metrics for instances that may have failed to initialize.
Conveying None values for all metrics would suggest that some step has gone wrong and let us decide to perhaps put it in +inf bucket.
This metric would show us how many times we were failing to either instantiate or SSH into the machine to fetch the metrics, helping us decide the next move.

github-runner-manager/tests/unit/metrics/test_runner.py

                   ],
               )
-              def test_issue_events_partial_metrics(

Collaborator

cbartz Aug 6, 2025

This test seems gone, are these cases covered in the new tests?

Member Author

yanksyoon Aug 7, 2025

Yup, they have been better parametrized by the "state" input, rather than testing for the internal function calls.
The parametrized tests are:

no instances
single instance, no metrics
single instance, partial metrics(POST_JOB_METRICS_FILE_PATH): post job only
single instance, partial metrics(PRE_JOB_METRICS_FILE_PATH): pre job only
single instance, partial metrics(RUNNER_INSTALLED_TS_FILE_PATH): runner installed timestamp only
single instance, all metrics
multi instance, all metrics

These cases more than cover the existing tests, but go beyond and extend the combinatorics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels