Skip to content

chore: refactor metrics classes [DO NOT MERGE WITHOUT CHANGING BASE TO MAIN] #613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: chore/rename-path-constants
Choose a base branch
from

Conversation

yanksyoon
Copy link
Member

Applicable spec:

Overview

  • Make RunnerMetrics into an interface READONLY dataclass using Protocol + @Property
  • Refactor complex _pull_runner_metrics function
  • Merge parsing + instantiating in PulledMetrics class
  • Rename runner_installed & other data attributes to be more descriptive

Rationale

  • This allows capturing accurate data for Prometheus metrics and disallows mutating data

Juju Events Changes

Module Changes

Library Changes

Checklist

  • The charm style guide was applied.
  • The contributing guide was applied.
  • The changes are compliant with ISD054 - Managing Charm Complexity
  • The documentation for charmhub is updated.
  • The PR is tagged with appropriate label (urgent, trivial, complex).
  • The changelog is updated with changes that affects the users of the charm.
  • The application version number is updated in github-runner-manager/pyproject.toml.

No user facing changes

@yanksyoon yanksyoon requested a review from cbartz as a code owner August 6, 2025 08:53
@yanksyoon yanksyoon changed the title chore: refactor metrics classes chore: refactor metrics classes [DO NOT MERGE WITHOUT CHANGING BASE TO MAIN] Aug 6, 2025
Copy link
Contributor

github-actions bot commented Aug 6, 2025

Test results for commit 4a25772

Test coverage for 4a25772

Wrote XML report to coverage/coverage.xml

Static code analysis report

Run started:2025-08-06 08:57:59.773666

Test results:
  No issues identified.

Code scanned:
  Total lines of code: 2026
  Total lines skipped (#nosec): 2
  Total potential issues skipped due to specifically being disabled (e.g., #nosec BXXX): 1

Run metrics:
  Total issues (by severity):
  	Undefined: 0
  	Low: 0
  	Medium: 0
  	High: 0
  Total issues (by confidence):
  	Undefined: 0
  	Low: 0
  	Medium: 0
  	High: 0
Files skipped (0):

Copy link
Collaborator

@cbartz cbartz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Some questions/remarks about semantic changes, otherwise it looks good.

Comment on lines +214 to +218
def installation_start_timestamp(self) -> NonNegativeFloat: # type: ignore
"""UNIX timestamp of in which the VM setup started."""

@property
def installation_end_timestamp(self) -> NonNegativeFloat | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before it was

    installation_start_timestamp: NonNegativeFloat | None
    installed_timestamp: NonNegativeFloat

so it seems it should be

Suggested change
def installation_start_timestamp(self) -> NonNegativeFloat: # type: ignore
"""UNIX timestamp of in which the VM setup started."""
@property
def installation_end_timestamp(self) -> NonNegativeFloat | None:
def installation_start_timestamp(self) -> NonNegativeFloat | None: # type: ignore
"""UNIX timestamp of in which the VM setup started."""
@property
def installation_end_timestamp(self) -> NonNegativeFloat:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous implementation as wrong (diff here).

Installation start timestamp is referring to the OpenStack VM created_at timestamp which the type hint suggests that it cannot be None.
Installation end timestamp, however, is a metric that is pulled over SSH connection which could be None due to some failures.

One question I have is whether Installation start timestamp (created_at timestamp) from OpenStack VM might be None which I don't think it shoud?

)
return None
@property
def installation_end_timestamp(self) -> NonNegativeFloat | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def installation_end_timestamp(self) -> NonNegativeFloat | None:
def installation_end_timestamp(self) -> NonNegativeFloat:

If I am correct with my other comment above

@@ -385,12 +401,12 @@ def _issue_runner_installed(
Returns:
The type of the issued event.
"""
installation_end_timestamp = runner_metrics.installation_end_timestamp or 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
installation_end_timestamp = runner_metrics.installation_end_timestamp or 0
installation_end_timestamp = runner_metrics.installation_end_timestamp

if I am correct with my other comment above

flavor=flavor,
# the installation_start_timestamp should be present
duration=runner_metrics.installed_timestamp # type: ignore
- runner_metrics.installation_start_timestamp, # type: ignore
duration=max(installation_end_timestamp - runner_metrics.installation_start_timestamp, 0),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the max change here, which looks like a semantic change?

@@ -466,15 +482,17 @@ def _create_runner_start(
# might be higher than the pre-job timestamp. This is due to the fact that we issue the runner
# installed timestamp for Openstack after waiting with delays for the runner to be ready.
# We set the idle_duration to 0 in this case.
if pre_job_metrics.timestamp < runner_metrics.installed_timestamp:
if pre_job_metrics.timestamp < (runner_metrics.installation_end_timestamp or 0):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if pre_job_metrics.timestamp < (runner_metrics.installation_end_timestamp or 0):
if pre_job_metrics.timestamp < runner_metrics.installation_end_timestamp:

if I am correct with my comment above that the timestamp should always be present

Comment on lines +191 to +196
if timestamp := metrics_contents_map.get(RUNNER_INSTALLED_TS_FILE_PATH, None):
try:
runner_installed_timestamp = float(timestamp)
except ValueError:
logger.warning("Corrupt runner installed timestamp: %s", timestamp)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a semantic change: Before the PR , if the timestamp was not present, we would not issue any metrics at all

        if self.runner_installed is None:
            logger.error(
                "Invalid pulled metrics. No runner_installed information for %s.", instance_id
            )
            return None

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a comment after discussion:

This semantics change is proposed to convey metrics for instances that may have failed to initialize.
Conveying None values for all metrics would suggest that some step has gone wrong and let us decide to perhaps put it in +inf bucket.
This metric would show us how many times we were failing to either instantiate or SSH into the machine to fetch the metrics, helping us decide the next move.

],
)
def test_issue_events_partial_metrics(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems gone, are these cases covered in the new tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, they have been better parametrized by the "state" input, rather than testing for the internal function calls.
The parametrized tests are:

  • no instances
  • single instance, no metrics
  • single instance, partial metrics(POST_JOB_METRICS_FILE_PATH): post job only
  • single instance, partial metrics(PRE_JOB_METRICS_FILE_PATH): pre job only
  • single instance, partial metrics(RUNNER_INSTALLED_TS_FILE_PATH): runner installed timestamp only
  • single instance, all metrics
  • multi instance, all metrics

These cases more than cover the existing tests, but go beyond and extend the combinatorics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants