Skip to content

fix: add missing Spark import/export support for metrics - part 2#246

Merged
richm merged 1 commit intolinux-system-roles:mainfrom
richm:fix-spark-2
Jul 24, 2025
Merged

fix: add missing Spark import/export support for metrics - part 2#246
richm merged 1 commit intolinux-system-roles:mainfrom
richm:fix-spark-2

Conversation

@richm
Copy link
Copy Markdown
Collaborator

@richm richm commented Jul 22, 2025

add openmetrics to metrics domains

Signed-off-by: Rich Megginson rmeggins@redhat.com

Summary by Sourcery

Bug Fixes:

  • Include 'openmetrics' in __metrics_domains when metrics_from_spark is enabled

@richm richm requested a review from natoscott as a code owner July 22, 2025 01:49
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai bot commented Jul 22, 2025

Reviewer's Guide

Extends the Ansible metrics domain pipeline to support OpenMetrics by conditionally appending 'openmetrics' to the __metrics_domains list based on the new metrics_from_spark flag.

Class diagram for __metrics_domains update logic

classDiagram
    class MetricsDomainPipeline {
        +list __metrics_domains
        +bool metrics_from_elasticsearch
        +bool metrics_from_spark
        +bool metrics_from_mssql
        +update_domains()
    }
    MetricsDomainPipeline : +update_domains()
    MetricsDomainPipeline : __metrics_domains += 'elasticsearch' if metrics_from_elasticsearch
    MetricsDomainPipeline : __metrics_domains += 'openmetrics' if metrics_from_spark
    MetricsDomainPipeline : __metrics_domains += 'mssql' if metrics_from_mssql
Loading

Flow diagram for metrics domain extension with OpenMetrics

flowchart TD
    A[Start] --> B{metrics_from_elasticsearch?}
    B -- Yes --> C[Add 'elasticsearch' to __metrics_domains]
    B -- No --> D
    C --> D{metrics_from_spark?}
    D -- Yes --> E[Add 'openmetrics' to __metrics_domains]
    D -- No --> F
    E --> F{metrics_from_mssql?}
    F -- Yes --> G[Add 'mssql' to __metrics_domains]
    F -- No --> H[End]
    G --> H
Loading

File-Level Changes

Change Details Files
Add OpenMetrics to metrics domain list
  • Insert task that appends 'openmetrics' to __metrics_domains
  • Conditionally run this step when metrics_from_spark is true
tasks/main.yml

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @richm - I've reviewed your changes and they look great!


Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@richm
Copy link
Copy Markdown
Collaborator Author

richm commented Jul 22, 2025

[citest]

@richm
Copy link
Copy Markdown
Collaborator Author

richm commented Jul 22, 2025

@natoscott still not working :-(

TASK [Check if OpenMetrics PMDA has Spark metrics registered] ******************
task path: /tmp/collections-LAW/ansible_collections/fedora/linux_system_roles/tests/metrics/check_from_spark.yml:3
Monday 21 July 2025  22:01:04 -0400 (0:00:00.033)       0:00:30.342 *********** 
FAILED - RETRYING: [managed-node2]: Check if OpenMetrics PMDA has Spark metrics registered (10 retries left).
FAILED - RETRYING: [managed-node2]: Check if OpenMetrics PMDA has Spark metrics registered (9 retries left).
FAILED - RETRYING: [managed-node2]: Check if OpenMetrics PMDA has Spark metrics registered (8 retries left).
FAILED - RETRYING: [managed-node2]: Check if OpenMetrics PMDA has Spark metrics registered (7 retries left).
FAILED - RETRYING: [managed-node2]: Check if OpenMetrics PMDA has Spark metrics registered (6 retries left).
FAILED - RETRYING: [managed-node2]: Check if OpenMetrics PMDA has Spark metrics registered (5 retries left).
FAILED - RETRYING: [managed-node2]: Check if OpenMetrics PMDA has Spark metrics registered (4 retries left).
FAILED - RETRYING: [managed-node2]: Check if OpenMetrics PMDA has Spark metrics registered (3 retries left).
FAILED - RETRYING: [managed-node2]: Check if OpenMetrics PMDA has Spark metrics registered (2 retries left).
FAILED - RETRYING: [managed-node2]: Check if OpenMetrics PMDA has Spark metrics registered (1 retries left).
fatal: [managed-node2]: FAILED! => {
    "attempts": 10,
    "changed": false,
    "cmd": [
        "pmprobe",
        "-I",
        "openmetrics.control.status"
    ],
    "delta": "0:00:00.012465",
    "end": "2025-07-21 22:01:18.079597",
    "rc": 0,
    "start": "2025-07-21 22:01:18.067132"
}

STDOUT:

openmetrics.control.status 4 "control" "grafana" "kepler" "vllm"

Running it locally - I never even get this far - I still get an error, no matter how many retries I have.

There is something really weird going on.

@natoscott
Copy link
Copy Markdown
Collaborator

@natoscott still not working :-(

OK - this part:

STDOUT:

openmetrics.control.status 4 "control" "grafana" "kepler" "vllm"

says that pmdaopenmetrics is installed but just with the default endpoints. Spark port/url should be added to this list via these tasks in metrics/roles/spark/tasks/main.yml -

- name: Ensure PCP OpenMetrics agent is configured for Spark
  [...]
- name: Ensure PCP OpenMetrics agent is enabled with Spark endpoint
  [...]

I wonder if there's an ordering problem? If pmdaopenmetrics was being started now before the /etc config file and /var symlink are established, this could behave in the way we're seeing.

@richm
Copy link
Copy Markdown
Collaborator Author

richm commented Jul 22, 2025

I wonder if there's an ordering problem?

I think the problem is ordering and/or timing, which would explain why we see different behavior in each test environment.

If pmdaopenmetrics was being started now before the /etc config file and /var symlink are established, this could behave in the way we're seeing.

How can we verify that?

@natoscott
Copy link
Copy Markdown
Collaborator

I wonder if there's an ordering problem?

I think the problem is ordering and/or timing, which would explain why we see different behavior in each test environment.

If pmdaopenmetrics was being started now before the /etc config file and /var symlink are established, this could behave in the way we're seeing.

How can we verify that?

We need to make sure the two tasks above (Spark openmetrics config and symlink) happen before the final two steps in roles/pcp/tasks/pmcd.yml which causes pmcd to be started/restarted. The ducks that need to be lined up before the pmcd servicec re/start are these Spark tasks and the first two/three tasks in the pmcd.yml file (especially "- name: Ensure optional metric collection agents are enabled").

@richm
Copy link
Copy Markdown
Collaborator Author

richm commented Jul 23, 2025

@natoscott The pmcd service was getting a permission denied error on spark.url - I changed it to 0644 to match the permissions on the other files in /etc/pcp/openmetrics/ - they are all 0644 - now it is working. Which leaves me really confused as to how this could have possibly worked?

If you approve, I'll go back and make this change in ansible-pcp

@richm
Copy link
Copy Markdown
Collaborator Author

richm commented Jul 23, 2025

[citest]

@richm
Copy link
Copy Markdown
Collaborator Author

richm commented Jul 23, 2025

This looks much better - but still two issues

  • centos-7 ansible 2.9 - Error, could not touch target: [Errno 2] No such file or directory: '/var/lib/pcp/pmdas/openmetrics/.NeedInstall' - maybe the name or location of the file is different in el7?
  • fedora-42 - bpftrace agent does not show up in pmprobe -I pmcd.agent.status - pmcd.agent.status 11 "root" "pmcd" "proc" "pmproxy" "xfs" "linux" "nfsclient" "mmv" "kvm" "jbd2" "dm"

@natoscott
Copy link
Copy Markdown
Collaborator

@natoscott The pmcd service was getting a permission denied error on spark.url - I changed it to 0644 to match the permissions on the other files in /etc/pcp/openmetrics/ - they are all 0644 - now it is working.

Excellent!

Which leaves me really confused as to how this could have possibly worked?

pmdaopenmetrics was recently changed to run 'unprivileged' (under the 'pcp' user account) - previously it ran as root - this is some unintended fallout (and unforeseen, I'm wondering if that change might affect anything else now - its the first diagnosed case of this I've seen fortunately, so hopefully this isn't common).

If you approve, I'll go back and make this change in ansible-pcp

Yes please, thanks for fixing this!

@richm
Copy link
Copy Markdown
Collaborator Author

richm commented Jul 23, 2025

@natoscott The pmcd service was getting a permission denied error on spark.url - I changed it to 0644 to match the permissions on the other files in /etc/pcp/openmetrics/ - they are all 0644 - now it is working.

Excellent!

Which leaves me really confused as to how this could have possibly worked?

pmdaopenmetrics was recently changed to run 'unprivileged' (under the 'pcp' user account) - previously it ran as root - this is some unintended fallout (and unforeseen, I'm wondering if that change might affect anything else now - its the first diagnosed case of this I've seen fortunately, so hopefully this isn't common).

If you approve, I'll go back and make this change in ansible-pcp

Yes please, thanks for fixing this!

performancecopilot/ansible-pcp#86

@richm
Copy link
Copy Markdown
Collaborator Author

richm commented Jul 24, 2025

[citest]

@richm
Copy link
Copy Markdown
Collaborator Author

richm commented Jul 24, 2025

[citest]

1 similar comment
@richm
Copy link
Copy Markdown
Collaborator Author

richm commented Jul 24, 2025

[citest]

add openmetrics to metrics domains

Signed-off-by: Rich Megginson <rmeggins@redhat.com>
@richm richm merged commit 4a93647 into linux-system-roles:main Jul 24, 2025
15 of 19 checks passed
@richm richm deleted the fix-spark-2 branch July 24, 2025 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants