[llm-d] Keep working by kpouget · Pull Request #915 · openshift-psap/topsail

kpouget · 2026-03-28T20:01:00Z

Summary by CodeRabbit

New Features
- Added capture_servicemonitor_metrics command to capture ServiceMonitor/PodMonitor YAML and metrics for specified services with configurable capture frequency and options.
- Automated metrics capture for LLM inference service tests.
Documentation
- Added documentation for the new metrics capture command.

…mportant files

openshift-ci · 2026-03-28T20:01:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign albertoperdomo2 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-03-28T20:01:17Z

📝 Walkthrough

Walkthrough

A new Cluster.capture_servicemonitor_metrics method and Ansible role were added to capture ServiceMonitor/PodMonitor YAML and metrics from Kubernetes services. The implementation includes configuration parameters, task files for both monitor types, a metrics capture pod template, and integration into the LLM-D test framework for automated metrics collection.

Changes

Cohort / File(s)	Summary
Documentation `docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst`, `docs/toolbox.generated/index.rst`	New auto-generated documentation for the `Cluster.capture_servicemonitor_metrics` command with parameter descriptions and command index entry.
Cluster API Method `projects/cluster/toolbox/cluster.py`	Added `Cluster.capture_servicemonitor_metrics()` method with `@AnsibleRole` and `@AnsibleMappedParams` decorators to wire ServiceMonitor/PodMonitor metrics capture invocation.
Ansible Role Configuration `projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml`	Defined default configuration variables for service name, namespace detection, capture frequency (60s default), monitor type selection, finalization mode, and Linux OS family.
Ansible Role Tasks `projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`, `tasks/servicemonitor.yml`, `tasks/podmonitor.yml`	Implemented task orchestration for namespace resolution, finalization logic, conditional ServiceMonitor vs. PodMonitor processing, pod deployment, and metrics URL extraction with optional auth secret handling.
Metrics Capture Pod Template `projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2`	Kubernetes Pod manifest template that deploys a metrics collection container with bash-based curl loops, TLS verification control, Bearer token authentication, and configurable capture frequency.
LLM-D Test Integration `projects/llm-d/testing/config.yaml`, `projects/llm-d/testing/test_llmd.py`	Added metrics capture configuration keys for ServiceMonitor/PodMonitor names, implemented `start_metrics_capture()` and `stop_metrics_capture()` helpers that gate captures by flavor and handle finalization.
Visualization Parser `projects/llm-d/visualizations/llmd_inference/store/parsers.py`	Extended `IMPORTANT_FILES` to include GuideLLM benchmark artifacts (`benchmarks.json` and `guidellm_benchmark_job.logs`).

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Framework
    participant Cluster as Cluster API
    participant Ansible as Ansible Role
    participant K8s as Kubernetes API
    participant Pod as Metrics Capture Pod

    Test->>Cluster: capture_servicemonitor_metrics(service_name, namespace)
    Cluster->>Ansible: RunAnsibleRole(cluster_capture_servicemonitor_metrics)
    Ansible->>K8s: oc project (determine namespace)
    Ansible->>K8s: Get ServiceMonitor/PodMonitor YAML
    Ansible->>K8s: Query matching services/pods
    Ansible->>K8s: Extract endpoint config (scheme, port)
    Ansible->>K8s: Build metrics URLs
    Ansible->>K8s: Create metrics capture pod (pod.yaml)
    K8s->>Pod: Deploy pod with metrics URLs
    Pod->>Pod: Loop: fetch metrics via curl (every capture_frequency)
    Pod->>Pod: Apply auth token if available
    Ansible->>K8s: Wait for pod Ready state
    Test->>Cluster: capture_servicemonitor_metrics(..., finalize=True)
    Cluster->>Ansible: RunAnsibleRole (finalize mode)
    Ansible->>K8s: Get pod logs
    Ansible->>K8s: Delete metrics capture pod

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[llm-d] Keep working #907 — Modifies the same llm-d test harness files (test_llmd.py, config.yaml) to integrate metrics/benchmark capture behaviors alongside this PR's metrics capture infrastructure.
[llmd] Keep working #902 — Updates llm-d testing and metrics integration by adding/consuming the same tests.llmd.inference_service.metrics configuration keys introduced in this PR.
[llm-d] Keep working #906 — Directly overlaps with this PR's changes to LLM-D test files and introduces related metrics capture wiring in the same test framework locations.

Poem

🐰 Metrics hop and pod monitors dance,
ServiceMonitor guides them in their prance,
With Ansible's grace and bash loops so tight,
We capture all the metrics through the night! ✨
From Kubernetes clouds to artifacts we race,
Monitoring magic in every place!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title '[llm-d] Keep working' is vague and does not clearly convey the actual changes, which involve adding ServiceMonitor/PodMonitor metrics capture functionality and related test infrastructure.	Replace with a descriptive title like 'Add ServiceMonitor/PodMonitor metrics capture functionality for llm-d tests' to clearly communicate the main changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (2)

projects/cluster/toolbox/cluster.py (1)

561-565: Fix the documented default for capture_frequency.

The function default is 60, but the docstring still says default: 15, and the generated RST now shows both values.

Proposed fix

-            capture_frequency: How often to capture metrics in seconds (default: 15)
+            capture_frequency: How often to capture metrics in seconds (default: 60)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@projects/cluster/toolbox/cluster.py` around lines 561 - 565, Update the
docstring for the parameter capture_frequency to match the function default (60
seconds) — currently it incorrectly says "default: 15". Locate the docstring
block in projects/cluster/toolbox/cluster.py that documents capture_frequency
(the function where the parameter capture_frequency is defined) and change the
text "default: 15" to "default: 60" so generated RST and docs reflect the actual
default.

projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml (1)

56-60: Secret existence check fails the play if secret is missing.

If auth_secret_name is set but the secret doesn't exist in the namespace, oc get secret returns non-zero and the play fails. If this is a valid scenario (e.g., optional auth), add error handling:

Proposed fix to handle missing secrets gracefully

 - name: Check if auth secret exists
   shell: |
     oc get secret {{ auth_secret_name }} -n {{ target_namespace }} --no-headers -o name
   register: auth_secret_exists
+  failed_when: false
   when: auth_secret_name != ""

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`
around lines 56 - 60, The "Check if auth secret exists" shell task fails the
play when the secret is missing; change the task (named "Check if auth secret
exists") to treat a non-zero oc exit code as non-fatal by adding failed_when:
false (or ignore_errors: true) so the play continues, keep register:
auth_secret_exists, and then use auth_secret_exists.rc or
auth_secret_exists.stdout to conditionally act later (e.g., only create
resources when auth_secret_exists.rc == 0) so missing optional secrets are
handled gracefully.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml`:
- Around line 13-14: The inline comment for
cluster_capture_servicemonitor_metrics_capture_frequency says "(default: 15)"
but the variable is set to 60; make them consistent by updating the comment to
"(default: 60)" to match
cluster_capture_servicemonitor_metrics_capture_frequency, or if you intended the
default to be 15, change the variable value to 15 instead—adjust the comment and
value together so cluster_capture_servicemonitor_metrics_capture_frequency and
its descriptive comment match.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`:
- Around line 20-27: The "Check if pod exists for finalization" task should not
fail the play when the pod is absent: add failed_when: false (and optionally
changed_when: false) to that task so oc get returns won't abort the run, and
then guard the "Capture pod logs" task with when: pod_exists_check.rc == 0 (or
when: pod_exists_check.stdout != '' ) so logs are only collected if the pod
actually exists; reference the task names "Check if pod exists for finalization"
and "Capture pod logs" and the registered variable pod_exists_check when making
these changes.
- Around line 17-35: The "Delete metrics capture pod" task currently runs
unconditionally; ensure it only runs in finalize mode by either moving the task
into the existing finalize block (between the "Capture pod logs" task and the
meta: end_play) or by adding the when:
cluster_capture_servicemonitor_metrics_finalize guard to the "Delete metrics
capture pod" task; reference the task name "Delete metrics capture pod", the
condition variable cluster_capture_servicemonitor_metrics_finalize, and the
finalize block (meta: end_play) when making the change so the delete only
executes during finalization.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml`:
- Around line 12-23: The tasks "Capture pod targets by PodMonitor selector",
"Capture pod targets by PodMonitor selector YAML" and "Get all target pod IPs
and names" currently use a hardcoded component list; change them to derive the
Pod label selector from the actual PodMonitor resource instead. Query the
PodMonitor in the target_namespace (e.g. oc get podmonitor <name> -n {{
target_namespace }} -o yaml) and extract .spec.selector.matchLabels (or convert
matchExpressions to label selector) into a variable, then use that label
selector in the oc get pod -l "<derived-selector>" commands (and in the -oyaml
and custom-columns calls) instead of the hardcoded app.kubernetes.io/component
list; ensure you reference the same selector encoding used by existing ISVC
lookups (app.kubernetes.io/name when applicable) and add a safe fallback to the
previous selector only if the PodMonitor has no selector.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml`:
- Around line 12-24: The tasks currently hardcode the scheduler labels and use
head -1 which can pick an arbitrary Service; instead, read the ServiceMonitor
named by the incoming monitor variable (the ServiceMonitor resource referenced
by the task that dumps YAML/status) and extract its spec.selector.matchLabels,
then use those labels to query Services to determine the target; replace the
hardcoded label selector in the "Capture service target by ServiceMonitor
selector" and "Get target service name" tasks with a step that fetches the
ServiceMonitor, builds the label selector string from spec.selector.matchLabels,
uses that selector for oc get service, and change the "Get target service name"
registration (target_service_name) to fail if zero matches or to
deterministically select a single match (e.g., error on multiple matches or pick
the service with a specific annotation/port) instead of using head -1.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2`:
- Around line 61-67: The template currently hardcodes reading
/var/run/secrets/auth/token and mounts the whole Secret; update the flow to
extract authorization.credentials.key from the ServiceMonitor/PodMonitor and
pass it into the template (e.g. as auth_secret_key), change the Secret volume
mount in the Pod template to use items and mount only that key (so the secret
key filename equals the passed-in key name), and update the runtime logic (the
fetch_metrics invocation/if block using auth_secret_name and fetch_metrics) to
read the token from /var/run/secrets/auth/<auth_secret_key> (or use a variable
TOKEN_FILE constructed from auth_secret_key) before calling fetch_metrics so the
correct key is mounted and read when credentials.key is not "token".

In `@projects/cluster/toolbox/cluster.py`:
- Around line 553-566: The function capture_servicemonitor_metrics currently
takes a parameter named service_name but treats it as the monitor resource name;
either rename the parameter to monitor_name (update the signature, docstring,
and all call sites where capture_servicemonitor_metrics is invoked) to reflect
that it expects a ServiceMonitor/PodMonitor name, or add logic inside
capture_servicemonitor_metrics to accept an actual Service name: when the
provided name is not found as a ServiceMonitor/PodMonitor, query the Service (oc
get service <name>) and locate the matching ServiceMonitor/PodMonitor by
selector/labels and use its name; update the docstring to state the accepted
inputs and adjust callers accordingly (refer to capture_servicemonitor_metrics
and parameter service_name).

In `@projects/llm-d/testing/test_llmd.py`:
- Around line 1019-1024: The helper functions start_metrics_capture and
stop_metrics_capture currently return when
tests.llmd.inference_service.metrics.manual_capture is false, which
inadvertently disables the automatic toolbox capture under shipped configs;
change the guard so the functions return only when manual_capture is true (i.e.,
invert the boolean check), and apply the same fix to both start_metrics_capture
and stop_metrics_capture so the automatic capture runs unless manual_capture is
explicitly enabled.

In `@projects/llm-d/visualizations/llmd_inference/store/parsers.py`:
- Around line 34-35: The important-file globs use
artifact_dirnames.GUIDELLM_BENCHMARK_DIR (exact match) while
find_guidellm_benchmark_directories() discovers with a wildcard suffix; update
the two globs that build
f"{artifact_dirnames.GUIDELLM_BENCHMARK_DIR}/artifacts/results/benchmarks.json"
and
f"{artifact_dirnames.GUIDELLM_BENCHMARK_DIR}/artifacts/guidellm_benchmark_job.logs"
to align with discovery by appending a wildcard to the benchmark dir portion
(e.g., use GUIDELLM_BENCHMARK_DIR + "*" or equivalent) so multi-rate/suffixed
directories are matched and registered/cached consistently.

---

Nitpick comments:
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`:
- Around line 56-60: The "Check if auth secret exists" shell task fails the play
when the secret is missing; change the task (named "Check if auth secret
exists") to treat a non-zero oc exit code as non-fatal by adding failed_when:
false (or ignore_errors: true) so the play continues, keep register:
auth_secret_exists, and then use auth_secret_exists.rc or
auth_secret_exists.stdout to conditionally act later (e.g., only create
resources when auth_secret_exists.rc == 0) so missing optional secrets are
handled gracefully.

In `@projects/cluster/toolbox/cluster.py`:
- Around line 561-565: Update the docstring for the parameter capture_frequency
to match the function default (60 seconds) — currently it incorrectly says
"default: 15". Locate the docstring block in projects/cluster/toolbox/cluster.py
that documents capture_frequency (the function where the parameter
capture_frequency is defined) and change the text "default: 15" to "default: 60"
so generated RST and docs reflect the actual default.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 071699d5-aa7b-423d-bdef-219be891108d

📥 Commits

Reviewing files that changed from the base of the PR and between 559da40 and 3c414e9.

📒 Files selected for processing (11)

docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst
docs/toolbox.generated/index.rst
projects/cluster/toolbox/cluster.py
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2
projects/llm-d/testing/config.yaml
projects/llm-d/testing/test_llmd.py
projects/llm-d/visualizations/llmd_inference/store/parsers.py

coderabbitai · 2026-03-28T20:12:30Z

projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml

+# How often to capture metrics in seconds (default: 15)
+cluster_capture_servicemonitor_metrics_capture_frequency: 60


⚠️ Potential issue | 🟡 Minor

Comment/default value mismatch.

The comment states (default: 15) but the actual default is 60. Update either the comment or the value to be consistent.

Proposed fix

-# How often to capture metrics in seconds (default: 15) -cluster_capture_servicemonitor_metrics_capture_frequency: 60 +# How often to capture metrics in seconds (default: 60) +cluster_capture_servicemonitor_metrics_capture_frequency: 60

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# How often to capture metrics in seconds (default: 15)

cluster_capture_servicemonitor_metrics_capture_frequency: 60

# How often to capture metrics in seconds (default: 60)

cluster_capture_servicemonitor_metrics_capture_frequency: 60

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml` around lines 13 - 14, The inline comment for cluster_capture_servicemonitor_metrics_capture_frequency says "(default: 15)" but the variable is set to 60; make them consistent by updating the comment to "(default: 60)" to match cluster_capture_servicemonitor_metrics_capture_frequency, or if you intended the default to be 15, change the variable value to 15 instead—adjust the comment and value together so cluster_capture_servicemonitor_metrics_capture_frequency and its descriptive comment match.

coderabbitai · 2026-03-28T20:12:30Z

projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml

+- name: "[Finalize mode] capture logs and delete existing pod"
+  when: cluster_capture_servicemonitor_metrics_finalize
+  block:
+  - name: Check if pod exists for finalization
+    shell: |
+      oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
+    register: pod_exists_check
+
+  - name: Capture pod logs
+    shell: |
+      oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"
+
+- name: Delete metrics capture pod
+  shell: |
+    oc delete pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }}  --grace-period=0 --ignore-not-found
+
+- name: "[Finalize mode] capture logs and delete existing pod"
+  when: cluster_capture_servicemonitor_metrics_finalize
+  meta: end_play


⚠️ Potential issue | 🔴 Critical

Critical: Delete task runs unconditionally outside the finalize block.

The "Delete metrics capture pod" task (lines 29-31) is outside the block structure and has no when condition, so it executes in both finalize and normal modes. In normal deployment mode, this will delete any existing capture pod before the play continues to create a new one.

If this is intentional cleanup behavior, move it after the end_play guard so it's clear. If it should only run in finalize mode, move it inside the block.

Proposed fix (assuming delete should be finalize-only)

- name: "[Finalize mode] capture logs and delete existing pod" when: cluster_capture_servicemonitor_metrics_finalize block: - name: Check if pod exists for finalization shell: | oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name register: pod_exists_check + failed_when: false - name: Capture pod logs shell: | oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt" + when: pod_exists_check.rc == 0 -- name: Delete metrics capture pod - shell: | - oc delete pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --grace-period=0 --ignore-not-found + - name: Delete metrics capture pod + shell: | + oc delete pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --grace-period=0 --ignore-not-found - name: "[Finalize mode] capture logs and delete existing pod" when: cluster_capture_servicemonitor_metrics_finalize meta: end_play

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml` around lines 17 - 35, The "Delete metrics capture pod" task currently runs unconditionally; ensure it only runs in finalize mode by either moving the task into the existing finalize block (between the "Capture pod logs" task and the meta: end_play) or by adding the when: cluster_capture_servicemonitor_metrics_finalize guard to the "Delete metrics capture pod" task; reference the task name "Delete metrics capture pod", the condition variable cluster_capture_servicemonitor_metrics_finalize, and the finalize block (meta: end_play) when making the change so the delete only executes during finalization.

coderabbitai · 2026-03-28T20:12:30Z

projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml

+  - name: Check if pod exists for finalization
+    shell: |
+      oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
+    register: pod_exists_check
+
+  - name: Capture pod logs
+    shell: |
+      oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"


⚠️ Potential issue | 🟠 Major

Missing error handling in finalize flow.

If the capture pod doesn't exist when finalize mode runs, oc get pod (line 22) fails with a non-zero exit code, causing the play to fail. The registered pod_exists_check variable is never used to conditionally skip the logs capture.

Consider adding failed_when: false to the check and using the result to guard subsequent tasks:

Proposed fix

- name: Check if pod exists for finalization shell: | oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name register: pod_exists_check + failed_when: false - name: Capture pod logs shell: | oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt" + when: pod_exists_check.rc == 0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- name: Check if pod exists for finalization

shell: |

oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name

register: pod_exists_check

- name: Capture pod logs

shell: |

oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"

- name: Check if pod exists for finalization

shell: |

oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name

register: pod_exists_check

failed_when: false

- name: Capture pod logs

shell: |

oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"

when: pod_exists_check.rc == 0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml` around lines 20 - 27, The "Check if pod exists for finalization" task should not fail the play when the pod is absent: add failed_when: false (and optionally changed_when: false) to that task so oc get returns won't abort the run, and then guard the "Capture pod logs" task with when: pod_exists_check.rc == 0 (or when: pod_exists_check.stdout != '' ) so logs are only collected if the pod actually exists; reference the task names "Check if pod exists for finalization" and "Capture pod logs" and the registered variable pod_exists_check when making these changes.

coderabbitai · 2026-03-28T20:12:30Z

projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml

+- name: Capture pod targets by PodMonitor selector
+  shell: |
+    oc get pod -l "app.kubernetes.io/component in (llminferenceservice-workload,llminferenceservice-workload-prefill,llminferenceservice-workload-worker,llminferenceservice-workload-leader,llminferenceservice-workload-leader-prefill,llminferenceservice-workload-worker-prefill),app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/target_pods.status"
+
+- name: Capture pod targets by PodMonitor selector YAML
+  shell: |
+    oc get pod -l "app.kubernetes.io/component in (llminferenceservice-workload,llminferenceservice-workload-prefill,llminferenceservice-workload-worker,llminferenceservice-workload-leader,llminferenceservice-workload-leader-prefill,llminferenceservice-workload-worker-prefill),app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/target_pods.yaml"
+
+- name: Get all target pod IPs and names
+  shell: |
+    oc get pod -l "app.kubernetes.io/component in (llminferenceservice-workload,llminferenceservice-workload-prefill,llminferenceservice-workload-worker,llminferenceservice-workload-leader,llminferenceservice-workload-leader-prefill,llminferenceservice-workload-worker-prefill),app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} --no-headers -o custom-columns=":metadata.name,:status.podIP"
+  register: target_pods_info


⚠️ Potential issue | 🟠 Major

Resolve pods from the PodMonitor you were asked to capture.

After fetching the named PodMonitor, the role ignores its selector and scans every llminferenceservice workload pod in the namespace via a hardcoded component list. That will mix captures across services when multiple LLMInferenceServices share a namespace, and it will break as soon as the PodMonitor selector changes. The existing ISVC state capture already keys pod lookup off app.kubernetes.io/name; this should derive its selector from the PodMonitor too.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml` around lines 12 - 23, The tasks "Capture pod targets by PodMonitor selector", "Capture pod targets by PodMonitor selector YAML" and "Get all target pod IPs and names" currently use a hardcoded component list; change them to derive the Pod label selector from the actual PodMonitor resource instead. Query the PodMonitor in the target_namespace (e.g. oc get podmonitor <name> -n {{ target_namespace }} -o yaml) and extract .spec.selector.matchLabels (or convert matchExpressions to label selector) into a variable, then use that label selector in the oc get pod -l "<derived-selector>" commands (and in the -oyaml and custom-columns calls) instead of the hardcoded app.kubernetes.io/component list; ensure you reference the same selector encoding used by existing ISVC lookups (app.kubernetes.io/name when applicable) and add a safe fallback to the previous selector only if the PodMonitor has no selector.

coderabbitai · 2026-03-28T20:12:30Z

projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml

+- name: Capture service target by ServiceMonitor selector
+  shell: |
+    oc get service -l "app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/target_service.status"
+
+- name: Capture service target by ServiceMonitor selector YAML
+  shell: |
+    oc get service -l "app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/target_service.yaml"
+
+- name: Get target service name
+  shell: |
+    set -o pipefail;
+    oc get service -l "app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} --no-headers -o custom-columns=":metadata.name" | head -1
+  register: target_service_name


⚠️ Potential issue | 🟠 Major

Derive the target Service from the ServiceMonitor instead of hardcoding the scheduler labels.

These tasks only use the requested monitor name for the YAML/status dump; the actual lookup is always pinned to llminferenceservice-router-scheduler. If this command is pointed at any other ServiceMonitor, or if that monitor’s selector changes, target_service_name and the generated metrics URL will be wrong. head -1 also makes the selected Service arbitrary when more than one matches.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml` around lines 12 - 24, The tasks currently hardcode the scheduler labels and use head -1 which can pick an arbitrary Service; instead, read the ServiceMonitor named by the incoming monitor variable (the ServiceMonitor resource referenced by the task that dumps YAML/status) and extract its spec.selector.matchLabels, then use those labels to query Services to determine the target; replace the hardcoded label selector in the "Capture service target by ServiceMonitor selector" and "Get target service name" tasks with a step that fetches the ServiceMonitor, builds the label selector string from spec.selector.matchLabels, uses that selector for oc get service, and change the "Get target service name" registration (target_service_name) to fail if zero matches or to deterministically select a single match (e.g., error on multiple matches or pick the service with a specific annotation/port) instead of using head -1.

coderabbitai · 2026-03-28T20:12:30Z

...cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2

+            {% if auth_secret_name %}
+            if [ -f /var/run/secrets/auth/token ]; then
+              TOKEN=$(cat /var/run/secrets/auth/token)
+              fetch_metrics "$url" "$TOKEN"
+            else
+              echo "Auth token file not found"
+            fi


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find projects/cluster/toolbox/cluster_capture_servicemonitor_metrics -name "*.yaml.j2" -o -name "*.yaml"

Repository: openshift-psap/topsail

Length of output: 167

🏁 Script executed:

cat projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2

Repository: openshift-psap/topsail

Length of output: 2801

🏁 Script executed:

cat projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2 | head -100

Repository: openshift-psap/topsail

Length of output: 2801

🏁 Script executed:

rg "auth_secret" projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/ -A 2 -B 2

Repository: openshift-psap/topsail

Length of output: 7737

🏁 Script executed:

find projects/cluster/toolbox/cluster_capture_servicemonitor_metrics -name "*.yml" -o -name "*.yaml" | xargs rg "auth_secret" -l

Repository: openshift-psap/topsail

Length of output: 318

🏁 Script executed:

cat projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main.yml

Repository: openshift-psap/topsail

Length of output: 179

🏁 Script executed:

cat projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml

Repository: openshift-psap/topsail

Length of output: 4093

🏁 Script executed:

cat projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml

Repository: openshift-psap/topsail

Length of output: 4618

🏁 Script executed:

cat projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml | head -80

Repository: openshift-psap/topsail

Length of output: 3716

Thread the secret key name through the role and mount only the selected key.

The template mounts the entire Secret without specifying which key to use (lines 83–94), then hardcodes reading /var/run/secrets/auth/token (lines 61–67). When a monitor's authorization.credentials points to a different key name, this path won't exist and authentication silently fails.

Extract authorization.credentials.key from the ServiceMonitor/PodMonitor alongside the secret name, pass it through to the template, mount only that key using the items field, and read the correct filename.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2` around lines 61 - 67, The template currently hardcodes reading /var/run/secrets/auth/token and mounts the whole Secret; update the flow to extract authorization.credentials.key from the ServiceMonitor/PodMonitor and pass it into the template (e.g. as auth_secret_key), change the Secret volume mount in the Pod template to use items and mount only that key (so the secret key filename equals the passed-in key name), and update the runtime logic (the fetch_metrics invocation/if block using auth_secret_name and fetch_metrics) to read the token from /var/run/secrets/auth/<auth_secret_key> (or use a variable TOKEN_FILE constructed from auth_secret_key) before calling fetch_metrics so the correct key is mounted and read when credentials.key is not "token".

coderabbitai · 2026-03-28T20:12:30Z

projects/cluster/toolbox/cluster.py

+    def capture_servicemonitor_metrics(self, service_name, namespace="", capture_frequency=60, is_podmonitor=False, finalize=False):
+        """
+        Captures ServiceMonitor or PodMonitor YAML and status for a given service
+
+        Captures the ServiceMonitor/PodMonitor configuration and status information for
+        a specific service in a namespace, including related service/pod and
+        endpoints information for troubleshooting monitoring setup.
+
+        Args:
+            service_name: Name of the service to capture ServiceMonitor/PodMonitor metrics for
+            namespace: Namespace where the service and ServiceMonitor/PodMonitor are located (empty string auto-detects current namespace)
+            capture_frequency: How often to capture metrics in seconds (default: 15)
+            is_podmonitor: Whether to use PodMonitor instead of ServiceMonitor (default: False)
+            finalize: Whether to finalize (capture logs and delete) an existing pod instead of creating new one (default: False)


🛠️ Refactor suggestion | 🟠 Major

Rename service_name or make the role accept actual Services.

The public command and generated docs say this parameter is a Service name, but the role uses it as oc get servicemonitor|podmonitor <name>. Every new caller in this PR therefore passes a monitor resource name instead. As-is, a CLI user following the docs will hit NotFound unless the monitor happens to share the Service name.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/cluster/toolbox/cluster.py` around lines 553 - 566, The function capture_servicemonitor_metrics currently takes a parameter named service_name but treats it as the monitor resource name; either rename the parameter to monitor_name (update the signature, docstring, and all call sites where capture_servicemonitor_metrics is invoked) to reflect that it expects a ServiceMonitor/PodMonitor name, or add logic inside capture_servicemonitor_metrics to accept an actual Service name: when the provided name is not found as a ServiceMonitor/PodMonitor, query the Service (oc get service <name>) and locate the matching ServiceMonitor/PodMonitor by selector/labels and use its name; update the docstring to state the accepted inputs and adjust callers accordingly (refer to capture_servicemonitor_metrics and parameter service_name).

coderabbitai · 2026-03-28T20:12:30Z

projects/llm-d/testing/test_llmd.py

+def start_metrics_capture(flavor):
+    """
+    Starts metrics capture for both ServiceMonitor and PodMonitor if enabled
+    """
+    if not config.project.get_config("tests.llmd.inference_service.metrics.manual_capture"):
+        return


⚠️ Potential issue | 🔴 Critical

manual_capture currently disables the new automatic capture path.

Both helpers return when tests.llmd.inference_service.metrics.manual_capture is false, but this PR sets that flag to false in the base config and in the cks preset. So start_metrics_capture() / stop_metrics_capture() never invoke the new toolbox command under the shipped configs.

Proposed fix

def start_metrics_capture(flavor): @@ - if not config.project.get_config("tests.llmd.inference_service.metrics.manual_capture"): + if config.project.get_config("tests.llmd.inference_service.metrics.manual_capture"): return @@ def stop_metrics_capture(flavor): @@ - if not config.project.get_config("tests.llmd.inference_service.metrics.manual_capture"): + if config.project.get_config("tests.llmd.inference_service.metrics.manual_capture"): return

Also applies to: 1057-1062

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/testing/test_llmd.py` around lines 1019 - 1024, The helper functions start_metrics_capture and stop_metrics_capture currently return when tests.llmd.inference_service.metrics.manual_capture is false, which inadvertently disables the automatic toolbox capture under shipped configs; change the guard so the functions return only when manual_capture is true (i.e., invert the boolean check), and apply the same fix to both start_metrics_capture and stop_metrics_capture so the automatic capture runs unless manual_capture is explicitly enabled.

coderabbitai · 2026-03-28T20:12:30Z

projects/llm-d/visualizations/llmd_inference/store/parsers.py

+    f"{artifact_dirnames.GUIDELLM_BENCHMARK_DIR}/artifacts/results/benchmarks.json",
+    f"{artifact_dirnames.GUIDELLM_BENCHMARK_DIR}/artifacts/guidellm_benchmark_job.logs",


⚠️ Potential issue | 🟠 Major

Align GuideLLM important-file glob with multi-rate directory discovery.

Line 34 and Line 35 use artifact_dirnames.GUIDELLM_BENCHMARK_DIR (*__llmd__run_guidellm_benchmark) but discovery in find_guidellm_benchmark_directories() uses *__llmd__run_guidellm_benchmark*.
This mismatch can miss registration/caching for suffixed multi-rate directories.

Proposed fix

-artifact_dirnames.GUIDELLM_BENCHMARK_DIR = "*__llmd__run_guidellm_benchmark" +artifact_dirnames.GUIDELLM_BENCHMARK_DIR = "*__llmd__run_guidellm_benchmark*"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/llm-d/visualizations/llmd_inference/store/parsers.py` around lines 34 - 35, The important-file globs use artifact_dirnames.GUIDELLM_BENCHMARK_DIR (exact match) while find_guidellm_benchmark_directories() discovers with a wildcard suffix; update the two globs that build f"{artifact_dirnames.GUIDELLM_BENCHMARK_DIR}/artifacts/results/benchmarks.json" and f"{artifact_dirnames.GUIDELLM_BENCHMARK_DIR}/artifacts/guidellm_benchmark_job.logs" to align with discovery by appending a wildcard to the benchmark dir portion (e.g., use GUIDELLM_BENCHMARK_DIR + "*" or equivalent) so multi-rate/suffixed directories are matched and registered/cached consistently.

kpouget added 4 commits March 26, 2026 11:21

[llm-d] visualizations: llmd_inference: store/parsers: register the i…

88cd5e8

…mportant files

[cluster] toolbox: cluster_capture_servicemonitor_metrics: new role

14105ff

[llm-d] testing: allow direct metrics capture

c57b55a

[llm-d] testing: config: update the CKS profile

3c414e9

coderabbitai bot reviewed Mar 28, 2026

View reviewed changes

kpouget closed this Mar 29, 2026

		# How often to capture metrics in seconds (default: 15)
		cluster_capture_servicemonitor_metrics_capture_frequency: 60

		f"{artifact_dirnames.GUIDELLM_BENCHMARK_DIR}/artifacts/results/benchmarks.json",
		f"{artifact_dirnames.GUIDELLM_BENCHMARK_DIR}/artifacts/guidellm_benchmark_job.logs",

Conversation

kpouget commented Mar 28, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci bot commented Mar 28, 2026

Uh oh!

coderabbitai bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kpouget commented Mar 28, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 28, 2026 •

edited

Loading