Skip to content

[llm-d] Keep working#915

Closed
kpouget wants to merge 4 commits intoopenshift-psap:mainfrom
kpouget:download-model3
Closed

[llm-d] Keep working#915
kpouget wants to merge 4 commits intoopenshift-psap:mainfrom
kpouget:download-model3

Conversation

@kpouget
Copy link
Copy Markdown
Contributor

@kpouget kpouget commented Mar 28, 2026

Summary by CodeRabbit

  • New Features

    • Added capture_servicemonitor_metrics command to capture ServiceMonitor/PodMonitor YAML and metrics for specified services with configurable capture frequency and options.
    • Automated metrics capture for LLM inference service tests.
  • Documentation

    • Added documentation for the new metrics capture command.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 28, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign albertoperdomo2 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 28, 2026

📝 Walkthrough

Walkthrough

A new Cluster.capture_servicemonitor_metrics method and Ansible role were added to capture ServiceMonitor/PodMonitor YAML and metrics from Kubernetes services. The implementation includes configuration parameters, task files for both monitor types, a metrics capture pod template, and integration into the LLM-D test framework for automated metrics collection.

Changes

Cohort / File(s) Summary
Documentation
docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst, docs/toolbox.generated/index.rst
New auto-generated documentation for the Cluster.capture_servicemonitor_metrics command with parameter descriptions and command index entry.
Cluster API Method
projects/cluster/toolbox/cluster.py
Added Cluster.capture_servicemonitor_metrics() method with @AnsibleRole and @AnsibleMappedParams decorators to wire ServiceMonitor/PodMonitor metrics capture invocation.
Ansible Role Configuration
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml
Defined default configuration variables for service name, namespace detection, capture frequency (60s default), monitor type selection, finalization mode, and Linux OS family.
Ansible Role Tasks
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml, tasks/servicemonitor.yml, tasks/podmonitor.yml
Implemented task orchestration for namespace resolution, finalization logic, conditional ServiceMonitor vs. PodMonitor processing, pod deployment, and metrics URL extraction with optional auth secret handling.
Metrics Capture Pod Template
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2
Kubernetes Pod manifest template that deploys a metrics collection container with bash-based curl loops, TLS verification control, Bearer token authentication, and configurable capture frequency.
LLM-D Test Integration
projects/llm-d/testing/config.yaml, projects/llm-d/testing/test_llmd.py
Added metrics capture configuration keys for ServiceMonitor/PodMonitor names, implemented start_metrics_capture() and stop_metrics_capture() helpers that gate captures by flavor and handle finalization.
Visualization Parser
projects/llm-d/visualizations/llmd_inference/store/parsers.py
Extended IMPORTANT_FILES to include GuideLLM benchmark artifacts (benchmarks.json and guidellm_benchmark_job.logs).

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Framework
    participant Cluster as Cluster API
    participant Ansible as Ansible Role
    participant K8s as Kubernetes API
    participant Pod as Metrics Capture Pod

    Test->>Cluster: capture_servicemonitor_metrics(service_name, namespace)
    Cluster->>Ansible: RunAnsibleRole(cluster_capture_servicemonitor_metrics)
    Ansible->>K8s: oc project (determine namespace)
    Ansible->>K8s: Get ServiceMonitor/PodMonitor YAML
    Ansible->>K8s: Query matching services/pods
    Ansible->>K8s: Extract endpoint config (scheme, port)
    Ansible->>K8s: Build metrics URLs
    Ansible->>K8s: Create metrics capture pod (pod.yaml)
    K8s->>Pod: Deploy pod with metrics URLs
    Pod->>Pod: Loop: fetch metrics via curl (every capture_frequency)
    Pod->>Pod: Apply auth token if available
    Ansible->>K8s: Wait for pod Ready state
    Test->>Cluster: capture_servicemonitor_metrics(..., finalize=True)
    Cluster->>Ansible: RunAnsibleRole (finalize mode)
    Ansible->>K8s: Get pod logs
    Ansible->>K8s: Delete metrics capture pod
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • [llm-d] Keep working #907 — Modifies the same llm-d test harness files (test_llmd.py, config.yaml) to integrate metrics/benchmark capture behaviors alongside this PR's metrics capture infrastructure.
  • [llmd] Keep working #902 — Updates llm-d testing and metrics integration by adding/consuming the same tests.llmd.inference_service.metrics configuration keys introduced in this PR.
  • [llm-d] Keep working #906 — Directly overlaps with this PR's changes to LLM-D test files and introduces related metrics capture wiring in the same test framework locations.

Poem

🐰 Metrics hop and pod monitors dance,
ServiceMonitor guides them in their prance,
With Ansible's grace and bash loops so tight,
We capture all the metrics through the night! ✨
From Kubernetes clouds to artifacts we race,
Monitoring magic in every place!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title '[llm-d] Keep working' is vague and does not clearly convey the actual changes, which involve adding ServiceMonitor/PodMonitor metrics capture functionality and related test infrastructure. Replace with a descriptive title like 'Add ServiceMonitor/PodMonitor metrics capture functionality for llm-d tests' to clearly communicate the main changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (2)
projects/cluster/toolbox/cluster.py (1)

561-565: Fix the documented default for capture_frequency.

The function default is 60, but the docstring still says default: 15, and the generated RST now shows both values.

Proposed fix
-            capture_frequency: How often to capture metrics in seconds (default: 15)
+            capture_frequency: How often to capture metrics in seconds (default: 60)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/cluster/toolbox/cluster.py` around lines 561 - 565, Update the
docstring for the parameter capture_frequency to match the function default (60
seconds) — currently it incorrectly says "default: 15". Locate the docstring
block in projects/cluster/toolbox/cluster.py that documents capture_frequency
(the function where the parameter capture_frequency is defined) and change the
text "default: 15" to "default: 60" so generated RST and docs reflect the actual
default.
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml (1)

56-60: Secret existence check fails the play if secret is missing.

If auth_secret_name is set but the secret doesn't exist in the namespace, oc get secret returns non-zero and the play fails. If this is a valid scenario (e.g., optional auth), add error handling:

Proposed fix to handle missing secrets gracefully
 - name: Check if auth secret exists
   shell: |
     oc get secret {{ auth_secret_name }} -n {{ target_namespace }} --no-headers -o name
   register: auth_secret_exists
+  failed_when: false
   when: auth_secret_name != ""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`
around lines 56 - 60, The "Check if auth secret exists" shell task fails the
play when the secret is missing; change the task (named "Check if auth secret
exists") to treat a non-zero oc exit code as non-fatal by adding failed_when:
false (or ignore_errors: true) so the play continues, keep register:
auth_secret_exists, and then use auth_secret_exists.rc or
auth_secret_exists.stdout to conditionally act later (e.g., only create
resources when auth_secret_exists.rc == 0) so missing optional secrets are
handled gracefully.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml`:
- Around line 13-14: The inline comment for
cluster_capture_servicemonitor_metrics_capture_frequency says "(default: 15)"
but the variable is set to 60; make them consistent by updating the comment to
"(default: 60)" to match
cluster_capture_servicemonitor_metrics_capture_frequency, or if you intended the
default to be 15, change the variable value to 15 instead—adjust the comment and
value together so cluster_capture_servicemonitor_metrics_capture_frequency and
its descriptive comment match.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`:
- Around line 20-27: The "Check if pod exists for finalization" task should not
fail the play when the pod is absent: add failed_when: false (and optionally
changed_when: false) to that task so oc get returns won't abort the run, and
then guard the "Capture pod logs" task with when: pod_exists_check.rc == 0 (or
when: pod_exists_check.stdout != '' ) so logs are only collected if the pod
actually exists; reference the task names "Check if pod exists for finalization"
and "Capture pod logs" and the registered variable pod_exists_check when making
these changes.
- Around line 17-35: The "Delete metrics capture pod" task currently runs
unconditionally; ensure it only runs in finalize mode by either moving the task
into the existing finalize block (between the "Capture pod logs" task and the
meta: end_play) or by adding the when:
cluster_capture_servicemonitor_metrics_finalize guard to the "Delete metrics
capture pod" task; reference the task name "Delete metrics capture pod", the
condition variable cluster_capture_servicemonitor_metrics_finalize, and the
finalize block (meta: end_play) when making the change so the delete only
executes during finalization.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml`:
- Around line 12-23: The tasks "Capture pod targets by PodMonitor selector",
"Capture pod targets by PodMonitor selector YAML" and "Get all target pod IPs
and names" currently use a hardcoded component list; change them to derive the
Pod label selector from the actual PodMonitor resource instead. Query the
PodMonitor in the target_namespace (e.g. oc get podmonitor <name> -n {{
target_namespace }} -o yaml) and extract .spec.selector.matchLabels (or convert
matchExpressions to label selector) into a variable, then use that label
selector in the oc get pod -l "<derived-selector>" commands (and in the -oyaml
and custom-columns calls) instead of the hardcoded app.kubernetes.io/component
list; ensure you reference the same selector encoding used by existing ISVC
lookups (app.kubernetes.io/name when applicable) and add a safe fallback to the
previous selector only if the PodMonitor has no selector.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml`:
- Around line 12-24: The tasks currently hardcode the scheduler labels and use
head -1 which can pick an arbitrary Service; instead, read the ServiceMonitor
named by the incoming monitor variable (the ServiceMonitor resource referenced
by the task that dumps YAML/status) and extract its spec.selector.matchLabels,
then use those labels to query Services to determine the target; replace the
hardcoded label selector in the "Capture service target by ServiceMonitor
selector" and "Get target service name" tasks with a step that fetches the
ServiceMonitor, builds the label selector string from spec.selector.matchLabels,
uses that selector for oc get service, and change the "Get target service name"
registration (target_service_name) to fail if zero matches or to
deterministically select a single match (e.g., error on multiple matches or pick
the service with a specific annotation/port) instead of using head -1.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2`:
- Around line 61-67: The template currently hardcodes reading
/var/run/secrets/auth/token and mounts the whole Secret; update the flow to
extract authorization.credentials.key from the ServiceMonitor/PodMonitor and
pass it into the template (e.g. as auth_secret_key), change the Secret volume
mount in the Pod template to use items and mount only that key (so the secret
key filename equals the passed-in key name), and update the runtime logic (the
fetch_metrics invocation/if block using auth_secret_name and fetch_metrics) to
read the token from /var/run/secrets/auth/<auth_secret_key> (or use a variable
TOKEN_FILE constructed from auth_secret_key) before calling fetch_metrics so the
correct key is mounted and read when credentials.key is not "token".

In `@projects/cluster/toolbox/cluster.py`:
- Around line 553-566: The function capture_servicemonitor_metrics currently
takes a parameter named service_name but treats it as the monitor resource name;
either rename the parameter to monitor_name (update the signature, docstring,
and all call sites where capture_servicemonitor_metrics is invoked) to reflect
that it expects a ServiceMonitor/PodMonitor name, or add logic inside
capture_servicemonitor_metrics to accept an actual Service name: when the
provided name is not found as a ServiceMonitor/PodMonitor, query the Service (oc
get service <name>) and locate the matching ServiceMonitor/PodMonitor by
selector/labels and use its name; update the docstring to state the accepted
inputs and adjust callers accordingly (refer to capture_servicemonitor_metrics
and parameter service_name).

In `@projects/llm-d/testing/test_llmd.py`:
- Around line 1019-1024: The helper functions start_metrics_capture and
stop_metrics_capture currently return when
tests.llmd.inference_service.metrics.manual_capture is false, which
inadvertently disables the automatic toolbox capture under shipped configs;
change the guard so the functions return only when manual_capture is true (i.e.,
invert the boolean check), and apply the same fix to both start_metrics_capture
and stop_metrics_capture so the automatic capture runs unless manual_capture is
explicitly enabled.

In `@projects/llm-d/visualizations/llmd_inference/store/parsers.py`:
- Around line 34-35: The important-file globs use
artifact_dirnames.GUIDELLM_BENCHMARK_DIR (exact match) while
find_guidellm_benchmark_directories() discovers with a wildcard suffix; update
the two globs that build
f"{artifact_dirnames.GUIDELLM_BENCHMARK_DIR}/artifacts/results/benchmarks.json"
and
f"{artifact_dirnames.GUIDELLM_BENCHMARK_DIR}/artifacts/guidellm_benchmark_job.logs"
to align with discovery by appending a wildcard to the benchmark dir portion
(e.g., use GUIDELLM_BENCHMARK_DIR + "*" or equivalent) so multi-rate/suffixed
directories are matched and registered/cached consistently.

---

Nitpick comments:
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`:
- Around line 56-60: The "Check if auth secret exists" shell task fails the play
when the secret is missing; change the task (named "Check if auth secret
exists") to treat a non-zero oc exit code as non-fatal by adding failed_when:
false (or ignore_errors: true) so the play continues, keep register:
auth_secret_exists, and then use auth_secret_exists.rc or
auth_secret_exists.stdout to conditionally act later (e.g., only create
resources when auth_secret_exists.rc == 0) so missing optional secrets are
handled gracefully.

In `@projects/cluster/toolbox/cluster.py`:
- Around line 561-565: Update the docstring for the parameter capture_frequency
to match the function default (60 seconds) — currently it incorrectly says
"default: 15". Locate the docstring block in projects/cluster/toolbox/cluster.py
that documents capture_frequency (the function where the parameter
capture_frequency is defined) and change the text "default: 15" to "default: 60"
so generated RST and docs reflect the actual default.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 071699d5-aa7b-423d-bdef-219be891108d

📥 Commits

Reviewing files that changed from the base of the PR and between 559da40 and 3c414e9.

📒 Files selected for processing (11)
  • docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst
  • docs/toolbox.generated/index.rst
  • projects/cluster/toolbox/cluster.py
  • projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml
  • projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml
  • projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml
  • projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml
  • projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2
  • projects/llm-d/testing/config.yaml
  • projects/llm-d/testing/test_llmd.py
  • projects/llm-d/visualizations/llmd_inference/store/parsers.py

Comment on lines +13 to +14
# How often to capture metrics in seconds (default: 15)
cluster_capture_servicemonitor_metrics_capture_frequency: 60
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Comment/default value mismatch.

The comment states (default: 15) but the actual default is 60. Update either the comment or the value to be consistent.

Proposed fix
-# How often to capture metrics in seconds (default: 15)
-cluster_capture_servicemonitor_metrics_capture_frequency: 60
+# How often to capture metrics in seconds (default: 60)
+cluster_capture_servicemonitor_metrics_capture_frequency: 60
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# How often to capture metrics in seconds (default: 15)
cluster_capture_servicemonitor_metrics_capture_frequency: 60
# How often to capture metrics in seconds (default: 60)
cluster_capture_servicemonitor_metrics_capture_frequency: 60
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml`
around lines 13 - 14, The inline comment for
cluster_capture_servicemonitor_metrics_capture_frequency says "(default: 15)"
but the variable is set to 60; make them consistent by updating the comment to
"(default: 60)" to match
cluster_capture_servicemonitor_metrics_capture_frequency, or if you intended the
default to be 15, change the variable value to 15 instead—adjust the comment and
value together so cluster_capture_servicemonitor_metrics_capture_frequency and
its descriptive comment match.

Comment on lines +17 to +35
- name: "[Finalize mode] capture logs and delete existing pod"
when: cluster_capture_servicemonitor_metrics_finalize
block:
- name: Check if pod exists for finalization
shell: |
oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
register: pod_exists_check

- name: Capture pod logs
shell: |
oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"

- name: Delete metrics capture pod
shell: |
oc delete pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --grace-period=0 --ignore-not-found

- name: "[Finalize mode] capture logs and delete existing pod"
when: cluster_capture_servicemonitor_metrics_finalize
meta: end_play
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Delete task runs unconditionally outside the finalize block.

The "Delete metrics capture pod" task (lines 29-31) is outside the block structure and has no when condition, so it executes in both finalize and normal modes. In normal deployment mode, this will delete any existing capture pod before the play continues to create a new one.

If this is intentional cleanup behavior, move it after the end_play guard so it's clear. If it should only run in finalize mode, move it inside the block.

Proposed fix (assuming delete should be finalize-only)
 - name: "[Finalize mode] capture logs and delete existing pod"
   when: cluster_capture_servicemonitor_metrics_finalize
   block:
   - name: Check if pod exists for finalization
     shell: |
       oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
     register: pod_exists_check
+    failed_when: false

   - name: Capture pod logs
     shell: |
       oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"
+    when: pod_exists_check.rc == 0

-- name: Delete metrics capture pod
-  shell: |
-    oc delete pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }}  --grace-period=0 --ignore-not-found
+  - name: Delete metrics capture pod
+    shell: |
+      oc delete pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }}  --grace-period=0 --ignore-not-found

 - name: "[Finalize mode] capture logs and delete existing pod"
   when: cluster_capture_servicemonitor_metrics_finalize
   meta: end_play
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`
around lines 17 - 35, The "Delete metrics capture pod" task currently runs
unconditionally; ensure it only runs in finalize mode by either moving the task
into the existing finalize block (between the "Capture pod logs" task and the
meta: end_play) or by adding the when:
cluster_capture_servicemonitor_metrics_finalize guard to the "Delete metrics
capture pod" task; reference the task name "Delete metrics capture pod", the
condition variable cluster_capture_servicemonitor_metrics_finalize, and the
finalize block (meta: end_play) when making the change so the delete only
executes during finalization.

Comment on lines +20 to +27
- name: Check if pod exists for finalization
shell: |
oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
register: pod_exists_check

- name: Capture pod logs
shell: |
oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing error handling in finalize flow.

If the capture pod doesn't exist when finalize mode runs, oc get pod (line 22) fails with a non-zero exit code, causing the play to fail. The registered pod_exists_check variable is never used to conditionally skip the logs capture.

Consider adding failed_when: false to the check and using the result to guard subsequent tasks:

Proposed fix
   - name: Check if pod exists for finalization
     shell: |
       oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
     register: pod_exists_check
+    failed_when: false

   - name: Capture pod logs
     shell: |
       oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"
+    when: pod_exists_check.rc == 0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Check if pod exists for finalization
shell: |
oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
register: pod_exists_check
- name: Capture pod logs
shell: |
oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"
- name: Check if pod exists for finalization
shell: |
oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
register: pod_exists_check
failed_when: false
- name: Capture pod logs
shell: |
oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"
when: pod_exists_check.rc == 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`
around lines 20 - 27, The "Check if pod exists for finalization" task should not
fail the play when the pod is absent: add failed_when: false (and optionally
changed_when: false) to that task so oc get returns won't abort the run, and
then guard the "Capture pod logs" task with when: pod_exists_check.rc == 0 (or
when: pod_exists_check.stdout != '' ) so logs are only collected if the pod
actually exists; reference the task names "Check if pod exists for finalization"
and "Capture pod logs" and the registered variable pod_exists_check when making
these changes.

Comment on lines +12 to +23
- name: Capture pod targets by PodMonitor selector
shell: |
oc get pod -l "app.kubernetes.io/component in (llminferenceservice-workload,llminferenceservice-workload-prefill,llminferenceservice-workload-worker,llminferenceservice-workload-leader,llminferenceservice-workload-leader-prefill,llminferenceservice-workload-worker-prefill),app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/target_pods.status"

- name: Capture pod targets by PodMonitor selector YAML
shell: |
oc get pod -l "app.kubernetes.io/component in (llminferenceservice-workload,llminferenceservice-workload-prefill,llminferenceservice-workload-worker,llminferenceservice-workload-leader,llminferenceservice-workload-leader-prefill,llminferenceservice-workload-worker-prefill),app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/target_pods.yaml"

- name: Get all target pod IPs and names
shell: |
oc get pod -l "app.kubernetes.io/component in (llminferenceservice-workload,llminferenceservice-workload-prefill,llminferenceservice-workload-worker,llminferenceservice-workload-leader,llminferenceservice-workload-leader-prefill,llminferenceservice-workload-worker-prefill),app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} --no-headers -o custom-columns=":metadata.name,:status.podIP"
register: target_pods_info
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Resolve pods from the PodMonitor you were asked to capture.

After fetching the named PodMonitor, the role ignores its selector and scans every llminferenceservice workload pod in the namespace via a hardcoded component list. That will mix captures across services when multiple LLMInferenceServices share a namespace, and it will break as soon as the PodMonitor selector changes. The existing ISVC state capture already keys pod lookup off app.kubernetes.io/name; this should derive its selector from the PodMonitor too.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml`
around lines 12 - 23, The tasks "Capture pod targets by PodMonitor selector",
"Capture pod targets by PodMonitor selector YAML" and "Get all target pod IPs
and names" currently use a hardcoded component list; change them to derive the
Pod label selector from the actual PodMonitor resource instead. Query the
PodMonitor in the target_namespace (e.g. oc get podmonitor <name> -n {{
target_namespace }} -o yaml) and extract .spec.selector.matchLabels (or convert
matchExpressions to label selector) into a variable, then use that label
selector in the oc get pod -l "<derived-selector>" commands (and in the -oyaml
and custom-columns calls) instead of the hardcoded app.kubernetes.io/component
list; ensure you reference the same selector encoding used by existing ISVC
lookups (app.kubernetes.io/name when applicable) and add a safe fallback to the
previous selector only if the PodMonitor has no selector.

Comment on lines +12 to +24
- name: Capture service target by ServiceMonitor selector
shell: |
oc get service -l "app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/target_service.status"

- name: Capture service target by ServiceMonitor selector YAML
shell: |
oc get service -l "app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/target_service.yaml"

- name: Get target service name
shell: |
set -o pipefail;
oc get service -l "app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} --no-headers -o custom-columns=":metadata.name" | head -1
register: target_service_name
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Derive the target Service from the ServiceMonitor instead of hardcoding the scheduler labels.

These tasks only use the requested monitor name for the YAML/status dump; the actual lookup is always pinned to llminferenceservice-router-scheduler. If this command is pointed at any other ServiceMonitor, or if that monitor’s selector changes, target_service_name and the generated metrics URL will be wrong. head -1 also makes the selected Service arbitrary when more than one matches.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml`
around lines 12 - 24, The tasks currently hardcode the scheduler labels and use
head -1 which can pick an arbitrary Service; instead, read the ServiceMonitor
named by the incoming monitor variable (the ServiceMonitor resource referenced
by the task that dumps YAML/status) and extract its spec.selector.matchLabels,
then use those labels to query Services to determine the target; replace the
hardcoded label selector in the "Capture service target by ServiceMonitor
selector" and "Get target service name" tasks with a step that fetches the
ServiceMonitor, builds the label selector string from spec.selector.matchLabels,
uses that selector for oc get service, and change the "Get target service name"
registration (target_service_name) to fail if zero matches or to
deterministically select a single match (e.g., error on multiple matches or pick
the service with a specific annotation/port) instead of using head -1.

Comment on lines +61 to +67
{% if auth_secret_name %}
if [ -f /var/run/secrets/auth/token ]; then
TOKEN=$(cat /var/run/secrets/auth/token)
fetch_metrics "$url" "$TOKEN"
else
echo "Auth token file not found"
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find projects/cluster/toolbox/cluster_capture_servicemonitor_metrics -name "*.yaml.j2" -o -name "*.yaml"

Repository: openshift-psap/topsail

Length of output: 167


🏁 Script executed:

cat projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2

Repository: openshift-psap/topsail

Length of output: 2801


🏁 Script executed:

cat projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2 | head -100

Repository: openshift-psap/topsail

Length of output: 2801


🏁 Script executed:

rg "auth_secret" projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/ -A 2 -B 2

Repository: openshift-psap/topsail

Length of output: 7737


🏁 Script executed:

find projects/cluster/toolbox/cluster_capture_servicemonitor_metrics -name "*.yml" -o -name "*.yaml" | xargs rg "auth_secret" -l

Repository: openshift-psap/topsail

Length of output: 318


🏁 Script executed:

cat projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main.yml

Repository: openshift-psap/topsail

Length of output: 179


🏁 Script executed:

cat projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml

Repository: openshift-psap/topsail

Length of output: 4093


🏁 Script executed:

cat projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml

Repository: openshift-psap/topsail

Length of output: 4618


🏁 Script executed:

cat projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml | head -80

Repository: openshift-psap/topsail

Length of output: 3716


Thread the secret key name through the role and mount only the selected key.

The template mounts the entire Secret without specifying which key to use (lines 83–94), then hardcodes reading /var/run/secrets/auth/token (lines 61–67). When a monitor's authorization.credentials points to a different key name, this path won't exist and authentication silently fails.

Extract authorization.credentials.key from the ServiceMonitor/PodMonitor alongside the secret name, pass it through to the template, mount only that key using the items field, and read the correct filename.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2`
around lines 61 - 67, The template currently hardcodes reading
/var/run/secrets/auth/token and mounts the whole Secret; update the flow to
extract authorization.credentials.key from the ServiceMonitor/PodMonitor and
pass it into the template (e.g. as auth_secret_key), change the Secret volume
mount in the Pod template to use items and mount only that key (so the secret
key filename equals the passed-in key name), and update the runtime logic (the
fetch_metrics invocation/if block using auth_secret_name and fetch_metrics) to
read the token from /var/run/secrets/auth/<auth_secret_key> (or use a variable
TOKEN_FILE constructed from auth_secret_key) before calling fetch_metrics so the
correct key is mounted and read when credentials.key is not "token".

Comment on lines +553 to +566
def capture_servicemonitor_metrics(self, service_name, namespace="", capture_frequency=60, is_podmonitor=False, finalize=False):
"""
Captures ServiceMonitor or PodMonitor YAML and status for a given service

Captures the ServiceMonitor/PodMonitor configuration and status information for
a specific service in a namespace, including related service/pod and
endpoints information for troubleshooting monitoring setup.

Args:
service_name: Name of the service to capture ServiceMonitor/PodMonitor metrics for
namespace: Namespace where the service and ServiceMonitor/PodMonitor are located (empty string auto-detects current namespace)
capture_frequency: How often to capture metrics in seconds (default: 15)
is_podmonitor: Whether to use PodMonitor instead of ServiceMonitor (default: False)
finalize: Whether to finalize (capture logs and delete) an existing pod instead of creating new one (default: False)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Rename service_name or make the role accept actual Services.

The public command and generated docs say this parameter is a Service name, but the role uses it as oc get servicemonitor|podmonitor <name>. Every new caller in this PR therefore passes a monitor resource name instead. As-is, a CLI user following the docs will hit NotFound unless the monitor happens to share the Service name.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/cluster/toolbox/cluster.py` around lines 553 - 566, The function
capture_servicemonitor_metrics currently takes a parameter named service_name
but treats it as the monitor resource name; either rename the parameter to
monitor_name (update the signature, docstring, and all call sites where
capture_servicemonitor_metrics is invoked) to reflect that it expects a
ServiceMonitor/PodMonitor name, or add logic inside
capture_servicemonitor_metrics to accept an actual Service name: when the
provided name is not found as a ServiceMonitor/PodMonitor, query the Service (oc
get service <name>) and locate the matching ServiceMonitor/PodMonitor by
selector/labels and use its name; update the docstring to state the accepted
inputs and adjust callers accordingly (refer to capture_servicemonitor_metrics
and parameter service_name).

Comment on lines +1019 to +1024
def start_metrics_capture(flavor):
"""
Starts metrics capture for both ServiceMonitor and PodMonitor if enabled
"""
if not config.project.get_config("tests.llmd.inference_service.metrics.manual_capture"):
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

manual_capture currently disables the new automatic capture path.

Both helpers return when tests.llmd.inference_service.metrics.manual_capture is false, but this PR sets that flag to false in the base config and in the cks preset. So start_metrics_capture() / stop_metrics_capture() never invoke the new toolbox command under the shipped configs.

Proposed fix
 def start_metrics_capture(flavor):
@@
-    if not config.project.get_config("tests.llmd.inference_service.metrics.manual_capture"):
+    if config.project.get_config("tests.llmd.inference_service.metrics.manual_capture"):
         return
@@
 def stop_metrics_capture(flavor):
@@
-    if not config.project.get_config("tests.llmd.inference_service.metrics.manual_capture"):
+    if config.project.get_config("tests.llmd.inference_service.metrics.manual_capture"):
         return

Also applies to: 1057-1062

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/testing/test_llmd.py` around lines 1019 - 1024, The helper
functions start_metrics_capture and stop_metrics_capture currently return when
tests.llmd.inference_service.metrics.manual_capture is false, which
inadvertently disables the automatic toolbox capture under shipped configs;
change the guard so the functions return only when manual_capture is true (i.e.,
invert the boolean check), and apply the same fix to both start_metrics_capture
and stop_metrics_capture so the automatic capture runs unless manual_capture is
explicitly enabled.

Comment on lines +34 to +35
f"{artifact_dirnames.GUIDELLM_BENCHMARK_DIR}/artifacts/results/benchmarks.json",
f"{artifact_dirnames.GUIDELLM_BENCHMARK_DIR}/artifacts/guidellm_benchmark_job.logs",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Align GuideLLM important-file glob with multi-rate directory discovery.

Line 34 and Line 35 use artifact_dirnames.GUIDELLM_BENCHMARK_DIR (*__llmd__run_guidellm_benchmark) but discovery in find_guidellm_benchmark_directories() uses *__llmd__run_guidellm_benchmark*.
This mismatch can miss registration/caching for suffixed multi-rate directories.

Proposed fix
-artifact_dirnames.GUIDELLM_BENCHMARK_DIR = "*__llmd__run_guidellm_benchmark"
+artifact_dirnames.GUIDELLM_BENCHMARK_DIR = "*__llmd__run_guidellm_benchmark*"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/llm-d/visualizations/llmd_inference/store/parsers.py` around lines
34 - 35, The important-file globs use artifact_dirnames.GUIDELLM_BENCHMARK_DIR
(exact match) while find_guidellm_benchmark_directories() discovers with a
wildcard suffix; update the two globs that build
f"{artifact_dirnames.GUIDELLM_BENCHMARK_DIR}/artifacts/results/benchmarks.json"
and
f"{artifact_dirnames.GUIDELLM_BENCHMARK_DIR}/artifacts/guidellm_benchmark_job.logs"
to align with discovery by appending a wildcard to the benchmark dir portion
(e.g., use GUIDELLM_BENCHMARK_DIR + "*" or equivalent) so multi-rate/suffixed
directories are matched and registered/cached consistently.

@kpouget kpouget closed this Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant