[llm-d] Keep working by kpouget · Pull Request #914 · openshift-psap/topsail

kpouget · 2026-03-28T18:14:06Z

Summary by CodeRabbit

New Features
- Automated metrics capture for Kubernetes ServiceMonitor/PodMonitor with configurable polling and finalize options.
Tests
- Tests now start/stop per-flavor metrics capture automatically and tolerate shutdown errors.
Documentation
- Added user docs and index entry for the new metrics-capture tooling.
Chores
- Updated default GuideLLM benchmark image/version and added benchmark artifact tracking for results/logs.

…mportant files

coderabbitai · 2026-03-28T18:14:21Z

📝 Walkthrough

Walkthrough

Adds a new Cluster.capture_servicemonitor_metrics method and its Ansible role to capture ServiceMonitor/PodMonitor YAML, status, and periodic metrics via a templated pod; updates docs, test harness to start/stop capture, visualization parser artifacts, and default benchmark image/version.

Changes

Cohort / File(s)	Summary
Documentation `docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst`, `docs/toolbox.generated/index.rst`	New auto-generated RST page and index entry documenting the Cluster.capture_servicemonitor_metrics API and parameters.
Cluster toolbox API `projects/cluster/toolbox/cluster.py`	Added public method `capture_servicemonitor_metrics(...)` decorated with `@AnsibleRole`/`@AnsibleMappedParams` that invokes the new Ansible role.
Ansible role defaults `projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml`	Defaults for namespace, capture_frequency, is_podmonitor, finalize, and ansible_os_family added.
Ansible role tasks (core) `projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`	Main role workflow: namespace resolution, artifacts dir, finalize flow (capture logs + delete pod), include of servicemonitor/podmonitor tasks, auth handling, render/create pod, wait for Ready.
Ansible role tasks (monitor-specific) `projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml`, `projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml`	ServiceMonitor and PodMonitor capture logic: fetch YAML/status, determine endpoints/ports/scheme, build metrics URLs, extract auth secret name, write artifact metadata.
Pod template `projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2`	Jinja2 pod manifest that runs a loop to curl METRICS_URLS with optional Bearer token mount, filters responses, and sleeps capture_frequency seconds.
Test integration `projects/llm-d/testing/config.yaml`, `projects/llm-d/testing/test_llmd.py`	Disabled manual capture in preset, added benchmark image/version overrides; added start_metrics_capture(flavor) and stop_metrics_capture(flavor) and wired capture start/stop around deployments.
Benchmark defaults & docs `projects/llm-d/toolbox/llmd.py`, `projects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.yml`, `docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst`	Updated default GuideLLM image and version (image → `ghcr.io/albertoperdomo2/guidellm`, version → `nightly`) in code, role defaults, and generated docs.
Visualization parser `projects/llm-d/visualizations/llmd_inference/store/parsers.py`	Added GuideLLM benchmark artifact paths (`artifacts/results/benchmarks.json`, `artifacts/guidellm_benchmark_job.logs`) to IMPORTANT_FILES.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Code
    participant Cluster as Cluster.capture_servicemonitor_metrics
    participant Ansible as Ansible Role
    participant K8s as Kubernetes Cluster
    participant Pod as Metrics Capture Pod
    participant Endpoint as Metrics Endpoint

    Test->>Cluster: capture_servicemonitor_metrics(service_name, namespace, ...)
    Cluster->>Ansible: RunAnsibleRole(locals())

    rect rgba(100,150,200,0.5)
    Ansible->>K8s: Resolve namespace (oc project -q if empty)
    Ansible->>K8s: Query ServiceMonitor/PodMonitor YAML & status
    Ansible->>K8s: Query target Service/Pods, extract ports/scheme
    end

    rect rgba(150,200,100,0.5)
    Ansible->>K8s: Check for auth secret
    Ansible->>Ansible: Render metrics_capture_pod.yaml (with auth mount if present)
    Ansible->>K8s: Create pod, wait for Ready
    K8s->>Pod: Start pod with METRICS_URLS env
    Pod->>Endpoint: Periodic curl (+Bearer token if mounted)
    Endpoint-->>Pod: Metrics response
    Pod->>K8s: Write metrics to artifact logs
    end

    Test->>Cluster: capture_servicemonitor_metrics(..., finalize=True)
    Cluster->>Ansible: RunAnsibleRole(locals())
    rect rgba(200,150,100,0.5)
    Ansible->>K8s: Find existing capture pod
    Ansible->>K8s: Dump pod logs to metrics_capture_logs.txt
    Ansible->>K8s: Delete pod (grace 0)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[llm-d] Keep working #907: Modifies llm-d testing/toolbox code (test_llmd, config) and benchmark configuration—overlaps metrics capture and defaults changes.
[llm-d] Keep working #910: Changes Llmd.run_guidellm_benchmark API/defaults (adds parameters) and touches the same toolbox function updated here.
[llm-d] Keep working #912: Updates llm-d visualization parsers—related to the IMPORTANT_FILES additions in this PR.

Suggested labels

lgtm

Poem

🐰 I hopped into logs at the break of day,
Captured metrics the curious way,
ServiceMonitors sang, pods hummed along,
Artifacts saved—oh what a song!
Hop, fetch, and stash — a rabbit's dev ballet.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title '[llm-d] Keep working' is vague and generic, using non-descriptive phrasing that does not convey meaningful information about the changeset's actual purpose or scope.	Replace the generic title with a specific, descriptive summary of the main changes (e.g., 'Add ServiceMonitor/PodMonitor metrics capture and update GuideLLM defaults').

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-03-28T18:14:28Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kpouget for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kpouget · 2026-03-28T18:14:59Z

/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval
/cluster cks
/only test_ci
/var tests.llmd.inference_service.metrics.manual_capture: false

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (3)

projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml (2)
45-47: Potential empty port value when fallback task is skipped.

If named_port_result.stdout is an empty string and the first_port_result task was skipped (line 43 condition not met because named_port_result.stdout != ""), the first_port_result.stdout may be undefined, causing final_port to resolve incorrectly.

However, reviewing the logic: if named_port_result.stdout == "", line 43's when condition is true, so first_port_result will run. The current logic appears correct, but the complex conditional on line 47 could be simplified for clarity.
Suggested simplification
 - name: Set final port number
   set_fact:
-    final_port: "{{ named_port_result.stdout if named_port_result.stdout != '' else first_port_result.stdout | default('9090') }}"
+    final_port: "{{ named_port_result.stdout | default(first_port_result.stdout | default('9090'), true) }}"
Using Jinja2's default filter with true as the second argument handles both undefined and empty string cases.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml`
around lines 45 - 47, The current set_fact for final_port uses a complex
conditional that can be simplified and made safe for undefined or empty values:
update the final_port assignment (set_fact final_port) to prefer
named_port_result.stdout, then fall back to first_port_result.stdout, and
finally to '9090' using Jinja2's default filter with the second argument true to
treat empty strings as undefined (e.g. use "{{ named_port_result.stdout |
default(first_port_result.stdout | default('9090', true), true) }}"),
referencing the final_port variable and the named_port_result and
first_port_result task results.
12-24: Hardcoded label selectors reduce reusability.

The label selectors app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceservice are hardcoded for a specific use case. Consider parameterizing these if ServiceMonitor capture needs to support other service types in the future.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml`
around lines 12 - 24, The three tasks ("Capture service target by ServiceMonitor
selector", "Capture service target by ServiceMonitor selector YAML", and "Get
target service name") use a hardcoded label selector string; parameterize that
selector (e.g., introduce a variable like service_label_selector or
serviceMonitor_label_selector with a sensible default) and replace the literal
selector in each oc command with the new variable, ensuring the variable is
documented/defaulted so other service types can be targeted without editing the
task bodies; keep existing variables target_namespace, artifact_extra_logs_dir
and the register target_service_name unchanged.
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml (1)
12-23: Hardcoded label selectors match servicemonitor.yml pattern.

Similar to servicemonitor.yml, these label selectors are hardcoded for the llminferenceservice workload types. This is consistent but limits reusability for other PodMonitor scenarios.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml`
around lines 12 - 23, The three tasks named "Capture pod targets by PodMonitor
selector", "Capture pod targets by PodMonitor selector YAML" and "Get all target
pod IPs and names" use a hardcoded label selector string for
llminferenceservice; replace that hardcoded selector with a reusable variable
(e.g., podmonitor_label_selector or podmonitor_labels) and reference the
variable in each oc get command, and provide a sensible default value matching
the current selector so existing behavior is preserved; update any
documentation/vars file where defaults live and ensure the variable is used
consistently across those tasks.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst`:
- Around line 33-37: The docstring for the capture_frequency parameter is
incorrect — update the docstring for capture_frequency inside the
capture_servicemonitor_metrics method (in projects/cluster/toolbox/cluster.py)
to state the correct default value of 60 seconds (replace "default: 15" with
"default: 60") and ensure the inline description reflects "How often to capture
metrics in seconds (default: 60)"; after that regenerate the docs so
docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst and the
Ansible defaults output reflect the corrected default.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml`:
- Around line 13-14: The inline comment for
cluster_capture_servicemonitor_metrics_capture_frequency says "(default: 15)"
but the actual default is 60; update the comment here and the corresponding
docstring in cluster.py (the docstring/source for
cluster_capture_servicemonitor_metrics_capture_frequency around the docstring at
the earlier definition) so both state "(default: 60)" to keep docs and code
consistent.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`:
- Around line 56-60: The "Check if auth secret exists" task uses "oc get secret
..." and registers auth_secret_exists but will abort the play when the secret is
missing; change the task so the command does not fail the play by adding either
failed_when: false (and optionally changed_when: false) or ignore_errors: true
to the task that runs "oc get secret {{ auth_secret_name }} -n {{
target_namespace }} --no-headers -o name", so auth_secret_exists.rc can be
inspected later; apply the same change to the other similar probe tasks that
register values (the tasks that set auth_secret_exists and the equivalent checks
in the 62-80 range) to make the "Secret exists: no" path reachable.
- Around line 20-27: The finalize probe currently fails if the capture pod
doesn't exist; make the pod-existence check non-fatal and only run the log
capture when the pod is present by adding failed_when: false (and optionally
changed_when: false) to the "Check if pod exists for finalization" task that
registers pod_exists_check, and guard the "Capture pod logs" task with a when
that checks pod_exists_check (e.g., pod_exists_check.stdout is defined and not
empty or pod_exists_check.rc == 0). Reference the task names "Check if pod
exists for finalization" and "Capture pod logs" and the variables
cluster_capture_servicemonitor_metrics_service_name, target_namespace, and
artifact_extra_logs_dir when making the changes.
- Around line 46-49: The current set_fact in tasks/main.yml unconditionally
overwrites auth_secret_name with an empty string when auth_secret_name_cmd is
undefined, clearing PodMonitor-registered secrets; change the set_fact logic so
it only assigns auth_secret_name when auth_secret_name_cmd is defined and
non-empty (e.g., keep existing auth_secret_name otherwise), ensuring the task
that sets auth_secret_name (from
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml)
is not clobbered and subsequent secret handling, mounting, and bearer auth
remain intact.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2`:
- Around line 61-67: The template hardcodes /var/run/secrets/auth/token and must
instead use the secret key provided by the tasks; update the
ServiceMonitor/PodMonitor extractor tasks (referenced in
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml
and podmonitor.yml) to also capture the authorization secret key (e.g.,
auth_secret_key) and pass it into the template, then change the template logic
around fetch_metrics and the second block (both the 61–67 and 83–93 areas) to
derive the file path from that key (e.g., /var/run/secrets/auth/{{
auth_secret_key }}) rather than using the literal "token" so fetch_metrics is
called with the actual mounted key file when present.

In `@projects/cluster/toolbox/cluster.py`:
- Line 564: The docstring for the function that defines the capture_frequency
parameter (the function with parameter capture_frequency=60) incorrectly states
"(default: 15)"; update the docstring to reflect the actual default of 60 by
changing the text to "(default: 60)" wherever capture_frequency is documented
(e.g., in the docstring block around the function that includes
capture_frequency) so the documentation matches the function signature.

---

Nitpick comments:
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml`:
- Around line 12-23: The three tasks named "Capture pod targets by PodMonitor
selector", "Capture pod targets by PodMonitor selector YAML" and "Get all target
pod IPs and names" use a hardcoded label selector string for
llminferenceservice; replace that hardcoded selector with a reusable variable
(e.g., podmonitor_label_selector or podmonitor_labels) and reference the
variable in each oc get command, and provide a sensible default value matching
the current selector so existing behavior is preserved; update any
documentation/vars file where defaults live and ensure the variable is used
consistently across those tasks.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml`:
- Around line 45-47: The current set_fact for final_port uses a complex
conditional that can be simplified and made safe for undefined or empty values:
update the final_port assignment (set_fact final_port) to prefer
named_port_result.stdout, then fall back to first_port_result.stdout, and
finally to '9090' using Jinja2's default filter with the second argument true to
treat empty strings as undefined (e.g. use "{{ named_port_result.stdout |
default(first_port_result.stdout | default('9090', true), true) }}"),
referencing the final_port variable and the named_port_result and
first_port_result task results.
- Around line 12-24: The three tasks ("Capture service target by ServiceMonitor
selector", "Capture service target by ServiceMonitor selector YAML", and "Get
target service name") use a hardcoded label selector string; parameterize that
selector (e.g., introduce a variable like service_label_selector or
serviceMonitor_label_selector with a sensible default) and replace the literal
selector in each oc command with the new variable, ensuring the variable is
documented/defaulted so other service types can be targeted without editing the
task bodies; keep existing variables target_namespace, artifact_extra_logs_dir
and the register target_service_name unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c84a89bf-7c83-4b96-9888-094f17d0f6b5

📥 Commits

Reviewing files that changed from the base of the PR and between 559da40 and 3c414e9.

📒 Files selected for processing (11)

docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst
docs/toolbox.generated/index.rst
projects/cluster/toolbox/cluster.py
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2
projects/llm-d/testing/config.yaml
projects/llm-d/testing/test_llmd.py
projects/llm-d/visualizations/llmd_inference/store/parsers.py

coderabbitai · 2026-03-28T18:25:30Z

docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst

+``capture_frequency``  
+
+* How often to capture metrics in seconds (default: 15)
+
+* default value: ``60``


⚠️ Potential issue | 🟡 Minor

Same documentation mismatch: "default: 15" but actual value is 60.

This inconsistency originates from the docstring in projects/cluster/toolbox/cluster.py line 564. Fix the source docstring to correct both this RST file and the Ansible defaults file upon regeneration.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst` around lines 33 - 37, The docstring for the capture_frequency parameter is incorrect — update the docstring for capture_frequency inside the capture_servicemonitor_metrics method (in projects/cluster/toolbox/cluster.py) to state the correct default value of 60 seconds (replace "default: 15" with "default: 60") and ensure the inline description reflects "How often to capture metrics in seconds (default: 60)"; after that regenerate the docs so docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst and the Ansible defaults output reflect the corrected default.

coderabbitai · 2026-03-28T18:25:30Z

projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml

+# How often to capture metrics in seconds (default: 15)
+cluster_capture_servicemonitor_metrics_capture_frequency: 60


⚠️ Potential issue | 🟡 Minor

Documentation mismatch: Comment says "default: 15" but actual default is 60.

The comment on line 13 states (default: 15) but the actual default value on line 14 is 60. This inconsistency is propagated from the source docstring in projects/cluster/toolbox/cluster.py line 564.

Proposed fix

-# How often to capture metrics in seconds (default: 15) +# How often to capture metrics in seconds (default: 60) cluster_capture_servicemonitor_metrics_capture_frequency: 60

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml` around lines 13 - 14, The inline comment for cluster_capture_servicemonitor_metrics_capture_frequency says "(default: 15)" but the actual default is 60; update the comment here and the corresponding docstring in cluster.py (the docstring/source for cluster_capture_servicemonitor_metrics_capture_frequency around the docstring at the earlier definition) so both state "(default: 60)" to keep docs and code consistent.

coderabbitai · 2026-03-28T18:25:30Z

projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml

+  - name: Check if pod exists for finalization
+    shell: |
+      oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
+    register: pod_exists_check
+
+  - name: Capture pod logs
+    shell: |
+      oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"


⚠️ Potential issue | 🟠 Major

Make finalize mode idempotent when the capture pod never existed.

stop_metrics_capture() runs from a finally: block in projects/llm-d/testing/test_llmd.py, so this role also executes after early startup failures. Right now oc get pod ... and oc logs ... fail hard when the pod is absent, which turns best-effort teardown into a secondary exception. Make the probe non-fatal and only fetch logs when the pod is there.

Suggested fix

- name: Check if pod exists for finalization shell: | oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name register: pod_exists_check + failed_when: false + changed_when: false - name: Capture pod logs shell: | oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt" + when: pod_exists_check.rc == 0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml` around lines 20 - 27, The finalize probe currently fails if the capture pod doesn't exist; make the pod-existence check non-fatal and only run the log capture when the pod is present by adding failed_when: false (and optionally changed_when: false) to the "Check if pod exists for finalization" task that registers pod_exists_check, and guard the "Capture pod logs" task with a when that checks pod_exists_check (e.g., pod_exists_check.stdout is defined and not empty or pod_exists_check.rc == 0). Reference the task names "Check if pod exists for finalization" and "Capture pod logs" and the variables cluster_capture_servicemonitor_metrics_service_name, target_namespace, and artifact_extra_logs_dir when making the changes.

coderabbitai · 2026-03-28T18:25:30Z

projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml

+# Ensure auth_secret_name is always defined with proper structure (fallback for edge cases)
+- name: Set default auth secret name if not defined
+  set_fact:
+    auth_secret_name: "{% if auth_secret_name_cmd is defined %}{{ auth_secret_name_cmd.stdout }}{% endif %}"


⚠️ Potential issue | 🟠 Major

This normalization step clears PodMonitor auth secrets.

The ServiceMonitor path registers auth_secret_name_cmd, but projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml:72-75 registers auth_secret_name. When the PodMonitor branch runs, this set_fact rewrites the resolved secret name to "", so the later secret handling, mount, and bearer auth are all skipped.

Suggested fix

-# Ensure auth_secret_name is always defined with proper structure (fallback for edge cases) -- name: Set default auth secret name if not defined - set_fact: - auth_secret_name: "{% if auth_secret_name_cmd is defined %}{{ auth_secret_name_cmd.stdout }}{% endif %}" +# Normalize the extracted auth secret name from either monitor path +- name: Normalize auth secret name + set_fact: + auth_secret_name: >- + {{ auth_secret_name_cmd.stdout if auth_secret_name_cmd is defined + else (auth_secret_name.stdout if auth_secret_name is defined else '') }}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml` around lines 46 - 49, The current set_fact in tasks/main.yml unconditionally overwrites auth_secret_name with an empty string when auth_secret_name_cmd is undefined, clearing PodMonitor-registered secrets; change the set_fact logic so it only assigns auth_secret_name when auth_secret_name_cmd is defined and non-empty (e.g., keep existing auth_secret_name otherwise), ensuring the task that sets auth_secret_name (from projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml) is not clobbered and subsequent secret handling, mounting, and bearer auth remain intact.

coderabbitai · 2026-03-28T18:25:30Z

projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml

+- name: Check if auth secret exists
+  shell: |
+    oc get secret {{ auth_secret_name }} -n {{ target_namespace }} --no-headers -o name
+  register: auth_secret_exists
+  when: auth_secret_name != ""


⚠️ Potential issue | 🟠 Major

The auth-secret probe is still fatal.

The next task branches on auth_secret_exists.rc, but oc get secret ... will already fail the play when the secret is absent. That makes the "Secret exists: no" path below unreachable and turns what looks like an informational probe into an early abort.

Suggested fix

- name: Check if auth secret exists shell: | oc get secret {{ auth_secret_name }} -n {{ target_namespace }} --no-headers -o name register: auth_secret_exists when: auth_secret_name != "" + failed_when: false + changed_when: false

Also applies to: 62-80

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml` around lines 56 - 60, The "Check if auth secret exists" task uses "oc get secret ..." and registers auth_secret_exists but will abort the play when the secret is missing; change the task so the command does not fail the play by adding either failed_when: false (and optionally changed_when: false) or ignore_errors: true to the task that runs "oc get secret {{ auth_secret_name }} -n {{ target_namespace }} --no-headers -o name", so auth_secret_exists.rc can be inspected later; apply the same change to the other similar probe tasks that register values (the tasks that set auth_secret_exists and the equivalent checks in the 62-80 range) to make the "Secret exists: no" path reachable.

coderabbitai · 2026-03-28T18:25:30Z

...cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2

+            {% if auth_secret_name %}
+            if [ -f /var/run/secrets/auth/token ]; then
+              TOKEN=$(cat /var/run/secrets/auth/token)
+              fetch_metrics "$url" "$TOKEN"
+            else
+              echo "Auth token file not found"
+            fi


⚠️ Potential issue | 🟠 Major

Don't hard-code the auth secret filename to token.

This template only receives a secret name, not the selected credential key. With the current /var/run/secrets/auth/token read, any ServiceMonitor/PodMonitor whose authorization secret uses a different key will mount successfully but still scrape unauthenticated. The extractor tasks only capture .name today in projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml:73-76 and projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml:72-75. Pass the key through as well and derive the file path from it.

Also applies to: 83-93

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2` around lines 61 - 67, The template hardcodes /var/run/secrets/auth/token and must instead use the secret key provided by the tasks; update the ServiceMonitor/PodMonitor extractor tasks (referenced in projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml and podmonitor.yml) to also capture the authorization secret key (e.g., auth_secret_key) and pass it into the template, then change the template logic around fetch_metrics and the second block (both the 61–67 and 83–93 areas) to derive the file path from that key (e.g., /var/run/secrets/auth/{{ auth_secret_key }}) rather than using the literal "token" so fetch_metrics is called with the actual mounted key file when present.

coderabbitai · 2026-03-28T18:25:30Z

projects/cluster/toolbox/cluster.py

+        Args:
+            service_name: Name of the service to capture ServiceMonitor/PodMonitor metrics for
+            namespace: Namespace where the service and ServiceMonitor/PodMonitor are located (empty string auto-detects current namespace)
+            capture_frequency: How often to capture metrics in seconds (default: 15)


⚠️ Potential issue | 🟡 Minor

Docstring default value is incorrect: says 15 but actual default is 60.

The docstring states (default: 15) but the method signature on line 553 shows capture_frequency=60. This is the root cause of the documentation inconsistencies in the generated files.

Proposed fix

- capture_frequency: How often to capture metrics in seconds (default: 15) + capture_frequency: How often to capture metrics in seconds (default: 60)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

capture_frequency: How often to capture metrics in seconds (default: 15)

capture_frequency: How often to capture metrics in seconds (default: 60)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@projects/cluster/toolbox/cluster.py` at line 564, The docstring for the function that defines the capture_frequency parameter (the function with parameter capture_frequency=60) incorrectly states "(default: 15)"; update the docstring to reflect the actual default of 60 by changing the text to "(default: 60)" wherever capture_frequency is documented (e.g., in the docstring block around the function that includes capture_frequency) so the documentation matches the function signature.

topsail-bot · 2026-03-28T20:06:14Z

🔴 Test of 'llm-d test test_ci' failed after 00 hours 34 minutes 18 seconds. 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: cks
PR_POSITIONAL_ARG_2: multi-flavor
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval
tests.llmd.inference_service.metrics.manual_capture: false

Failure indicator: Empty. (See run.log)

kpouget · 2026-03-28T20:58:22Z

/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval
/cluster cks
/only test_ci
/var tests.llmd.inference_service.metrics.manual_capture: false

topsail-bot · 2026-03-29T00:26:58Z

🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 25 minutes 12 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: cks
PR_POSITIONAL_ARG_2: multi-flavor
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval
tests.llmd.inference_service.metrics.manual_capture: false

kpouget · 2026-03-29T09:20:37Z

/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval gpt-oss
/cluster cks
/only test_ci
/var tests.llmd.inference_service.metrics.manual_capture: false

…image

kpouget · 2026-03-29T09:22:19Z

/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval gpt-oss
/cluster cks
/only test_ci
/var tests.llmd.inference_service.metrics.manual_capture: false

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/llm-d/testing/config.yaml`:
- Around line 297-300: The run_guidellm_benchmark call currently uses hardcoded
image defaults; update test_llmd.py so run_guidellm_benchmark reads the nested
config values using
config.project.get_config("tests.llmd.benchmarks.guidellm.image.name") and
config.project.get_config("tests.llmd.benchmarks.guidellm.image.version")
(assign to image_name and image_version) and pass them into run.run_toolbox via
the image= and version= parameters (alongside existing args like endpoint_url,
name, namespace, timeout, guidellm_args, run_as_root, artifact_dir_suffix) so
the config-driven image/version override the defaults.
- Around line 51-53: The gateway.name in config
(tests.llmd.inference_service.gateway.name) must match an actual
LLMInferenceService.status.addresses[*].name or the test will raise
RuntimeError—update the config value to an existing address name or add a
pre-check in the test to validate existence; ensure monitor names used by
capture_servicemonitor_metrics are validated before invoking oc get to avoid
capture failures; and fix the image schema mismatch by either flattening config
keys to benchmarks.guidellm.image and benchmarks.guidellm.version or change
run_guidellm_benchmark to read the nested benchmarks.guidellm.image.name and
benchmarks.guidellm.image.version and pass them into the toolbox call instead of
the hardcoded defaults (image="ghcr.io/albertoperdomo2/guidellm",
version="nightly").

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 474e4624-b358-4d5c-a103-edfd7162d477

📥 Commits

Reviewing files that changed from the base of the PR and between 3c414e9 and 3175abb.

📒 Files selected for processing (5)

docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst
projects/llm-d/testing/config.yaml
projects/llm-d/testing/test_llmd.py
projects/llm-d/toolbox/llmd.py
projects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.yml

✅ Files skipped from review due to trivial changes (3)

projects/llm-d/toolbox/llmd.py
docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst
projects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.yml

🚧 Files skipped from review as they are similar to previous changes (1)

projects/llm-d/testing/test_llmd.py

projects/llm-d/testing/config.yaml

topsail-bot · 2026-03-29T12:55:15Z

🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 29 minutes 19 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: cks
PR_POSITIONAL_ARG_2: multi-flavor
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval
PR_POSITIONAL_ARG_4: gpt-oss
tests.llmd.inference_service.metrics.manual_capture: false

kpouget · 2026-03-30T07:16:40Z

/test jump-ci llm-d psap_h200 multi-flavor guidellm_heterogeneous_eval gpt-oss
/cluster psap-h200
/only test_ci

topsail-bot · 2026-03-30T09:46:57Z

🔴 Test of 'llm-d test test_ci' failed after 02 hours 26 minutes 23 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: psap_h200
PR_POSITIONAL_ARG_2: multi-flavor
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval
PR_POSITIONAL_ARG_4: gpt-oss

Failure indicator: Empty. (See run.log)

openshift-ci · 2026-03-30T09:47:41Z

@kpouget: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/jump-ci	`3175abb`	link	true	`/test jump-ci`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

kpouget added 3 commits March 26, 2026 11:21

[llm-d] visualizations: llmd_inference: store/parsers: register the i…

88cd5e8

…mportant files

[cluster] toolbox: cluster_capture_servicemonitor_metrics: new role

14105ff

[llm-d] testing: allow direct metrics capture

c57b55a

coderabbitai bot reviewed Mar 28, 2026

View reviewed changes

[llm-d] testing: config: update the CKS profile

dd68002

kpouget force-pushed the download-model2 branch from 3c414e9 to 03fbb4c Compare March 29, 2026 09:19

WIP: [llm-d] testing: test_llmd: add the request type and change the …

3175abb

…image

kpouget force-pushed the download-model2 branch from 03fbb4c to 3175abb Compare March 29, 2026 09:21

coderabbitai bot reviewed Mar 29, 2026

View reviewed changes

projects/llm-d/testing/config.yaml Show resolved Hide resolved

projects/llm-d/testing/config.yaml Show resolved Hide resolved

openshift-psap deleted a comment from openshift-ci bot Mar 30, 2026

kpouget closed this Mar 31, 2026

		# How often to capture metrics in seconds (default: 15)
		cluster_capture_servicemonitor_metrics_capture_frequency: 60

	capture_frequency: How often to capture metrics in seconds (default: 15)
	capture_frequency: How often to capture metrics in seconds (default: 60)

Conversation

kpouget commented Mar 28, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

openshift-ci bot commented Mar 28, 2026

Uh oh!

kpouget commented Mar 28, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

topsail-bot bot commented Mar 28, 2026

Uh oh!

kpouget commented Mar 28, 2026

Uh oh!

topsail-bot bot commented Mar 29, 2026

Uh oh!

kpouget commented Mar 29, 2026

Uh oh!

kpouget commented Mar 29, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

topsail-bot bot commented Mar 29, 2026

Uh oh!

kpouget commented Mar 30, 2026

Uh oh!

topsail-bot bot commented Mar 30, 2026

Uh oh!

openshift-ci bot commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kpouget commented Mar 28, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 28, 2026 •

edited

Loading