Skip to content

[llm-d] Keep working#914

Closed
kpouget wants to merge 5 commits intoopenshift-psap:mainfrom
kpouget:download-model2
Closed

[llm-d] Keep working#914
kpouget wants to merge 5 commits intoopenshift-psap:mainfrom
kpouget:download-model2

Conversation

@kpouget
Copy link
Copy Markdown
Contributor

@kpouget kpouget commented Mar 28, 2026

Summary by CodeRabbit

  • New Features
    • Automated metrics capture for Kubernetes ServiceMonitor/PodMonitor with configurable polling and finalize options.
  • Tests
    • Tests now start/stop per-flavor metrics capture automatically and tolerate shutdown errors.
  • Documentation
    • Added user docs and index entry for the new metrics-capture tooling.
  • Chores
    • Updated default GuideLLM benchmark image/version and added benchmark artifact tracking for results/logs.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 28, 2026

📝 Walkthrough

Walkthrough

Adds a new Cluster.capture_servicemonitor_metrics method and its Ansible role to capture ServiceMonitor/PodMonitor YAML, status, and periodic metrics via a templated pod; updates docs, test harness to start/stop capture, visualization parser artifacts, and default benchmark image/version.

Changes

Cohort / File(s) Summary
Documentation
docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst, docs/toolbox.generated/index.rst
New auto-generated RST page and index entry documenting the Cluster.capture_servicemonitor_metrics API and parameters.
Cluster toolbox API
projects/cluster/toolbox/cluster.py
Added public method capture_servicemonitor_metrics(...) decorated with @AnsibleRole/@AnsibleMappedParams that invokes the new Ansible role.
Ansible role defaults
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml
Defaults for namespace, capture_frequency, is_podmonitor, finalize, and ansible_os_family added.
Ansible role tasks (core)
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml
Main role workflow: namespace resolution, artifacts dir, finalize flow (capture logs + delete pod), include of servicemonitor/podmonitor tasks, auth handling, render/create pod, wait for Ready.
Ansible role tasks (monitor-specific)
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml, projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml
ServiceMonitor and PodMonitor capture logic: fetch YAML/status, determine endpoints/ports/scheme, build metrics URLs, extract auth secret name, write artifact metadata.
Pod template
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2
Jinja2 pod manifest that runs a loop to curl METRICS_URLS with optional Bearer token mount, filters responses, and sleeps capture_frequency seconds.
Test integration
projects/llm-d/testing/config.yaml, projects/llm-d/testing/test_llmd.py
Disabled manual capture in preset, added benchmark image/version overrides; added start_metrics_capture(flavor) and stop_metrics_capture(flavor) and wired capture start/stop around deployments.
Benchmark defaults & docs
projects/llm-d/toolbox/llmd.py, projects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.yml, docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst
Updated default GuideLLM image and version (image → ghcr.io/albertoperdomo2/guidellm, version → nightly) in code, role defaults, and generated docs.
Visualization parser
projects/llm-d/visualizations/llmd_inference/store/parsers.py
Added GuideLLM benchmark artifact paths (artifacts/results/benchmarks.json, artifacts/guidellm_benchmark_job.logs) to IMPORTANT_FILES.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Code
    participant Cluster as Cluster.capture_servicemonitor_metrics
    participant Ansible as Ansible Role
    participant K8s as Kubernetes Cluster
    participant Pod as Metrics Capture Pod
    participant Endpoint as Metrics Endpoint

    Test->>Cluster: capture_servicemonitor_metrics(service_name, namespace, ...)
    Cluster->>Ansible: RunAnsibleRole(locals())

    rect rgba(100,150,200,0.5)
    Ansible->>K8s: Resolve namespace (oc project -q if empty)
    Ansible->>K8s: Query ServiceMonitor/PodMonitor YAML & status
    Ansible->>K8s: Query target Service/Pods, extract ports/scheme
    end

    rect rgba(150,200,100,0.5)
    Ansible->>K8s: Check for auth secret
    Ansible->>Ansible: Render metrics_capture_pod.yaml (with auth mount if present)
    Ansible->>K8s: Create pod, wait for Ready
    K8s->>Pod: Start pod with METRICS_URLS env
    Pod->>Endpoint: Periodic curl (+Bearer token if mounted)
    Endpoint-->>Pod: Metrics response
    Pod->>K8s: Write metrics to artifact logs
    end

    Test->>Cluster: capture_servicemonitor_metrics(..., finalize=True)
    Cluster->>Ansible: RunAnsibleRole(locals())
    rect rgba(200,150,100,0.5)
    Ansible->>K8s: Find existing capture pod
    Ansible->>K8s: Dump pod logs to metrics_capture_logs.txt
    Ansible->>K8s: Delete pod (grace 0)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • [llm-d] Keep working #907: Modifies llm-d testing/toolbox code (test_llmd, config) and benchmark configuration—overlaps metrics capture and defaults changes.
  • [llm-d] Keep working #910: Changes Llmd.run_guidellm_benchmark API/defaults (adds parameters) and touches the same toolbox function updated here.
  • [llm-d] Keep working #912: Updates llm-d visualization parsers—related to the IMPORTANT_FILES additions in this PR.

Suggested labels

lgtm

Poem

🐰 I hopped into logs at the break of day,
Captured metrics the curious way,
ServiceMonitors sang, pods hummed along,
Artifacts saved—oh what a song!
Hop, fetch, and stash — a rabbit's dev ballet.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title '[llm-d] Keep working' is vague and generic, using non-descriptive phrasing that does not convey meaningful information about the changeset's actual purpose or scope. Replace the generic title with a specific, descriptive summary of the main changes (e.g., 'Add ServiceMonitor/PodMonitor metrics capture and update GuideLLM defaults').
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 28, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kpouget for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 28, 2026

/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval
/cluster cks
/only test_ci
/var tests.llmd.inference_service.metrics.manual_capture: false

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (3)
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml (2)

45-47: Potential empty port value when fallback task is skipped.

If named_port_result.stdout is an empty string and the first_port_result task was skipped (line 43 condition not met because named_port_result.stdout != ""), the first_port_result.stdout may be undefined, causing final_port to resolve incorrectly.

However, reviewing the logic: if named_port_result.stdout == "", line 43's when condition is true, so first_port_result will run. The current logic appears correct, but the complex conditional on line 47 could be simplified for clarity.

Suggested simplification
 - name: Set final port number
   set_fact:
-    final_port: "{{ named_port_result.stdout if named_port_result.stdout != '' else first_port_result.stdout | default('9090') }}"
+    final_port: "{{ named_port_result.stdout | default(first_port_result.stdout | default('9090'), true) }}"

Using Jinja2's default filter with true as the second argument handles both undefined and empty string cases.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml`
around lines 45 - 47, The current set_fact for final_port uses a complex
conditional that can be simplified and made safe for undefined or empty values:
update the final_port assignment (set_fact final_port) to prefer
named_port_result.stdout, then fall back to first_port_result.stdout, and
finally to '9090' using Jinja2's default filter with the second argument true to
treat empty strings as undefined (e.g. use "{{ named_port_result.stdout |
default(first_port_result.stdout | default('9090', true), true) }}"),
referencing the final_port variable and the named_port_result and
first_port_result task results.

12-24: Hardcoded label selectors reduce reusability.

The label selectors app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceservice are hardcoded for a specific use case. Consider parameterizing these if ServiceMonitor capture needs to support other service types in the future.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml`
around lines 12 - 24, The three tasks ("Capture service target by ServiceMonitor
selector", "Capture service target by ServiceMonitor selector YAML", and "Get
target service name") use a hardcoded label selector string; parameterize that
selector (e.g., introduce a variable like service_label_selector or
serviceMonitor_label_selector with a sensible default) and replace the literal
selector in each oc command with the new variable, ensuring the variable is
documented/defaulted so other service types can be targeted without editing the
task bodies; keep existing variables target_namespace, artifact_extra_logs_dir
and the register target_service_name unchanged.
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml (1)

12-23: Hardcoded label selectors match servicemonitor.yml pattern.

Similar to servicemonitor.yml, these label selectors are hardcoded for the llminferenceservice workload types. This is consistent but limits reusability for other PodMonitor scenarios.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml`
around lines 12 - 23, The three tasks named "Capture pod targets by PodMonitor
selector", "Capture pod targets by PodMonitor selector YAML" and "Get all target
pod IPs and names" use a hardcoded label selector string for
llminferenceservice; replace that hardcoded selector with a reusable variable
(e.g., podmonitor_label_selector or podmonitor_labels) and reference the
variable in each oc get command, and provide a sensible default value matching
the current selector so existing behavior is preserved; update any
documentation/vars file where defaults live and ensure the variable is used
consistently across those tasks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst`:
- Around line 33-37: The docstring for the capture_frequency parameter is
incorrect — update the docstring for capture_frequency inside the
capture_servicemonitor_metrics method (in projects/cluster/toolbox/cluster.py)
to state the correct default value of 60 seconds (replace "default: 15" with
"default: 60") and ensure the inline description reflects "How often to capture
metrics in seconds (default: 60)"; after that regenerate the docs so
docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst and the
Ansible defaults output reflect the corrected default.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml`:
- Around line 13-14: The inline comment for
cluster_capture_servicemonitor_metrics_capture_frequency says "(default: 15)"
but the actual default is 60; update the comment here and the corresponding
docstring in cluster.py (the docstring/source for
cluster_capture_servicemonitor_metrics_capture_frequency around the docstring at
the earlier definition) so both state "(default: 60)" to keep docs and code
consistent.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`:
- Around line 56-60: The "Check if auth secret exists" task uses "oc get secret
..." and registers auth_secret_exists but will abort the play when the secret is
missing; change the task so the command does not fail the play by adding either
failed_when: false (and optionally changed_when: false) or ignore_errors: true
to the task that runs "oc get secret {{ auth_secret_name }} -n {{
target_namespace }} --no-headers -o name", so auth_secret_exists.rc can be
inspected later; apply the same change to the other similar probe tasks that
register values (the tasks that set auth_secret_exists and the equivalent checks
in the 62-80 range) to make the "Secret exists: no" path reachable.
- Around line 20-27: The finalize probe currently fails if the capture pod
doesn't exist; make the pod-existence check non-fatal and only run the log
capture when the pod is present by adding failed_when: false (and optionally
changed_when: false) to the "Check if pod exists for finalization" task that
registers pod_exists_check, and guard the "Capture pod logs" task with a when
that checks pod_exists_check (e.g., pod_exists_check.stdout is defined and not
empty or pod_exists_check.rc == 0). Reference the task names "Check if pod
exists for finalization" and "Capture pod logs" and the variables
cluster_capture_servicemonitor_metrics_service_name, target_namespace, and
artifact_extra_logs_dir when making the changes.
- Around line 46-49: The current set_fact in tasks/main.yml unconditionally
overwrites auth_secret_name with an empty string when auth_secret_name_cmd is
undefined, clearing PodMonitor-registered secrets; change the set_fact logic so
it only assigns auth_secret_name when auth_secret_name_cmd is defined and
non-empty (e.g., keep existing auth_secret_name otherwise), ensuring the task
that sets auth_secret_name (from
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml)
is not clobbered and subsequent secret handling, mounting, and bearer auth
remain intact.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2`:
- Around line 61-67: The template hardcodes /var/run/secrets/auth/token and must
instead use the secret key provided by the tasks; update the
ServiceMonitor/PodMonitor extractor tasks (referenced in
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml
and podmonitor.yml) to also capture the authorization secret key (e.g.,
auth_secret_key) and pass it into the template, then change the template logic
around fetch_metrics and the second block (both the 61–67 and 83–93 areas) to
derive the file path from that key (e.g., /var/run/secrets/auth/{{
auth_secret_key }}) rather than using the literal "token" so fetch_metrics is
called with the actual mounted key file when present.

In `@projects/cluster/toolbox/cluster.py`:
- Line 564: The docstring for the function that defines the capture_frequency
parameter (the function with parameter capture_frequency=60) incorrectly states
"(default: 15)"; update the docstring to reflect the actual default of 60 by
changing the text to "(default: 60)" wherever capture_frequency is documented
(e.g., in the docstring block around the function that includes
capture_frequency) so the documentation matches the function signature.

---

Nitpick comments:
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml`:
- Around line 12-23: The three tasks named "Capture pod targets by PodMonitor
selector", "Capture pod targets by PodMonitor selector YAML" and "Get all target
pod IPs and names" use a hardcoded label selector string for
llminferenceservice; replace that hardcoded selector with a reusable variable
(e.g., podmonitor_label_selector or podmonitor_labels) and reference the
variable in each oc get command, and provide a sensible default value matching
the current selector so existing behavior is preserved; update any
documentation/vars file where defaults live and ensure the variable is used
consistently across those tasks.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml`:
- Around line 45-47: The current set_fact for final_port uses a complex
conditional that can be simplified and made safe for undefined or empty values:
update the final_port assignment (set_fact final_port) to prefer
named_port_result.stdout, then fall back to first_port_result.stdout, and
finally to '9090' using Jinja2's default filter with the second argument true to
treat empty strings as undefined (e.g. use "{{ named_port_result.stdout |
default(first_port_result.stdout | default('9090', true), true) }}"),
referencing the final_port variable and the named_port_result and
first_port_result task results.
- Around line 12-24: The three tasks ("Capture service target by ServiceMonitor
selector", "Capture service target by ServiceMonitor selector YAML", and "Get
target service name") use a hardcoded label selector string; parameterize that
selector (e.g., introduce a variable like service_label_selector or
serviceMonitor_label_selector with a sensible default) and replace the literal
selector in each oc command with the new variable, ensuring the variable is
documented/defaulted so other service types can be targeted without editing the
task bodies; keep existing variables target_namespace, artifact_extra_logs_dir
and the register target_service_name unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c84a89bf-7c83-4b96-9888-094f17d0f6b5

📥 Commits

Reviewing files that changed from the base of the PR and between 559da40 and 3c414e9.

📒 Files selected for processing (11)
  • docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst
  • docs/toolbox.generated/index.rst
  • projects/cluster/toolbox/cluster.py
  • projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml
  • projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml
  • projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml
  • projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml
  • projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2
  • projects/llm-d/testing/config.yaml
  • projects/llm-d/testing/test_llmd.py
  • projects/llm-d/visualizations/llmd_inference/store/parsers.py

Comment on lines +33 to +37
``capture_frequency``

* How often to capture metrics in seconds (default: 15)

* default value: ``60``
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Same documentation mismatch: "default: 15" but actual value is 60.

This inconsistency originates from the docstring in projects/cluster/toolbox/cluster.py line 564. Fix the source docstring to correct both this RST file and the Ansible defaults file upon regeneration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst` around
lines 33 - 37, The docstring for the capture_frequency parameter is incorrect —
update the docstring for capture_frequency inside the
capture_servicemonitor_metrics method (in projects/cluster/toolbox/cluster.py)
to state the correct default value of 60 seconds (replace "default: 15" with
"default: 60") and ensure the inline description reflects "How often to capture
metrics in seconds (default: 60)"; after that regenerate the docs so
docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst and the
Ansible defaults output reflect the corrected default.

Comment on lines +13 to +14
# How often to capture metrics in seconds (default: 15)
cluster_capture_servicemonitor_metrics_capture_frequency: 60
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Documentation mismatch: Comment says "default: 15" but actual default is 60.

The comment on line 13 states (default: 15) but the actual default value on line 14 is 60. This inconsistency is propagated from the source docstring in projects/cluster/toolbox/cluster.py line 564.

Proposed fix
-# How often to capture metrics in seconds (default: 15)
+# How often to capture metrics in seconds (default: 60)
 cluster_capture_servicemonitor_metrics_capture_frequency: 60
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml`
around lines 13 - 14, The inline comment for
cluster_capture_servicemonitor_metrics_capture_frequency says "(default: 15)"
but the actual default is 60; update the comment here and the corresponding
docstring in cluster.py (the docstring/source for
cluster_capture_servicemonitor_metrics_capture_frequency around the docstring at
the earlier definition) so both state "(default: 60)" to keep docs and code
consistent.

Comment on lines +20 to +27
- name: Check if pod exists for finalization
shell: |
oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
register: pod_exists_check

- name: Capture pod logs
shell: |
oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make finalize mode idempotent when the capture pod never existed.

stop_metrics_capture() runs from a finally: block in projects/llm-d/testing/test_llmd.py, so this role also executes after early startup failures. Right now oc get pod ... and oc logs ... fail hard when the pod is absent, which turns best-effort teardown into a secondary exception. Make the probe non-fatal and only fetch logs when the pod is there.

Suggested fix
   - name: Check if pod exists for finalization
     shell: |
       oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
     register: pod_exists_check
+    failed_when: false
+    changed_when: false

   - name: Capture pod logs
     shell: |
       oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"
+    when: pod_exists_check.rc == 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`
around lines 20 - 27, The finalize probe currently fails if the capture pod
doesn't exist; make the pod-existence check non-fatal and only run the log
capture when the pod is present by adding failed_when: false (and optionally
changed_when: false) to the "Check if pod exists for finalization" task that
registers pod_exists_check, and guard the "Capture pod logs" task with a when
that checks pod_exists_check (e.g., pod_exists_check.stdout is defined and not
empty or pod_exists_check.rc == 0). Reference the task names "Check if pod
exists for finalization" and "Capture pod logs" and the variables
cluster_capture_servicemonitor_metrics_service_name, target_namespace, and
artifact_extra_logs_dir when making the changes.

Comment on lines +46 to +49
# Ensure auth_secret_name is always defined with proper structure (fallback for edge cases)
- name: Set default auth secret name if not defined
set_fact:
auth_secret_name: "{% if auth_secret_name_cmd is defined %}{{ auth_secret_name_cmd.stdout }}{% endif %}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This normalization step clears PodMonitor auth secrets.

The ServiceMonitor path registers auth_secret_name_cmd, but projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml:72-75 registers auth_secret_name. When the PodMonitor branch runs, this set_fact rewrites the resolved secret name to "", so the later secret handling, mount, and bearer auth are all skipped.

Suggested fix
-# Ensure auth_secret_name is always defined with proper structure (fallback for edge cases)
-- name: Set default auth secret name if not defined
-  set_fact:
-    auth_secret_name: "{% if auth_secret_name_cmd is defined %}{{ auth_secret_name_cmd.stdout }}{% endif %}"
+# Normalize the extracted auth secret name from either monitor path
+- name: Normalize auth secret name
+  set_fact:
+    auth_secret_name: >-
+      {{ auth_secret_name_cmd.stdout if auth_secret_name_cmd is defined
+         else (auth_secret_name.stdout if auth_secret_name is defined else '') }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`
around lines 46 - 49, The current set_fact in tasks/main.yml unconditionally
overwrites auth_secret_name with an empty string when auth_secret_name_cmd is
undefined, clearing PodMonitor-registered secrets; change the set_fact logic so
it only assigns auth_secret_name when auth_secret_name_cmd is defined and
non-empty (e.g., keep existing auth_secret_name otherwise), ensuring the task
that sets auth_secret_name (from
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml)
is not clobbered and subsequent secret handling, mounting, and bearer auth
remain intact.

Comment on lines +56 to +60
- name: Check if auth secret exists
shell: |
oc get secret {{ auth_secret_name }} -n {{ target_namespace }} --no-headers -o name
register: auth_secret_exists
when: auth_secret_name != ""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The auth-secret probe is still fatal.

The next task branches on auth_secret_exists.rc, but oc get secret ... will already fail the play when the secret is absent. That makes the "Secret exists: no" path below unreachable and turns what looks like an informational probe into an early abort.

Suggested fix
 - name: Check if auth secret exists
   shell: |
     oc get secret {{ auth_secret_name }} -n {{ target_namespace }} --no-headers -o name
   register: auth_secret_exists
   when: auth_secret_name != ""
+  failed_when: false
+  changed_when: false

Also applies to: 62-80

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`
around lines 56 - 60, The "Check if auth secret exists" task uses "oc get secret
..." and registers auth_secret_exists but will abort the play when the secret is
missing; change the task so the command does not fail the play by adding either
failed_when: false (and optionally changed_when: false) or ignore_errors: true
to the task that runs "oc get secret {{ auth_secret_name }} -n {{
target_namespace }} --no-headers -o name", so auth_secret_exists.rc can be
inspected later; apply the same change to the other similar probe tasks that
register values (the tasks that set auth_secret_exists and the equivalent checks
in the 62-80 range) to make the "Secret exists: no" path reachable.

Comment on lines +61 to +67
{% if auth_secret_name %}
if [ -f /var/run/secrets/auth/token ]; then
TOKEN=$(cat /var/run/secrets/auth/token)
fetch_metrics "$url" "$TOKEN"
else
echo "Auth token file not found"
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't hard-code the auth secret filename to token.

This template only receives a secret name, not the selected credential key. With the current /var/run/secrets/auth/token read, any ServiceMonitor/PodMonitor whose authorization secret uses a different key will mount successfully but still scrape unauthenticated. The extractor tasks only capture .name today in projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml:73-76 and projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml:72-75. Pass the key through as well and derive the file path from it.

Also applies to: 83-93

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2`
around lines 61 - 67, The template hardcodes /var/run/secrets/auth/token and
must instead use the secret key provided by the tasks; update the
ServiceMonitor/PodMonitor extractor tasks (referenced in
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml
and podmonitor.yml) to also capture the authorization secret key (e.g.,
auth_secret_key) and pass it into the template, then change the template logic
around fetch_metrics and the second block (both the 61–67 and 83–93 areas) to
derive the file path from that key (e.g., /var/run/secrets/auth/{{
auth_secret_key }}) rather than using the literal "token" so fetch_metrics is
called with the actual mounted key file when present.

Args:
service_name: Name of the service to capture ServiceMonitor/PodMonitor metrics for
namespace: Namespace where the service and ServiceMonitor/PodMonitor are located (empty string auto-detects current namespace)
capture_frequency: How often to capture metrics in seconds (default: 15)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring default value is incorrect: says 15 but actual default is 60.

The docstring states (default: 15) but the method signature on line 553 shows capture_frequency=60. This is the root cause of the documentation inconsistencies in the generated files.

Proposed fix
-            capture_frequency: How often to capture metrics in seconds (default: 15)
+            capture_frequency: How often to capture metrics in seconds (default: 60)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
capture_frequency: How often to capture metrics in seconds (default: 15)
capture_frequency: How often to capture metrics in seconds (default: 60)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/cluster/toolbox/cluster.py` at line 564, The docstring for the
function that defines the capture_frequency parameter (the function with
parameter capture_frequency=60) incorrectly states "(default: 15)"; update the
docstring to reflect the actual default of 60 by changing the text to "(default:
60)" wherever capture_frequency is documented (e.g., in the docstring block
around the function that includes capture_frequency) so the documentation
matches the function signature.

@topsail-bot
Copy link
Copy Markdown

topsail-bot bot commented Mar 28, 2026

🔴 Test of 'llm-d test test_ci' failed after 00 hours 34 minutes 18 seconds. 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: cks
PR_POSITIONAL_ARG_2: multi-flavor
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval
tests.llmd.inference_service.metrics.manual_capture: false

Failure indicator: Empty. (See run.log)

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 28, 2026

/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval
/cluster cks
/only test_ci
/var tests.llmd.inference_service.metrics.manual_capture: false

@topsail-bot
Copy link
Copy Markdown

topsail-bot bot commented Mar 29, 2026

🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 25 minutes 12 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: cks
PR_POSITIONAL_ARG_2: multi-flavor
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval
tests.llmd.inference_service.metrics.manual_capture: false

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 29, 2026

/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval gpt-oss
/cluster cks
/only test_ci
/var tests.llmd.inference_service.metrics.manual_capture: false

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 29, 2026

/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval gpt-oss
/cluster cks
/only test_ci
/var tests.llmd.inference_service.metrics.manual_capture: false

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@projects/llm-d/testing/config.yaml`:
- Around line 297-300: The run_guidellm_benchmark call currently uses hardcoded
image defaults; update test_llmd.py so run_guidellm_benchmark reads the nested
config values using
config.project.get_config("tests.llmd.benchmarks.guidellm.image.name") and
config.project.get_config("tests.llmd.benchmarks.guidellm.image.version")
(assign to image_name and image_version) and pass them into run.run_toolbox via
the image= and version= parameters (alongside existing args like endpoint_url,
name, namespace, timeout, guidellm_args, run_as_root, artifact_dir_suffix) so
the config-driven image/version override the defaults.
- Around line 51-53: The gateway.name in config
(tests.llmd.inference_service.gateway.name) must match an actual
LLMInferenceService.status.addresses[*].name or the test will raise
RuntimeError—update the config value to an existing address name or add a
pre-check in the test to validate existence; ensure monitor names used by
capture_servicemonitor_metrics are validated before invoking oc get to avoid
capture failures; and fix the image schema mismatch by either flattening config
keys to benchmarks.guidellm.image and benchmarks.guidellm.version or change
run_guidellm_benchmark to read the nested benchmarks.guidellm.image.name and
benchmarks.guidellm.image.version and pass them into the toolbox call instead of
the hardcoded defaults (image="ghcr.io/albertoperdomo2/guidellm",
version="nightly").
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 474e4624-b358-4d5c-a103-edfd7162d477

📥 Commits

Reviewing files that changed from the base of the PR and between 3c414e9 and 3175abb.

📒 Files selected for processing (5)
  • docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst
  • projects/llm-d/testing/config.yaml
  • projects/llm-d/testing/test_llmd.py
  • projects/llm-d/toolbox/llmd.py
  • projects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.yml
✅ Files skipped from review due to trivial changes (3)
  • projects/llm-d/toolbox/llmd.py
  • docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst
  • projects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • projects/llm-d/testing/test_llmd.py

@topsail-bot
Copy link
Copy Markdown

topsail-bot bot commented Mar 29, 2026

🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 29 minutes 19 seconds. 🟢

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: cks
PR_POSITIONAL_ARG_2: multi-flavor
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval
PR_POSITIONAL_ARG_4: gpt-oss
tests.llmd.inference_service.metrics.manual_capture: false

@kpouget
Copy link
Copy Markdown
Contributor Author

kpouget commented Mar 30, 2026

/test jump-ci llm-d psap_h200 multi-flavor guidellm_heterogeneous_eval gpt-oss
/cluster psap-h200
/only test_ci

@openshift-psap openshift-psap deleted a comment from openshift-ci bot Mar 30, 2026
@topsail-bot
Copy link
Copy Markdown

topsail-bot bot commented Mar 30, 2026

🔴 Test of 'llm-d test test_ci' failed after 02 hours 26 minutes 23 seconds. 🔴

• Link to the test results.

• Link to the reports index.

Test configuration:

PR_POSITIONAL_ARGS: jump-ci
PR_POSITIONAL_ARG_0: jump-ci
PR_POSITIONAL_ARG_1: psap_h200
PR_POSITIONAL_ARG_2: multi-flavor
PR_POSITIONAL_ARG_3: guidellm_heterogeneous_eval
PR_POSITIONAL_ARG_4: gpt-oss

Failure indicator: Empty. (See run.log)

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 30, 2026

@kpouget: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/jump-ci 3175abb link true /test jump-ci

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@kpouget kpouget closed this Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant