Conversation
📝 WalkthroughWalkthroughAdds a new Cluster.capture_servicemonitor_metrics method and its Ansible role to capture ServiceMonitor/PodMonitor YAML, status, and periodic metrics via a templated pod; updates docs, test harness to start/stop capture, visualization parser artifacts, and default benchmark image/version. Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Test Code
participant Cluster as Cluster.capture_servicemonitor_metrics
participant Ansible as Ansible Role
participant K8s as Kubernetes Cluster
participant Pod as Metrics Capture Pod
participant Endpoint as Metrics Endpoint
Test->>Cluster: capture_servicemonitor_metrics(service_name, namespace, ...)
Cluster->>Ansible: RunAnsibleRole(locals())
rect rgba(100,150,200,0.5)
Ansible->>K8s: Resolve namespace (oc project -q if empty)
Ansible->>K8s: Query ServiceMonitor/PodMonitor YAML & status
Ansible->>K8s: Query target Service/Pods, extract ports/scheme
end
rect rgba(150,200,100,0.5)
Ansible->>K8s: Check for auth secret
Ansible->>Ansible: Render metrics_capture_pod.yaml (with auth mount if present)
Ansible->>K8s: Create pod, wait for Ready
K8s->>Pod: Start pod with METRICS_URLS env
Pod->>Endpoint: Periodic curl (+Bearer token if mounted)
Endpoint-->>Pod: Metrics response
Pod->>K8s: Write metrics to artifact logs
end
Test->>Cluster: capture_servicemonitor_metrics(..., finalize=True)
Cluster->>Ansible: RunAnsibleRole(locals())
rect rgba(200,150,100,0.5)
Ansible->>K8s: Find existing capture pod
Ansible->>K8s: Dump pod logs to metrics_capture_logs.txt
Ansible->>K8s: Delete pod (grace 0)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (3)
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml (2)
45-47: Potential empty port value when fallback task is skipped.If
named_port_result.stdoutis an empty string and thefirst_port_resulttask was skipped (line 43 condition not met becausenamed_port_result.stdout != ""), thefirst_port_result.stdoutmay be undefined, causingfinal_portto resolve incorrectly.However, reviewing the logic: if
named_port_result.stdout == "", line 43'swhencondition is true, sofirst_port_resultwill run. The current logic appears correct, but the complex conditional on line 47 could be simplified for clarity.Suggested simplification
- name: Set final port number set_fact: - final_port: "{{ named_port_result.stdout if named_port_result.stdout != '' else first_port_result.stdout | default('9090') }}" + final_port: "{{ named_port_result.stdout | default(first_port_result.stdout | default('9090'), true) }}"Using Jinja2's
defaultfilter withtrueas the second argument handles both undefined and empty string cases.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml` around lines 45 - 47, The current set_fact for final_port uses a complex conditional that can be simplified and made safe for undefined or empty values: update the final_port assignment (set_fact final_port) to prefer named_port_result.stdout, then fall back to first_port_result.stdout, and finally to '9090' using Jinja2's default filter with the second argument true to treat empty strings as undefined (e.g. use "{{ named_port_result.stdout | default(first_port_result.stdout | default('9090', true), true) }}"), referencing the final_port variable and the named_port_result and first_port_result task results.
12-24: Hardcoded label selectors reduce reusability.The label selectors
app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceserviceare hardcoded for a specific use case. Consider parameterizing these if ServiceMonitor capture needs to support other service types in the future.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml` around lines 12 - 24, The three tasks ("Capture service target by ServiceMonitor selector", "Capture service target by ServiceMonitor selector YAML", and "Get target service name") use a hardcoded label selector string; parameterize that selector (e.g., introduce a variable like service_label_selector or serviceMonitor_label_selector with a sensible default) and replace the literal selector in each oc command with the new variable, ensuring the variable is documented/defaulted so other service types can be targeted without editing the task bodies; keep existing variables target_namespace, artifact_extra_logs_dir and the register target_service_name unchanged.projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml (1)
12-23: Hardcoded label selectors match servicemonitor.yml pattern.Similar to
servicemonitor.yml, these label selectors are hardcoded for the llminferenceservice workload types. This is consistent but limits reusability for other PodMonitor scenarios.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml` around lines 12 - 23, The three tasks named "Capture pod targets by PodMonitor selector", "Capture pod targets by PodMonitor selector YAML" and "Get all target pod IPs and names" use a hardcoded label selector string for llminferenceservice; replace that hardcoded selector with a reusable variable (e.g., podmonitor_label_selector or podmonitor_labels) and reference the variable in each oc get command, and provide a sensible default value matching the current selector so existing behavior is preserved; update any documentation/vars file where defaults live and ensure the variable is used consistently across those tasks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst`:
- Around line 33-37: The docstring for the capture_frequency parameter is
incorrect — update the docstring for capture_frequency inside the
capture_servicemonitor_metrics method (in projects/cluster/toolbox/cluster.py)
to state the correct default value of 60 seconds (replace "default: 15" with
"default: 60") and ensure the inline description reflects "How often to capture
metrics in seconds (default: 60)"; after that regenerate the docs so
docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst and the
Ansible defaults output reflect the corrected default.
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml`:
- Around line 13-14: The inline comment for
cluster_capture_servicemonitor_metrics_capture_frequency says "(default: 15)"
but the actual default is 60; update the comment here and the corresponding
docstring in cluster.py (the docstring/source for
cluster_capture_servicemonitor_metrics_capture_frequency around the docstring at
the earlier definition) so both state "(default: 60)" to keep docs and code
consistent.
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`:
- Around line 56-60: The "Check if auth secret exists" task uses "oc get secret
..." and registers auth_secret_exists but will abort the play when the secret is
missing; change the task so the command does not fail the play by adding either
failed_when: false (and optionally changed_when: false) or ignore_errors: true
to the task that runs "oc get secret {{ auth_secret_name }} -n {{
target_namespace }} --no-headers -o name", so auth_secret_exists.rc can be
inspected later; apply the same change to the other similar probe tasks that
register values (the tasks that set auth_secret_exists and the equivalent checks
in the 62-80 range) to make the "Secret exists: no" path reachable.
- Around line 20-27: The finalize probe currently fails if the capture pod
doesn't exist; make the pod-existence check non-fatal and only run the log
capture when the pod is present by adding failed_when: false (and optionally
changed_when: false) to the "Check if pod exists for finalization" task that
registers pod_exists_check, and guard the "Capture pod logs" task with a when
that checks pod_exists_check (e.g., pod_exists_check.stdout is defined and not
empty or pod_exists_check.rc == 0). Reference the task names "Check if pod
exists for finalization" and "Capture pod logs" and the variables
cluster_capture_servicemonitor_metrics_service_name, target_namespace, and
artifact_extra_logs_dir when making the changes.
- Around line 46-49: The current set_fact in tasks/main.yml unconditionally
overwrites auth_secret_name with an empty string when auth_secret_name_cmd is
undefined, clearing PodMonitor-registered secrets; change the set_fact logic so
it only assigns auth_secret_name when auth_secret_name_cmd is defined and
non-empty (e.g., keep existing auth_secret_name otherwise), ensuring the task
that sets auth_secret_name (from
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml)
is not clobbered and subsequent secret handling, mounting, and bearer auth
remain intact.
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2`:
- Around line 61-67: The template hardcodes /var/run/secrets/auth/token and must
instead use the secret key provided by the tasks; update the
ServiceMonitor/PodMonitor extractor tasks (referenced in
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml
and podmonitor.yml) to also capture the authorization secret key (e.g.,
auth_secret_key) and pass it into the template, then change the template logic
around fetch_metrics and the second block (both the 61–67 and 83–93 areas) to
derive the file path from that key (e.g., /var/run/secrets/auth/{{
auth_secret_key }}) rather than using the literal "token" so fetch_metrics is
called with the actual mounted key file when present.
In `@projects/cluster/toolbox/cluster.py`:
- Line 564: The docstring for the function that defines the capture_frequency
parameter (the function with parameter capture_frequency=60) incorrectly states
"(default: 15)"; update the docstring to reflect the actual default of 60 by
changing the text to "(default: 60)" wherever capture_frequency is documented
(e.g., in the docstring block around the function that includes
capture_frequency) so the documentation matches the function signature.
---
Nitpick comments:
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml`:
- Around line 12-23: The three tasks named "Capture pod targets by PodMonitor
selector", "Capture pod targets by PodMonitor selector YAML" and "Get all target
pod IPs and names" use a hardcoded label selector string for
llminferenceservice; replace that hardcoded selector with a reusable variable
(e.g., podmonitor_label_selector or podmonitor_labels) and reference the
variable in each oc get command, and provide a sensible default value matching
the current selector so existing behavior is preserved; update any
documentation/vars file where defaults live and ensure the variable is used
consistently across those tasks.
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml`:
- Around line 45-47: The current set_fact for final_port uses a complex
conditional that can be simplified and made safe for undefined or empty values:
update the final_port assignment (set_fact final_port) to prefer
named_port_result.stdout, then fall back to first_port_result.stdout, and
finally to '9090' using Jinja2's default filter with the second argument true to
treat empty strings as undefined (e.g. use "{{ named_port_result.stdout |
default(first_port_result.stdout | default('9090', true), true) }}"),
referencing the final_port variable and the named_port_result and
first_port_result task results.
- Around line 12-24: The three tasks ("Capture service target by ServiceMonitor
selector", "Capture service target by ServiceMonitor selector YAML", and "Get
target service name") use a hardcoded label selector string; parameterize that
selector (e.g., introduce a variable like service_label_selector or
serviceMonitor_label_selector with a sensible default) and replace the literal
selector in each oc command with the new variable, ensuring the variable is
documented/defaulted so other service types can be targeted without editing the
task bodies; keep existing variables target_namespace, artifact_extra_logs_dir
and the register target_service_name unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c84a89bf-7c83-4b96-9888-094f17d0f6b5
📒 Files selected for processing (11)
docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rstdocs/toolbox.generated/index.rstprojects/cluster/toolbox/cluster.pyprojects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.ymlprojects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.ymlprojects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.ymlprojects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.ymlprojects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2projects/llm-d/testing/config.yamlprojects/llm-d/testing/test_llmd.pyprojects/llm-d/visualizations/llmd_inference/store/parsers.py
| ``capture_frequency`` | ||
|
|
||
| * How often to capture metrics in seconds (default: 15) | ||
|
|
||
| * default value: ``60`` |
There was a problem hiding this comment.
Same documentation mismatch: "default: 15" but actual value is 60.
This inconsistency originates from the docstring in projects/cluster/toolbox/cluster.py line 564. Fix the source docstring to correct both this RST file and the Ansible defaults file upon regeneration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst` around
lines 33 - 37, The docstring for the capture_frequency parameter is incorrect —
update the docstring for capture_frequency inside the
capture_servicemonitor_metrics method (in projects/cluster/toolbox/cluster.py)
to state the correct default value of 60 seconds (replace "default: 15" with
"default: 60") and ensure the inline description reflects "How often to capture
metrics in seconds (default: 60)"; after that regenerate the docs so
docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst and the
Ansible defaults output reflect the corrected default.
| # How often to capture metrics in seconds (default: 15) | ||
| cluster_capture_servicemonitor_metrics_capture_frequency: 60 |
There was a problem hiding this comment.
Documentation mismatch: Comment says "default: 15" but actual default is 60.
The comment on line 13 states (default: 15) but the actual default value on line 14 is 60. This inconsistency is propagated from the source docstring in projects/cluster/toolbox/cluster.py line 564.
Proposed fix
-# How often to capture metrics in seconds (default: 15)
+# How often to capture metrics in seconds (default: 60)
cluster_capture_servicemonitor_metrics_capture_frequency: 60🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml`
around lines 13 - 14, The inline comment for
cluster_capture_servicemonitor_metrics_capture_frequency says "(default: 15)"
but the actual default is 60; update the comment here and the corresponding
docstring in cluster.py (the docstring/source for
cluster_capture_servicemonitor_metrics_capture_frequency around the docstring at
the earlier definition) so both state "(default: 60)" to keep docs and code
consistent.
| - name: Check if pod exists for finalization | ||
| shell: | | ||
| oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name | ||
| register: pod_exists_check | ||
|
|
||
| - name: Capture pod logs | ||
| shell: | | ||
| oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt" |
There was a problem hiding this comment.
Make finalize mode idempotent when the capture pod never existed.
stop_metrics_capture() runs from a finally: block in projects/llm-d/testing/test_llmd.py, so this role also executes after early startup failures. Right now oc get pod ... and oc logs ... fail hard when the pod is absent, which turns best-effort teardown into a secondary exception. Make the probe non-fatal and only fetch logs when the pod is there.
Suggested fix
- name: Check if pod exists for finalization
shell: |
oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
register: pod_exists_check
+ failed_when: false
+ changed_when: false
- name: Capture pod logs
shell: |
oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"
+ when: pod_exists_check.rc == 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`
around lines 20 - 27, The finalize probe currently fails if the capture pod
doesn't exist; make the pod-existence check non-fatal and only run the log
capture when the pod is present by adding failed_when: false (and optionally
changed_when: false) to the "Check if pod exists for finalization" task that
registers pod_exists_check, and guard the "Capture pod logs" task with a when
that checks pod_exists_check (e.g., pod_exists_check.stdout is defined and not
empty or pod_exists_check.rc == 0). Reference the task names "Check if pod
exists for finalization" and "Capture pod logs" and the variables
cluster_capture_servicemonitor_metrics_service_name, target_namespace, and
artifact_extra_logs_dir when making the changes.
| # Ensure auth_secret_name is always defined with proper structure (fallback for edge cases) | ||
| - name: Set default auth secret name if not defined | ||
| set_fact: | ||
| auth_secret_name: "{% if auth_secret_name_cmd is defined %}{{ auth_secret_name_cmd.stdout }}{% endif %}" |
There was a problem hiding this comment.
This normalization step clears PodMonitor auth secrets.
The ServiceMonitor path registers auth_secret_name_cmd, but projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml:72-75 registers auth_secret_name. When the PodMonitor branch runs, this set_fact rewrites the resolved secret name to "", so the later secret handling, mount, and bearer auth are all skipped.
Suggested fix
-# Ensure auth_secret_name is always defined with proper structure (fallback for edge cases)
-- name: Set default auth secret name if not defined
- set_fact:
- auth_secret_name: "{% if auth_secret_name_cmd is defined %}{{ auth_secret_name_cmd.stdout }}{% endif %}"
+# Normalize the extracted auth secret name from either monitor path
+- name: Normalize auth secret name
+ set_fact:
+ auth_secret_name: >-
+ {{ auth_secret_name_cmd.stdout if auth_secret_name_cmd is defined
+ else (auth_secret_name.stdout if auth_secret_name is defined else '') }}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`
around lines 46 - 49, The current set_fact in tasks/main.yml unconditionally
overwrites auth_secret_name with an empty string when auth_secret_name_cmd is
undefined, clearing PodMonitor-registered secrets; change the set_fact logic so
it only assigns auth_secret_name when auth_secret_name_cmd is defined and
non-empty (e.g., keep existing auth_secret_name otherwise), ensuring the task
that sets auth_secret_name (from
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml)
is not clobbered and subsequent secret handling, mounting, and bearer auth
remain intact.
| - name: Check if auth secret exists | ||
| shell: | | ||
| oc get secret {{ auth_secret_name }} -n {{ target_namespace }} --no-headers -o name | ||
| register: auth_secret_exists | ||
| when: auth_secret_name != "" |
There was a problem hiding this comment.
The auth-secret probe is still fatal.
The next task branches on auth_secret_exists.rc, but oc get secret ... will already fail the play when the secret is absent. That makes the "Secret exists: no" path below unreachable and turns what looks like an informational probe into an early abort.
Suggested fix
- name: Check if auth secret exists
shell: |
oc get secret {{ auth_secret_name }} -n {{ target_namespace }} --no-headers -o name
register: auth_secret_exists
when: auth_secret_name != ""
+ failed_when: false
+ changed_when: falseAlso applies to: 62-80
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`
around lines 56 - 60, The "Check if auth secret exists" task uses "oc get secret
..." and registers auth_secret_exists but will abort the play when the secret is
missing; change the task so the command does not fail the play by adding either
failed_when: false (and optionally changed_when: false) or ignore_errors: true
to the task that runs "oc get secret {{ auth_secret_name }} -n {{
target_namespace }} --no-headers -o name", so auth_secret_exists.rc can be
inspected later; apply the same change to the other similar probe tasks that
register values (the tasks that set auth_secret_exists and the equivalent checks
in the 62-80 range) to make the "Secret exists: no" path reachable.
| {% if auth_secret_name %} | ||
| if [ -f /var/run/secrets/auth/token ]; then | ||
| TOKEN=$(cat /var/run/secrets/auth/token) | ||
| fetch_metrics "$url" "$TOKEN" | ||
| else | ||
| echo "Auth token file not found" | ||
| fi |
There was a problem hiding this comment.
Don't hard-code the auth secret filename to token.
This template only receives a secret name, not the selected credential key. With the current /var/run/secrets/auth/token read, any ServiceMonitor/PodMonitor whose authorization secret uses a different key will mount successfully but still scrape unauthenticated. The extractor tasks only capture .name today in projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml:73-76 and projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml:72-75. Pass the key through as well and derive the file path from it.
Also applies to: 83-93
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2`
around lines 61 - 67, The template hardcodes /var/run/secrets/auth/token and
must instead use the secret key provided by the tasks; update the
ServiceMonitor/PodMonitor extractor tasks (referenced in
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml
and podmonitor.yml) to also capture the authorization secret key (e.g.,
auth_secret_key) and pass it into the template, then change the template logic
around fetch_metrics and the second block (both the 61–67 and 83–93 areas) to
derive the file path from that key (e.g., /var/run/secrets/auth/{{
auth_secret_key }}) rather than using the literal "token" so fetch_metrics is
called with the actual mounted key file when present.
| Args: | ||
| service_name: Name of the service to capture ServiceMonitor/PodMonitor metrics for | ||
| namespace: Namespace where the service and ServiceMonitor/PodMonitor are located (empty string auto-detects current namespace) | ||
| capture_frequency: How often to capture metrics in seconds (default: 15) |
There was a problem hiding this comment.
Docstring default value is incorrect: says 15 but actual default is 60.
The docstring states (default: 15) but the method signature on line 553 shows capture_frequency=60. This is the root cause of the documentation inconsistencies in the generated files.
Proposed fix
- capture_frequency: How often to capture metrics in seconds (default: 15)
+ capture_frequency: How often to capture metrics in seconds (default: 60)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| capture_frequency: How often to capture metrics in seconds (default: 15) | |
| capture_frequency: How often to capture metrics in seconds (default: 60) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/cluster/toolbox/cluster.py` at line 564, The docstring for the
function that defines the capture_frequency parameter (the function with
parameter capture_frequency=60) incorrectly states "(default: 15)"; update the
docstring to reflect the actual default of 60 by changing the text to "(default:
60)" wherever capture_frequency is documented (e.g., in the docstring block
around the function that includes capture_frequency) so the documentation
matches the function signature.
|
🔴 Test of 'llm-d test test_ci' failed after 00 hours 34 minutes 18 seconds. 🔴 • Link to the test results. • No reports index generated... Test configuration: Failure indicator: Empty. (See run.log) |
|
/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval |
|
🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 25 minutes 12 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
3c414e9 to
03fbb4c
Compare
|
/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval gpt-oss |
03fbb4c to
3175abb
Compare
|
/test jump-ci llm-d cks multi-flavor guidellm_heterogeneous_eval gpt-oss |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@projects/llm-d/testing/config.yaml`:
- Around line 297-300: The run_guidellm_benchmark call currently uses hardcoded
image defaults; update test_llmd.py so run_guidellm_benchmark reads the nested
config values using
config.project.get_config("tests.llmd.benchmarks.guidellm.image.name") and
config.project.get_config("tests.llmd.benchmarks.guidellm.image.version")
(assign to image_name and image_version) and pass them into run.run_toolbox via
the image= and version= parameters (alongside existing args like endpoint_url,
name, namespace, timeout, guidellm_args, run_as_root, artifact_dir_suffix) so
the config-driven image/version override the defaults.
- Around line 51-53: The gateway.name in config
(tests.llmd.inference_service.gateway.name) must match an actual
LLMInferenceService.status.addresses[*].name or the test will raise
RuntimeError—update the config value to an existing address name or add a
pre-check in the test to validate existence; ensure monitor names used by
capture_servicemonitor_metrics are validated before invoking oc get to avoid
capture failures; and fix the image schema mismatch by either flattening config
keys to benchmarks.guidellm.image and benchmarks.guidellm.version or change
run_guidellm_benchmark to read the nested benchmarks.guidellm.image.name and
benchmarks.guidellm.image.version and pass them into the toolbox call instead of
the hardcoded defaults (image="ghcr.io/albertoperdomo2/guidellm",
version="nightly").
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 474e4624-b358-4d5c-a103-edfd7162d477
📒 Files selected for processing (5)
docs/toolbox.generated/Llmd.run_guidellm_benchmark.rstprojects/llm-d/testing/config.yamlprojects/llm-d/testing/test_llmd.pyprojects/llm-d/toolbox/llmd.pyprojects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.yml
✅ Files skipped from review due to trivial changes (3)
- projects/llm-d/toolbox/llmd.py
- docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst
- projects/llm-d/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.yml
🚧 Files skipped from review as they are similar to previous changes (1)
- projects/llm-d/testing/test_llmd.py
|
🟢 Test of 'llm-d test test_ci' succeeded after 03 hours 29 minutes 19 seconds. 🟢 • Link to the test results. • Link to the reports index. Test configuration: |
|
/test jump-ci llm-d psap_h200 multi-flavor guidellm_heterogeneous_eval gpt-oss |
|
🔴 Test of 'llm-d test test_ci' failed after 02 hours 26 minutes 23 seconds. 🔴 • Link to the test results. • Link to the reports index. Test configuration: Failure indicator: Empty. (See run.log) |
|
@kpouget: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary by CodeRabbit