Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughA new Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Test Framework
participant Cluster as Cluster API
participant Ansible as Ansible Role
participant K8s as Kubernetes API
participant Pod as Metrics Capture Pod
Test->>Cluster: capture_servicemonitor_metrics(service_name, namespace)
Cluster->>Ansible: RunAnsibleRole(cluster_capture_servicemonitor_metrics)
Ansible->>K8s: oc project (determine namespace)
Ansible->>K8s: Get ServiceMonitor/PodMonitor YAML
Ansible->>K8s: Query matching services/pods
Ansible->>K8s: Extract endpoint config (scheme, port)
Ansible->>K8s: Build metrics URLs
Ansible->>K8s: Create metrics capture pod (pod.yaml)
K8s->>Pod: Deploy pod with metrics URLs
Pod->>Pod: Loop: fetch metrics via curl (every capture_frequency)
Pod->>Pod: Apply auth token if available
Ansible->>K8s: Wait for pod Ready state
Test->>Cluster: capture_servicemonitor_metrics(..., finalize=True)
Cluster->>Ansible: RunAnsibleRole (finalize mode)
Ansible->>K8s: Get pod logs
Ansible->>K8s: Delete metrics capture pod
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (2)
projects/cluster/toolbox/cluster.py (1)
561-565: Fix the documented default forcapture_frequency.The function default is
60, but the docstring still saysdefault: 15, and the generated RST now shows both values.Proposed fix
- capture_frequency: How often to capture metrics in seconds (default: 15) + capture_frequency: How often to capture metrics in seconds (default: 60)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/cluster/toolbox/cluster.py` around lines 561 - 565, Update the docstring for the parameter capture_frequency to match the function default (60 seconds) — currently it incorrectly says "default: 15". Locate the docstring block in projects/cluster/toolbox/cluster.py that documents capture_frequency (the function where the parameter capture_frequency is defined) and change the text "default: 15" to "default: 60" so generated RST and docs reflect the actual default.projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml (1)
56-60: Secret existence check fails the play if secret is missing.If
auth_secret_nameis set but the secret doesn't exist in the namespace,oc get secretreturns non-zero and the play fails. If this is a valid scenario (e.g., optional auth), add error handling:Proposed fix to handle missing secrets gracefully
- name: Check if auth secret exists shell: | oc get secret {{ auth_secret_name }} -n {{ target_namespace }} --no-headers -o name register: auth_secret_exists + failed_when: false when: auth_secret_name != ""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml` around lines 56 - 60, The "Check if auth secret exists" shell task fails the play when the secret is missing; change the task (named "Check if auth secret exists") to treat a non-zero oc exit code as non-fatal by adding failed_when: false (or ignore_errors: true) so the play continues, keep register: auth_secret_exists, and then use auth_secret_exists.rc or auth_secret_exists.stdout to conditionally act later (e.g., only create resources when auth_secret_exists.rc == 0) so missing optional secrets are handled gracefully.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml`:
- Around line 13-14: The inline comment for
cluster_capture_servicemonitor_metrics_capture_frequency says "(default: 15)"
but the variable is set to 60; make them consistent by updating the comment to
"(default: 60)" to match
cluster_capture_servicemonitor_metrics_capture_frequency, or if you intended the
default to be 15, change the variable value to 15 instead—adjust the comment and
value together so cluster_capture_servicemonitor_metrics_capture_frequency and
its descriptive comment match.
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`:
- Around line 20-27: The "Check if pod exists for finalization" task should not
fail the play when the pod is absent: add failed_when: false (and optionally
changed_when: false) to that task so oc get returns won't abort the run, and
then guard the "Capture pod logs" task with when: pod_exists_check.rc == 0 (or
when: pod_exists_check.stdout != '' ) so logs are only collected if the pod
actually exists; reference the task names "Check if pod exists for finalization"
and "Capture pod logs" and the registered variable pod_exists_check when making
these changes.
- Around line 17-35: The "Delete metrics capture pod" task currently runs
unconditionally; ensure it only runs in finalize mode by either moving the task
into the existing finalize block (between the "Capture pod logs" task and the
meta: end_play) or by adding the when:
cluster_capture_servicemonitor_metrics_finalize guard to the "Delete metrics
capture pod" task; reference the task name "Delete metrics capture pod", the
condition variable cluster_capture_servicemonitor_metrics_finalize, and the
finalize block (meta: end_play) when making the change so the delete only
executes during finalization.
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml`:
- Around line 12-23: The tasks "Capture pod targets by PodMonitor selector",
"Capture pod targets by PodMonitor selector YAML" and "Get all target pod IPs
and names" currently use a hardcoded component list; change them to derive the
Pod label selector from the actual PodMonitor resource instead. Query the
PodMonitor in the target_namespace (e.g. oc get podmonitor <name> -n {{
target_namespace }} -o yaml) and extract .spec.selector.matchLabels (or convert
matchExpressions to label selector) into a variable, then use that label
selector in the oc get pod -l "<derived-selector>" commands (and in the -oyaml
and custom-columns calls) instead of the hardcoded app.kubernetes.io/component
list; ensure you reference the same selector encoding used by existing ISVC
lookups (app.kubernetes.io/name when applicable) and add a safe fallback to the
previous selector only if the PodMonitor has no selector.
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml`:
- Around line 12-24: The tasks currently hardcode the scheduler labels and use
head -1 which can pick an arbitrary Service; instead, read the ServiceMonitor
named by the incoming monitor variable (the ServiceMonitor resource referenced
by the task that dumps YAML/status) and extract its spec.selector.matchLabels,
then use those labels to query Services to determine the target; replace the
hardcoded label selector in the "Capture service target by ServiceMonitor
selector" and "Get target service name" tasks with a step that fetches the
ServiceMonitor, builds the label selector string from spec.selector.matchLabels,
uses that selector for oc get service, and change the "Get target service name"
registration (target_service_name) to fail if zero matches or to
deterministically select a single match (e.g., error on multiple matches or pick
the service with a specific annotation/port) instead of using head -1.
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2`:
- Around line 61-67: The template currently hardcodes reading
/var/run/secrets/auth/token and mounts the whole Secret; update the flow to
extract authorization.credentials.key from the ServiceMonitor/PodMonitor and
pass it into the template (e.g. as auth_secret_key), change the Secret volume
mount in the Pod template to use items and mount only that key (so the secret
key filename equals the passed-in key name), and update the runtime logic (the
fetch_metrics invocation/if block using auth_secret_name and fetch_metrics) to
read the token from /var/run/secrets/auth/<auth_secret_key> (or use a variable
TOKEN_FILE constructed from auth_secret_key) before calling fetch_metrics so the
correct key is mounted and read when credentials.key is not "token".
In `@projects/cluster/toolbox/cluster.py`:
- Around line 553-566: The function capture_servicemonitor_metrics currently
takes a parameter named service_name but treats it as the monitor resource name;
either rename the parameter to monitor_name (update the signature, docstring,
and all call sites where capture_servicemonitor_metrics is invoked) to reflect
that it expects a ServiceMonitor/PodMonitor name, or add logic inside
capture_servicemonitor_metrics to accept an actual Service name: when the
provided name is not found as a ServiceMonitor/PodMonitor, query the Service (oc
get service <name>) and locate the matching ServiceMonitor/PodMonitor by
selector/labels and use its name; update the docstring to state the accepted
inputs and adjust callers accordingly (refer to capture_servicemonitor_metrics
and parameter service_name).
In `@projects/llm-d/testing/test_llmd.py`:
- Around line 1019-1024: The helper functions start_metrics_capture and
stop_metrics_capture currently return when
tests.llmd.inference_service.metrics.manual_capture is false, which
inadvertently disables the automatic toolbox capture under shipped configs;
change the guard so the functions return only when manual_capture is true (i.e.,
invert the boolean check), and apply the same fix to both start_metrics_capture
and stop_metrics_capture so the automatic capture runs unless manual_capture is
explicitly enabled.
In `@projects/llm-d/visualizations/llmd_inference/store/parsers.py`:
- Around line 34-35: The important-file globs use
artifact_dirnames.GUIDELLM_BENCHMARK_DIR (exact match) while
find_guidellm_benchmark_directories() discovers with a wildcard suffix; update
the two globs that build
f"{artifact_dirnames.GUIDELLM_BENCHMARK_DIR}/artifacts/results/benchmarks.json"
and
f"{artifact_dirnames.GUIDELLM_BENCHMARK_DIR}/artifacts/guidellm_benchmark_job.logs"
to align with discovery by appending a wildcard to the benchmark dir portion
(e.g., use GUIDELLM_BENCHMARK_DIR + "*" or equivalent) so multi-rate/suffixed
directories are matched and registered/cached consistently.
---
Nitpick comments:
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`:
- Around line 56-60: The "Check if auth secret exists" shell task fails the play
when the secret is missing; change the task (named "Check if auth secret
exists") to treat a non-zero oc exit code as non-fatal by adding failed_when:
false (or ignore_errors: true) so the play continues, keep register:
auth_secret_exists, and then use auth_secret_exists.rc or
auth_secret_exists.stdout to conditionally act later (e.g., only create
resources when auth_secret_exists.rc == 0) so missing optional secrets are
handled gracefully.
In `@projects/cluster/toolbox/cluster.py`:
- Around line 561-565: Update the docstring for the parameter capture_frequency
to match the function default (60 seconds) — currently it incorrectly says
"default: 15". Locate the docstring block in projects/cluster/toolbox/cluster.py
that documents capture_frequency (the function where the parameter
capture_frequency is defined) and change the text "default: 15" to "default: 60"
so generated RST and docs reflect the actual default.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 071699d5-aa7b-423d-bdef-219be891108d
📒 Files selected for processing (11)
docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rstdocs/toolbox.generated/index.rstprojects/cluster/toolbox/cluster.pyprojects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.ymlprojects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.ymlprojects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.ymlprojects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.ymlprojects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2projects/llm-d/testing/config.yamlprojects/llm-d/testing/test_llmd.pyprojects/llm-d/visualizations/llmd_inference/store/parsers.py
| # How often to capture metrics in seconds (default: 15) | ||
| cluster_capture_servicemonitor_metrics_capture_frequency: 60 |
There was a problem hiding this comment.
Comment/default value mismatch.
The comment states (default: 15) but the actual default is 60. Update either the comment or the value to be consistent.
Proposed fix
-# How often to capture metrics in seconds (default: 15)
-cluster_capture_servicemonitor_metrics_capture_frequency: 60
+# How often to capture metrics in seconds (default: 60)
+cluster_capture_servicemonitor_metrics_capture_frequency: 60📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # How often to capture metrics in seconds (default: 15) | |
| cluster_capture_servicemonitor_metrics_capture_frequency: 60 | |
| # How often to capture metrics in seconds (default: 60) | |
| cluster_capture_servicemonitor_metrics_capture_frequency: 60 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml`
around lines 13 - 14, The inline comment for
cluster_capture_servicemonitor_metrics_capture_frequency says "(default: 15)"
but the variable is set to 60; make them consistent by updating the comment to
"(default: 60)" to match
cluster_capture_servicemonitor_metrics_capture_frequency, or if you intended the
default to be 15, change the variable value to 15 instead—adjust the comment and
value together so cluster_capture_servicemonitor_metrics_capture_frequency and
its descriptive comment match.
| - name: "[Finalize mode] capture logs and delete existing pod" | ||
| when: cluster_capture_servicemonitor_metrics_finalize | ||
| block: | ||
| - name: Check if pod exists for finalization | ||
| shell: | | ||
| oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name | ||
| register: pod_exists_check | ||
|
|
||
| - name: Capture pod logs | ||
| shell: | | ||
| oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt" | ||
|
|
||
| - name: Delete metrics capture pod | ||
| shell: | | ||
| oc delete pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --grace-period=0 --ignore-not-found | ||
|
|
||
| - name: "[Finalize mode] capture logs and delete existing pod" | ||
| when: cluster_capture_servicemonitor_metrics_finalize | ||
| meta: end_play |
There was a problem hiding this comment.
Critical: Delete task runs unconditionally outside the finalize block.
The "Delete metrics capture pod" task (lines 29-31) is outside the block structure and has no when condition, so it executes in both finalize and normal modes. In normal deployment mode, this will delete any existing capture pod before the play continues to create a new one.
If this is intentional cleanup behavior, move it after the end_play guard so it's clear. If it should only run in finalize mode, move it inside the block.
Proposed fix (assuming delete should be finalize-only)
- name: "[Finalize mode] capture logs and delete existing pod"
when: cluster_capture_servicemonitor_metrics_finalize
block:
- name: Check if pod exists for finalization
shell: |
oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
register: pod_exists_check
+ failed_when: false
- name: Capture pod logs
shell: |
oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"
+ when: pod_exists_check.rc == 0
-- name: Delete metrics capture pod
- shell: |
- oc delete pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --grace-period=0 --ignore-not-found
+ - name: Delete metrics capture pod
+ shell: |
+ oc delete pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --grace-period=0 --ignore-not-found
- name: "[Finalize mode] capture logs and delete existing pod"
when: cluster_capture_servicemonitor_metrics_finalize
meta: end_play🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`
around lines 17 - 35, The "Delete metrics capture pod" task currently runs
unconditionally; ensure it only runs in finalize mode by either moving the task
into the existing finalize block (between the "Capture pod logs" task and the
meta: end_play) or by adding the when:
cluster_capture_servicemonitor_metrics_finalize guard to the "Delete metrics
capture pod" task; reference the task name "Delete metrics capture pod", the
condition variable cluster_capture_servicemonitor_metrics_finalize, and the
finalize block (meta: end_play) when making the change so the delete only
executes during finalization.
| - name: Check if pod exists for finalization | ||
| shell: | | ||
| oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name | ||
| register: pod_exists_check | ||
|
|
||
| - name: Capture pod logs | ||
| shell: | | ||
| oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt" |
There was a problem hiding this comment.
Missing error handling in finalize flow.
If the capture pod doesn't exist when finalize mode runs, oc get pod (line 22) fails with a non-zero exit code, causing the play to fail. The registered pod_exists_check variable is never used to conditionally skip the logs capture.
Consider adding failed_when: false to the check and using the result to guard subsequent tasks:
Proposed fix
- name: Check if pod exists for finalization
shell: |
oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
register: pod_exists_check
+ failed_when: false
- name: Capture pod logs
shell: |
oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"
+ when: pod_exists_check.rc == 0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: Check if pod exists for finalization | |
| shell: | | |
| oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name | |
| register: pod_exists_check | |
| - name: Capture pod logs | |
| shell: | | |
| oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt" | |
| - name: Check if pod exists for finalization | |
| shell: | | |
| oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name | |
| register: pod_exists_check | |
| failed_when: false | |
| - name: Capture pod logs | |
| shell: | | |
| oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt" | |
| when: pod_exists_check.rc == 0 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`
around lines 20 - 27, The "Check if pod exists for finalization" task should not
fail the play when the pod is absent: add failed_when: false (and optionally
changed_when: false) to that task so oc get returns won't abort the run, and
then guard the "Capture pod logs" task with when: pod_exists_check.rc == 0 (or
when: pod_exists_check.stdout != '' ) so logs are only collected if the pod
actually exists; reference the task names "Check if pod exists for finalization"
and "Capture pod logs" and the registered variable pod_exists_check when making
these changes.
| - name: Capture pod targets by PodMonitor selector | ||
| shell: | | ||
| oc get pod -l "app.kubernetes.io/component in (llminferenceservice-workload,llminferenceservice-workload-prefill,llminferenceservice-workload-worker,llminferenceservice-workload-leader,llminferenceservice-workload-leader-prefill,llminferenceservice-workload-worker-prefill),app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/target_pods.status" | ||
|
|
||
| - name: Capture pod targets by PodMonitor selector YAML | ||
| shell: | | ||
| oc get pod -l "app.kubernetes.io/component in (llminferenceservice-workload,llminferenceservice-workload-prefill,llminferenceservice-workload-worker,llminferenceservice-workload-leader,llminferenceservice-workload-leader-prefill,llminferenceservice-workload-worker-prefill),app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/target_pods.yaml" | ||
|
|
||
| - name: Get all target pod IPs and names | ||
| shell: | | ||
| oc get pod -l "app.kubernetes.io/component in (llminferenceservice-workload,llminferenceservice-workload-prefill,llminferenceservice-workload-worker,llminferenceservice-workload-leader,llminferenceservice-workload-leader-prefill,llminferenceservice-workload-worker-prefill),app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} --no-headers -o custom-columns=":metadata.name,:status.podIP" | ||
| register: target_pods_info |
There was a problem hiding this comment.
Resolve pods from the PodMonitor you were asked to capture.
After fetching the named PodMonitor, the role ignores its selector and scans every llminferenceservice workload pod in the namespace via a hardcoded component list. That will mix captures across services when multiple LLMInferenceServices share a namespace, and it will break as soon as the PodMonitor selector changes. The existing ISVC state capture already keys pod lookup off app.kubernetes.io/name; this should derive its selector from the PodMonitor too.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml`
around lines 12 - 23, The tasks "Capture pod targets by PodMonitor selector",
"Capture pod targets by PodMonitor selector YAML" and "Get all target pod IPs
and names" currently use a hardcoded component list; change them to derive the
Pod label selector from the actual PodMonitor resource instead. Query the
PodMonitor in the target_namespace (e.g. oc get podmonitor <name> -n {{
target_namespace }} -o yaml) and extract .spec.selector.matchLabels (or convert
matchExpressions to label selector) into a variable, then use that label
selector in the oc get pod -l "<derived-selector>" commands (and in the -oyaml
and custom-columns calls) instead of the hardcoded app.kubernetes.io/component
list; ensure you reference the same selector encoding used by existing ISVC
lookups (app.kubernetes.io/name when applicable) and add a safe fallback to the
previous selector only if the PodMonitor has no selector.
| - name: Capture service target by ServiceMonitor selector | ||
| shell: | | ||
| oc get service -l "app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/target_service.status" | ||
|
|
||
| - name: Capture service target by ServiceMonitor selector YAML | ||
| shell: | | ||
| oc get service -l "app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/target_service.yaml" | ||
|
|
||
| - name: Get target service name | ||
| shell: | | ||
| set -o pipefail; | ||
| oc get service -l "app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} --no-headers -o custom-columns=":metadata.name" | head -1 | ||
| register: target_service_name |
There was a problem hiding this comment.
Derive the target Service from the ServiceMonitor instead of hardcoding the scheduler labels.
These tasks only use the requested monitor name for the YAML/status dump; the actual lookup is always pinned to llminferenceservice-router-scheduler. If this command is pointed at any other ServiceMonitor, or if that monitor’s selector changes, target_service_name and the generated metrics URL will be wrong. head -1 also makes the selected Service arbitrary when more than one matches.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml`
around lines 12 - 24, The tasks currently hardcode the scheduler labels and use
head -1 which can pick an arbitrary Service; instead, read the ServiceMonitor
named by the incoming monitor variable (the ServiceMonitor resource referenced
by the task that dumps YAML/status) and extract its spec.selector.matchLabels,
then use those labels to query Services to determine the target; replace the
hardcoded label selector in the "Capture service target by ServiceMonitor
selector" and "Get target service name" tasks with a step that fetches the
ServiceMonitor, builds the label selector string from spec.selector.matchLabels,
uses that selector for oc get service, and change the "Get target service name"
registration (target_service_name) to fail if zero matches or to
deterministically select a single match (e.g., error on multiple matches or pick
the service with a specific annotation/port) instead of using head -1.
| {% if auth_secret_name %} | ||
| if [ -f /var/run/secrets/auth/token ]; then | ||
| TOKEN=$(cat /var/run/secrets/auth/token) | ||
| fetch_metrics "$url" "$TOKEN" | ||
| else | ||
| echo "Auth token file not found" | ||
| fi |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find projects/cluster/toolbox/cluster_capture_servicemonitor_metrics -name "*.yaml.j2" -o -name "*.yaml"Repository: openshift-psap/topsail
Length of output: 167
🏁 Script executed:
cat projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2Repository: openshift-psap/topsail
Length of output: 2801
🏁 Script executed:
cat projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2 | head -100Repository: openshift-psap/topsail
Length of output: 2801
🏁 Script executed:
rg "auth_secret" projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/ -A 2 -B 2Repository: openshift-psap/topsail
Length of output: 7737
🏁 Script executed:
find projects/cluster/toolbox/cluster_capture_servicemonitor_metrics -name "*.yml" -o -name "*.yaml" | xargs rg "auth_secret" -lRepository: openshift-psap/topsail
Length of output: 318
🏁 Script executed:
cat projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main.ymlRepository: openshift-psap/topsail
Length of output: 179
🏁 Script executed:
cat projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.ymlRepository: openshift-psap/topsail
Length of output: 4093
🏁 Script executed:
cat projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.ymlRepository: openshift-psap/topsail
Length of output: 4618
🏁 Script executed:
cat projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml | head -80Repository: openshift-psap/topsail
Length of output: 3716
Thread the secret key name through the role and mount only the selected key.
The template mounts the entire Secret without specifying which key to use (lines 83–94), then hardcodes reading /var/run/secrets/auth/token (lines 61–67). When a monitor's authorization.credentials points to a different key name, this path won't exist and authentication silently fails.
Extract authorization.credentials.key from the ServiceMonitor/PodMonitor alongside the secret name, pass it through to the template, mount only that key using the items field, and read the correct filename.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/templates/metrics_capture_pod.yaml.j2`
around lines 61 - 67, The template currently hardcodes reading
/var/run/secrets/auth/token and mounts the whole Secret; update the flow to
extract authorization.credentials.key from the ServiceMonitor/PodMonitor and
pass it into the template (e.g. as auth_secret_key), change the Secret volume
mount in the Pod template to use items and mount only that key (so the secret
key filename equals the passed-in key name), and update the runtime logic (the
fetch_metrics invocation/if block using auth_secret_name and fetch_metrics) to
read the token from /var/run/secrets/auth/<auth_secret_key> (or use a variable
TOKEN_FILE constructed from auth_secret_key) before calling fetch_metrics so the
correct key is mounted and read when credentials.key is not "token".
| def capture_servicemonitor_metrics(self, service_name, namespace="", capture_frequency=60, is_podmonitor=False, finalize=False): | ||
| """ | ||
| Captures ServiceMonitor or PodMonitor YAML and status for a given service | ||
|
|
||
| Captures the ServiceMonitor/PodMonitor configuration and status information for | ||
| a specific service in a namespace, including related service/pod and | ||
| endpoints information for troubleshooting monitoring setup. | ||
|
|
||
| Args: | ||
| service_name: Name of the service to capture ServiceMonitor/PodMonitor metrics for | ||
| namespace: Namespace where the service and ServiceMonitor/PodMonitor are located (empty string auto-detects current namespace) | ||
| capture_frequency: How often to capture metrics in seconds (default: 15) | ||
| is_podmonitor: Whether to use PodMonitor instead of ServiceMonitor (default: False) | ||
| finalize: Whether to finalize (capture logs and delete) an existing pod instead of creating new one (default: False) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Rename service_name or make the role accept actual Services.
The public command and generated docs say this parameter is a Service name, but the role uses it as oc get servicemonitor|podmonitor <name>. Every new caller in this PR therefore passes a monitor resource name instead. As-is, a CLI user following the docs will hit NotFound unless the monitor happens to share the Service name.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/cluster/toolbox/cluster.py` around lines 553 - 566, The function
capture_servicemonitor_metrics currently takes a parameter named service_name
but treats it as the monitor resource name; either rename the parameter to
monitor_name (update the signature, docstring, and all call sites where
capture_servicemonitor_metrics is invoked) to reflect that it expects a
ServiceMonitor/PodMonitor name, or add logic inside
capture_servicemonitor_metrics to accept an actual Service name: when the
provided name is not found as a ServiceMonitor/PodMonitor, query the Service (oc
get service <name>) and locate the matching ServiceMonitor/PodMonitor by
selector/labels and use its name; update the docstring to state the accepted
inputs and adjust callers accordingly (refer to capture_servicemonitor_metrics
and parameter service_name).
| def start_metrics_capture(flavor): | ||
| """ | ||
| Starts metrics capture for both ServiceMonitor and PodMonitor if enabled | ||
| """ | ||
| if not config.project.get_config("tests.llmd.inference_service.metrics.manual_capture"): | ||
| return |
There was a problem hiding this comment.
manual_capture currently disables the new automatic capture path.
Both helpers return when tests.llmd.inference_service.metrics.manual_capture is false, but this PR sets that flag to false in the base config and in the cks preset. So start_metrics_capture() / stop_metrics_capture() never invoke the new toolbox command under the shipped configs.
Proposed fix
def start_metrics_capture(flavor):
@@
- if not config.project.get_config("tests.llmd.inference_service.metrics.manual_capture"):
+ if config.project.get_config("tests.llmd.inference_service.metrics.manual_capture"):
return
@@
def stop_metrics_capture(flavor):
@@
- if not config.project.get_config("tests.llmd.inference_service.metrics.manual_capture"):
+ if config.project.get_config("tests.llmd.inference_service.metrics.manual_capture"):
returnAlso applies to: 1057-1062
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/llm-d/testing/test_llmd.py` around lines 1019 - 1024, The helper
functions start_metrics_capture and stop_metrics_capture currently return when
tests.llmd.inference_service.metrics.manual_capture is false, which
inadvertently disables the automatic toolbox capture under shipped configs;
change the guard so the functions return only when manual_capture is true (i.e.,
invert the boolean check), and apply the same fix to both start_metrics_capture
and stop_metrics_capture so the automatic capture runs unless manual_capture is
explicitly enabled.
| f"{artifact_dirnames.GUIDELLM_BENCHMARK_DIR}/artifacts/results/benchmarks.json", | ||
| f"{artifact_dirnames.GUIDELLM_BENCHMARK_DIR}/artifacts/guidellm_benchmark_job.logs", |
There was a problem hiding this comment.
Align GuideLLM important-file glob with multi-rate directory discovery.
Line 34 and Line 35 use artifact_dirnames.GUIDELLM_BENCHMARK_DIR (*__llmd__run_guidellm_benchmark) but discovery in find_guidellm_benchmark_directories() uses *__llmd__run_guidellm_benchmark*.
This mismatch can miss registration/caching for suffixed multi-rate directories.
Proposed fix
-artifact_dirnames.GUIDELLM_BENCHMARK_DIR = "*__llmd__run_guidellm_benchmark"
+artifact_dirnames.GUIDELLM_BENCHMARK_DIR = "*__llmd__run_guidellm_benchmark*"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@projects/llm-d/visualizations/llmd_inference/store/parsers.py` around lines
34 - 35, The important-file globs use artifact_dirnames.GUIDELLM_BENCHMARK_DIR
(exact match) while find_guidellm_benchmark_directories() discovers with a
wildcard suffix; update the two globs that build
f"{artifact_dirnames.GUIDELLM_BENCHMARK_DIR}/artifacts/results/benchmarks.json"
and
f"{artifact_dirnames.GUIDELLM_BENCHMARK_DIR}/artifacts/guidellm_benchmark_job.logs"
to align with discovery by appending a wildcard to the benchmark dir portion
(e.g., use GUIDELLM_BENCHMARK_DIR + "*" or equivalent) so multi-rate/suffixed
directories are matched and registered/cached consistently.
Summary by CodeRabbit
New Features
capture_servicemonitor_metricscommand to capture ServiceMonitor/PodMonitor YAML and metrics for specified services with configurable capture frequency and options.Documentation