Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
:orphan:

..
_Auto-generated file, do not edit manually ...
_Toolbox generate command: repo generate_toolbox_rst_documentation
_ Source component: Cluster.capture_servicemonitor_metrics


cluster capture_servicemonitor_metrics
======================================

Captures ServiceMonitor or PodMonitor YAML and status for a given service

Captures the ServiceMonitor/PodMonitor configuration and status information for
a specific service in a namespace, including related service/pod and
endpoints information for troubleshooting monitoring setup.


Parameters
----------


``service_name``

* Name of the service to capture ServiceMonitor/PodMonitor metrics for


``namespace``

* Namespace where the service and ServiceMonitor/PodMonitor are located (empty string auto-detects current namespace)


``capture_frequency``

* How often to capture metrics in seconds (default: 15)

* default value: ``60``


``is_podmonitor``

* Whether to use PodMonitor instead of ServiceMonitor (default: False)


``finalize``

* Whether to finalize (capture logs and delete) an existing pod instead of creating new one (default: False)

1 change: 1 addition & 0 deletions docs/toolbox.generated/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Toolbox Documentation

* :doc:`build_push_image <Cluster.build_push_image>` Build and publish an image to quay using either a Dockerfile or git repo.
* :doc:`capture_environment <Cluster.capture_environment>` Captures the cluster environment
* :doc:`capture_servicemonitor_metrics <Cluster.capture_servicemonitor_metrics>` Captures ServiceMonitor or PodMonitor YAML and status for a given service
* :doc:`create_htpasswd_adminuser <Cluster.create_htpasswd_adminuser>` Create an htpasswd admin user.
* :doc:`create_osd <Cluster.create_osd>` Create an OpenShift Dedicated cluster.
* :doc:`deploy_operator <Cluster.deploy_operator>` Deploy an operator from OperatorHub catalog entry.
Expand Down
20 changes: 20 additions & 0 deletions projects/cluster/toolbox/cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -547,3 +547,23 @@ def enable_userworkload_monitoring(self, namespaces: list = []):
"""

return RunAnsibleRole(locals())

@AnsibleRole("cluster_capture_servicemonitor_metrics")
@AnsibleMappedParams
def capture_servicemonitor_metrics(self, service_name, namespace="", capture_frequency=60, is_podmonitor=False, finalize=False):
"""
Captures ServiceMonitor or PodMonitor YAML and status for a given service

Captures the ServiceMonitor/PodMonitor configuration and status information for
a specific service in a namespace, including related service/pod and
endpoints information for troubleshooting monitoring setup.

Args:
service_name: Name of the service to capture ServiceMonitor/PodMonitor metrics for
namespace: Namespace where the service and ServiceMonitor/PodMonitor are located (empty string auto-detects current namespace)
capture_frequency: How often to capture metrics in seconds (default: 15)
is_podmonitor: Whether to use PodMonitor instead of ServiceMonitor (default: False)
finalize: Whether to finalize (capture logs and delete) an existing pod instead of creating new one (default: False)
Comment on lines +553 to +566
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Rename service_name or make the role accept actual Services.

The public command and generated docs say this parameter is a Service name, but the role uses it as oc get servicemonitor|podmonitor <name>. Every new caller in this PR therefore passes a monitor resource name instead. As-is, a CLI user following the docs will hit NotFound unless the monitor happens to share the Service name.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/cluster/toolbox/cluster.py` around lines 553 - 566, The function
capture_servicemonitor_metrics currently takes a parameter named service_name
but treats it as the monitor resource name; either rename the parameter to
monitor_name (update the signature, docstring, and all call sites where
capture_servicemonitor_metrics is invoked) to reflect that it expects a
ServiceMonitor/PodMonitor name, or add logic inside
capture_servicemonitor_metrics to accept an actual Service name: when the
provided name is not found as a ServiceMonitor/PodMonitor, query the Service (oc
get service <name>) and locate the matching ServiceMonitor/PodMonitor by
selector/labels and use its name; update the docstring to state the accepted
inputs and adjust callers accordingly (refer to capture_servicemonitor_metrics
and parameter service_name).

"""

return RunAnsibleRole(locals())
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Auto-generated file, do not edit manually ...
# Toolbox generate command: repo generate_ansible_default_settings
# Source component: Cluster.capture_servicemonitor_metrics

# Parameters
# Name of the service to capture ServiceMonitor/PodMonitor metrics for
# Mandatory value
cluster_capture_servicemonitor_metrics_service_name:

# Namespace where the service and ServiceMonitor/PodMonitor are located (empty string auto-detects current namespace)
cluster_capture_servicemonitor_metrics_namespace:

# How often to capture metrics in seconds (default: 15)
cluster_capture_servicemonitor_metrics_capture_frequency: 60
Comment on lines +13 to +14
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Comment/default value mismatch.

The comment states (default: 15) but the actual default is 60. Update either the comment or the value to be consistent.

Proposed fix
-# How often to capture metrics in seconds (default: 15)
-cluster_capture_servicemonitor_metrics_capture_frequency: 60
+# How often to capture metrics in seconds (default: 60)
+cluster_capture_servicemonitor_metrics_capture_frequency: 60
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# How often to capture metrics in seconds (default: 15)
cluster_capture_servicemonitor_metrics_capture_frequency: 60
# How often to capture metrics in seconds (default: 60)
cluster_capture_servicemonitor_metrics_capture_frequency: 60
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml`
around lines 13 - 14, The inline comment for
cluster_capture_servicemonitor_metrics_capture_frequency says "(default: 15)"
but the variable is set to 60; make them consistent by updating the comment to
"(default: 60)" to match
cluster_capture_servicemonitor_metrics_capture_frequency, or if you intended the
default to be 15, change the variable value to 15 instead—adjust the comment and
value together so cluster_capture_servicemonitor_metrics_capture_frequency and
its descriptive comment match.


# Whether to use PodMonitor instead of ServiceMonitor (default: False)
cluster_capture_servicemonitor_metrics_is_podmonitor: false

# Whether to finalize (capture logs and delete) an existing pod instead of creating new one (default: False)
cluster_capture_servicemonitor_metrics_finalize: false

# Default Ansible variables
# Default value for ansible_os_family to ensure role remains standalone
ansible_os_family: Linux
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
- name: Get current namespace if not specified
command: oc project -q
register: current_namespace_result
when: cluster_capture_servicemonitor_metrics_namespace == ""

- name: Set the target namespace
set_fact:
target_namespace: "{{ current_namespace_result.stdout if cluster_capture_servicemonitor_metrics_namespace == '' else cluster_capture_servicemonitor_metrics_namespace }}"

- name: Create capture directory
file:
path: "{{ artifact_extra_logs_dir }}/artifacts"
state: directory
mode: '0755'

- name: "[Finalize mode] capture logs and delete existing pod"
when: cluster_capture_servicemonitor_metrics_finalize
block:
- name: Check if pod exists for finalization
shell: |
oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
register: pod_exists_check

- name: Capture pod logs
shell: |
oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"
Comment on lines +20 to +27
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing error handling in finalize flow.

If the capture pod doesn't exist when finalize mode runs, oc get pod (line 22) fails with a non-zero exit code, causing the play to fail. The registered pod_exists_check variable is never used to conditionally skip the logs capture.

Consider adding failed_when: false to the check and using the result to guard subsequent tasks:

Proposed fix
   - name: Check if pod exists for finalization
     shell: |
       oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
     register: pod_exists_check
+    failed_when: false

   - name: Capture pod logs
     shell: |
       oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"
+    when: pod_exists_check.rc == 0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Check if pod exists for finalization
shell: |
oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
register: pod_exists_check
- name: Capture pod logs
shell: |
oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"
- name: Check if pod exists for finalization
shell: |
oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
register: pod_exists_check
failed_when: false
- name: Capture pod logs
shell: |
oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"
when: pod_exists_check.rc == 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`
around lines 20 - 27, The "Check if pod exists for finalization" task should not
fail the play when the pod is absent: add failed_when: false (and optionally
changed_when: false) to that task so oc get returns won't abort the run, and
then guard the "Capture pod logs" task with when: pod_exists_check.rc == 0 (or
when: pod_exists_check.stdout != '' ) so logs are only collected if the pod
actually exists; reference the task names "Check if pod exists for finalization"
and "Capture pod logs" and the registered variable pod_exists_check when making
these changes.


- name: Delete metrics capture pod
shell: |
oc delete pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --grace-period=0 --ignore-not-found

- name: "[Finalize mode] capture logs and delete existing pod"
when: cluster_capture_servicemonitor_metrics_finalize
meta: end_play
Comment on lines +17 to +35
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Delete task runs unconditionally outside the finalize block.

The "Delete metrics capture pod" task (lines 29-31) is outside the block structure and has no when condition, so it executes in both finalize and normal modes. In normal deployment mode, this will delete any existing capture pod before the play continues to create a new one.

If this is intentional cleanup behavior, move it after the end_play guard so it's clear. If it should only run in finalize mode, move it inside the block.

Proposed fix (assuming delete should be finalize-only)
 - name: "[Finalize mode] capture logs and delete existing pod"
   when: cluster_capture_servicemonitor_metrics_finalize
   block:
   - name: Check if pod exists for finalization
     shell: |
       oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
     register: pod_exists_check
+    failed_when: false

   - name: Capture pod logs
     shell: |
       oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"
+    when: pod_exists_check.rc == 0

-- name: Delete metrics capture pod
-  shell: |
-    oc delete pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }}  --grace-period=0 --ignore-not-found
+  - name: Delete metrics capture pod
+    shell: |
+      oc delete pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }}  --grace-period=0 --ignore-not-found

 - name: "[Finalize mode] capture logs and delete existing pod"
   when: cluster_capture_servicemonitor_metrics_finalize
   meta: end_play
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`
around lines 17 - 35, The "Delete metrics capture pod" task currently runs
unconditionally; ensure it only runs in finalize mode by either moving the task
into the existing finalize block (between the "Capture pod logs" task and the
meta: end_play) or by adding the when:
cluster_capture_servicemonitor_metrics_finalize guard to the "Delete metrics
capture pod" task; reference the task name "Delete metrics capture pod", the
condition variable cluster_capture_servicemonitor_metrics_finalize, and the
finalize block (meta: end_play) when making the change so the delete only
executes during finalization.


# Normal deployment mode: discover resources and create pod
- name: Include ServiceMonitor tasks
include_tasks: servicemonitor.yml
when: not cluster_capture_servicemonitor_metrics_is_podmonitor

- name: Include PodMonitor tasks
include_tasks: podmonitor.yml
when: cluster_capture_servicemonitor_metrics_is_podmonitor

# Ensure auth_secret_name is always defined with proper structure (fallback for edge cases)
- name: Set default auth secret name if not defined
set_fact:
auth_secret_name: "{% if auth_secret_name_cmd is defined %}{{ auth_secret_name_cmd.stdout }}{% endif %}"

# Common tasks for deployment mode
- name: Read metrics URL
shell: cat "{{ artifact_extra_logs_dir }}/artifacts/metrics_url.txt"
register: metrics_url_content

- name: Check if auth secret exists
shell: |
oc get secret {{ auth_secret_name }} -n {{ target_namespace }} --no-headers -o name
register: auth_secret_exists
when: auth_secret_name != ""

- name: Save authentication info
shell: |
{% if cluster_capture_servicemonitor_metrics_is_podmonitor %}
echo "PodMonitor: {{ cluster_capture_servicemonitor_metrics_service_name }}" > "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt"
{% else %}
echo "ServiceMonitor: {{ cluster_capture_servicemonitor_metrics_service_name }}" > "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt"
{% endif %}
echo "Auth secret name: {{ auth_secret_name | default('none') }}" >> "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt"
if [ -n "{{ auth_secret_name }}" ]; then
{% if auth_secret_name != '' and auth_secret_exists.rc == 0 %}
echo "Secret exists: yes" >> "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt"
echo "Secret will be mounted at: /var/run/secrets/auth/token" >> "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt"
{% else %}
echo "Secret exists: no" >> "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt"
echo "WARNING: Secret not found!" >> "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt"
{% endif %}
else
echo "No authentication required" >> "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt"
fi

- name: Create metrics capture pod manifest
template:
src: metrics_capture_pod.yaml.j2
dest: "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_pod.yaml"
mode: '0644'
vars:
metrics_url: "{{ metrics_url_content.stdout }}"
auth_secret_name: "{{ auth_secret_name | default('') }}"
capture_frequency: "{{ cluster_capture_servicemonitor_metrics_capture_frequency }}"

- name: Create metrics capture pod
shell: |
oc create -f "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_pod.yaml"

- name: Wait for pod to start
shell: |
oc wait --for=condition=Ready pod/topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --timeout=60s
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
---
# PodMonitor-specific tasks

- name: Capture PodMonitor YAML
shell: |
oc get podmonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/podmonitor.yaml"

- name: Get PodMonitor status
shell: |
oc get podmonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/podmonitor.status"

- name: Capture pod targets by PodMonitor selector
shell: |
oc get pod -l "app.kubernetes.io/component in (llminferenceservice-workload,llminferenceservice-workload-prefill,llminferenceservice-workload-worker,llminferenceservice-workload-leader,llminferenceservice-workload-leader-prefill,llminferenceservice-workload-worker-prefill),app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/target_pods.status"

- name: Capture pod targets by PodMonitor selector YAML
shell: |
oc get pod -l "app.kubernetes.io/component in (llminferenceservice-workload,llminferenceservice-workload-prefill,llminferenceservice-workload-worker,llminferenceservice-workload-leader,llminferenceservice-workload-leader-prefill,llminferenceservice-workload-worker-prefill),app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/target_pods.yaml"

- name: Get all target pod IPs and names
shell: |
oc get pod -l "app.kubernetes.io/component in (llminferenceservice-workload,llminferenceservice-workload-prefill,llminferenceservice-workload-worker,llminferenceservice-workload-leader,llminferenceservice-workload-leader-prefill,llminferenceservice-workload-worker-prefill),app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} --no-headers -o custom-columns=":metadata.name,:status.podIP"
register: target_pods_info
Comment on lines +12 to +23
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Resolve pods from the PodMonitor you were asked to capture.

After fetching the named PodMonitor, the role ignores its selector and scans every llminferenceservice workload pod in the namespace via a hardcoded component list. That will mix captures across services when multiple LLMInferenceServices share a namespace, and it will break as soon as the PodMonitor selector changes. The existing ISVC state capture already keys pod lookup off app.kubernetes.io/name; this should derive its selector from the PodMonitor too.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml`
around lines 12 - 23, The tasks "Capture pod targets by PodMonitor selector",
"Capture pod targets by PodMonitor selector YAML" and "Get all target pod IPs
and names" currently use a hardcoded component list; change them to derive the
Pod label selector from the actual PodMonitor resource instead. Query the
PodMonitor in the target_namespace (e.g. oc get podmonitor <name> -n {{
target_namespace }} -o yaml) and extract .spec.selector.matchLabels (or convert
matchExpressions to label selector) into a variable, then use that label
selector in the oc get pod -l "<derived-selector>" commands (and in the -oyaml
and custom-columns calls) instead of the hardcoded app.kubernetes.io/component
list; ensure you reference the same selector encoding used by existing ISVC
lookups (app.kubernetes.io/name when applicable) and add a safe fallback to the
previous selector only if the PodMonitor has no selector.


- name: Extract scheme from PodMonitor
shell: |
oc get podmonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -o jsonpath='{.spec.podMetricsEndpoints[0].scheme}' 2>/dev/null || echo "http"
register: metrics_scheme

- name: Extract target port from PodMonitor
shell: |
oc get podmonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -o jsonpath='{.spec.podMetricsEndpoints[0].targetPort}' 2>/dev/null || echo "9090"
register: metrics_port

- name: Build metrics URLs for all matching pods
shell: |
set -o pipefail;

SCHEME="{{ metrics_scheme.stdout | trim | default('http') }}"
PORT="{{ metrics_port.stdout | trim | default('9090') }}"

# Count total pods and initialize files
TOTAL_PODS=$(echo "{{ target_pods_info.stdout }}" | wc -l)

# Initialize files
echo "" > "{{ artifact_extra_logs_dir }}/artifacts/metrics_url.txt"
echo "PodMonitor target pods (found: $TOTAL_PODS):" > "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"
echo "Scheme: $SCHEME" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"
echo "Port: $PORT" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"
echo "" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"

# Build URL list for each pod
while IFS= read -r line; do
if [ -n "$line" ]; then
POD_NAME=$(echo "$line" | awk '{print $1}')
POD_IP=$(echo "$line" | awk '{print $2}')
if [ -n "$POD_IP" ] && [ "$POD_IP" != "<none>" ]; then
URL="$SCHEME://$POD_IP:$PORT/metrics"
echo "$URL" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_url.txt"
echo "Pod: $POD_NAME ($POD_IP) -> $URL" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"
else
echo "Pod: $POD_NAME (no IP available)" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"
fi
fi
done <<< "{{ target_pods_info.stdout }}"

# Count valid URLs and add summary
VALID_URLS=$(grep -c "^http" "{{ artifact_extra_logs_dir }}/artifacts/metrics_url.txt" || echo "0")
echo "" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"
echo "Total metrics URLs: $VALID_URLS" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"

- name: Extract authorization secret name from PodMonitor
shell: |
oc get podmonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -o jsonpath='{.spec.podMetricsEndpoints[0].authorization.credentials.name}' 2>/dev/null || echo ""
register: auth_secret_name
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
# ServiceMonitor-specific tasks

- name: Capture ServiceMonitor YAML
shell: |
oc get servicemonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/servicemonitor.yaml"

- name: Get ServiceMonitor status
shell: |
oc get servicemonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/servicemonitor.status"

- name: Capture service target by ServiceMonitor selector
shell: |
oc get service -l "app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/target_service.status"

- name: Capture service target by ServiceMonitor selector YAML
shell: |
oc get service -l "app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/target_service.yaml"

- name: Get target service name
shell: |
set -o pipefail;
oc get service -l "app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} --no-headers -o custom-columns=":metadata.name" | head -1
register: target_service_name
Comment on lines +12 to +24
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Derive the target Service from the ServiceMonitor instead of hardcoding the scheduler labels.

These tasks only use the requested monitor name for the YAML/status dump; the actual lookup is always pinned to llminferenceservice-router-scheduler. If this command is pointed at any other ServiceMonitor, or if that monitor’s selector changes, target_service_name and the generated metrics URL will be wrong. head -1 also makes the selected Service arbitrary when more than one matches.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/servicemonitor.yml`
around lines 12 - 24, The tasks currently hardcode the scheduler labels and use
head -1 which can pick an arbitrary Service; instead, read the ServiceMonitor
named by the incoming monitor variable (the ServiceMonitor resource referenced
by the task that dumps YAML/status) and extract its spec.selector.matchLabels,
then use those labels to query Services to determine the target; replace the
hardcoded label selector in the "Capture service target by ServiceMonitor
selector" and "Get target service name" tasks with a step that fetches the
ServiceMonitor, builds the label selector string from spec.selector.matchLabels,
uses that selector for oc get service, and change the "Get target service name"
registration (target_service_name) to fail if zero matches or to
deterministically select a single match (e.g., error on multiple matches or pick
the service with a specific annotation/port) instead of using head -1.


- name: Extract port name from ServiceMonitor
shell: |
oc get servicemonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -o jsonpath='{.spec.endpoints[0].port}' 2>/dev/null || echo "metrics"
register: metrics_port_name

- name: Get port number from Service by name
shell: |
SERVICE_NAME='{{ target_service_name.stdout | trim }}'
PORT_NAME='{{ metrics_port_name.stdout | trim | default("metrics") }}'
oc get service "$SERVICE_NAME" -n {{ target_namespace }} -o jsonpath="{.spec.ports[?(@.name=='$PORT_NAME')].port}" 2>/dev/null || echo ""
register: named_port_result

- name: Get first port as fallback
shell: |
SERVICE_NAME='{{ target_service_name.stdout | trim }}'
oc get service "$SERVICE_NAME" -n {{ target_namespace }} -o jsonpath='{.spec.ports[0].port}' 2>/dev/null || echo "9090"
register: first_port_result
when: named_port_result.stdout == ""

- name: Set final port number
set_fact:
final_port: "{{ named_port_result.stdout if named_port_result.stdout != '' else first_port_result.stdout | default('9090') }}"

- name: Determine scheme from port
set_fact:
final_scheme: >-
{{
'https' if (
final_port in ['443', '8443', '6443'] or
(metrics_port_name.stdout | trim | default('metrics')) is match('.*(https|secure|tls).*')
) else 'http'
}}

- name: Build metrics URL for ServiceMonitor
shell: |
SERVICE_NAME='{{ target_service_name.stdout | trim }}'
PORT_NAME='{{ metrics_port_name.stdout | trim | default("metrics") }}'
PORT_NUMBER='{{ final_port }}'
SCHEME='{{ final_scheme }}'

echo "$SCHEME://$SERVICE_NAME.{{ target_namespace }}.svc:$PORT_NUMBER/metrics" > "{{ artifact_extra_logs_dir }}/artifacts/metrics_url.txt"
echo "Service: $SERVICE_NAME" > "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"
echo "Port name: $PORT_NAME" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"
echo "Port number: $PORT_NUMBER" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"
echo "Scheme: $SCHEME" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"
echo "URL: $SCHEME://$SERVICE_NAME.{{ target_namespace }}.svc:$PORT_NUMBER/metrics" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"

- name: Extract authorization secret name from ServiceMonitor
shell: |
oc get servicemonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -o jsonpath='{.spec.endpoints[0].authorization.credentials.name}' 2>/dev/null || echo ""
register: auth_secret_name_cmd
Loading
Loading