Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
:orphan:

..
_Auto-generated file, do not edit manually ...
_Toolbox generate command: repo generate_toolbox_rst_documentation
_ Source component: Cluster.capture_servicemonitor_metrics


cluster capture_servicemonitor_metrics
======================================

Captures ServiceMonitor or PodMonitor YAML and status for a given service

Captures the ServiceMonitor/PodMonitor configuration and status information for
a specific service in a namespace, including related service/pod and
endpoints information for troubleshooting monitoring setup.


Parameters
----------


``service_name``

* Name of the service to capture ServiceMonitor/PodMonitor metrics for


``namespace``

* Namespace where the service and ServiceMonitor/PodMonitor are located (empty string auto-detects current namespace)


``capture_frequency``

* How often to capture metrics in seconds (default: 15)

* default value: ``60``
Comment on lines +33 to +37
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Same documentation mismatch: "default: 15" but actual value is 60.

This inconsistency originates from the docstring in projects/cluster/toolbox/cluster.py line 564. Fix the source docstring to correct both this RST file and the Ansible defaults file upon regeneration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst` around
lines 33 - 37, The docstring for the capture_frequency parameter is incorrect —
update the docstring for capture_frequency inside the
capture_servicemonitor_metrics method (in projects/cluster/toolbox/cluster.py)
to state the correct default value of 60 seconds (replace "default: 15" with
"default: 60") and ensure the inline description reflects "How often to capture
metrics in seconds (default: 60)"; after that regenerate the docs so
docs/toolbox.generated/Cluster.capture_servicemonitor_metrics.rst and the
Ansible defaults output reflect the corrected default.



``is_podmonitor``

* Whether to use PodMonitor instead of ServiceMonitor (default: False)


``finalize``

* Whether to finalize (capture logs and delete) an existing pod instead of creating new one (default: False)

4 changes: 2 additions & 2 deletions docs/toolbox.generated/Llmd.run_guidellm_benchmark.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,14 @@ Parameters

* Container image for the benchmark

* default value: ``ghcr.io/vllm-project/guidellm``
* default value: ``ghcr.io/albertoperdomo2/guidellm``


``version``

* Version tag for the benchmark image

* default value: ``pr-590``
* default value: ``nightly``


``timeout``
Expand Down
1 change: 1 addition & 0 deletions docs/toolbox.generated/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Toolbox Documentation

* :doc:`build_push_image <Cluster.build_push_image>` Build and publish an image to quay using either a Dockerfile or git repo.
* :doc:`capture_environment <Cluster.capture_environment>` Captures the cluster environment
* :doc:`capture_servicemonitor_metrics <Cluster.capture_servicemonitor_metrics>` Captures ServiceMonitor or PodMonitor YAML and status for a given service
* :doc:`create_htpasswd_adminuser <Cluster.create_htpasswd_adminuser>` Create an htpasswd admin user.
* :doc:`create_osd <Cluster.create_osd>` Create an OpenShift Dedicated cluster.
* :doc:`deploy_operator <Cluster.deploy_operator>` Deploy an operator from OperatorHub catalog entry.
Expand Down
20 changes: 20 additions & 0 deletions projects/cluster/toolbox/cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -547,3 +547,23 @@ def enable_userworkload_monitoring(self, namespaces: list = []):
"""

return RunAnsibleRole(locals())

@AnsibleRole("cluster_capture_servicemonitor_metrics")
@AnsibleMappedParams
def capture_servicemonitor_metrics(self, service_name, namespace="", capture_frequency=60, is_podmonitor=False, finalize=False):
"""
Captures ServiceMonitor or PodMonitor YAML and status for a given service

Captures the ServiceMonitor/PodMonitor configuration and status information for
a specific service in a namespace, including related service/pod and
endpoints information for troubleshooting monitoring setup.

Args:
service_name: Name of the service to capture ServiceMonitor/PodMonitor metrics for
namespace: Namespace where the service and ServiceMonitor/PodMonitor are located (empty string auto-detects current namespace)
capture_frequency: How often to capture metrics in seconds (default: 15)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring default value is incorrect: says 15 but actual default is 60.

The docstring states (default: 15) but the method signature on line 553 shows capture_frequency=60. This is the root cause of the documentation inconsistencies in the generated files.

Proposed fix
-            capture_frequency: How often to capture metrics in seconds (default: 15)
+            capture_frequency: How often to capture metrics in seconds (default: 60)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
capture_frequency: How often to capture metrics in seconds (default: 15)
capture_frequency: How often to capture metrics in seconds (default: 60)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@projects/cluster/toolbox/cluster.py` at line 564, The docstring for the
function that defines the capture_frequency parameter (the function with
parameter capture_frequency=60) incorrectly states "(default: 15)"; update the
docstring to reflect the actual default of 60 by changing the text to "(default:
60)" wherever capture_frequency is documented (e.g., in the docstring block
around the function that includes capture_frequency) so the documentation
matches the function signature.

is_podmonitor: Whether to use PodMonitor instead of ServiceMonitor (default: False)
finalize: Whether to finalize (capture logs and delete) an existing pod instead of creating new one (default: False)
"""

return RunAnsibleRole(locals())
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Auto-generated file, do not edit manually ...
# Toolbox generate command: repo generate_ansible_default_settings
# Source component: Cluster.capture_servicemonitor_metrics

# Parameters
# Name of the service to capture ServiceMonitor/PodMonitor metrics for
# Mandatory value
cluster_capture_servicemonitor_metrics_service_name:

# Namespace where the service and ServiceMonitor/PodMonitor are located (empty string auto-detects current namespace)
cluster_capture_servicemonitor_metrics_namespace:

# How often to capture metrics in seconds (default: 15)
cluster_capture_servicemonitor_metrics_capture_frequency: 60
Comment on lines +13 to +14
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Documentation mismatch: Comment says "default: 15" but actual default is 60.

The comment on line 13 states (default: 15) but the actual default value on line 14 is 60. This inconsistency is propagated from the source docstring in projects/cluster/toolbox/cluster.py line 564.

Proposed fix
-# How often to capture metrics in seconds (default: 15)
+# How often to capture metrics in seconds (default: 60)
 cluster_capture_servicemonitor_metrics_capture_frequency: 60
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/defaults/main/config.yml`
around lines 13 - 14, The inline comment for
cluster_capture_servicemonitor_metrics_capture_frequency says "(default: 15)"
but the actual default is 60; update the comment here and the corresponding
docstring in cluster.py (the docstring/source for
cluster_capture_servicemonitor_metrics_capture_frequency around the docstring at
the earlier definition) so both state "(default: 60)" to keep docs and code
consistent.


# Whether to use PodMonitor instead of ServiceMonitor (default: False)
cluster_capture_servicemonitor_metrics_is_podmonitor: false

# Whether to finalize (capture logs and delete) an existing pod instead of creating new one (default: False)
cluster_capture_servicemonitor_metrics_finalize: false

# Default Ansible variables
# Default value for ansible_os_family to ensure role remains standalone
ansible_os_family: Linux
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
- name: Get current namespace if not specified
command: oc project -q
register: current_namespace_result
when: cluster_capture_servicemonitor_metrics_namespace == ""

- name: Set the target namespace
set_fact:
target_namespace: "{{ current_namespace_result.stdout if cluster_capture_servicemonitor_metrics_namespace == '' else cluster_capture_servicemonitor_metrics_namespace }}"

- name: Create capture directory
file:
path: "{{ artifact_extra_logs_dir }}/artifacts"
state: directory
mode: '0755'

- name: "[Finalize mode] capture logs and delete existing pod"
when: cluster_capture_servicemonitor_metrics_finalize
block:
- name: Check if pod exists for finalization
shell: |
oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
register: pod_exists_check

- name: Capture pod logs
shell: |
oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"
Comment on lines +20 to +27
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make finalize mode idempotent when the capture pod never existed.

stop_metrics_capture() runs from a finally: block in projects/llm-d/testing/test_llmd.py, so this role also executes after early startup failures. Right now oc get pod ... and oc logs ... fail hard when the pod is absent, which turns best-effort teardown into a secondary exception. Make the probe non-fatal and only fetch logs when the pod is there.

Suggested fix
   - name: Check if pod exists for finalization
     shell: |
       oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
     register: pod_exists_check
+    failed_when: false
+    changed_when: false

   - name: Capture pod logs
     shell: |
       oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"
+    when: pod_exists_check.rc == 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`
around lines 20 - 27, The finalize probe currently fails if the capture pod
doesn't exist; make the pod-existence check non-fatal and only run the log
capture when the pod is present by adding failed_when: false (and optionally
changed_when: false) to the "Check if pod exists for finalization" task that
registers pod_exists_check, and guard the "Capture pod logs" task with a when
that checks pod_exists_check (e.g., pod_exists_check.stdout is defined and not
empty or pod_exists_check.rc == 0). Reference the task names "Check if pod
exists for finalization" and "Capture pod logs" and the variables
cluster_capture_servicemonitor_metrics_service_name, target_namespace, and
artifact_extra_logs_dir when making the changes.


- name: Delete metrics capture pod
shell: |
oc delete pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --grace-period=0 --ignore-not-found

- name: "[Finalize mode] capture logs and delete existing pod"
when: cluster_capture_servicemonitor_metrics_finalize
meta: end_play

# Normal deployment mode: discover resources and create pod
- name: Include ServiceMonitor tasks
include_tasks: servicemonitor.yml
when: not cluster_capture_servicemonitor_metrics_is_podmonitor

- name: Include PodMonitor tasks
include_tasks: podmonitor.yml
when: cluster_capture_servicemonitor_metrics_is_podmonitor

# Ensure auth_secret_name is always defined with proper structure (fallback for edge cases)
- name: Set default auth secret name if not defined
set_fact:
auth_secret_name: "{% if auth_secret_name_cmd is defined %}{{ auth_secret_name_cmd.stdout }}{% endif %}"
Comment on lines +46 to +49
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This normalization step clears PodMonitor auth secrets.

The ServiceMonitor path registers auth_secret_name_cmd, but projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml:72-75 registers auth_secret_name. When the PodMonitor branch runs, this set_fact rewrites the resolved secret name to "", so the later secret handling, mount, and bearer auth are all skipped.

Suggested fix
-# Ensure auth_secret_name is always defined with proper structure (fallback for edge cases)
-- name: Set default auth secret name if not defined
-  set_fact:
-    auth_secret_name: "{% if auth_secret_name_cmd is defined %}{{ auth_secret_name_cmd.stdout }}{% endif %}"
+# Normalize the extracted auth secret name from either monitor path
+- name: Normalize auth secret name
+  set_fact:
+    auth_secret_name: >-
+      {{ auth_secret_name_cmd.stdout if auth_secret_name_cmd is defined
+         else (auth_secret_name.stdout if auth_secret_name is defined else '') }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`
around lines 46 - 49, The current set_fact in tasks/main.yml unconditionally
overwrites auth_secret_name with an empty string when auth_secret_name_cmd is
undefined, clearing PodMonitor-registered secrets; change the set_fact logic so
it only assigns auth_secret_name when auth_secret_name_cmd is defined and
non-empty (e.g., keep existing auth_secret_name otherwise), ensuring the task
that sets auth_secret_name (from
projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/podmonitor.yml)
is not clobbered and subsequent secret handling, mounting, and bearer auth
remain intact.


# Common tasks for deployment mode
- name: Read metrics URL
shell: cat "{{ artifact_extra_logs_dir }}/artifacts/metrics_url.txt"
register: metrics_url_content

- name: Check if auth secret exists
shell: |
oc get secret {{ auth_secret_name }} -n {{ target_namespace }} --no-headers -o name
register: auth_secret_exists
when: auth_secret_name != ""
Comment on lines +56 to +60
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The auth-secret probe is still fatal.

The next task branches on auth_secret_exists.rc, but oc get secret ... will already fail the play when the secret is absent. That makes the "Secret exists: no" path below unreachable and turns what looks like an informational probe into an early abort.

Suggested fix
 - name: Check if auth secret exists
   shell: |
     oc get secret {{ auth_secret_name }} -n {{ target_namespace }} --no-headers -o name
   register: auth_secret_exists
   when: auth_secret_name != ""
+  failed_when: false
+  changed_when: false

Also applies to: 62-80

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@projects/cluster/toolbox/cluster_capture_servicemonitor_metrics/tasks/main.yml`
around lines 56 - 60, The "Check if auth secret exists" task uses "oc get secret
..." and registers auth_secret_exists but will abort the play when the secret is
missing; change the task so the command does not fail the play by adding either
failed_when: false (and optionally changed_when: false) or ignore_errors: true
to the task that runs "oc get secret {{ auth_secret_name }} -n {{
target_namespace }} --no-headers -o name", so auth_secret_exists.rc can be
inspected later; apply the same change to the other similar probe tasks that
register values (the tasks that set auth_secret_exists and the equivalent checks
in the 62-80 range) to make the "Secret exists: no" path reachable.


- name: Save authentication info
shell: |
{% if cluster_capture_servicemonitor_metrics_is_podmonitor %}
echo "PodMonitor: {{ cluster_capture_servicemonitor_metrics_service_name }}" > "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt"
{% else %}
echo "ServiceMonitor: {{ cluster_capture_servicemonitor_metrics_service_name }}" > "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt"
{% endif %}
echo "Auth secret name: {{ auth_secret_name | default('none') }}" >> "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt"
if [ -n "{{ auth_secret_name }}" ]; then
{% if auth_secret_name != '' and auth_secret_exists.rc == 0 %}
echo "Secret exists: yes" >> "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt"
echo "Secret will be mounted at: /var/run/secrets/auth/token" >> "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt"
{% else %}
echo "Secret exists: no" >> "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt"
echo "WARNING: Secret not found!" >> "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt"
{% endif %}
else
echo "No authentication required" >> "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt"
fi

- name: Create metrics capture pod manifest
template:
src: metrics_capture_pod.yaml.j2
dest: "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_pod.yaml"
mode: '0644'
vars:
metrics_url: "{{ metrics_url_content.stdout }}"
auth_secret_name: "{{ auth_secret_name | default('') }}"
capture_frequency: "{{ cluster_capture_servicemonitor_metrics_capture_frequency }}"

- name: Create metrics capture pod
shell: |
oc create -f "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_pod.yaml"

- name: Wait for pod to start
shell: |
oc wait --for=condition=Ready pod/topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --timeout=60s
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
---
# PodMonitor-specific tasks

- name: Capture PodMonitor YAML
shell: |
oc get podmonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/podmonitor.yaml"

- name: Get PodMonitor status
shell: |
oc get podmonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/podmonitor.status"

- name: Capture pod targets by PodMonitor selector
shell: |
oc get pod -l "app.kubernetes.io/component in (llminferenceservice-workload,llminferenceservice-workload-prefill,llminferenceservice-workload-worker,llminferenceservice-workload-leader,llminferenceservice-workload-leader-prefill,llminferenceservice-workload-worker-prefill),app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/target_pods.status"

- name: Capture pod targets by PodMonitor selector YAML
shell: |
oc get pod -l "app.kubernetes.io/component in (llminferenceservice-workload,llminferenceservice-workload-prefill,llminferenceservice-workload-worker,llminferenceservice-workload-leader,llminferenceservice-workload-leader-prefill,llminferenceservice-workload-worker-prefill),app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/target_pods.yaml"

- name: Get all target pod IPs and names
shell: |
oc get pod -l "app.kubernetes.io/component in (llminferenceservice-workload,llminferenceservice-workload-prefill,llminferenceservice-workload-worker,llminferenceservice-workload-leader,llminferenceservice-workload-leader-prefill,llminferenceservice-workload-worker-prefill),app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} --no-headers -o custom-columns=":metadata.name,:status.podIP"
register: target_pods_info

- name: Extract scheme from PodMonitor
shell: |
oc get podmonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -o jsonpath='{.spec.podMetricsEndpoints[0].scheme}' 2>/dev/null || echo "http"
register: metrics_scheme

- name: Extract target port from PodMonitor
shell: |
oc get podmonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -o jsonpath='{.spec.podMetricsEndpoints[0].targetPort}' 2>/dev/null || echo "9090"
register: metrics_port

- name: Build metrics URLs for all matching pods
shell: |
set -o pipefail;

SCHEME="{{ metrics_scheme.stdout | trim | default('http') }}"
PORT="{{ metrics_port.stdout | trim | default('9090') }}"

# Count total pods and initialize files
TOTAL_PODS=$(echo "{{ target_pods_info.stdout }}" | wc -l)

# Initialize files
echo "" > "{{ artifact_extra_logs_dir }}/artifacts/metrics_url.txt"
echo "PodMonitor target pods (found: $TOTAL_PODS):" > "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"
echo "Scheme: $SCHEME" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"
echo "Port: $PORT" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"
echo "" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"

# Build URL list for each pod
while IFS= read -r line; do
if [ -n "$line" ]; then
POD_NAME=$(echo "$line" | awk '{print $1}')
POD_IP=$(echo "$line" | awk '{print $2}')
if [ -n "$POD_IP" ] && [ "$POD_IP" != "<none>" ]; then
URL="$SCHEME://$POD_IP:$PORT/metrics"
echo "$URL" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_url.txt"
echo "Pod: $POD_NAME ($POD_IP) -> $URL" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"
else
echo "Pod: $POD_NAME (no IP available)" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"
fi
fi
done <<< "{{ target_pods_info.stdout }}"

# Count valid URLs and add summary
VALID_URLS=$(grep -c "^http" "{{ artifact_extra_logs_dir }}/artifacts/metrics_url.txt" || echo "0")
echo "" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"
echo "Total metrics URLs: $VALID_URLS" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"

- name: Extract authorization secret name from PodMonitor
shell: |
oc get podmonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -o jsonpath='{.spec.podMetricsEndpoints[0].authorization.credentials.name}' 2>/dev/null || echo ""
register: auth_secret_name
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
---
# ServiceMonitor-specific tasks

- name: Capture ServiceMonitor YAML
shell: |
oc get servicemonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/servicemonitor.yaml"

- name: Get ServiceMonitor status
shell: |
oc get servicemonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/servicemonitor.status"

- name: Capture service target by ServiceMonitor selector
shell: |
oc get service -l "app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/target_service.status"

- name: Capture service target by ServiceMonitor selector YAML
shell: |
oc get service -l "app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/target_service.yaml"

- name: Get target service name
shell: |
set -o pipefail;
oc get service -l "app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} --no-headers -o custom-columns=":metadata.name" | head -1
register: target_service_name

- name: Extract port name from ServiceMonitor
shell: |
oc get servicemonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -o jsonpath='{.spec.endpoints[0].port}' 2>/dev/null || echo "metrics"
register: metrics_port_name

- name: Get port number from Service by name
shell: |
SERVICE_NAME='{{ target_service_name.stdout | trim }}'
PORT_NAME='{{ metrics_port_name.stdout | trim | default("metrics") }}'
oc get service "$SERVICE_NAME" -n {{ target_namespace }} -o jsonpath="{.spec.ports[?(@.name=='$PORT_NAME')].port}" 2>/dev/null || echo ""
register: named_port_result

- name: Get first port as fallback
shell: |
SERVICE_NAME='{{ target_service_name.stdout | trim }}'
oc get service "$SERVICE_NAME" -n {{ target_namespace }} -o jsonpath='{.spec.ports[0].port}' 2>/dev/null || echo "9090"
register: first_port_result
when: named_port_result.stdout == ""

- name: Set final port number
set_fact:
final_port: "{{ named_port_result.stdout if named_port_result.stdout != '' else first_port_result.stdout | default('9090') }}"

- name: Determine scheme from port
set_fact:
final_scheme: >-
{{
'https' if (
final_port in ['443', '8443', '6443'] or
(metrics_port_name.stdout | trim | default('metrics')) is match('.*(https|secure|tls).*')
) else 'http'
}}

- name: Build metrics URL for ServiceMonitor
shell: |
SERVICE_NAME='{{ target_service_name.stdout | trim }}'
PORT_NAME='{{ metrics_port_name.stdout | trim | default("metrics") }}'
PORT_NUMBER='{{ final_port }}'
SCHEME='{{ final_scheme }}'

echo "$SCHEME://$SERVICE_NAME.{{ target_namespace }}.svc:$PORT_NUMBER/metrics" > "{{ artifact_extra_logs_dir }}/artifacts/metrics_url.txt"
echo "Service: $SERVICE_NAME" > "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"
echo "Port name: $PORT_NAME" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"
echo "Port number: $PORT_NUMBER" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"
echo "Scheme: $SCHEME" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"
echo "URL: $SCHEME://$SERVICE_NAME.{{ target_namespace }}.svc:$PORT_NUMBER/metrics" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt"

- name: Extract authorization secret name from ServiceMonitor
shell: |
oc get servicemonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -o jsonpath='{.spec.endpoints[0].authorization.credentials.name}' 2>/dev/null || echo ""
register: auth_secret_name_cmd
Loading
Loading