-
Notifications
You must be signed in to change notification settings - Fork 21
[llm-d] Keep working #915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[llm-d] Keep working #915
Changes from all commits
88cd5e8
14105ff
c57b55a
3c414e9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,48 @@ | ||
| :orphan: | ||
|
|
||
| .. | ||
| _Auto-generated file, do not edit manually ... | ||
| _Toolbox generate command: repo generate_toolbox_rst_documentation | ||
| _ Source component: Cluster.capture_servicemonitor_metrics | ||
|
|
||
|
|
||
| cluster capture_servicemonitor_metrics | ||
| ====================================== | ||
|
|
||
| Captures ServiceMonitor or PodMonitor YAML and status for a given service | ||
|
|
||
| Captures the ServiceMonitor/PodMonitor configuration and status information for | ||
| a specific service in a namespace, including related service/pod and | ||
| endpoints information for troubleshooting monitoring setup. | ||
|
|
||
|
|
||
| Parameters | ||
| ---------- | ||
|
|
||
|
|
||
| ``service_name`` | ||
|
|
||
| * Name of the service to capture ServiceMonitor/PodMonitor metrics for | ||
|
|
||
|
|
||
| ``namespace`` | ||
|
|
||
| * Namespace where the service and ServiceMonitor/PodMonitor are located (empty string auto-detects current namespace) | ||
|
|
||
|
|
||
| ``capture_frequency`` | ||
|
|
||
| * How often to capture metrics in seconds (default: 15) | ||
|
|
||
| * default value: ``60`` | ||
|
|
||
|
|
||
| ``is_podmonitor`` | ||
|
|
||
| * Whether to use PodMonitor instead of ServiceMonitor (default: False) | ||
|
|
||
|
|
||
| ``finalize`` | ||
|
|
||
| * Whether to finalize (capture logs and delete) an existing pod instead of creating new one (default: False) | ||
|
|
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,24 @@ | ||||||||||
| # Auto-generated file, do not edit manually ... | ||||||||||
| # Toolbox generate command: repo generate_ansible_default_settings | ||||||||||
| # Source component: Cluster.capture_servicemonitor_metrics | ||||||||||
|
|
||||||||||
| # Parameters | ||||||||||
| # Name of the service to capture ServiceMonitor/PodMonitor metrics for | ||||||||||
| # Mandatory value | ||||||||||
| cluster_capture_servicemonitor_metrics_service_name: | ||||||||||
|
|
||||||||||
| # Namespace where the service and ServiceMonitor/PodMonitor are located (empty string auto-detects current namespace) | ||||||||||
| cluster_capture_servicemonitor_metrics_namespace: | ||||||||||
|
|
||||||||||
| # How often to capture metrics in seconds (default: 15) | ||||||||||
| cluster_capture_servicemonitor_metrics_capture_frequency: 60 | ||||||||||
|
Comment on lines
+13
to
+14
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Comment/default value mismatch. The comment states Proposed fix-# How often to capture metrics in seconds (default: 15)
-cluster_capture_servicemonitor_metrics_capture_frequency: 60
+# How often to capture metrics in seconds (default: 60)
+cluster_capture_servicemonitor_metrics_capture_frequency: 60📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||
|
|
||||||||||
| # Whether to use PodMonitor instead of ServiceMonitor (default: False) | ||||||||||
| cluster_capture_servicemonitor_metrics_is_podmonitor: false | ||||||||||
|
|
||||||||||
| # Whether to finalize (capture logs and delete) an existing pod instead of creating new one (default: False) | ||||||||||
| cluster_capture_servicemonitor_metrics_finalize: false | ||||||||||
|
|
||||||||||
| # Default Ansible variables | ||||||||||
| # Default value for ansible_os_family to ensure role remains standalone | ||||||||||
| ansible_os_family: Linux | ||||||||||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,98 @@ | ||||||||||||||||||||||||||||||||||||||
| --- | ||||||||||||||||||||||||||||||||||||||
| - name: Get current namespace if not specified | ||||||||||||||||||||||||||||||||||||||
| command: oc project -q | ||||||||||||||||||||||||||||||||||||||
| register: current_namespace_result | ||||||||||||||||||||||||||||||||||||||
| when: cluster_capture_servicemonitor_metrics_namespace == "" | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| - name: Set the target namespace | ||||||||||||||||||||||||||||||||||||||
| set_fact: | ||||||||||||||||||||||||||||||||||||||
| target_namespace: "{{ current_namespace_result.stdout if cluster_capture_servicemonitor_metrics_namespace == '' else cluster_capture_servicemonitor_metrics_namespace }}" | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| - name: Create capture directory | ||||||||||||||||||||||||||||||||||||||
| file: | ||||||||||||||||||||||||||||||||||||||
| path: "{{ artifact_extra_logs_dir }}/artifacts" | ||||||||||||||||||||||||||||||||||||||
| state: directory | ||||||||||||||||||||||||||||||||||||||
| mode: '0755' | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| - name: "[Finalize mode] capture logs and delete existing pod" | ||||||||||||||||||||||||||||||||||||||
| when: cluster_capture_servicemonitor_metrics_finalize | ||||||||||||||||||||||||||||||||||||||
| block: | ||||||||||||||||||||||||||||||||||||||
| - name: Check if pod exists for finalization | ||||||||||||||||||||||||||||||||||||||
| shell: | | ||||||||||||||||||||||||||||||||||||||
| oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name | ||||||||||||||||||||||||||||||||||||||
| register: pod_exists_check | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| - name: Capture pod logs | ||||||||||||||||||||||||||||||||||||||
| shell: | | ||||||||||||||||||||||||||||||||||||||
| oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt" | ||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+20
to
+27
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing error handling in finalize flow. If the capture pod doesn't exist when finalize mode runs, Consider adding Proposed fix - name: Check if pod exists for finalization
shell: |
oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
register: pod_exists_check
+ failed_when: false
- name: Capture pod logs
shell: |
oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"
+ when: pod_exists_check.rc == 0📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| - name: Delete metrics capture pod | ||||||||||||||||||||||||||||||||||||||
| shell: | | ||||||||||||||||||||||||||||||||||||||
| oc delete pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --grace-period=0 --ignore-not-found | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| - name: "[Finalize mode] capture logs and delete existing pod" | ||||||||||||||||||||||||||||||||||||||
| when: cluster_capture_servicemonitor_metrics_finalize | ||||||||||||||||||||||||||||||||||||||
| meta: end_play | ||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+17
to
+35
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Critical: Delete task runs unconditionally outside the finalize block. The "Delete metrics capture pod" task (lines 29-31) is outside the If this is intentional cleanup behavior, move it after the Proposed fix (assuming delete should be finalize-only) - name: "[Finalize mode] capture logs and delete existing pod"
when: cluster_capture_servicemonitor_metrics_finalize
block:
- name: Check if pod exists for finalization
shell: |
oc get pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --no-headers -o name
register: pod_exists_check
+ failed_when: false
- name: Capture pod logs
shell: |
oc logs topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_logs.txt"
+ when: pod_exists_check.rc == 0
-- name: Delete metrics capture pod
- shell: |
- oc delete pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --grace-period=0 --ignore-not-found
+ - name: Delete metrics capture pod
+ shell: |
+ oc delete pod topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --grace-period=0 --ignore-not-found
- name: "[Finalize mode] capture logs and delete existing pod"
when: cluster_capture_servicemonitor_metrics_finalize
meta: end_play🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| # Normal deployment mode: discover resources and create pod | ||||||||||||||||||||||||||||||||||||||
| - name: Include ServiceMonitor tasks | ||||||||||||||||||||||||||||||||||||||
| include_tasks: servicemonitor.yml | ||||||||||||||||||||||||||||||||||||||
| when: not cluster_capture_servicemonitor_metrics_is_podmonitor | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| - name: Include PodMonitor tasks | ||||||||||||||||||||||||||||||||||||||
| include_tasks: podmonitor.yml | ||||||||||||||||||||||||||||||||||||||
| when: cluster_capture_servicemonitor_metrics_is_podmonitor | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| # Ensure auth_secret_name is always defined with proper structure (fallback for edge cases) | ||||||||||||||||||||||||||||||||||||||
| - name: Set default auth secret name if not defined | ||||||||||||||||||||||||||||||||||||||
| set_fact: | ||||||||||||||||||||||||||||||||||||||
| auth_secret_name: "{% if auth_secret_name_cmd is defined %}{{ auth_secret_name_cmd.stdout }}{% endif %}" | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| # Common tasks for deployment mode | ||||||||||||||||||||||||||||||||||||||
| - name: Read metrics URL | ||||||||||||||||||||||||||||||||||||||
| shell: cat "{{ artifact_extra_logs_dir }}/artifacts/metrics_url.txt" | ||||||||||||||||||||||||||||||||||||||
| register: metrics_url_content | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| - name: Check if auth secret exists | ||||||||||||||||||||||||||||||||||||||
| shell: | | ||||||||||||||||||||||||||||||||||||||
| oc get secret {{ auth_secret_name }} -n {{ target_namespace }} --no-headers -o name | ||||||||||||||||||||||||||||||||||||||
| register: auth_secret_exists | ||||||||||||||||||||||||||||||||||||||
| when: auth_secret_name != "" | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| - name: Save authentication info | ||||||||||||||||||||||||||||||||||||||
| shell: | | ||||||||||||||||||||||||||||||||||||||
| {% if cluster_capture_servicemonitor_metrics_is_podmonitor %} | ||||||||||||||||||||||||||||||||||||||
| echo "PodMonitor: {{ cluster_capture_servicemonitor_metrics_service_name }}" > "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt" | ||||||||||||||||||||||||||||||||||||||
| {% else %} | ||||||||||||||||||||||||||||||||||||||
| echo "ServiceMonitor: {{ cluster_capture_servicemonitor_metrics_service_name }}" > "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt" | ||||||||||||||||||||||||||||||||||||||
| {% endif %} | ||||||||||||||||||||||||||||||||||||||
| echo "Auth secret name: {{ auth_secret_name | default('none') }}" >> "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt" | ||||||||||||||||||||||||||||||||||||||
| if [ -n "{{ auth_secret_name }}" ]; then | ||||||||||||||||||||||||||||||||||||||
| {% if auth_secret_name != '' and auth_secret_exists.rc == 0 %} | ||||||||||||||||||||||||||||||||||||||
| echo "Secret exists: yes" >> "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt" | ||||||||||||||||||||||||||||||||||||||
| echo "Secret will be mounted at: /var/run/secrets/auth/token" >> "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt" | ||||||||||||||||||||||||||||||||||||||
| {% else %} | ||||||||||||||||||||||||||||||||||||||
| echo "Secret exists: no" >> "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt" | ||||||||||||||||||||||||||||||||||||||
| echo "WARNING: Secret not found!" >> "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt" | ||||||||||||||||||||||||||||||||||||||
| {% endif %} | ||||||||||||||||||||||||||||||||||||||
| else | ||||||||||||||||||||||||||||||||||||||
| echo "No authentication required" >> "{{ artifact_extra_logs_dir }}/artifacts/auth_info.txt" | ||||||||||||||||||||||||||||||||||||||
| fi | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| - name: Create metrics capture pod manifest | ||||||||||||||||||||||||||||||||||||||
| template: | ||||||||||||||||||||||||||||||||||||||
| src: metrics_capture_pod.yaml.j2 | ||||||||||||||||||||||||||||||||||||||
| dest: "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_pod.yaml" | ||||||||||||||||||||||||||||||||||||||
| mode: '0644' | ||||||||||||||||||||||||||||||||||||||
| vars: | ||||||||||||||||||||||||||||||||||||||
| metrics_url: "{{ metrics_url_content.stdout }}" | ||||||||||||||||||||||||||||||||||||||
| auth_secret_name: "{{ auth_secret_name | default('') }}" | ||||||||||||||||||||||||||||||||||||||
| capture_frequency: "{{ cluster_capture_servicemonitor_metrics_capture_frequency }}" | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| - name: Create metrics capture pod | ||||||||||||||||||||||||||||||||||||||
| shell: | | ||||||||||||||||||||||||||||||||||||||
| oc create -f "{{ artifact_extra_logs_dir }}/artifacts/metrics_capture_pod.yaml" | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| - name: Wait for pod to start | ||||||||||||||||||||||||||||||||||||||
| shell: | | ||||||||||||||||||||||||||||||||||||||
| oc wait --for=condition=Ready pod/topsail-metrics-capture-{{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} --timeout=60s | ||||||||||||||||||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| --- | ||
| # PodMonitor-specific tasks | ||
|
|
||
| - name: Capture PodMonitor YAML | ||
| shell: | | ||
| oc get podmonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/podmonitor.yaml" | ||
|
|
||
| - name: Get PodMonitor status | ||
| shell: | | ||
| oc get podmonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/podmonitor.status" | ||
|
|
||
| - name: Capture pod targets by PodMonitor selector | ||
| shell: | | ||
| oc get pod -l "app.kubernetes.io/component in (llminferenceservice-workload,llminferenceservice-workload-prefill,llminferenceservice-workload-worker,llminferenceservice-workload-leader,llminferenceservice-workload-leader-prefill,llminferenceservice-workload-worker-prefill),app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/target_pods.status" | ||
|
|
||
| - name: Capture pod targets by PodMonitor selector YAML | ||
| shell: | | ||
| oc get pod -l "app.kubernetes.io/component in (llminferenceservice-workload,llminferenceservice-workload-prefill,llminferenceservice-workload-worker,llminferenceservice-workload-leader,llminferenceservice-workload-leader-prefill,llminferenceservice-workload-worker-prefill),app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/target_pods.yaml" | ||
|
|
||
| - name: Get all target pod IPs and names | ||
| shell: | | ||
| oc get pod -l "app.kubernetes.io/component in (llminferenceservice-workload,llminferenceservice-workload-prefill,llminferenceservice-workload-worker,llminferenceservice-workload-leader,llminferenceservice-workload-leader-prefill,llminferenceservice-workload-worker-prefill),app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} --no-headers -o custom-columns=":metadata.name,:status.podIP" | ||
| register: target_pods_info | ||
|
Comment on lines
+12
to
+23
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Resolve pods from the PodMonitor you were asked to capture. After fetching the named PodMonitor, the role ignores its selector and scans every llminferenceservice workload pod in the namespace via a hardcoded component list. That will mix captures across services when multiple LLMInferenceServices share a namespace, and it will break as soon as the PodMonitor selector changes. The existing ISVC state capture already keys pod lookup off 🤖 Prompt for AI Agents |
||
|
|
||
| - name: Extract scheme from PodMonitor | ||
| shell: | | ||
| oc get podmonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -o jsonpath='{.spec.podMetricsEndpoints[0].scheme}' 2>/dev/null || echo "http" | ||
| register: metrics_scheme | ||
|
|
||
| - name: Extract target port from PodMonitor | ||
| shell: | | ||
| oc get podmonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -o jsonpath='{.spec.podMetricsEndpoints[0].targetPort}' 2>/dev/null || echo "9090" | ||
| register: metrics_port | ||
|
|
||
| - name: Build metrics URLs for all matching pods | ||
| shell: | | ||
| set -o pipefail; | ||
|
|
||
| SCHEME="{{ metrics_scheme.stdout | trim | default('http') }}" | ||
| PORT="{{ metrics_port.stdout | trim | default('9090') }}" | ||
|
|
||
| # Count total pods and initialize files | ||
| TOTAL_PODS=$(echo "{{ target_pods_info.stdout }}" | wc -l) | ||
|
|
||
| # Initialize files | ||
| echo "" > "{{ artifact_extra_logs_dir }}/artifacts/metrics_url.txt" | ||
| echo "PodMonitor target pods (found: $TOTAL_PODS):" > "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt" | ||
| echo "Scheme: $SCHEME" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt" | ||
| echo "Port: $PORT" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt" | ||
| echo "" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt" | ||
|
|
||
| # Build URL list for each pod | ||
| while IFS= read -r line; do | ||
| if [ -n "$line" ]; then | ||
| POD_NAME=$(echo "$line" | awk '{print $1}') | ||
| POD_IP=$(echo "$line" | awk '{print $2}') | ||
| if [ -n "$POD_IP" ] && [ "$POD_IP" != "<none>" ]; then | ||
| URL="$SCHEME://$POD_IP:$PORT/metrics" | ||
| echo "$URL" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_url.txt" | ||
| echo "Pod: $POD_NAME ($POD_IP) -> $URL" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt" | ||
| else | ||
| echo "Pod: $POD_NAME (no IP available)" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt" | ||
| fi | ||
| fi | ||
| done <<< "{{ target_pods_info.stdout }}" | ||
|
|
||
| # Count valid URLs and add summary | ||
| VALID_URLS=$(grep -c "^http" "{{ artifact_extra_logs_dir }}/artifacts/metrics_url.txt" || echo "0") | ||
| echo "" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt" | ||
| echo "Total metrics URLs: $VALID_URLS" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt" | ||
|
|
||
| - name: Extract authorization secret name from PodMonitor | ||
| shell: | | ||
| oc get podmonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -o jsonpath='{.spec.podMetricsEndpoints[0].authorization.credentials.name}' 2>/dev/null || echo "" | ||
| register: auth_secret_name | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,76 @@ | ||
| --- | ||
| # ServiceMonitor-specific tasks | ||
|
|
||
| - name: Capture ServiceMonitor YAML | ||
| shell: | | ||
| oc get servicemonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/servicemonitor.yaml" | ||
|
|
||
| - name: Get ServiceMonitor status | ||
| shell: | | ||
| oc get servicemonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/servicemonitor.status" | ||
|
|
||
| - name: Capture service target by ServiceMonitor selector | ||
| shell: | | ||
| oc get service -l "app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} > "{{ artifact_extra_logs_dir }}/artifacts/target_service.status" | ||
|
|
||
| - name: Capture service target by ServiceMonitor selector YAML | ||
| shell: | | ||
| oc get service -l "app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} -oyaml > "{{ artifact_extra_logs_dir }}/artifacts/target_service.yaml" | ||
|
|
||
| - name: Get target service name | ||
| shell: | | ||
| set -o pipefail; | ||
| oc get service -l "app.kubernetes.io/component=llminferenceservice-router-scheduler,app.kubernetes.io/part-of=llminferenceservice" -n {{ target_namespace }} --no-headers -o custom-columns=":metadata.name" | head -1 | ||
| register: target_service_name | ||
|
Comment on lines
+12
to
+24
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Derive the target Service from the ServiceMonitor instead of hardcoding the scheduler labels. These tasks only use the requested monitor name for the YAML/status dump; the actual lookup is always pinned to 🤖 Prompt for AI Agents |
||
|
|
||
| - name: Extract port name from ServiceMonitor | ||
| shell: | | ||
| oc get servicemonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -o jsonpath='{.spec.endpoints[0].port}' 2>/dev/null || echo "metrics" | ||
| register: metrics_port_name | ||
|
|
||
| - name: Get port number from Service by name | ||
| shell: | | ||
| SERVICE_NAME='{{ target_service_name.stdout | trim }}' | ||
| PORT_NAME='{{ metrics_port_name.stdout | trim | default("metrics") }}' | ||
| oc get service "$SERVICE_NAME" -n {{ target_namespace }} -o jsonpath="{.spec.ports[?(@.name=='$PORT_NAME')].port}" 2>/dev/null || echo "" | ||
| register: named_port_result | ||
|
|
||
| - name: Get first port as fallback | ||
| shell: | | ||
| SERVICE_NAME='{{ target_service_name.stdout | trim }}' | ||
| oc get service "$SERVICE_NAME" -n {{ target_namespace }} -o jsonpath='{.spec.ports[0].port}' 2>/dev/null || echo "9090" | ||
| register: first_port_result | ||
| when: named_port_result.stdout == "" | ||
|
|
||
| - name: Set final port number | ||
| set_fact: | ||
| final_port: "{{ named_port_result.stdout if named_port_result.stdout != '' else first_port_result.stdout | default('9090') }}" | ||
|
|
||
| - name: Determine scheme from port | ||
| set_fact: | ||
| final_scheme: >- | ||
| {{ | ||
| 'https' if ( | ||
| final_port in ['443', '8443', '6443'] or | ||
| (metrics_port_name.stdout | trim | default('metrics')) is match('.*(https|secure|tls).*') | ||
| ) else 'http' | ||
| }} | ||
|
|
||
| - name: Build metrics URL for ServiceMonitor | ||
| shell: | | ||
| SERVICE_NAME='{{ target_service_name.stdout | trim }}' | ||
| PORT_NAME='{{ metrics_port_name.stdout | trim | default("metrics") }}' | ||
| PORT_NUMBER='{{ final_port }}' | ||
| SCHEME='{{ final_scheme }}' | ||
|
|
||
| echo "$SCHEME://$SERVICE_NAME.{{ target_namespace }}.svc:$PORT_NUMBER/metrics" > "{{ artifact_extra_logs_dir }}/artifacts/metrics_url.txt" | ||
| echo "Service: $SERVICE_NAME" > "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt" | ||
| echo "Port name: $PORT_NAME" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt" | ||
| echo "Port number: $PORT_NUMBER" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt" | ||
| echo "Scheme: $SCHEME" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt" | ||
| echo "URL: $SCHEME://$SERVICE_NAME.{{ target_namespace }}.svc:$PORT_NUMBER/metrics" >> "{{ artifact_extra_logs_dir }}/artifacts/metrics_info.txt" | ||
|
|
||
| - name: Extract authorization secret name from ServiceMonitor | ||
| shell: | | ||
| oc get servicemonitor {{ cluster_capture_servicemonitor_metrics_service_name }} -n {{ target_namespace }} -o jsonpath='{.spec.endpoints[0].authorization.credentials.name}' 2>/dev/null || echo "" | ||
| register: auth_secret_name_cmd | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Rename
service_nameor make the role accept actual Services.The public command and generated docs say this parameter is a Service name, but the role uses it as
oc get servicemonitor|podmonitor <name>. Every new caller in this PR therefore passes a monitor resource name instead. As-is, a CLI user following the docs will hitNotFoundunless the monitor happens to share the Service name.🤖 Prompt for AI Agents