Skip to content

fix: Look for increases in counter for missing_pods Prometheus alert rule#310

Merged
mvlassis merged 7 commits intomainfrom
kf-8299-argo-pods-missing-rule
Feb 12, 2026
Merged

fix: Look for increases in counter for missing_pods Prometheus alert rule#310
mvlassis merged 7 commits intomainfrom
kf-8299-argo-pods-missing-rule

Conversation

@mvlassis
Copy link
Contributor

@mvlassis mvlassis commented Feb 5, 2026

Fixes canonical/bundle-kubeflow#1380

This PR updated the pods_missing metrics in argo-controller to use the new name introduced in Argo 3.7. The changes include:

  • The missing_pods.rule Prometheus alert rule to spot increases in the missing_pods metrics, to avoid firing continuously once the counter is over zero.
  • The respective Grafana dashboard panel

To test

First, deploy all necessary charms:

juju add-model kubeflow
juju deploy opentelemetry-collector-k8s --trust --channel=1/stable

# In the argo-controller directory
charmcraft pack
juju deploy ./argo-controller_ubuntu@24.04-amd64.charm --trust --resource oci-image=docker.io/charmedkubeflow/workflow-controller:3.7.3-be9eb54

juju deploy minio --trust --channel=1.10/stable
juju integrate argo-controller minio

# Deploy cos-lite
juju add-model cos
juju switch cos
juju deploy cos-lite --trust
juju offer cos.prometheus:receive-remote-write prometheus-receive-remote-write
juju offer cos.grafana:grafana-dashboard grafana-dashboards
juju offer cos.loki:logging loki-logging

juju consume -m kubeflow cos.prometheus-receive-remote-write
juju consume -m kubeflow cos.grafana-dashboards
juju consume -m kubeflow cos.loki-logging

juju integrate -m kubeflow opentelemetry-collector-k8s:send-remote-write prometheus-receive-remote-write
juju integrate -m kubeflow opentelemetry-collector-k8s:grafana-dashboards-provider grafana-dashboards

# Relate with argo-controller
juju switch kubeflow
juju integrate argo-controller:metrics-endpoint opentelemetry-collector-k8s:metrics-endpoint
juju integrate argo-controller:grafana-dashboard opentelemetry-collector-k8s:grafana-dashboards-consumer

Then, create a pod with a workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: alert-trigger-
spec:
  entrypoint: main
  templates:
    - name: main
      container:
        image: alpine:latest
        command: [sh, -c]
        args: ["echo 'Starting...'; sleep 900; echo 'Done'"]

Apply it to the kubeflow namespace, and then delete it:

kubectl create -n kubeflow -f test-alert.yaml

# After 1 minute
kubectl delete pod -n kubeflow <pod-name>

We should see:

  • The "Workflows missing Pods in the past 10 minutes" gauge increasing in Grafana dashboard
  • The "ArgoWorkflowPodsMissing" alert firing

@mvlassis mvlassis changed the title chore: Look for increases in counter for missing_pods Prometheus alert rule fix: Look for increases in counter for missing_pods Prometheus alert rule Feb 5, 2026
@github-actions github-actions bot added the Libraries: Out of sync The charm libs used are out-of-sync label Feb 5, 2026
Copy link
Contributor

@NohaIhab NohaIhab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @mvlassis, just a comment on the error persistence duration

@mvlassis mvlassis requested a review from NohaIhab February 11, 2026 10:22
Copy link
Contributor

@NohaIhab NohaIhab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

Copy link
Contributor

@dariofaccin dariofaccin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@mvlassis mvlassis merged commit fedac8e into main Feb 12, 2026
9 checks passed
@mvlassis mvlassis deleted the kf-8299-argo-pods-missing-rule branch February 12, 2026 11:13
mvlassis added a commit that referenced this pull request Feb 12, 2026
…t rule (#310)

* chore: Look for increases in counter for Prometheus alert rule

* chore: Look for increases in counter for Prometheus alert rule

* chore: Look for increases in counter for Prometheus alert rule

* chore: Look for increases in counter for Prometheus alert rule

* chore: Look for increases in counter for Prometheus alert rule

* Update Grafana dashboard

* Update Prometheus description
mvlassis added a commit that referenced this pull request Feb 12, 2026
…t rule (#310)

* chore: Look for increases in counter for Prometheus alert rule

* chore: Look for increases in counter for Prometheus alert rule

* chore: Look for increases in counter for Prometheus alert rule

* chore: Look for increases in counter for Prometheus alert rule

* chore: Look for increases in counter for Prometheus alert rule

* Update Grafana dashboard

* Update Prometheus description
mvlassis added a commit that referenced this pull request Feb 12, 2026
… Prometheus alert rules (#312)

This PR is a backport of #310 to `track/3.5`
mvlassis added a commit that referenced this pull request Feb 12, 2026
… alert rules (#311)

This PR is a backport of #310 to `track/3.7`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Libraries: Out of sync The charm libs used are out-of-sync

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ArgoWorkflowPodsMissing alert should be triggered by argo_pod_missing counter increase and not by counter being > 0

3 participants