pkg/monitortests/clusterversionoperator/legacycvomonitortests: Expand monitoring reason exceptions

wking · wking · commit 5573b55c852a · 2023-11-28T20:58:14.000-08:00
To cover the main hits in: $ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/monitoring.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/$[^ ]*$ condition/Available reason/$[^ ]*$ status/False[^:]*: $.*$|\1 \2 \3|' | sort | uniq -c | sort -n 1 monitoring PlatformTasksFailed reconciling Console Plugin Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/monitoring-plugin: context deadline exceeded, UpdatingAlertmanager: waiting for Alertmanager object changes failed: waiting for Alertmanager openshift-monitoring/main: context deadline exceeded, UpdatingPrometheusAdapter: reconciling PrometheusAdapter Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-adapter: context deadline exceeded, UpdatingThanosQuerier: reconciling Thanos Querier Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/thanos-querier: context deadline exceeded, UpdatingPrometheus: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline 1 monitoring PlatformTasksFailed reconciling Console Plugin failed: retrieving ConsolePlugin object failed: conversion webhook for console.openshift.io/v1alpha1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": no endpoints available for service "webhook", UpdatingPrometheus: reconciling Prometheus object failed: updating Prometheus object failed: Operation cannot be fulfilled on prometheuses.monitoring.coreos.com "k8s": the object has been modified; please apply your changes to the latest version and try again 1 monitoring UpdatingPrometheusOperatorFailed reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: context deadline exceeded 3 monitoring UpdatingAlertmanagerFailed reconciling Alertmanager object failed: updating Alertmanager object failed: Operation cannot be fulfilled on alertmanagers.monitoring.coreos.com "main": the object has been modified; please apply your changes to the latest version and try again 4 monitoring PlatformTasksFailed reconciling Console Plugin failed: retrieving ConsolePlugin object failed: conversion webhook for console.openshift.io/v1alpha1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": no endpoints available for service "webhook", UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded 15 monitoring UpdatingConsolePluginComponentsFailed reconciling Console Plugin failed: retrieving ConsolePlugin object failed: conversion webhook for console.openshift.io/v1alpha1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": no endpoints available for service "webhook" I've also commented about these in [1], but I have no problem if folks decide to fork that bug into multiple trackers and have specific exceptions for each tracker. And I'm including UpdatingPrometheusFailed for [2]: : [bz-Monitoring] clusteroperator/monitoring should not change condition/Available expand_less 1h54m23s { 2 unexpected clusteroperator state transitions during e2e test run. These did not match any known exceptions, so they cause this test-case to fail: Nov 28 20:24:17.720 W clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded Nov 28 20:24:17.720 - 110s E clusteroperator/monitoring condition/Available reason/UpdatingPrometheusFailed status/Unknown UpdatingPrometheus: client rate limiter Wait returned an error: context deadline exceeded 1 unwelcome but acceptable clusteroperator state transitions during e2e test run. These should not happen, but because they are tied to exceptions, the fact that they did happen is not sufficient to cause this test-case to fail: Nov 28 20:26:08.168 W clusteroperator/monitoring condition/Available reason/RollOutDone status/True Successfully rolled out the stack. (exception: Available=True is the happy case) } [1]: https://issues.redhat.com/browse/OCPBUGS-23745 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27231/pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade/1729564593711222784
diff --git a/pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go b/pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go
@@ -72,7 +72,7 @@ func testUpgradeOperatorStateTransitions(events monitorapi.Intervals) []*junitap
 				return "https://issues.redhat.com/browse/OCPBUGS-22364", nil
 			}
 		case "monitoring":
-			if condition.Type == configv1.OperatorAvailable && condition.Status == configv1.ConditionFalse && condition.Reason == "UpdatingPrometheusK8SFailed" {
+			if condition.Type == configv1.OperatorAvailable && (condition.Status == configv1.ConditionFalse && (condition.Reason == "PlatformTasksFailed" || condition.Reason == "UpdatingAlertmanagerFailed" || condition.Reason == "UpdatingConsolePluginComponentsFailed" || condition.Reason == "UpdatingPrometheusK8SFailed" || condition.Reason == "UpdatingPrometheusOperatorFailed")) || (condition.Status == configv1.ConditionUnknown && condition.Reason == "UpdatingPrometheusFailed") {
 				return "https://issues.redhat.com/browse/OCPBUGS-23745", nil
 			}
 		case "openshift-apiserver":

Original file line number	Diff line number	Diff line change
`@@ -72,7 +72,7 @@ func testUpgradeOperatorStateTransitions(events monitorapi.Intervals) []*junitap`
`72`	`72`	`return "https://issues.redhat.com/browse/OCPBUGS-22364", nil`
`73`	`73`	`}`
`74`	`74`	`case "monitoring":`
`75`		`- if condition.Type == configv1.OperatorAvailable && condition.Status == configv1.ConditionFalse && condition.Reason == "UpdatingPrometheusK8SFailed" {`
	`75`	`+ if condition.Type == configv1.OperatorAvailable && (condition.Status == configv1.ConditionFalse && (condition.Reason == "PlatformTasksFailed" \|\| condition.Reason == "UpdatingAlertmanagerFailed" \|\| condition.Reason == "UpdatingConsolePluginComponentsFailed" \|\| condition.Reason == "UpdatingPrometheusK8SFailed" \|\| condition.Reason == "UpdatingPrometheusOperatorFailed")) \|\| (condition.Status == configv1.ConditionUnknown && condition.Reason == "UpdatingPrometheusFailed") {`
`76`	`76`	`return "https://issues.redhat.com/browse/OCPBUGS-23745", nil`
`77`	`77`	`}`
`78`	`78`	`case "openshift-apiserver":`