Skip to content

Commit cdd10a8

Browse files
vuvkardrichards-87
andauthored
[MOPU-301] AI driven improvement on "What's happening" section for monitor templates (#22628)
* [MOPU-301] What's happening section for monitor templates * Update nginx/assets/monitors/5xx.json Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com> * Update nginx/assets/monitors/upstream_peer_fails.json Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com> * [MOPU-301] Reverted monitor files formatting. * [MOPU-301] Scope generated message withing {{is_alert}} conditional variable. * [MOPU-301] Scope generated message withing {{is_alert}} conditional variable. * [MOPU-301] manual enhancement over AI generated messages. * [MOPU-301] fix type --------- Co-authored-by: DeForest Richards <56796055+drichards-87@users.noreply.github.com>
1 parent 499ac59 commit cdd10a8

14 files changed

+14
-14
lines changed

kubernetes/assets/monitors/monitor_deployments_replicas.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
],
99
"description": "Kubernetes replicas are clones that facilitate self-healing for pods. Each pod has a desired number of replica Pods that should be running at any given time. This monitor tracks the number of replicas that are failing per deployment.",
1010
"definition": {
11-
"message": "More than one Deployments Replica's pods are down in Deployment {{kube_namespace.name}}/{{kube_deployment.name}}.",
11+
"message": "{{#is_alert}}\n\n## What's happening?\nThere are at least 2 or more missing replicas for Deployment {{kube_namespace.name}}/{{kube_deployment.name}} over the last 15 minutes.\n\n{{/is_alert}}",
1212
"name": "[Kubernetes] Monitor Kubernetes Deployments Replica Pods",
1313
"options": {
1414
"escalation_message": "",

kubernetes/assets/monitors/monitor_node_unavailable.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
],
99
"description": "Kubernetes nodes can either be schedulable or unschedulable. When unschedulable, the node prevents the scheduler from placing new pods onto that node. This monitor tracks the percentage of schedulable nodes.",
1010
"definition": {
11-
"message": "More than 20% of nodes are unschedulable on ({{kube_cluster_name.name}} cluster). \n Keep in mind that this might be expected based on your infrastructure.",
11+
"message": "{{#is_alert}}\n\n## What's happening?\nThe percentage of schedulable nodes is below 80% for status:schedulable on ({{kube_cluster_name.name}} cluster over the last 15 minutes.\n\n{{/is_alert}}\n\n Keep in mind that this might be expected based on your infrastructure.",
1212
"name": "[Kubernetes] Monitor Unschedulable Kubernetes Nodes",
1313
"options": {
1414
"escalation_message": "",

kubernetes/assets/monitors/monitor_pod_crashloopbackoff.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
],
99
"description": "The status CrashloopBackOff means that a container in the Pod is started, crashes, and is restarted, over and over again. This monitor tracks when a pod is in a CrashloopBackOff state for your Kubernetes integration.",
1010
"definition": {
11-
"message": "pod {{pod_name.name}} is in CrashloopBackOff on {{kube_namespace.name}} \n This alert could generate several alerts for a bad deployment. Adjust the thresholds of the query to suit your infrastructure.",
11+
"message": "{{#is_alert}}\n\n## What's happening?\nAt least one container in pod {{pod_name.name}} on {{kube_namespace.name}} is in a waiting state due to reason crashloopbackoff in the last 10 minutes.\n\n{{/is_alert}}\n\n This alert could generate several alerts for a bad deployment. Adjust the thresholds of the query to suit your infrastructure.",
1212
"name": "[Kubernetes] Pod {{pod_name.name}} is CrashloopBackOff on namespace {{kube_namespace.name}}",
1313
"options": {
1414
"escalation_message": "",

kubernetes/assets/monitors/monitor_pod_imagepullbackoff.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
],
99
"description": "The status ImagePullBackOff means that a container could not start because Kubernetes could not pull a container image. This monitor tracks when a pod is in an ImagePullBackOff state for your Kubernetes integration.",
1010
"definition": {
11-
"message": "pod {{pod_name.name}} is in ImagePullBackOff on {{kube_namespace.name}} \n This could happen for several reasons, for example a bad image path or tag or if the credentials for pulling images are not configured properly.",
11+
"message": "{{#is_alert}}\n\n## What's happening?\nAt least one container in pod {{pod_name.name}} on namespace {{kube_namespace.name}} is in a waiting state due to an ImagePullBackOff error in the last 10 minutes.\n\n{{/is_alert}}\n\n This could happen for several reasons, for example a bad image path or tag or if the credentials for pulling images are not configured properly.",
1212
"name": "[Kubernetes] Pod {{pod_name.name}} is ImagePullBackOff on namespace {{kube_namespace.name}}",
1313
"options": {
1414
"escalation_message": "",

kubernetes/assets/monitors/monitor_pod_oomkilled.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
],
99
"description": "The status OOMKilled means that a container was killed because it exceeded memory limits or the node ran out of available memory. This monitor tracks when a pod is in an OOMKilled state for your Kubernetes integration.",
1010
"definition": {
11-
"message": "pod {{pod_name.name}} is in OOMKilled on {{kube_namespace.name}} \n This could happen for several reasons, for example insufficient memory limits, memory leaks in the application, or the node running out of available memory.",
11+
"message": "{{#is_alert}}\n\n## What's happening?\nThere has been at least one container terminated in pod {{pod_name.name}} on namespace {{kube_namespace.name}} with reason oomkilled in the last 10 minutes.\n\n{{/is_alert}}\n\n This could happen for several reasons, for example insufficient memory limits, memory leaks in the application, or the node running out of available memory.",
1212
"name": "[Kubernetes] Pod {{pod_name.name}} is OOMKilled on namespace {{kube_namespace.name}}",
1313
"options": {
1414
"escalation_message": "",

kubernetes/assets/monitors/monitor_pods_failed_state.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
],
99
"description": "When a pod is failing it means the container either exited with non-zero status or was terminated by the system. This monitor tracks when more than 10 pods are failing for a given Kubernetes cluster.",
1010
"definition": {
11-
"message": "More than ten pods are failing in ({{kube_cluster_name.name}} cluster). \n The threshold of ten pods varies depending on your infrastructure. Change the threshold to suit your needs.",
11+
"message": "{{#is_alert}}\n\n## What's happening?\nThe number of failed pods has increased by more than 10 in ({{kube_cluster_name.name}} cluster in the last 5 minutes.\n\n{{/is_alert}}\n\n The threshold of ten pods varies depending on your infrastructure. Change the threshold to suit your needs.",
1212
"name": "[Kubernetes] Monitor Kubernetes Failed Pods in Namespaces",
1313
"options": {
1414
"escalation_message": "",

kubernetes/assets/monitors/monitor_pods_restarting.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
],
99
"description": "Kubernetes pods restart according to the restart policy. A restarting container can indicate problems with memory, CPU usage, or an application exiting prematurely. This monitor tracks when pods are restarting multiple times.",
1010
"definition": {
11-
"message": "Pod {{pod_name.name}} restarted multiple times in the last five minutes.",
11+
"message": "{{#is_alert}}\n\n## What's happening?\nThere has been an increase of more than 5 container restarts in the pod {{pod_name.name}} in the last 5 minutes.\n\n{{/is_alert}}",
1212
"name": "[Kubernetes] Monitor Kubernetes Pods Restarting",
1313
"options": {
1414
"escalation_message": "",

kubernetes/assets/monitors/monitor_statefulset_replicas.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
],
99
"description": "Kubernetes replicas are clones that facilitate self-healing for pods. Each pod has a desired number of replica Pods that should be running at any given time. This monitor tracks when the number of replicas per statefulset is falling.",
1010
"definition": {
11-
"message": "More than one Statefulset Replica's pods are down in Statefulset {{kube_namespace.name}}/{{kube_stateful_set.name}}. This might present an unsafe situation for any further manual operations, such as killing other pods.",
11+
"message": "{{#is_alert}}\n\n## What's happening?\nThere are at least 2 desired replicas that are not ready for {{kube_namespace.name}}/{{kube_stateful_set.name}} StatefulSet over the last 15 minutes.\n\n{{/is_alert}}\n\n This might present an unsafe situation for any further manual operations, such as killing other pods.",
1212
"name": "[Kubernetes] Monitor Kubernetes Statefulset Replicas",
1313
"options": {
1414
"escalation_message": "",

nginx/assets/monitors/4xx.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
],
99
"description": "NGINX sends requests to upstream peers that can fail eventually. This monitor tracks the count of 4xx HTTP responses to identify issues in the communication between NGINX and the backend servers.",
1010
"definition": {
11-
"message": "Number of 4xx errors on NGINX upstreams is at {{value}} which is higher than usual.",
11+
"message": "{{#is_alert}}\n\n## What's happening?\nAn anomaly has been detected in the number of 4xx responses from NGINX upstream peers over the last hour, with a value of {{value}}.\n\n{{/is_alert}}",
1212
"name": "[NGINX] 4xx Errors higher than usual",
1313
"options": {
1414
"escalation_message": "",

nginx/assets/monitors/5xx.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
],
99
"description": "“5xx upstream request errors” are indicating server issues from backend servers. This monitor tracks the count of 5xx responses from NGINX's upstream peers to identify server-related issues in your web or application infrastructure.",
1010
"definition": {
11-
"message": "Number of 5xx errors on NGINX upstreams is at {{value}} which is higher than usual.",
11+
"message": "{{#is_alert}}\n\n## What's happening?\nAn anomaly has been detected in the number of 5xx responses from NGINX upstream peers over the last hour, with a value of {{value}}.\n\n{{/is_alert}}\n\n",
1212
"name": "[NGINX] 5xx Errors higher than usual",
1313
"options": {
1414
"escalation_message": "",

0 commit comments

Comments
 (0)