sysdiglabs
diff --git a/‎apps/consul.yaml‎
Lines changed: 15 additions & 0 deletions b/‎apps/consul.yaml‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎apps/images/consul.png‎
8.94 KB b/‎apps/images/consul.png‎
8.94 KB
diff --git a/‎resources/consul/ALERTS.md‎
Lines changed: 55 additions & 0 deletions b/‎resources/consul/ALERTS.md‎
Lines changed: 55 additions & 0 deletions
diff --git a/‎resources/consul/INSTALL.md‎
Lines changed: 2 additions & 0 deletions b/‎resources/consul/INSTALL.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎resources/consul/README.md‎
Lines changed: 13 additions & 0 deletions b/‎resources/consul/README.md‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎resources/consul/alerts.yaml‎
Lines changed: 162 additions & 0 deletions b/‎resources/consul/alerts.yaml‎
Lines changed: 162 additions & 0 deletions
diff --git a/‎resources/consul/dashboards.yaml‎
Lines changed: 19 additions & 0 deletions b/‎resources/consul/dashboards.yaml‎
Lines changed: 19 additions & 0 deletions
diff --git a/‎resources/consul/description.yaml‎
Lines changed: 7 additions & 0 deletions b/‎resources/consul/description.yaml‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎resources/consul/images/consul_sysdig.png‎
1.88 MB b/‎resources/consul/images/consul_sysdig.png‎
1.88 MB
@@ -0,0 +1,15 @@
+---
+apiVersion: v1
+kind: App
+name: "Consul"
+keywords: 
+  - Kubernetes
+  - Available
+availableVersions: 
+  - '1.11.1'
+shortDescription: "Consul is a free and open-source service networking platform developed by HashiCorp."
+description: |
+  Consul uses service identity with automated networking to help organizations securely connect applications running in any environment.
+icon: https://raw.githubusercontent.com/sysdiglabs/promcat-resources/master/apps/images/consul.svg
+website: consul.io
+available: true
@@ -0,0 +1,55 @@
+# Alerts
+## KV Store update time anomaly
+Consul KV Store update time had noticeable deviations from baseline over the previous hour.
+
+## Transaction time anomaly
+Consul Transaction time had noticeable deviations from baseline over the previous hour.
+
+## Raft transactions count anomaly
+Consul Raft transactions count rate had noticeable deviations from baseline over the previous hour.
+
+## Raft commit time anomaly
+Consul Raft commit time had noticeable deviations from baseline over the previous hour.
+
+## Leader time to contact followers too high
+Consul Leader time to contact followers was greater than 200ms.
+
+## Flapping leadership
+There are too many leadership changes."
+
+## Too many elections
+There are too many elections for leadership."
+
+## Server cluster unhealthy
+One or many Consul servers in the cluster are unhealthy.
+
+## Zero failure tolerance
+There is no failure tolerance in case one Consul server goes down.
+
+## Client RPC requests anomaly
+Consul Client RPC requests had noticeable deviations from baseline over the previous hour.
+
+## Client RPC requests rate limit exceeded
+Over 10% of Consul Client RPC requests have exceeded the rate limit.
+
+## Client RPC requests failed
+Over 10% of Consul Client RPC requests are failing.
+
+## License Expiry
+Consul license will expire in less than 30 days.
+
+## Garbage Collection pause high
+Garbage Collection stop-the-world pauses were greater than 2 seconds per minute.
+
+## Garbage Collection pause too high
+Garbage Collection stop-the-world pauses were greater than 5 seconds per minute.
+
+## Raft restore duration too high
+The Raft FSM restore duration is too close to the Raft leader oldest log age.
+
+## RPC requests error rate is high
+RPC requests error rate is higher than 5%.
+
+## Cache hit rate is low
+RPC requests error rate is higher than 5%.
+
@@ -0,0 +1,2 @@
+# Prerequisites
+Consul instruments Prometheus metrics and annotates the pods with Prometheus annotations. 
@@ -0,0 +1,13 @@
+# Consul
+Consul is a service mesh solution providing a full featured control plane with service discovery, configuration, and segmentation functionality. Each of these features can be used individually as needed, or they can be used together to build a full service mesh. Consul requires a data plane and supports both a proxy and native integration model. Consul ships with a simple built-in proxy so that everything works out of the box, but also supports 3rd party proxy integrations such as Envoy.
+
+# Prometheus and exporters
+Consul already has a Prometheus endpoint with all the metrics exposed on the port 8500 and the envoys are the 20200. In Kubernetes the pod is already annotated, so with the Sysdig agent you can scrape the endpoint right away.
+
+# Metrics
+- Consul server statistics
+- Consul client statistics
+- Envoy metrics
+
+# Attributions
+Configuration files, dashboards and alerts are maintained by [Sysdig team](https://sysdig.com/).
@@ -0,0 +1,162 @@
+apiVersion: v1
+kind: Alert
+app: Consul
+version: 1.0.0
+appVersion:
+- 1.11.1
+descriptionFile: ALERTS.md
+configurations:
+- kind: Prometheus
+  data: |-
+    groups:
+    - name: Consul
+      rules:
+      - alert: KV Store update time anomaly
+        expr: |
+          avg(rate(consul_kvs_apply_sum{kube_pod_label_component="server"}[1m]) > 0)>(avg_over_time(rate(consul_kvs_apply_sum{kube_pod_label_component="server"}[1m]) [1h:1m])
+          + 2* stddev_over_time(rate(consul_kvs_apply_sum{kube_pod_label_component="server"}[1m]) [1h:1m]))
+        for: 5m
+        labels:
+          severity: medium
+        annotations:
+          description: Consul KV Store update time had noticeable deviations from baseline over the previous hour.
+      - alert: Transaction time anomaly
+        expr: |
+          avg(rate(consul_txn_apply_sum{kube_pod_label_component="server"}[1m]) > 0)>(avg_over_time(rate(consul_txn_apply_sum{kube_pod_label_component="server"}[1m]) [1h:1m])
+          + 2* stddev_over_time(rate(consul_txn_apply_sum{kube_pod_label_component="server"}[1m]) [1h:1m]))
+        for: 5m
+        labels:
+          severity: medium
+        annotations:
+          description: Consul Transaction time had noticeable deviations from baseline over the previous hour.
+      - alert: Raft transactions count anomaly
+        expr: |
+          avg(rate(consul_raft_apply{
+          kube_pod_label_component="server"}[1m]) > 0)>(avg_over_time(rate(consul_raft_apply{kube_pod_label_component="server"}[1m]) [1h:1m])
+          + 2* stddev_over_time(rate(consul_raft_apply{kube_pod_label_component="server"}[1m]) [1h:1m]) )
+        for: 5m
+        labels:
+          severity: medium
+        annotations:
+          description: Consul Raft transactions count rate had noticeable deviations from baseline over the previous hour.
+      - alert: Raft commit time anomaly
+        expr: |
+          avg(rate(consul_raft_commitTime_sum{kube_pod_label_component="server"}[1m]) > 0)>(avg_over_time(rate(consul_raft_commitTime_sum{kube_pod_label_component="server"}[1m]) [1h:1m])
+          + 2* stddev_over_time(rate(consul_raft_commitTime_sum{kube_pod_label_component="server"}[1m]) [1h:1m]))
+        for: 5m
+        labels:
+          severity: medium
+        annotations:
+          description: Consul Raft commit time had noticeable deviations from baseline over the previous hour.
+      - alert: Leader time to contact followers too high
+        expr: |
+          consul_raft_leader_lastContact{quantile="0.9"} > 200
+        for: 5m
+        labels:
+          severity: medium
+        annotations:
+          description: Consul Leader time to contact followers was greater than 200ms.
+      - alert: Flapping leadership
+        expr: |
+          sum(rate(consul_raft_state_leader{kube_pod_label_component="server"}[1m])) > 0
+        for: 5m
+        labels:
+          severity: medium
+        annotations:
+          description: There are too many leadership changes."
+      - alert: Too many elections
+        expr: |
+          sum(rate(consul_raft_state_candidate{kube_pod_label_component="server"}[1m])) > 0
+        for: 5m
+        labels:
+          severity: medium
+        annotations:
+          description: There are too many elections for leadership."
+      - alert: Server cluster unhealthy
+        expr: |
+          consul_per_server_autopilot_healthy == 0
+        for: 5m
+        labels:
+          severity: high
+        annotations:
+          description: One or many Consul servers in the cluster are unhealthy.
+      - alert: Zero failure tolerance
+        expr: |
+          consul_per_server_autopilot_failure_tolerance == 0
+        for: 5m
+        labels:
+          severity: medium
+        annotations:
+          description: There is no failure tolerance in case one Consul server goes down.
+      - alert: Client RPC requests anomaly
+        expr: |
+          avg(rate(consul_client_rpc[1m]) > 0) > (avg_over_time(rate(consul_client_rpc[1m]) [1h:1m])+ 2* stddev_over_time(rate(consul_client_rpc[1m]) [1h:1m]) )
+        for: 5m
+        labels:
+          severity: medium
+        annotations:
+          description: Consul Client RPC requests had noticeable deviations from baseline over the previous hour.
+      - alert: Client RPC requests rate limit exceeded
+        expr: |
+          rate(consul_client_rpc_exceeded[1m]) / rate(consul_client_rpc[1m]) > 0.1
+        for: 5m
+        labels:
+          severity: medium
+        annotations:
+          description: Over 10% of Consul Client RPC requests have exceeded the rate limit.
+      - alert: Client RPC requests failed
+        expr: |
+          rate(consul_client_rpc_failed[1m]) / rate(consul_client_rpc[1m]) > 0.1
+        for: 5m
+        labels:
+          severity: medium
+        annotations:
+          description: Over 10% of Consul Client RPC requests are failing.
+      - alert: License Expiry
+        expr: |
+          consul_system_licenseExpiration / 24 < 30
+        for: 5m
+        labels:
+          severity: medium
+        annotations:
+          description: Consul license will expire in less than 30 days.
+      - alert: Garbage Collection pause high
+        expr: |
+          (rate(consul_runtime_gc_pause_ns_sum[1m]) / (1000000000) > 2
+        for: 5m
+        labels:
+          severity: medium
+        annotations:
+          description: Garbage Collection stop-the-world pauses were greater than 2 seconds per minute.
+      - alert: Garbage Collection pause too high
+        expr: |
+          (min(consul_runtime_gc_pause_ns_sum)) / (1000000000) > 5
+        for: 5m
+        labels:
+          severity: high
+        annotations:
+          description: Garbage Collection stop-the-world pauses were greater than 5 seconds per minute.
+      - alert: Raft restore duration too high
+        expr: |
+          consul_per_server_raft_leader_oldestLogAge < 2* max(consul_raft_fsm_lastRestoreDuration{kube_pod_label_component="server"})
+        for: 5m
+        labels:
+          severity: medium
+        annotations:
+          description: The Raft FSM restore duration is too close to the Raft leader oldest log age.
+      - alert: RPC requests error rate is high
+        expr: |
+          sum(rate(consul_rpc_request_error[1m])) / sum(rate(consul_rpc_request[1m])) > 0.05
+        for: 5m
+        labels:
+          severity: medium
+        annotations:
+          description: RPC requests error rate is higher than 5%.
+      - alert: Cache hit rate is low
+        expr: |
+          consul_consul_cache_fetch_success/(consul_consul_cache_fetch_success+ ignoring(result_not_modified) group_leftconsul_consul_cache_fetch_error)< 0.95
+        for: 5m
+        labels:
+          severity: medium
+        annotations:
+          description: RPC requests error rate is higher than 5%.
@@ -0,0 +1,19 @@
+apiVersion: v1
+kind: Dashboard
+app: Consul
+version: 1.0.0
+appVersion:
+- '1.11.1'
+configurations:
+- name: Overview
+  kind: Sysdig
+  image: consul/images/consul_sysdig.png
+  description: |
+    This dashboard offers information on:
+    * Overview
+    * Health
+    * Transaction
+    * Leadership
+    * Network
+    * Cache
+  file: include/consul_sysdig.json
@@ -0,0 +1,7 @@
+apiVersion: v1
+kind: Description
+app: Consul
+version: 1.0.0
+appVersion:
+- '1.11.1'
+descriptionFile: README.md
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+# Prerequisites`
	`2`	`+Consul instruments Prometheus metrics and annotates the pods with Prometheus annotations.`