Skip to content

Commit aef920a

Browse files
authored
Rabbitmq (#181)
* Add rabbitmq resource * Add dashboards * Change the test
1 parent d5a330e commit aef920a

14 files changed

+3917
-3
lines changed

apps/rabbitmq.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,12 @@ kind: App
44
name: "rabbitmq"
55
keywords:
66
- Message-broker
7-
- Coming soon
7+
- Available
88
availableVersions:
99
- '3.8'
1010
shortDescription: "RabbitMQ is the most widely deployed open source message broker."
1111
description: |
1212
RabbitMQ is the most widely deployed open source message broker.
1313
icon: https://upload.wikimedia.org/wikipedia/commons/7/71/RabbitMQ_logo.svg
1414
website: https://www.rabbitmq.com/
15-
available: false
15+
available: true

resources/rabbitmq/ALERTS.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Alerts
2+
## RabbitMQClusterOperatorUnavailableReplicas
3+
There are pods that are either running but not yet available or pods that still have not been created.
4+
## InsufficientEstablishedErlangDistributionLinks
5+
There are only `{{ $value }}` established Erlang distribution links
6+
## LowDiskWatermarkPredicted
7+
The predicted free disk space in 24 hours from now is `{{ $value | humanize1024 }}B`
8+
## HighConnectionChurn
9+
Over the last 5 minutes, `{{ $value | humanizePercentage }}` of total connections are closed or opened per second in RabbitMQ cluster `{{ $labels.rabbitmq_cluster }}` in namespace `{{ $labels.namespace }}`.
10+
## NoMajorityOfNodesReady
11+
Only `{{ $value }}` replicas are ready in StatefulSet `{{ $labels.statefulset }}` of RabbitMQ cluster `{{ $labels.label_app_kubernetes_io_name }}` in namespace `{{ $labels.namespace }}`.
12+
## PersistentVolumeMissing
13+
PersistentVolumeClaim `{{ $labels.persistentvolumeclaim }}` of RabbitMQ cluster `{{ $labels.label_app_kubernetes_io_name }}` in namespace `{{ $labels.namespace }}` is not bound.
14+
## UnroutableMessages
15+
There were `{{ $value | printf "%.0f" }}` unroutable messages within the last 5 minutes in RabbitMQ cluster `{{ $labels.rabbitmq_cluster }}` in namespace `{{ $labels.namespace }}`.
16+
## FileDescriptorsNearLimit
17+
`{{ $value | humanizePercentage }}` file descriptors of file descriptor limit are used in RabbitMQ node `{{ $labels.rabbitmq_node }}`, pod `{{ $labels.pod }}`, RabbitMQ cluster `{{ $labels.rabbitmq_cluster }}`, namespace `{{ $labels.namespace }}`.
18+
## ContainerRestarts
19+
Over the last 10 minutes, container `{{ $labels.container }}` restarted `{{ $value | printf "%.0f" }}` times in pod `{{ $labels.pod }}` of RabbitMQ cluster `{{ $labels.rabbitmq_cluster }}` in namespace `{{ $labels.namespace }}`.
20+
## TCPSocketsNearLimit
21+
`{{ $value | humanizePercentage }}` TCP sockets of TCP socket limit are open in RabbitMQ node `{{ $labels.rabbitmq_node }}`, pod `{{ $labels.pod }}`, RabbitMQ cluster `{{ $labels.rabbitmq_cluster }}`, namespace `{{ $labels.namespace }}`.
22+
##

resources/rabbitmq/INSTALL.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
## Enable Prometheus Metrics
2+
Rabbitmq instruments Prometheus metrics and annotates the metrics API pod with Prometheus annotations.
3+
4+
Make sure that the prometheus metrics are activated. In case you don't have activated the plugin use the next command:
5+
6+
```sh
7+
rabbitmq-plugins enable rabbitmq_prometheus
8+
```

resources/rabbitmq/README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Rabbitmq
2+
RabbitMQ is the most widely deployed open source message broker.
3+
4+
With tens of thousands of users, RabbitMQ is one of the most popular open source message brokers. From T-Mobile to Runtastic, RabbitMQ is used worldwide at small startups and large enterprises.
5+
6+
RabbitMQ is lightweight and easy to deploy on premises and in the cloud. It supports multiple messaging protocols. RabbitMQ can be deployed in distributed and federated configurations to meet high-scale, high-availability requirements.
7+
8+
# Attributions
9+
Configuration files and dashboards are maintained by [Sysdig team](https://sysdig.com/).
10+
11+
All dashboards and alerts are modified from the [rabbitmq repository](https://github.com/rabbitmq/cluster-operator/tree/main/observability/) as reference.

resources/rabbitmq/alerts.yaml

Lines changed: 242 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,242 @@
1+
apiVersion: v1
2+
kind: Alert
3+
app: "rabbitmq"
4+
version: 1.0.0
5+
appVersion:
6+
- '3.8'
7+
descriptionFile: ALERTS.md
8+
configurations:
9+
- kind: Prometheus
10+
data: |
11+
groups:
12+
- name: rabbitmq-cluster-operator
13+
rules:
14+
- alert: RabbitMQClusterOperatorUnavailableReplicas
15+
expr: |
16+
kube_deployment_status_replicas_unavailable{deployment="rabbitmq-cluster-operator"}
17+
>
18+
0
19+
for: 5m
20+
annotations:
21+
description: |
22+
`{{ $value }}` replicas are unavailable in Deployment `rabbitmq-cluster-operator`
23+
in namespace `{{ $labels.namespace }}`.
24+
summary: |
25+
There are pods that are either running but not yet available or pods that still have not been created.
26+
Check the status of the deployment: `kubectl -n {{ $labels.namespace }} describe deployment rabbitmq-cluster-operator`
27+
Check the status of the pod: `kubectl -n {{ $labels.namespace }} describe pod -l app.kubernetes.io/component=rabbitmq-cluster-operator`
28+
labels:
29+
rulesgroup: rabbitmq-operator
30+
severity: warning
31+
- name: rabbitmq
32+
rules:
33+
- alert: InsufficientEstablishedErlangDistributionLinks
34+
# erlang_vm_dist_node_state: 1=pending, 2=up_pending, 3=up
35+
expr: |
36+
count by (namespace, rabbitmq_cluster) (erlang_vm_dist_node_state * on(instance) group_left(rabbitmq_cluster) rabbitmq_identity_info == 3)
37+
<
38+
count by (namespace, rabbitmq_cluster) (rabbitmq_build_info * on(instance) group_left(rabbitmq_cluster) rabbitmq_identity_info)
39+
*
40+
(count by (namespace, rabbitmq_cluster) (rabbitmq_build_info * on(instance) group_left(rabbitmq_cluster) rabbitmq_identity_info) -1 )
41+
for: 10m
42+
annotations:
43+
description: |
44+
There are only `{{ $value }}` established Erlang distribution links
45+
in RabbitMQ cluster `{{ $labels.rabbitmq_cluster }}` in namespace `{{ $labels.namespace }}`.
46+
summary: |
47+
RabbitMQ clusters have a full mesh topology.
48+
All RabbitMQ nodes connect to all other RabbitMQ nodes in both directions.
49+
The expected number of established Erlang distribution links is therefore `n*(n-1)` where `n` is the number of RabbitMQ nodes in the cluster.
50+
Therefore, the expected number of distribution links are `0` for a 1-node cluster, `6` for a 3-node cluster, and `20` for a 5-node cluster.
51+
This alert reports that the number of established distributions links is less than the expected number.
52+
Some reasons for this alert include failed network links, network partitions, failed clustering (i.e. nodes can't join the cluster).
53+
Check the panels `All distribution links`, `Established distribution links`, `Connecting distributions links`, `Waiting distribution links`, and `distribution links`
54+
of the Grafana dashboard `Erlang-Distribution`.
55+
Check the logs of the RabbitMQ nodes: `kubectl -n {{ $labels.namespace }} logs -l app.kubernetes.io/component=rabbitmq,app.kubernetes.io/name={{ $labels.rabbitmq_cluster }}`
56+
labels:
57+
rulesgroup: rabbitmq
58+
severity: warning
59+
- alert: LowDiskWatermarkPredicted
60+
# The 2nd condition ensures that data points are available until 24 hours ago such that no false positive alerts are triggered for newly created RabbitMQ clusters.
61+
expr: |
62+
(
63+
predict_linear(rabbitmq_disk_space_available_bytes[24h], 60*60*24) * on (instance) group_left(rabbitmq_cluster, rabbitmq_node, pod) rabbitmq_identity_info
64+
<
65+
rabbitmq_disk_space_available_limit_bytes * on (instance) group_left(rabbitmq_cluster, rabbitmq_node, pod) rabbitmq_identity_info
66+
)
67+
and
68+
(
69+
count_over_time(rabbitmq_disk_space_available_limit_bytes[2h] offset 22h) * on (instance) group_left(rabbitmq_cluster, rabbitmq_node, pod) rabbitmq_identity_info
70+
>
71+
0
72+
)
73+
for: 60m
74+
annotations:
75+
description: |
76+
The predicted free disk space in 24 hours from now is `{{ $value | humanize1024 }}B`
77+
in RabbitMQ node `{{ $labels.rabbitmq_node }}`, pod `{{ $labels.pod }}`,
78+
RabbitMQ cluster `{{ $labels.rabbitmq_cluster }}`, namespace `{{ $labels.namespace }}`.
79+
summary: |
80+
Based on the trend of available disk space over the past 24 hours, it's predicted that, in 24 hours from now, a disk alarm will be triggered since the free disk space will drop below the free disk space limit.
81+
This alert is reported for the partition where the RabbitMQ data directory is stored.
82+
When the disk alarm will be triggered, all publishing connections across all cluster nodes will be blocked.
83+
See
84+
https://www.rabbitmq.com/alarms.html,
85+
https://www.rabbitmq.com/disk-alarms.html,
86+
https://www.rabbitmq.com/production-checklist.html#resource-limits-disk-space,
87+
https://www.rabbitmq.com/persistence-conf.html,
88+
https://www.rabbitmq.com/connection-blocked.html.
89+
labels:
90+
rulesgroup: rabbitmq
91+
severity: warning
92+
- alert: HighConnectionChurn
93+
expr: |
94+
(
95+
sum(rate(rabbitmq_connections_closed_total[5m]) * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info) by(namespace, rabbitmq_cluster)
96+
+
97+
sum(rate(rabbitmq_connections_opened_total[5m]) * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info) by(namespace, rabbitmq_cluster)
98+
)
99+
/
100+
sum (rabbitmq_connections * on(instance) group_left(rabbitmq_cluster) rabbitmq_identity_info) by (namespace, rabbitmq_cluster)
101+
> 0.1
102+
unless
103+
sum (rabbitmq_connections * on(instance) group_left(rabbitmq_cluster) rabbitmq_identity_info) by (namespace, rabbitmq_cluster)
104+
< 100
105+
for: 10m
106+
annotations:
107+
description: |
108+
Over the last 5 minutes, `{{ $value | humanizePercentage }}`
109+
of total connections are closed or opened per second in RabbitMQ cluster `{{ $labels.rabbitmq_cluster }}`
110+
in namespace `{{ $labels.namespace }}`.
111+
summary: |
112+
More than 10% of total connections are churning.
113+
This means that client application connections are short-lived instead of long-lived.
114+
Read https://www.rabbitmq.com/connections.html#high-connection-churn to understand why this is an anti-pattern.
115+
labels:
116+
rulesgroup: rabbitmq
117+
severity: warning
118+
- alert: NoMajorityOfNodesReady
119+
expr: |
120+
kube_statefulset_status_replicas_ready * on (namespace, statefulset) group_left(label_app_kubernetes_io_name) kube_statefulset_labels{label_app_kubernetes_io_component="rabbitmq"}
121+
<=
122+
kube_statefulset_replicas * on (namespace, statefulset) group_left(label_app_kubernetes_io_name) kube_statefulset_labels{label_app_kubernetes_io_component="rabbitmq"}
123+
/ 2
124+
unless
125+
kube_statefulset_replicas * on (namespace, statefulset) group_left(label_app_kubernetes_io_name) kube_statefulset_labels{label_app_kubernetes_io_component="rabbitmq"}
126+
== 0
127+
for: 5m
128+
annotations:
129+
description: |
130+
Only `{{ $value }}` replicas are ready in StatefulSet `{{ $labels.statefulset }}`
131+
of RabbitMQ cluster `{{ $labels.label_app_kubernetes_io_name }}` in namespace `{{ $labels.namespace }}`.
132+
summary: |
133+
No majority of nodes have been ready for the last 5 minutes.
134+
Check the details of the pods:
135+
`kubectl -n {{ $labels.namespace }} describe pods -l app.kubernetes.io/component=rabbitmq,app.kubernetes.io/name={{ $labels.label_app_kubernetes_io_name }}`
136+
labels:
137+
rabbitmq_cluster: '{{ $labels.label_app_kubernetes_io_name }}'
138+
rulesgroup: rabbitmq
139+
severity: warning
140+
- alert: PersistentVolumeMissing
141+
expr: |
142+
kube_persistentvolumeclaim_status_phase{phase="Bound"} * on (namespace, persistentvolumeclaim) group_left(label_app_kubernetes_io_name) kube_persistentvolumeclaim_labels{label_app_kubernetes_io_component="rabbitmq"}
143+
==
144+
0
145+
for: 10m
146+
annotations:
147+
description: |
148+
PersistentVolumeClaim `{{ $labels.persistentvolumeclaim }}` of
149+
RabbitMQ cluster `{{ $labels.label_app_kubernetes_io_name }}` in namespace
150+
`{{ $labels.namespace }}` is not bound.
151+
summary: |
152+
RabbitMQ needs a PersistentVolume for its data.
153+
However, there is no PersistentVolume bound to the PersistentVolumeClaim.
154+
This means the requested storage could not be provisioned.
155+
Check the status of the PersistentVolumeClaim: `kubectl -n {{ $labels.namespace }} describe pvc {{ $labels.persistentvolumeclaim }}`.
156+
labels:
157+
rabbitmq_cluster: '{{ $labels.label_app_kubernetes_io_name }}'
158+
rulesgroup: rabbitmq
159+
severity: critical
160+
- alert: UnroutableMessages
161+
expr: |
162+
sum by(namespace, rabbitmq_cluster) (increase(rabbitmq_channel_messages_unroutable_dropped_total[5m]) * on(instance) group_left(rabbitmq_cluster) rabbitmq_identity_info)
163+
>= 1
164+
or
165+
sum by(namespace, rabbitmq_cluster) (increase(rabbitmq_channel_messages_unroutable_returned_total[5m]) * on(instance) group_left(rabbitmq_cluster) rabbitmq_identity_info)
166+
>= 1
167+
annotations:
168+
description: |
169+
There were `{{ $value | printf "%.0f" }}` unroutable messages within the last
170+
5 minutes in RabbitMQ cluster `{{ $labels.rabbitmq_cluster }}` in namespace
171+
`{{ $labels.namespace }}`.
172+
summary: |
173+
There are messages published into an exchange which cannot be routed and are either dropped silently, or returned to publishers.
174+
Is your routing topology set up correctly?
175+
Check your application code and bindings between exchanges and queues.
176+
See
177+
https://www.rabbitmq.com/publishers.html#unroutable,
178+
https://www.rabbitmq.com/confirms.html#when-publishes-are-confirmed.
179+
labels:
180+
rulesgroup: rabbitmq
181+
severity: warning
182+
- alert: FileDescriptorsNearLimit
183+
expr: |
184+
sum by(namespace, rabbitmq_cluster, pod, rabbitmq_node) (max_over_time(rabbitmq_process_open_fds[5m]) * on(instance) group_left(rabbitmq_cluster, rabbitmq_node, pod) rabbitmq_identity_info)
185+
/
186+
sum by(namespace, rabbitmq_cluster, pod, rabbitmq_node) (rabbitmq_process_max_tcp_sockets * on(instance) group_left(rabbitmq_cluster, rabbitmq_node, pod) rabbitmq_identity_info)
187+
> 0.8
188+
for: 10m
189+
annotations:
190+
description: |
191+
`{{ $value | humanizePercentage }}` file descriptors of file
192+
descriptor limit are used in RabbitMQ node `{{ $labels.rabbitmq_node }}`,
193+
pod `{{ $labels.pod }}`, RabbitMQ cluster `{{ $labels.rabbitmq_cluster }}`,
194+
namespace `{{ $labels.namespace }}`.
195+
summary: |
196+
More than 80% of file descriptors are used on the RabbitMQ node.
197+
When this value reaches 100%, new connections will not be accepted and disk write operations may fail.
198+
Client libraries, peer nodes and CLI tools will not be able to connect when the node runs out of available file descriptors.
199+
See https://www.rabbitmq.com/production-checklist.html#resource-limits-file-handle-limit.
200+
labels:
201+
rulesgroup: rabbitmq
202+
severity: warning
203+
- alert: ContainerRestarts
204+
expr: |
205+
increase(kube_pod_container_status_restarts_total[10m]) * on(namespace, pod, container) group_left(rabbitmq_cluster) rabbitmq_identity_info
206+
>=
207+
1
208+
for: 5m
209+
annotations:
210+
description: |
211+
Over the last 10 minutes, container `{{ $labels.container }}`
212+
restarted `{{ $value | printf "%.0f" }}` times in pod `{{ $labels.pod }}` of RabbitMQ cluster
213+
`{{ $labels.rabbitmq_cluster }}` in namespace `{{ $labels.namespace }}`.
214+
summary: |
215+
Investigate why the container got restarted.
216+
Check the logs of the current container: `kubectl -n {{ $labels.namespace }} logs {{ $labels.pod }}`
217+
Check the logs of the previous container: `kubectl -n {{ $labels.namespace }} logs {{ $labels.pod }} --previous`
218+
Check the last state of the container: `kubectl -n {{ $labels.namespace }} get pod {{ $labels.pod }} -o jsonpath='{.status.containerStatuses[].lastState}'`
219+
labels:
220+
rabbitmq_cluster: '{{ $labels.rabbitmq_cluster }}'
221+
rulesgroup: rabbitmq
222+
severity: warning
223+
- alert: TCPSocketsNearLimit
224+
expr: |
225+
sum by(namespace, rabbitmq_cluster, pod, rabbitmq_node) (max_over_time(rabbitmq_process_open_tcp_sockets[5m]) * on(instance) group_left(rabbitmq_cluster, rabbitmq_node, pod) rabbitmq_identity_info)
226+
/
227+
sum by(namespace, rabbitmq_cluster, pod, rabbitmq_node) (rabbitmq_process_max_tcp_sockets * on(instance) group_left(rabbitmq_cluster, rabbitmq_node, pod) rabbitmq_identity_info)
228+
> 0.8
229+
for: 10m
230+
annotations:
231+
description: |
232+
`{{ $value | humanizePercentage }}` TCP sockets of TCP socket
233+
limit are open in RabbitMQ node `{{ $labels.rabbitmq_node }}`, pod `{{ $labels.pod }}`,
234+
RabbitMQ cluster `{{ $labels.rabbitmq_cluster }}`, namespace `{{ $labels.namespace }}`.
235+
summary: |
236+
More than 80% of TCP sockets are open on the RabbitMQ node.
237+
When this value reaches 100%, new connections will not be accepted.
238+
Client libraries, peer nodes and CLI tools will not be able to connect when the node runs out of available TCP sockets.
239+
See https://www.rabbitmq.com/networking.html.
240+
labels:
241+
rulesgroup: rabbitmq
242+
severity: warning

resources/rabbitmq/dashboards.yaml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
apiVersion: v1
2+
kind: Dashboard
3+
app: "rabbitmq"
4+
version: 1.0.0
5+
appVersion:
6+
- '3.8'
7+
configurations:
8+
- name: Rabbitmq Overview
9+
kind: Sysdig
10+
image: rabbitmq/images/rabbitmq_overview_sysdig.png
11+
description: |
12+
This dashboard offers information on:
13+
* Node identity, including RabbitMQ & Erlang/OTP version
14+
* Node memory & disk available before publishers blocked (alarm triggers)
15+
* Node file descriptors & TCP sockets available
16+
* Ready & pending messages
17+
file: include/dashboard-Sysdig-rabbitmq-overview.json
18+
- name: Rabbitmq Usage
19+
kind: Sysdig
20+
image: rabbitmq/images/rabbitmq_usage_sysdig.png
21+
description: |
22+
This dashboard offers information on:
23+
* Incoming message rates: published / routed to queues / confirmed / unconfirmed / returned / dropped
24+
* Outgoing message rated: delivered with auto or manual acks / acknowledged / redelivered
25+
* Polling operation with auto or manual acks, as well as empty ops
26+
* Queues, including declaration & deletion rates
27+
* Channels, including open & close rates
28+
* Connections, including open & close rates
29+
file: include/dashboard-Sysdig-rabbitmq-usage.json

0 commit comments

Comments
 (0)