Skip to content

Commit b684b14

Browse files
committed
cassandra jmx exporter update
1 parent cf509b4 commit b684b14

File tree

10 files changed

+4457
-521
lines changed

10 files changed

+4457
-521
lines changed

resources/cassandra/ALERTS.md

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,22 @@
11
# Alerts
2-
## HighCpuUtilizationRate
3-
Cassandra compaction task pending
2+
## [Cassandra] Compaction Task Pending
3+
There are many Cassandra compaction tasks pending.
44

5-
## CassandraWriteLatency
6-
Cassandra write latency
5+
## [Cassandra] Commitlog Pending Tasks
6+
There are many Cassandra Commitlog tasks pending.
77

8-
## CassandraReadLatency
9-
Cassandra read latency
8+
## [Cassandra] Compaction Executor Blocked Tasks
9+
There are many Cassandra compaction executor blocked tasks.
1010

11-
## CassandraCommitlogPendingTasks
12-
Cassandra commitlog pending tasks
11+
## [Cassandra] Flush Writer Blocked Tasks
12+
There are many Cassandra flush writer blocked tasks.
1313

14-
## CassandraConnectionTimeoutsTotal
15-
Cassandra connection timeouts total
14+
## [Cassandra] Storage Exceptions
15+
There are storage exceptions in Cassandra node.
16+
17+
## [Cassandra] High Tombstones Scanned
18+
There is a high number of tombstones scanned.
19+
20+
## [Cassandra] JVM Heap Memory
21+
High JVM Heap Memory.
1622

resources/cassandra/INSTALL.md

Lines changed: 10 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,58 +1,22 @@
11
# Installing the exporter
2-
Cassandra exposes the metrics with JMX (Java Management Extensions). The exporter gather this metrics and expose them in Prometheus format. Usually JMX is unsecured and it has no authentication methods. In this case, the best way to deploy JMX metrics is to add a sidecar with the exporter.
2+
Cassandra exposes the metrics via JMX (Java Management Extensions). The exporter filter these metrics using jmx-config file and exposes them in Prometheus format. Usually JMX is unsecured and it has no authentication methods. In this case, the best way to deploy JMX metrics is to add a sidecar with the exporter.
33

4-
```yaml
5-
spec:
6-
template:
7-
metadata:
8-
annotations:
9-
prometheus.io/scrape: "true"
10-
prometheus.io/port: "9500"
11-
spec:
12-
containers:
13-
- name: cassandra-exporter
14-
image: quay.io/sysdig/promcat-cassandra-exporter:v0.9.10
15-
imagePullPolicy: Always
16-
volumeMounts:
17-
- mountPath: /var/lib/cassandra
18-
name: data
19-
ports:
20-
- name: metrics
21-
containerPort: 9500
22-
protocol: TCP
23-
livenessProbe:
24-
tcpSocket:
25-
port: 9500
26-
initialDelaySeconds: 180
27-
readinessProbe:
28-
httpGet:
29-
path: /metrics
30-
port: 9500
31-
initialDelaySeconds: 180
32-
timeoutSeconds: 45
33-
```
4+
## Steps to install
345

35-
To do so, run the following command:
6+
1.- Create the following ConfiMap (you must specify the namespace where you have Cassandra deployed inside jmx-config.yaml file):
367

378
```
38-
kubectl patch deployment NameOfYourDeployment --patch https://raw.githubusercontent.com/sysdiglabs/promcat-resources/master/resources/cassandra/include/patch.yaml
9+
kubectl apply -f jmx-config.yaml
3910
```
4011

41-
Alternatively, you can download the file and run:
12+
2.- Deploy exporter as sidecar running the following command:
4213

4314
```
44-
kubectl patch deployment NameOfYourDeployment --patch "$(cat patch.yaml)"
15+
kubectl patch sts -n cassandra cassandra-sts --patch https://raw.githubusercontent.com/sysdiglabs/promcat-resources/master/resources/cassandra/include/patch.yaml
4516
```
4617

47-
# Sysdig Agent configuration
48-
To use the Sysdig agent, do the following:
49-
50-
1. Create the recording rules to scrape only the metrics that will be used in the dashboards.
51-
52-
2. Copy the agent configuration provided and save it as `sysdig-agent.yaml`.
53-
54-
3. Apply the configuration:
18+
Alternatively, you can download the file and run:
5519

56-
```
57-
kubectl apply -f sysdig-agent.yaml
58-
```
20+
```
21+
kubectl patch sts -n cassandra cassandra-sts --patch "$(cat patch.yaml)"
22+
```

resources/cassandra/README.md

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -5,22 +5,21 @@
55
* Request_latency
66
* Unavailable_exceptions_total
77
* Timeouts_total
8-
* Jvm_memory_pool_used
9-
* Jvm_memory_pool_maximum
8+
* Jvm_memory_used
9+
* Jvm_memory_maximum
1010
* Pending_compactions
11-
* Pending_tasks
12-
* Disk_space
11+
* CommitLog_tasks
1312
* Storage_total
14-
* Storage_bytes
13+
* Storage_exceptions
1514
* Native_connections
16-
* Load_average
17-
* Free_memory
18-
* Memory_total
15+
* Dropped_messages
16+
* Keyspace_size
17+
* Table_size
1918

2019
# Number of time series generated
21-
The metrics for each instance are around 7k.
20+
The metrics for each instance are around 850.
2221

2322
# Attributions
24-
Using the [Cassandra exporter](https://github.com/instaclustr/cassandra-exporter) with license Apache 2.0
23+
Using the [jmx exporter](https://github.com/sysdiglabs/jmx_exporter) with license Apache 2.0
2524

2625
The configuration files, dashboards, and alerts are maintained by [Sysdig team](https://sysdig.com/).

resources/cassandra/alerts.yaml

Lines changed: 56 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -11,47 +11,59 @@ configurations:
1111
groups:
1212
- name: Cassandra
1313
rules:
14-
- alert: HighCpuUtilizationRate
15-
expr: avg_over_time(cassandra_table_estimated_pending_compactions:cassandra)[30m] > 100
16-
for: 5m
17-
labels:
18-
severity: warning
19-
annotations:
20-
summary: "Cassandra compaction task pending (instance {{ $labels.instance }})"
21-
description: "Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
22-
23-
- alert: CassandraWriteLatency
24-
expr: cassandra_client_request_latency_seconds:cassandra{quantile="0.95", operation="write"} > 100000
25-
for: 5m
26-
labels:
27-
severity: warning
28-
annotations:
29-
summary: "Cassandra write latency (instance {{ $labels.instance }})"
30-
description: "High write latency on {{ $labels.instance }} cassandra node\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
31-
32-
- alert: CassandraReadLatency
33-
expr: cassandra_client_request_latency_seconds:cassandra{quantile="0.95", operation="read"} > 100000
34-
for: 5m
35-
labels:
36-
severity: warning
37-
annotations:
38-
summary: "Cassandra read latency (instance {{ $labels.instance }})"
39-
description: "High read latency on {{ $labels.instance }} cassandra node\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
40-
41-
- alert: CassandraCommitlogPendingTasks
42-
expr: cassandra_commit_log_pending_tasks:cassandra > 15
43-
for: 5m
44-
labels:
45-
severity: warning
46-
annotations:
47-
summary: "Cassandra commitlog pending tasks (instance {{ $labels.instance }})"
48-
description: "Unexpected number of Cassandra commitlog pending tasks\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
49-
50-
- alert: CassandraConnectionTimeoutsTotal
51-
expr: rate(cassandra_client_request_timeouts_total:cassandra[1m]) > 5
52-
for: 5m
53-
labels:
54-
severity: critical
55-
annotations:
56-
summary: "Cassandra connection timeouts total (instance {{ $labels.instance }})"
57-
description: "Some connection between nodes are ending in timeout\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
14+
- alert: '[Cassandra] Compaction Task Pending'
15+
expr: |
16+
sum (cassandra_compaction_pending_tasks)> 20
17+
for: 5m
18+
labels:
19+
severity: critical
20+
annotations:
21+
description: There are many Cassandra compaction tasks pending.
22+
- alert: '[Cassandra] Commitlog Pending Tasks'
23+
expr: |
24+
sum (cassandra_commitlog_pending_tasks)> 20
25+
for: 5m
26+
labels:
27+
severity: warning
28+
annotations:
29+
description: There are many Cassandra Commitlog tasks pending.
30+
- alert: '[Cassandra] Compaction Executor Blocked Tasks'
31+
expr: |
32+
sum (rate(cassandra_threadpool_blocked_tasks_total{pool="CompactionExecutor"}[2m]))> 20
33+
for: 2m
34+
labels:
35+
severity: warning
36+
annotations:
37+
description: There are many Cassandra compaction executor blocked tasks.
38+
- alert: '[Cassandra] Flush Writer Blocked Tasks'
39+
expr: |
40+
sum (rate(cassandra_threadpool_blocked_tasks_total{pool="MemFlushWriter"}[5m]))> 20
41+
for: 5m
42+
labels:
43+
severity: warning
44+
annotations:
45+
description: There are many Cassandra flush writer blocked tasks.
46+
- alert: '[Cassandra] Storage Exceptions'
47+
expr: |
48+
sum (cassandra_storage_internal_exceptions_total)> 1
49+
for: 2m
50+
labels:
51+
severity: critical
52+
annotations:
53+
description: There are storage exceptions in Cassandra node.
54+
- alert: '[Cassandra] High Tombstones Scanned'
55+
expr: |
56+
sum (cassandra__tombstoned_scanned)> 1000
57+
for: 5m
58+
labels:
59+
severity: critical
60+
annotations:
61+
description: There is a high number of tombstones scanned.
62+
- alert: '[Cassandra] JVM Heap Memory'
63+
expr: |
64+
sum (cassandra_jvm_memory_usage_used_bytes{area="Heap"})/ sum (cassandra_jvm_memory_usage_max_bytes{area="Heap"})> 0.90
65+
for: 5m
66+
labels:
67+
severity: critical
68+
annotations:
69+
description: High JVM Heap Memory.

resources/cassandra/dashboards.yaml

Lines changed: 7 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -10,40 +10,18 @@ configurations:
1010
image: 'cassandra/images/cassandra-sysdig.png'
1111
description: |
1212
This dashboard offers information on:
13-
* Nodes up
13+
* Dropped Messages
1414
* ClientRead avg duration
1515
* ClientWrite avg duration
1616
* JVM Heap usage
1717
* Pending compactions
18-
* Pending tasks
18+
* Commitlog tasks
19+
* Storage exceptions
1920
* Unavailable exceptions
20-
* Available disk
2121
* 95thPercentile read latency
2222
* 95thPercentile write latency
23-
* Avaiable disk by instance
2423
* Users conected
25-
* CPU load ratio
26-
* JVM usage
27-
* Cassandra
28-
file: include/dashboard-Sysdig--1.0.0.json
29-
- name: 'Cassandra overview'
30-
kind: Grafana
31-
image: 'cassandra/images/cassandra-grafana.png'
32-
description: |
33-
This dashboard offers information on:
34-
* Nodes up
35-
* ClientRead avg duration
36-
* ClientWrite avg duration
37-
* JVM Heap usage
38-
* Pending compactions
39-
* Pending tasks
40-
* Unavailable exceptions
41-
* Available disk
42-
* 95thPercentile read latency
43-
* 95thPercentile write latency
44-
* Avaiable disk by instance
45-
* Users conected
46-
* CPU load ratio
47-
* JVM usage
48-
* Cassandra
49-
file: include/dashboard-Grafana--1.0.0.json
24+
* Tombstones scanned
25+
* CQL statements
26+
* Keyspace and tables
27+
file: include/dashboard-Sysdig--1.0.0.json

0 commit comments

Comments
 (0)