This is an Icinga compatible plugin for monitoring ceph. It should work with Nagios based products as well. Overall it works well, i've noticed a couple time when a result came back unknown or warning when it shouldn't have, but it shortly corrected itself. I'm still working on figuring what what cluster/osd/pg states are causing that.
It does many things such as:
- Check OSD status. You set the warning and critical values at how many OSD not up and in you want to be warned at.
- General health status. Like the HEALTH_OK, HEALTH_WARN you see with ceph status
- PG Status. It doesn't take a warning or critical yet, but it goes to warning if there are PGs not active+clean.
- Performance metrics. No warning or critical on this either. It is for gathering perfdata for graphing. It currently captures IOPS and Read/Write bytes/sec. Most other checks report perfdata as well. These are simply metrics that can't be gathered through the other commands.
- Disk space. It either alerts on a single pool or all pools. If you pass it --pool it will use only that pool to alert on, without --pool it is all pools. Warning and critical are in GB/TB free in that pool. This is based on the Max Avail fields you get with
ceph df. So it is not raw free space but based on pool replication size. This captures disk space metrics for all pools and global free space
You need to create a ceph user and get a keyring. The user only needs caps: [mon] allow r permissions. Name it whatever you like.
I used this CheckCommand object in my icinga configuration.
object CheckCommand "check_ceph" {
import "plugin-check-command"
command = [PluginDir + "/check_ceph.py"]
timeout = 20
arguments = {
"-c" = "$conf$"
"--conf" = "$conf$"
"--id" = "$id$"
"-k" = "$keyring$"
"--keyring" = "$keyring$"
"--health" = {
set_if = "$health$"
}
"--osd" = {
set_if = "$osd$"
}
"--pg" = {
set_if = "$pg$"
}
"--df" = {
set_if = "$df$"
}
"--perf" = {
set_if = "$perf$"
}
"--pool" = "$pool$"
"-b" = "$byte$"
"--byte" = "$byte$"
"-w" = "$warning$"
"--warning" = "$warning$"
"-c" = "$critical$"
"--critical" = "$critical$"
}
}The services are applied like...
apply Service "ceph_health" {
import "generic-5m-service"
display_name = "Ceph Health"
check_command = "check_ceph"
vars.conf = "/etc/icinga2/ceph/ceph.conf"
vars.id = "icinga"
vars.keyring = "/etc/icinga2/ceph/ceph.client.icinga.keyring"
vars.health = true
vars.notification.mute = true
assign where host.vars.type.contains("ceph-mon")
}usage: check_ceph.py [-h] [-C CONF] -id ID [-k KEYRING] [--health] [-o]
[-m MON] [-p] [--perf] [--df] [-b BYTE] [--pool POOL]
[--objects OBJECTS] [-w WARNING] [-c CRITICAL]
Runs health checks against a ceph cluster. This is designed to run on the
monitoring server using the ceph client software. Supply a ceph.conf, keyring,
and user to access the cluster.
optional arguments:
-h, --help show this help message and exit
-C CONF, --conf CONF ceph.conf file, defaults to /etc/ceph/ceph.conf.
-id ID, --id ID Ceph authx user
-k KEYRING, --keyring KEYRING
Path to ceph keyring if not in
/etc/ceph/client.\$id.keyring
--health Get general health status. ex. HEALTH_OK, HEALTH_WARN
-o, --osd OSD status. Thresholds are in number of OSDs missing
-m MON, --mon MON MON status. Thesholds are in number of mons missing
-p, --pg PG status. No thresholds due to the large number of pg
states.
--perf collects additional ceph performance statistics
--df Disk/cluster usage. Reports global and all pools
unless --pool is used. Warning and critical are number
of -b free to the pools. This is not Raw Free, but Max
Avail to the pools based on rep or k,m settings. If
you do not define a pool the threshold is run agains
all the pools in the cluster.
-b BYTE, --byte BYTE Format to use for displaying DF data. G=Gigabyte,
T=Terabyte. Use with the --df option. Defults to TB
--pool POOL Pool. Use with df
--objects OBJECTS Object counts based on pool
-w WARNING, --warning WARNING
Warning threshold. See specific checks for value types
-c CRITICAL, --critical CRITICAL
Critical threshold. See specific checks for value
types./check_ceph.py -C /etc/icinga2/ceph/ceph.conf --id icinga -k /etc/icinga2/ceph/ceph.client.icinga.keyring --osd -w 2 -c 3
ALL OSDs are up and in. 264 OSDS. 264 up, 264 in|num_osds=264 num_up_osds=264 num_in_osds=264
./check_ceph.py -C /etc/icinga2/ceph/ceph.conf --id icinga -k /etc/icinga2/ceph/ceph.client.icinga.keyring --health
HEALTH_OK
./check_ceph.py -C /etc/icinga2/ceph/ceph.conf --id icinga -k /etc/icinga2/ceph/ceph.client.icinga.keyring --pg
All PGs are active+clean: 20480 PGs Total, active+clean=20480 |active+clean=20480
./check_ceph.py -C /etc/icinga2/ceph/ceph.conf --id icinga -k /etc/icinga2/ceph/ceph.client.icinga.keyring --pg
All PGs are active+clean: 20480 PGs Total, active+clean=20480 |active+clean=20480
./check_ceph.py -C /etc/icinga2/ceph/ceph.conf --id icinga -k /etc/icinga2/ceph/ceph.client.icinga.keyring --df -w 100 -c 50
Healthy: All ceph pools are within free space thresholds|global_total_bytes=1699TB global_used_bytes=1179TB global_avail_bytes=520TB dev_bytes_used=756TB dev_max_avail=270TB dev_objects=6995252 ops_bytes_used=183TB ops_max_avail=270TB ops_objects=2817297