Skip to content

Commit a4452f6

Browse files
authored
Merge pull request ceph#54742 from guits/node-proxy
orch: implement hardware monitoring Reviewed-by: Juan Miguel Olmo Martínez <[email protected]>
2 parents 5952230 + b7c0a6a commit a4452f6

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+3863
-53
lines changed

ceph.spec.in

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1305,6 +1305,15 @@ Group: System/Monitoring
13051305
%description mib
13061306
This package provides a Ceph MIB for SNMP traps.
13071307

1308+
%package node-proxy
1309+
Summary: hw monitoring agent for Ceph
1310+
BuildArch: noarch
1311+
%if 0%{?suse_version}
1312+
Group: System/Monitoring
1313+
%endif
1314+
%description node-proxy
1315+
This package provides a Ceph hardware monitoring agent.
1316+
13081317
#################################################################################
13091318
# common
13101319
#################################################################################
@@ -2647,4 +2656,11 @@ exit 0
26472656
%attr(0755,root,root) %dir %{_datadir}/snmp
26482657
%{_datadir}/snmp/mibs
26492658

2659+
%files node-proxy
2660+
%{_sbindir}/ceph-node-proxy
2661+
%dir %{python3_sitelib}/ceph_node_proxy
2662+
%{python3_sitelib}/ceph_node_proxy/*
2663+
%{python3_sitelib}/ceph_node_proxy-*
2664+
#%{_mandir}/man8/ceph-node-proxy.8*
2665+
26502666
%changelog

doc/hardware-monitoring/index.rst

Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
.. _hardware-monitoring:
2+
3+
Hardware monitoring
4+
===================
5+
6+
`node-proxy` is the internal name to designate the running agent which inventories a machine's hardware, provides the different statuses and enable the operator to perform some actions.
7+
It gathers details from the RedFish API, processes and pushes data to agent endpoint in the Ceph manager daemon.
8+
9+
.. graphviz::
10+
11+
digraph G {
12+
node [shape=record];
13+
mgr [label="{<mgr> ceph manager}"];
14+
dashboard [label="<dashboard> ceph dashboard"];
15+
agent [label="<agent> agent"];
16+
redfish [label="<redfish> redfish"];
17+
18+
agent -> redfish [label=" 1." color=green];
19+
agent -> mgr [label=" 2." color=orange];
20+
dashboard:dashboard -> mgr [label=" 3."color=lightgreen];
21+
node [shape=plaintext];
22+
legend [label=<<table border="0" cellborder="1" cellspacing="0">
23+
<tr><td bgcolor="lightgrey">Legend</td></tr>
24+
<tr><td align="center">1. Collects data from redfish API</td></tr>
25+
<tr><td align="left">2. Pushes data to ceph mgr</td></tr>
26+
<tr><td align="left">3. Query ceph mgr</td></tr>
27+
</table>>];
28+
}
29+
30+
31+
Limitations
32+
-----------
33+
34+
For the time being, the `node-proxy` agent relies on the RedFish API.
35+
It implies both `node-proxy` agent and `ceph-mgr` daemon need to be able to access the Out-Of-Band network to work.
36+
37+
38+
Deploying the agent
39+
-------------------
40+
41+
| The first step is to provide the out of band management tool credentials.
42+
| This can be done when adding the host with a service spec file:
43+
44+
.. code-block:: bash
45+
46+
# cat host.yml
47+
---
48+
service_type: host
49+
hostname: node-10
50+
addr: 10.10.10.10
51+
oob:
52+
addr: 20.20.20.10
53+
username: admin
54+
password: p@ssword
55+
56+
Apply the spec:
57+
58+
.. code-block:: bash
59+
60+
# ceph orch apply -i host.yml
61+
Added host 'node-10' with addr '10.10.10.10'
62+
63+
Deploy the agent:
64+
65+
.. code-block:: bash
66+
67+
# ceph config set mgr mgr/cephadm/hw_monitoring true
68+
69+
CLI
70+
---
71+
72+
| **orch** **hardware** **status** [hostname] [--category CATEGORY] [--format plain | json]
73+
74+
supported categories are:
75+
76+
* summary (default)
77+
* memory
78+
* storage
79+
* processors
80+
* network
81+
* power
82+
* fans
83+
* firmwares
84+
* criticals
85+
86+
Examples
87+
********
88+
89+
90+
hardware health statuses summary
91+
++++++++++++++++++++++++++++++++
92+
93+
.. code-block:: bash
94+
95+
# ceph orch hardware status
96+
+------------+---------+-----+-----+--------+-------+------+
97+
| HOST | STORAGE | CPU | NET | MEMORY | POWER | FANS |
98+
+------------+---------+-----+-----+--------+-------+------+
99+
| node-10 | ok | ok | ok | ok | ok | ok |
100+
+------------+---------+-----+-----+--------+-------+------+
101+
102+
103+
storage devices report
104+
++++++++++++++++++++++
105+
106+
.. code-block:: bash
107+
108+
# ceph orch hardware status IBM-Ceph-1 --category storage
109+
+------------+--------------------------------------------------------+------------------+----------------+----------+----------------+--------+---------+
110+
| HOST | NAME | MODEL | SIZE | PROTOCOL | SN | STATUS | STATE |
111+
+------------+--------------------------------------------------------+------------------+----------------+----------+----------------+--------+---------+
112+
| node-10 | Disk 8 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT99QLL | OK | Enabled |
113+
| node-10 | Disk 10 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT98ZYX | OK | Enabled |
114+
| node-10 | Disk 11 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT98ZWB | OK | Enabled |
115+
| node-10 | Disk 9 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT98ZC9 | OK | Enabled |
116+
| node-10 | Disk 3 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT9903Y | OK | Enabled |
117+
| node-10 | Disk 1 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT9901E | OK | Enabled |
118+
| node-10 | Disk 7 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT98ZQJ | OK | Enabled |
119+
| node-10 | Disk 2 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT99PA2 | OK | Enabled |
120+
| node-10 | Disk 4 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT99PFG | OK | Enabled |
121+
| node-10 | Disk 0 in Backplane 0 of Storage Controller in Slot 2 | MZ7L33T8HBNAAD3 | 3840755981824 | SATA | S6M5NE0T800539 | OK | Enabled |
122+
| node-10 | Disk 1 in Backplane 0 of Storage Controller in Slot 2 | MZ7L33T8HBNAAD3 | 3840755981824 | SATA | S6M5NE0T800554 | OK | Enabled |
123+
| node-10 | Disk 6 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT98ZER | OK | Enabled |
124+
| node-10 | Disk 0 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT98ZEJ | OK | Enabled |
125+
| node-10 | Disk 5 in Backplane 1 of Storage Controller in Slot 2 | ST20000NM008D-3D | 20000588955136 | SATA | ZVT99QMH | OK | Enabled |
126+
| node-10 | Disk 0 on AHCI Controller in SL 6 | MTFDDAV240TDU | 240057409536 | SATA | 22373BB1E0F8 | OK | Enabled |
127+
| node-10 | Disk 1 on AHCI Controller in SL 6 | MTFDDAV240TDU | 240057409536 | SATA | 22373BB1E0D5 | OK | Enabled |
128+
+------------+--------------------------------------------------------+------------------+----------------+----------+----------------+--------+---------+
129+
130+
131+
132+
firmwares details
133+
+++++++++++++++++
134+
135+
.. code-block:: bash
136+
137+
# ceph orch hardware status node-10 --category firmwares
138+
+------------+----------------------------------------------------------------------------+--------------------------------------------------------------+----------------------+-------------+--------+
139+
| HOST | COMPONENT | NAME | DATE | VERSION | STATUS |
140+
+------------+----------------------------------------------------------------------------+--------------------------------------------------------------+----------------------+-------------+--------+
141+
| node-10 | current-107649-7.03__raid.backplane.firmware.0 | Backplane 0 | 2022-12-05T00:00:00Z | 7.03 | OK |
142+
143+
144+
... omitted output ...
145+
146+
147+
| node-10 | previous-25227-6.10.30.20__idrac.embedded.1-1 | Integrated Remote Access Controller | 00:00:00Z | 6.10.30.20 | OK |
148+
+------------+----------------------------------------------------------------------------+--------------------------------------------------------------+----------------------+-------------+--------+
149+
150+
151+
hardware critical warnings report
152+
+++++++++++++++++++++++++++++++++
153+
154+
.. code-block:: bash
155+
156+
# ceph orch hardware status --category criticals
157+
+------------+-----------+------------+----------+-----------------+
158+
| HOST | COMPONENT | NAME | STATUS | STATE |
159+
+------------+-----------+------------+----------+-----------------+
160+
| node-10 | power | PS2 Status | critical | unplugged |
161+
+------------+-----------+------------+----------+-----------------+
162+
163+
164+
Developpers
165+
-----------
166+
167+
.. py:currentmodule:: cephadm.agent
168+
.. autoclass:: NodeProxyEndpoint
169+
.. automethod:: NodeProxyEndpoint.__init__
170+
.. automethod:: NodeProxyEndpoint.oob
171+
.. automethod:: NodeProxyEndpoint.data
172+
.. automethod:: NodeProxyEndpoint.fullreport
173+
.. automethod:: NodeProxyEndpoint.summary
174+
.. automethod:: NodeProxyEndpoint.criticals
175+
.. automethod:: NodeProxyEndpoint.memory
176+
.. automethod:: NodeProxyEndpoint.storage
177+
.. automethod:: NodeProxyEndpoint.network
178+
.. automethod:: NodeProxyEndpoint.power
179+
.. automethod:: NodeProxyEndpoint.processors
180+
.. automethod:: NodeProxyEndpoint.fans
181+
.. automethod:: NodeProxyEndpoint.firmwares
182+
.. automethod:: NodeProxyEndpoint.led
183+

doc/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,5 +121,6 @@ about Ceph, see our `Architecture`_ section.
121121
releases/general
122122
releases/index
123123
security/index
124+
hardware-monitoring/index
124125
Glossary <glossary>
125126
Tracing <jaegertracing/index>

doc/monitoring/index.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -470,5 +470,8 @@ Useful queries
470470
rate(ceph_rbd_read_latency_sum[30s]) / rate(ceph_rbd_read_latency_count[30s]) * on (instance) group_left (ceph_daemon) ceph_rgw_metadata
471471
472472
473+
Hardware monitoring
474+
===================
473475
476+
See :ref:`hardware-monitoring`
474477

monitoring/ceph-mixin/prometheus_alerts.libsonnet

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -689,6 +689,71 @@
689689
},
690690
],
691691
},
692+
{
693+
name: 'hardware',
694+
rules: [
695+
{
696+
alert: 'HardwareStorageError',
697+
'for': '30s',
698+
expr: 'ceph_health_detail{name="HARDWARE_STORAGE"} > 0',
699+
labels: { severity: 'critical', type: 'ceph_default', oid: '1.3.6.1.4.1.50495.1.2.1.13.1' },
700+
annotations: {
701+
summary: 'Storage devices error(s) detected%(cluster)s' % $.MultiClusterSummary(),
702+
description: 'Some storage devices are in error. Check `ceph health detail`.',
703+
},
704+
},
705+
{
706+
alert: 'HardwareMemoryError',
707+
'for': '30s',
708+
expr: 'ceph_health_detail{name="HARDWARE_MEMORY"} > 0',
709+
labels: { severity: 'critical', type: 'ceph_default', oid: '1.3.6.1.4.1.50495.1.2.1.13.2' },
710+
annotations: {
711+
summary: 'DIMM error(s) detected%(cluster)s' % $.MultiClusterSummary(),
712+
description: 'DIMM error(s) detected. Check `ceph health detail`.',
713+
},
714+
},
715+
{
716+
alert: 'HardwareProcessorError',
717+
'for': '30s',
718+
expr: 'ceph_health_detail{name="HARDWARE_PROCESSOR"} > 0',
719+
labels: { severity: 'critical', type: 'ceph_default', oid: '1.3.6.1.4.1.50495.1.2.1.13.3' },
720+
annotations: {
721+
summary: 'Processor error(s) detected%(cluster)s' % $.MultiClusterSummary(),
722+
description: 'Processor error(s) detected. Check `ceph health detail`.',
723+
},
724+
},
725+
{
726+
alert: 'HardwareNetworkError',
727+
'for': '30s',
728+
expr: 'ceph_health_detail{name="HARDWARE_NETWORK"} > 0',
729+
labels: { severity: 'critical', type: 'ceph_default', oid: '1.3.6.1.4.1.50495.1.2.1.13.4' },
730+
annotations: {
731+
summary: 'Network error(s) detected%(cluster)s' % $.MultiClusterSummary(),
732+
description: 'Network error(s) detected. Check `ceph health detail`.',
733+
},
734+
},
735+
{
736+
alert: 'HardwarePowerError',
737+
'for': '30s',
738+
expr: 'ceph_health_detail{name="HARDWARE_POWER"} > 0',
739+
labels: { severity: 'critical', type: 'ceph_default', oid: '1.3.6.1.4.1.50495.1.2.1.13.5' },
740+
annotations: {
741+
summary: 'Power supply error(s) detected%(cluster)s' % $.MultiClusterSummary(),
742+
description: 'Power supply error(s) detected. Check `ceph health detail`.',
743+
},
744+
},
745+
{
746+
alert: 'HardwareFanError',
747+
'for': '30s',
748+
expr: 'ceph_health_detail{name="HARDWARE_FANS"} > 0',
749+
labels: { severity: 'critical', type: 'ceph_default', oid: '1.3.6.1.4.1.50495.1.2.1.13.6' },
750+
annotations: {
751+
summary: 'Fan error(s) detected%(cluster)s' % $.MultiClusterSummary(),
752+
description: 'Fan error(s) detected. Check `ceph health detail`.',
753+
},
754+
},
755+
],
756+
},
692757
{
693758
name: 'PrometheusServer',
694759
rules: [

monitoring/ceph-mixin/prometheus_alerts.yml

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -614,6 +614,68 @@ groups:
614614
labels:
615615
severity: "warning"
616616
type: "ceph_default"
617+
- name: "hardware"
618+
rules:
619+
- alert: "HardwareStorageError"
620+
annotations:
621+
description: "Some storage devices are in error. Check `ceph health detail`."
622+
summary: "Storage devices error(s) detected"
623+
expr: "ceph_health_detail{name=\"HARDWARE_STORAGE\"} > 0"
624+
for: "30s"
625+
labels:
626+
oid: "1.3.6.1.4.1.50495.1.2.1.13.1"
627+
severity: "critical"
628+
type: "ceph_default"
629+
- alert: "HardwareMemoryError"
630+
annotations:
631+
description: "DIMM error(s) detected. Check `ceph health detail`."
632+
summary: "DIMM error(s) detected"
633+
expr: "ceph_health_detail{name=\"HARDWARE_MEMORY\"} > 0"
634+
for: "30s"
635+
labels:
636+
oid: "1.3.6.1.4.1.50495.1.2.1.13.2"
637+
severity: "critical"
638+
type: "ceph_default"
639+
- alert: "HardwareProcessorError"
640+
annotations:
641+
description: "Processor error(s) detected. Check `ceph health detail`."
642+
summary: "Processor error(s) detected"
643+
expr: "ceph_health_detail{name=\"HARDWARE_PROCESSOR\"} > 0"
644+
for: "30s"
645+
labels:
646+
oid: "1.3.6.1.4.1.50495.1.2.1.13.3"
647+
severity: "critical"
648+
type: "ceph_default"
649+
- alert: "HardwareNetworkError"
650+
annotations:
651+
description: "Network error(s) detected. Check `ceph health detail`."
652+
summary: "Network error(s) detected"
653+
expr: "ceph_health_detail{name=\"HARDWARE_NETWORK\"} > 0"
654+
for: "30s"
655+
labels:
656+
oid: "1.3.6.1.4.1.50495.1.2.1.13.4"
657+
severity: "critical"
658+
type: "ceph_default"
659+
- alert: "HardwarePowerError"
660+
annotations:
661+
description: "Power supply error(s) detected. Check `ceph health detail`."
662+
summary: "Power supply error(s) detected"
663+
expr: "ceph_health_detail{name=\"HARDWARE_POWER\"} > 0"
664+
for: "30s"
665+
labels:
666+
oid: "1.3.6.1.4.1.50495.1.2.1.13.5"
667+
severity: "critical"
668+
type: "ceph_default"
669+
- alert: "HardwareFanError"
670+
annotations:
671+
description: "Fan error(s) detected. Check `ceph health detail`."
672+
summary: "Fan error(s) detected"
673+
expr: "ceph_health_detail{name=\"HARDWARE_FANS\"} > 0"
674+
for: "30s"
675+
labels:
676+
oid: "1.3.6.1.4.1.50495.1.2.1.13.6"
677+
severity: "critical"
678+
type: "ceph_default"
617679
- name: "PrometheusServer"
618680
rules:
619681
- alert: "PrometheusJobMissing"

0 commit comments

Comments
 (0)