Skip to content

Commit 2367b2c

Browse files
committed
Liquid Cooling leakage detection in SONiC
Signed-off-by: Yuanzhe, Liu <[email protected]>
1 parent f35ecb9 commit 2367b2c

File tree

2 files changed

+155
-0
lines changed

2 files changed

+155
-0
lines changed

doc/bmc/leakage_detection_hld.md

Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# Liquid Cooling leakage detection in SONiC
2+
3+
## 1. Overview
4+
5+
Due to the excessive heat generated by the equipment, traditional air-cooling methods are no longer sufficient for effective heat dissipation. Therefore, liquid cooling technology has become a necessary choice for more efficiently cooling the equipment and ensuring its proper operation. Given the potential fatality of liquid cooling leakage, implementing a mechanism to monitor and instantly alert the system when such an event occurs is crucial.
6+
7+
## 2. Requirements
8+
1. Monitoring the liquid colling leakge detection sensor, and alarm accordingly.
9+
2. For platform that doesn't support liquid cooling at all, there should be no further performance overheading.
10+
11+
## 3. Detection and alarm flow
12+
The leak alarm process is straightforward. The platform API first acquires the status of the leak detection sensors. Then, thermalctld has a thread that calling the API and in turn notifies system health monitor, who ultimately sends out a gNMI event.
13+
14+
![LCflow chart](https://github.com/sonic-net/SONiC/blob/73f11eb7ad058b214d745e9ef728b8319574edbe/images/bmc/leakage_detection_flow.png)
15+
16+
## 4. Platform API
17+
A new object `LiquidCollingBase` will be added to the `sonic-platform-common` to reflect the new liquid cooling device
18+
19+
```
20+
Class LiquidcollingBase(ojbect):
21+
leakge_sensors_num = 0
22+
leakage_sensors = {}
23+
24+
def get_leak_sensor_num(self):
25+
"""
26+
Retrieves the number of leakage sensors
27+
28+
Returns:
29+
int: The number of leakage sensors
30+
"""
31+
return self.leakge_sensors_num
32+
33+
def get_leak_sensor_list(self):
34+
"""
35+
Retrieves the list of leakage sensors
36+
37+
Returns:
38+
list: A list of leakage sensor names
39+
"""
40+
return self.leakage_sensors
41+
42+
def get_leak_sensor_status(self):
43+
"""
44+
Retrieves the leak status of the sensors
45+
46+
Returns:
47+
list: A list of leakage sensor names that are leaking, empty list if no leakage
48+
"""
49+
leaking_sensors = []
50+
for sensor in self.leakage_sensors:
51+
if sensor.is_leak():
52+
leaking_sensors.add(sensors)
53+
return leaking_sensors
54+
55+
Class LeakageSensor(sensor_base):
56+
"" there might be mutiple leakge detection sensors, to let user better find the location,
57+
name = ""
58+
leaking = 0
59+
60+
"" string return to get the sensor's name
61+
def get_name():
62+
63+
"" boolean returen to indicate whether there is a leak detected
64+
def is_leak():
65+
66+
```
67+
68+
## 5. Thermal control daemon
69+
A new object `LiquidCoolingUpdater` will be added to the Thermal Control daemon that dedicated monitoring the liquid device status.
70+
71+
During initialization, a separate thread will be launched to periodically call the `get_leak_sensor_status` API. The reason for separating this thread from the main thermal control daemon process is that the main process has a period of 1 minute, which is reasonable for normal thermal device updates but not suitable for leak events. Leak events are critical and need to be reported as soon as possible. However, these events are also abnormal and rare, so we don't need to involve them in the main thread and increase the main thread period to 1 second, which would be overkill and lower performance.
72+
73+
New configuration will be added to pmon_daemon_control.json, to indicate whether the system has liquid cooling system, if not, the object and thread will not be created in the initialization of thermalctld at all to avoid performance overheading.
74+
```
75+
# to enable the seperate thread for liquid cooling monitor
76+
enable_liquid_cooling: true,
77+
# set the interval to update the leakage status, default 0.5
78+
liquid_cooling_update_interval: 0.5
79+
```
80+
81+
Once the leakage event has been detected, the thread will write it to state db to notify the system health monitor. Meanwhile, the syslog error message will be printed out.
82+
"Liquid cooling leakge has been detected on sensor{}"
83+
84+
```
85+
class LiquidCoolingUpdater():
86+
87+
def update():
88+
_refresh_leak_status_update
89+
_refresh_other_status_update
90+
def _refresh_leak_status_update():
91+
liquidCoolingOjbect = chassis.get_liquid_cooling_device
92+
liquidCoolingOjbect.get_leak_sensor_status is not None
93+
update the state db accordingly
94+
```
95+
96+
### stat_db data schema
97+
the `LIQUID_COOLING_DEVICE` table stores all the date gathered by thermal control deamon, currtenly, it will have only `leakage_sensors`
98+
99+
```
100+
Defines a logical structure for liquid cooling devices, with keys for various sensors.
101+
102+
key = LIQUID_COOLING_DEVICE|leakage_sensors{X}
103+
; field = value
104+
name = STR ; sensor name
105+
leaking = STR ; Yes or No to indicate leakage status
106+
```
107+
108+
## 6. system health monitor
109+
A new function named `_check_liquid_cooling_status(self, config)` will be added to the system health monitor hardware_chekcer.py, used to monitoring the leakage detection state db value, and once it is detected, a gnmi event will be sent out.
110+
It worth to note that both change from NOTleak to leak and leak to NOTleak will trigger an event.
111+
112+
```
113+
def publish_events(self, leakge_sensor_list):
114+
params = swsscommon.FieldValueMap()
115+
for leakage_sensor in leakge_sensor_list:
116+
swsscommon.event_publish(self.events_handle, EVENTS_PUBLISHER_TAG, params)
117+
```
118+
119+
### the GNMI event model
120+
EVENTS_PUBLISHER_SOURCE = "sonic-events-host"
121+
122+
EVENTS_PUBLISHER_TAG = "liquid-cooling-leak"
123+
124+
## 7. CLI management
125+
A new command `show platform leakage status` will be added to let user know current leak sensor status, the data will come from the state db
126+
```
127+
Name Leak
128+
-----------------
129+
leak_sensors1 NO
130+
leak_sensors2 NO
131+
...
132+
leak_sensorsX Yes
133+
```
134+
135+
As the new functionality will be added to system health monitor, the relevant command `show system-health detail` will be updated.
136+
```
137+
System services and devices monitor list
138+
Name Status Type
139+
----------------------- -------- ----------
140+
...
141+
leak_sensors1 OK LiquidCooling
142+
leak_sensors2 OK LiquidCooling
143+
...
144+
leak_sensors3 Not OK LiquidCooling
145+
```
146+
147+
## 8. Performance
148+
Seperate thread will be lunched in thermal contorl daemon keep monitoring entire liquid cooling device status within 0.5s interval
149+
150+
## 9. Testing
151+
A mock testing should be created to demonstrate the functionality of this implementation. Once simulated a leaking event, these things need to be checked:
152+
1. correct sensors number had been indicated in the syslog messge
153+
2. state db is rightly updated
154+
3. GNMI event had been sent out
155+
4. `show platform leakage status` command output is correct
35 KB
Loading

0 commit comments

Comments
 (0)