|
| 1 | +# Liquid Cooling leakage detection in SONiC |
| 2 | + |
| 3 | +## 1. Overview |
| 4 | + |
| 5 | +Due to the excessive heat generated by the equipment, traditional air-cooling methods are no longer sufficient for effective heat dissipation. Therefore, liquid cooling technology has become a necessary choice for more efficiently cooling the equipment and ensuring its proper operation. Given the potential fatality of liquid cooling leakage, implementing a mechanism to monitor and instantly alert the system when such an event occurs is crucial. |
| 6 | + |
| 7 | +## 2. Requirements |
| 8 | +1. Monitoring the liquid colling leakge detection sensor, and alarm accordingly. |
| 9 | +2. For platform that doesn't support liquid cooling at all, there should be no further performance overheading. |
| 10 | + |
| 11 | +## 3. Detection and alarm flow |
| 12 | +The leak alarm process is straightforward. The platform API first acquires the status of the leak detection sensors. Then, thermalctld has a thread that calling the API and in turn notifies system health monitor, who ultimately sends out a gNMI event. |
| 13 | + |
| 14 | + |
| 15 | + |
| 16 | +## 4. Platform API |
| 17 | +A new object `LiquidCollingBase` will be added to the `sonic-platform-common` to reflect the new liquid cooling device |
| 18 | + |
| 19 | +``` |
| 20 | +Class LiquidcollingBase(ojbect): |
| 21 | + leakge_sensors_num = 0 |
| 22 | + leakage_sensors = {} |
| 23 | +
|
| 24 | + "" a set of sensors that detects leak event, empty if leak is NOT happening |
| 25 | + def get_leak_sensor_status(): |
| 26 | + leaking_sensors = {} |
| 27 | + for sensor in self.leakage_sensors: |
| 28 | + if sensor.is_leak(): |
| 29 | + leaking_sensors.add(sensors) |
| 30 | + returen leaking_sensors |
| 31 | +
|
| 32 | +Class LeakageSensor(sensor_base): |
| 33 | + "" there might be mutiple leakge detection sensors, to let user better find the location, |
| 34 | + name = "" |
| 35 | + leaking = 0 |
| 36 | +
|
| 37 | + "" string return to get the sensor's name |
| 38 | + def get_name(): |
| 39 | +
|
| 40 | + "" boolean returen to indicate whether there is a leak detected |
| 41 | + def is_leak(): |
| 42 | + |
| 43 | +``` |
| 44 | + |
| 45 | +## 5. Thermal control daemon |
| 46 | +A new object `LiquidCoolingUpdater` will be added to the Thermal Control daemon that dedicated monitoring the liquid device status. |
| 47 | + |
| 48 | +During initialization, a separate thread will be launched to periodically call the `get_leak_sensor_status` API. The reason for separating this thread from the main thermal control daemon process is that the main process has a period of 1 minute, which is reasonable for normal thermal device updates but not suitable for leak events. Leak events are critical and need to be reported as soon as possible. However, these events are also abnormal and rare, so we don't need to involve them in the main thread and increase the main thread period to 1 second, which would be overkill and lower performance. |
| 49 | + |
| 50 | +New configuration will be added to pmon_daemon_control.json, to indicate whether the system has liquid cooling system, if not, the object and thread will not be created in the initialization of thermalctld at all to avoid performance overheading. |
| 51 | +``` |
| 52 | +# to enable the seperate thread for liquid cooling monitor |
| 53 | +enable_liquid_cooling: true, |
| 54 | +# set the interval to update the leakage status, default 0.5 |
| 55 | +liquid_cooling_update_interval: 0.5 |
| 56 | +``` |
| 57 | + |
| 58 | +Once the leakage event has been detected, the thread will write it to state db to notify the system health monitor. Meanwhile, the syslog error message will be printed out. |
| 59 | +"Liquid cooling leakge has been detected on sensor{}" |
| 60 | + |
| 61 | +``` |
| 62 | +class LiquidCoolingUpdater(): |
| 63 | +
|
| 64 | + def update(): |
| 65 | + _refresh_leak_status_update |
| 66 | + _refresh_other_status_update |
| 67 | + def _refresh_leak_status_update(): |
| 68 | + liquidCoolingOjbect = chassis.get_liquid_cooling_device |
| 69 | + liquidCoolingOjbect.get_leak_sensor_status is not None |
| 70 | + update the state db accordingly |
| 71 | +``` |
| 72 | + |
| 73 | +### stat_db data schema |
| 74 | +the `LIQUID_COOLING_DEVICE` table stores all the date gathered by thermal control deamon, currtenly, it will have only `leakage_sensors` |
| 75 | + |
| 76 | +``` |
| 77 | +Defines a logical structure for liquid cooling devices, with keys for various sensors. |
| 78 | +
|
| 79 | +key = LIQUID_COOLING_DEVICE|leakage_sensors{X} |
| 80 | + ; field = value |
| 81 | +name = STR ; sensor name |
| 82 | +leaking = STR ; Yes or No to indicate leakage status |
| 83 | +``` |
| 84 | + |
| 85 | +## 6. system health monitor |
| 86 | +A new function named `_check_liquid_cooling_status(self, config)` will be added to the system health monitor hardware_chekcer.py, used to monitoring the leakage detection state db value, and once it is detected, a gnmi event will be sent out. |
| 87 | +It worth to note that both change from NOTleak to leak and leak to NOTleak will trigger an event. |
| 88 | + |
| 89 | +``` |
| 90 | +def publish_events(self, leakge_sensor_list): |
| 91 | + params = swsscommon.FieldValueMap() |
| 92 | + for leakage_sensor in leakge_sensor_list: |
| 93 | + swsscommon.event_publish(self.events_handle, EVENTS_PUBLISHER_TAG, params) |
| 94 | +``` |
| 95 | + |
| 96 | +### the GNMI event model |
| 97 | +EVENTS_PUBLISHER_SOURCE = "sonic-events-host" |
| 98 | + |
| 99 | +EVENTS_PUBLISHER_TAG = "liquid-cooling-leak" |
| 100 | + |
| 101 | +## 7. CLI management |
| 102 | +A new command `show platform leakage status` will be added to let user know current leak sensor status, the data will come from the state db |
| 103 | +``` |
| 104 | +Name Leak |
| 105 | +----------------- |
| 106 | +leak_sensors1 NO |
| 107 | +leak_sensors2 NO |
| 108 | +... |
| 109 | +leak_sensorsX Yes |
| 110 | +``` |
| 111 | + |
| 112 | +As the new functionality will be added to system health monitor, the relevant command `show system-health detail` will be updated. |
| 113 | +``` |
| 114 | +System services and devices monitor list |
| 115 | +Name Status Type |
| 116 | +----------------------- -------- ---------- |
| 117 | +... |
| 118 | +leak_sensors1 OK LiquidCooling |
| 119 | +leak_sensors2 OK LiquidCooling |
| 120 | +... |
| 121 | +leak_sensors3 Not OK LiquidCooling |
| 122 | +``` |
| 123 | + |
| 124 | +## 8. Performance |
| 125 | +Seperate thread will be lunched in thermal contorl daemon keep monitoring entire liquid cooling device status within 0.5s interval |
| 126 | + |
| 127 | +## 9. Testing |
| 128 | +A mock testing should be created to demonstrate the functionality of this implementation. Once simulated a leaking event, these things need to be checked: |
| 129 | +1. correct sensors number had been indicated in the syslog messge |
| 130 | +2. state db is rightly updated |
| 131 | +3. GNMI event had been sent out |
| 132 | +4. `show platform leakage status` command output is correct |
0 commit comments