JIRA-ONSBUSWI-3479:[AS9736-64d][Fix] Dynamically adjust tolerance to …#660
JIRA-ONSBUSWI-3479:[AS9736-64d][Fix] Dynamically adjust tolerance to …#660bryan1978 merged 6 commits intoedge-core:202311.Xfrom
Conversation
…avoid fan fault misdetection on large target speed changes 1. Add a flag to control the tolerance when thermal policy changes the speed 2. Change global variable platform_chassis to local variable 3. Add the Fan function for activate the tolerance
|
Hi @bryan1978, |
There was a problem hiding this comment.
Pull request overview
This PR implements a dynamic tolerance adjustment mechanism to prevent false fan fault detection during large target speed changes. When the thermal policy changes fan speeds significantly, the tolerance checking is temporarily disabled for 40 seconds to allow fans to settle, avoiding spurious fault alarms.
Key Changes:
- Added a flag-based tolerance control mechanism using file presence to enable/disable fan speed tolerance checking
- Refactored
platform_chassisfrom a global variable to an instance variable passed through constructor - Implemented a 40-second timer to automatically re-enable tolerance checking after fan speed changes
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 11 comments.
| File | Description |
|---|---|
accton_as9736_64d_monitor.py |
Added tolerance mode management with timer-based control; refactored platform_chassis to instance variable; integrated tolerance disable/enable calls around fan speed changes |
helper.py |
Added get_platform() method to read platform name from machine.conf for constructing platform-specific file paths |
fan.py |
Implemented set_tolerance_mode() to create/delete flag files and modified get_speed_tolerance() to raise NotImplementedError when tolerance checking is disabled |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def set_fans_tolerance_mode(self, mode): | ||
| """ | ||
| Set the tolerance mode for all fans in this group. | ||
| Args: | ||
| mode: "on" or "off" | ||
| """ | ||
| if mode in ["on", "off"]: | ||
| for fan in self.fan_list: | ||
| fan.set_tolerance_mode(mode) |
There was a problem hiding this comment.
[nitpick] The method name set_fans_tolerance_mode uses plural "fans" but operates on self.fan_list. For consistency with the class structure and to clarify that it affects multiple fans, consider documenting this in the method's docstring.
Add or improve documentation:
def set_fans_tolerance_mode(self, mode):
"""
Set the tolerance mode for all fans in the chassis.
Args:
mode: "on" to enable tolerance checking, "off" to disable it
"""| if mode in ["on", "off"]: | ||
| for fan in self.fan_list: | ||
| fan.set_tolerance_mode(mode) | ||
|
|
There was a problem hiding this comment.
The set_fans_tolerance_mode method doesn't handle or log failures from individual fan.set_tolerance_mode() calls. If any fan fails to set the tolerance mode, the error is silently ignored. This could lead to inconsistent tolerance states across fans.
Consider adding error handling:
def set_fans_tolerance_mode(self, mode):
"""
Set the tolerance mode for all fans in this group.
Args:
mode: "on" or "off"
"""
if mode not in ["on", "off"]:
logging.warning("Invalid tolerance mode: %s", mode)
return
for fan in self.fan_list:
if not fan.set_tolerance_mode(mode):
logging.warning("Failed to set tolerance mode '%s' for fan", mode)| if mode in ["on", "off"]: | |
| for fan in self.fan_list: | |
| fan.set_tolerance_mode(mode) | |
| if mode not in ["on", "off"]: | |
| logging.warning("Invalid tolerance mode: %s", mode) | |
| return | |
| for fan in self.fan_list: | |
| try: | |
| result = fan.set_tolerance_mode(mode) | |
| if not result: | |
| logging.warning("Failed to set tolerance mode '%s' for fan: %s", mode, getattr(fan, 'name', repr(fan))) | |
| except Exception as e: | |
| logging.error("Exception when setting tolerance mode '%s' for fan %s: %s", mode, getattr(fan, 'name', repr(fan)), str(e)) |
| try: | ||
| if mode == "off": | ||
| open(self.tolerance_flag, "a").close() | ||
| elif mode == "on": | ||
| if os.path.exists(self.tolerance_flag): | ||
| os.remove(self.tolerance_flag) | ||
| else: | ||
| return False | ||
| return True | ||
|
|
||
| except OSError: | ||
| return False |
There was a problem hiding this comment.
The file creation in set_tolerance_mode doesn't ensure the parent directory exists. If the directory /usr/share/sonic/device/{platform}/ or /usr/share/sonic/platform/ doesn't exist, the file creation will fail with an OSError, but the error is silently caught and False is returned without logging.
Consider adding directory validation or logging:
try:
if mode == "off":
# Ensure parent directory exists
os.makedirs(os.path.dirname(self.tolerance_flag), exist_ok=True)
open(self.tolerance_flag, "a").close()
elif mode == "on":
if os.path.exists(self.tolerance_flag):
os.remove(self.tolerance_flag)
else:
return False
return True
except OSError as e:
# Log the error for debugging
# logging.error("Failed to set tolerance mode: %s", e)
return False
platform/broadcom/sonic-platform-modules-accton/as9736-64d/sonic_platform/fan.py
Outdated
Show resolved
Hide resolved
| #logging.debug('SET. logfile:%s / loglevel:%d', log_file, log_level) | ||
|
|
||
| self.transceiver_dom_sensor_table = None | ||
| self.set_fans_tolerance_mode("on") |
There was a problem hiding this comment.
[nitpick] The tolerance mode is initialized to "on" in __init__, but the timer starts at 0. This creates an inconsistency: the tolerance is "on" (checking enabled), but the timer appears expired (since time.time() - 0 is very large), so on the first call to manage_fans(), the tolerance will be set to "on" again (redundantly).
While this is not harmful, it's inefficient. Consider either:
- Starting with tolerance mode "off" and timer at current time, or
- Starting with tolerance mode "on" and timer at
time.time() - FAN_SPEED_SETTLE_TIMEOUT_Sto indicate it expired long ago
Option 1 is safer for avoiding false positives during system startup.
| return speed_tolerance | ||
|
|
||
| if os.path.exists(self.tolerance_flag): | ||
| raise NotImplementedError |
There was a problem hiding this comment.
Raising NotImplementedError when the tolerance flag file exists is misleading and incorrect. According to the design, when the flag file exists, tolerance checking should be disabled (tolerance mode is "off"). This exception will cause the tolerance check to fail rather than disabling it.
The intended behavior based on the context should be to return a very high tolerance value (e.g., 100) to effectively disable tolerance checking, or use a different mechanism to signal that tolerance checking should be skipped.
Consider returning a high value instead:
if os.path.exists(self.tolerance_flag):
return 100 # Effectively disable tolerance checkingOr document why NotImplementedError is the correct approach for this platform.
| raise NotImplementedError | |
| return 100 # Effectively disable tolerance checking |
platform/broadcom/sonic-platform-modules-accton/as9736-64d/utils/accton_as9736_64d_monitor.py
Outdated
Show resolved
Hide resolved
platform/broadcom/sonic-platform-modules-accton/as9736-64d/utils/accton_as9736_64d_monitor.py
Show resolved
Hide resolved
| # tolerance path | ||
| self.tolerance_flag = PMON_FAN_TOLERANCE_FILE | ||
| if self.is_host: | ||
| self.tolerance_flag = HOST_FAN_TOLERANCE_FILE.format(self._api_helper.get_platform()) |
There was a problem hiding this comment.
The tolerance flag file path is constructed dynamically using get_platform(), but there's no error handling if get_platform() returns None. This would result in the path being /usr/share/sonic/device/None/tolerance_flag, which is likely incorrect.
Add validation:
# tolerance path
self.tolerance_flag = PMON_FAN_TOLERANCE_FILE
if self.is_host:
platform = self._api_helper.get_platform()
if platform:
self.tolerance_flag = HOST_FAN_TOLERANCE_FILE.format(platform)
else:
# Fall back to PMON path or log an error
self.tolerance_flag = PMON_FAN_TOLERANCE_FILE| self.tolerance_flag = HOST_FAN_TOLERANCE_FILE.format(self._api_helper.get_platform()) | |
| platform = self._api_helper.get_platform() | |
| if platform: | |
| self.tolerance_flag = HOST_FAN_TOLERANCE_FILE.format(platform) | |
| else: | |
| # Optionally log a warning if platform is None | |
| # import logging | |
| # logging.warning("get_platform() returned None; using PMON_FAN_TOLERANCE_FILE") | |
| self.tolerance_flag = PMON_FAN_TOLERANCE_FILE |
platform/broadcom/sonic-platform-modules-accton/as9736-64d/sonic_platform/fan.py
Outdated
Show resolved
Hide resolved
…ls/accton_as9736_64d_monitor.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ic_platform/fan.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ls/accton_as9736_64d_monitor.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ic_platform/helper.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ic_platform/fan.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
JIRA-ONSBUSWI-3479:[AS9736-64d][Fix] Dynamically adjust tolerance to avoid fan fault misdetection on large target speed changes