Skip to content

JIRA-ONSBUSWI-3479:[AS9736-64d][Fix] Dynamically adjust tolerance to …#660

Merged
bryan1978 merged 6 commits intoedge-core:202311.Xfrom
richardkuo1999:202311.X
Dec 10, 2025
Merged

JIRA-ONSBUSWI-3479:[AS9736-64d][Fix] Dynamically adjust tolerance to …#660
bryan1978 merged 6 commits intoedge-core:202311.Xfrom
richardkuo1999:202311.X

Conversation

@richardkuo1999
Copy link
Contributor

@richardkuo1999 richardkuo1999 commented Dec 9, 2025

JIRA-ONSBUSWI-3479:[AS9736-64d][Fix] Dynamically adjust tolerance to avoid fan fault misdetection on large target speed changes

  1. Add a flag to control the tolerance when thermal policy changes the speed
  2. Change global variable platform_chassis to local variable
  3. Add the Fan function for activate the tolerance

…avoid fan fault misdetection on large target speed changes

1. Add a flag to control the tolerance when thermal policy changes the speed
2. Change global variable platform_chassis to local variable
3. Add the Fan function for activate the tolerance
@richardkuo1999
Copy link
Contributor Author

Hi @bryan1978,
Please review this PR; it was synced from GitLab.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a dynamic tolerance adjustment mechanism to prevent false fan fault detection during large target speed changes. When the thermal policy changes fan speeds significantly, the tolerance checking is temporarily disabled for 40 seconds to allow fans to settle, avoiding spurious fault alarms.

Key Changes:

  • Added a flag-based tolerance control mechanism using file presence to enable/disable fan speed tolerance checking
  • Refactored platform_chassis from a global variable to an instance variable passed through constructor
  • Implemented a 40-second timer to automatically re-enable tolerance checking after fan speed changes

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 11 comments.

File Description
accton_as9736_64d_monitor.py Added tolerance mode management with timer-based control; refactored platform_chassis to instance variable; integrated tolerance disable/enable calls around fan speed changes
helper.py Added get_platform() method to read platform name from machine.conf for constructing platform-specific file paths
fan.py Implemented set_tolerance_mode() to create/delete flag files and modified get_speed_tolerance() to raise NotImplementedError when tolerance checking is disabled

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +271 to +279
def set_fans_tolerance_mode(self, mode):
"""
Set the tolerance mode for all fans in this group.
Args:
mode: "on" or "off"
"""
if mode in ["on", "off"]:
for fan in self.fan_list:
fan.set_tolerance_mode(mode)
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The method name set_fans_tolerance_mode uses plural "fans" but operates on self.fan_list. For consistency with the class structure and to clarify that it affects multiple fans, consider documenting this in the method's docstring.

Add or improve documentation:

def set_fans_tolerance_mode(self, mode):
    """
    Set the tolerance mode for all fans in the chassis.
    Args:
        mode: "on" to enable tolerance checking, "off" to disable it
    """

Copilot uses AI. Check for mistakes.
Comment on lines +277 to +280
if mode in ["on", "off"]:
for fan in self.fan_list:
fan.set_tolerance_mode(mode)

Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The set_fans_tolerance_mode method doesn't handle or log failures from individual fan.set_tolerance_mode() calls. If any fan fails to set the tolerance mode, the error is silently ignored. This could lead to inconsistent tolerance states across fans.

Consider adding error handling:

def set_fans_tolerance_mode(self, mode):
    """
    Set the tolerance mode for all fans in this group.
    Args:
        mode: "on" or "off"
    """
    if mode not in ["on", "off"]:
        logging.warning("Invalid tolerance mode: %s", mode)
        return
    
    for fan in self.fan_list:
        if not fan.set_tolerance_mode(mode):
            logging.warning("Failed to set tolerance mode '%s' for fan", mode)
Suggested change
if mode in ["on", "off"]:
for fan in self.fan_list:
fan.set_tolerance_mode(mode)
if mode not in ["on", "off"]:
logging.warning("Invalid tolerance mode: %s", mode)
return
for fan in self.fan_list:
try:
result = fan.set_tolerance_mode(mode)
if not result:
logging.warning("Failed to set tolerance mode '%s' for fan: %s", mode, getattr(fan, 'name', repr(fan)))
except Exception as e:
logging.error("Exception when setting tolerance mode '%s' for fan %s: %s", mode, getattr(fan, 'name', repr(fan)), str(e))

Copilot uses AI. Check for mistakes.
Comment on lines 165 to 176
try:
if mode == "off":
open(self.tolerance_flag, "a").close()
elif mode == "on":
if os.path.exists(self.tolerance_flag):
os.remove(self.tolerance_flag)
else:
return False
return True

except OSError:
return False
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file creation in set_tolerance_mode doesn't ensure the parent directory exists. If the directory /usr/share/sonic/device/{platform}/ or /usr/share/sonic/platform/ doesn't exist, the file creation will fail with an OSError, but the error is silently caught and False is returned without logging.

Consider adding directory validation or logging:

try:
    if mode == "off":
        # Ensure parent directory exists
        os.makedirs(os.path.dirname(self.tolerance_flag), exist_ok=True)
        open(self.tolerance_flag, "a").close()
    elif mode == "on":
        if os.path.exists(self.tolerance_flag):
            os.remove(self.tolerance_flag)
    else:
        return False
    return True

except OSError as e:
    # Log the error for debugging
    # logging.error("Failed to set tolerance mode: %s", e)
    return False

Copilot uses AI. Check for mistakes.
#logging.debug('SET. logfile:%s / loglevel:%d', log_file, log_level)

self.transceiver_dom_sensor_table = None
self.set_fans_tolerance_mode("on")
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The tolerance mode is initialized to "on" in __init__, but the timer starts at 0. This creates an inconsistency: the tolerance is "on" (checking enabled), but the timer appears expired (since time.time() - 0 is very large), so on the first call to manage_fans(), the tolerance will be set to "on" again (redundantly).

While this is not harmful, it's inefficient. Consider either:

  1. Starting with tolerance mode "off" and timer at current time, or
  2. Starting with tolerance mode "on" and timer at time.time() - FAN_SPEED_SETTLE_TIMEOUT_S to indicate it expired long ago

Option 1 is safer for avoiding false positives during system startup.

Copilot uses AI. Check for mistakes.
return speed_tolerance

if os.path.exists(self.tolerance_flag):
raise NotImplementedError
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raising NotImplementedError when the tolerance flag file exists is misleading and incorrect. According to the design, when the flag file exists, tolerance checking should be disabled (tolerance mode is "off"). This exception will cause the tolerance check to fail rather than disabling it.

The intended behavior based on the context should be to return a very high tolerance value (e.g., 100) to effectively disable tolerance checking, or use a different mechanism to signal that tolerance checking should be skipped.

Consider returning a high value instead:

if os.path.exists(self.tolerance_flag):
    return 100  # Effectively disable tolerance checking

Or document why NotImplementedError is the correct approach for this platform.

Suggested change
raise NotImplementedError
return 100 # Effectively disable tolerance checking

Copilot uses AI. Check for mistakes.
# tolerance path
self.tolerance_flag = PMON_FAN_TOLERANCE_FILE
if self.is_host:
self.tolerance_flag = HOST_FAN_TOLERANCE_FILE.format(self._api_helper.get_platform())
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tolerance flag file path is constructed dynamically using get_platform(), but there's no error handling if get_platform() returns None. This would result in the path being /usr/share/sonic/device/None/tolerance_flag, which is likely incorrect.

Add validation:

# tolerance path
self.tolerance_flag = PMON_FAN_TOLERANCE_FILE
if self.is_host:
    platform = self._api_helper.get_platform()
    if platform:
        self.tolerance_flag = HOST_FAN_TOLERANCE_FILE.format(platform)
    else:
        # Fall back to PMON path or log an error
        self.tolerance_flag = PMON_FAN_TOLERANCE_FILE
Suggested change
self.tolerance_flag = HOST_FAN_TOLERANCE_FILE.format(self._api_helper.get_platform())
platform = self._api_helper.get_platform()
if platform:
self.tolerance_flag = HOST_FAN_TOLERANCE_FILE.format(platform)
else:
# Optionally log a warning if platform is None
# import logging
# logging.warning("get_platform() returned None; using PMON_FAN_TOLERANCE_FILE")
self.tolerance_flag = PMON_FAN_TOLERANCE_FILE

Copilot uses AI. Check for mistakes.
richardkuo1999 and others added 5 commits December 10, 2025 16:27
…ls/accton_as9736_64d_monitor.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ic_platform/fan.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ls/accton_as9736_64d_monitor.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ic_platform/helper.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ic_platform/fan.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@bryan1978 bryan1978 merged commit f9edb9f into edge-core:202311.X Dec 10, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants