Skip to content

Conversation

@vvolam
Copy link
Contributor

@vvolam vvolam commented Oct 31, 2025

Description

HLD: https://github.com/sonic-net/SONiC/blob/master/doc/smart-switch/graceful-shutdown/graceful-shutdown.md
These changes build upon enhancements in sonic-platform-common#567

This change introduces enhancements to the ModuleBase class to support graceful shutdown and startup operations for DPU and other module types.

It adds new methods and transition handling logic to ensure platform modules follow an ordered and coordinated shutdown/startup procedure, minimizing hardware inconsistencies and transient errors during reboot or DPU detachment.

Key changes include:
Added transition management APIs:

set_module_state_transition()
get_module_state_transition()
clear_module_state_transition()

Introduced graceful lifecycle handlers:

  • _graceful_shutdown_handler() to wait for external transition completion using gnoi_halt_in_progress field with timeout handling
  • Implemented database-backed transition tracking (CHASSIS_MODULE_TABLE)

Added helper functions for:

  • File-based operation locks to ensure concurrency safety during transitions
  • Included caching of transition timeout configuration from platform.json
  • Added robust error-handling and logging to prevent partial updates in Redis DB

Motivation and Context

This enhancement is part of the SmartSwitch / DPU graceful shutdown/reboot and state management effort.
Currently, ModuleBase lacks lifecycle orchestration methods for safe shutdown or startup of DPUs and peripheral modules.
By adding transition-aware handling, the system can:

Avoid race conditions between platform daemons during reboot/shutdown
Ensure state transitions are reflected in Redis (CHASSIS_MODULE_TABLE)
Support controlled detach/reattach of PCIe devices and sensor configuration reloads
Enable PMON daemons to coordinate module-level transitions consistently
This work aligns with SONiC’s graceful reboot framework and the upcoming DPU lifecycle enhancements tracked internally.

How Has This Been Tested?

Testing performed on both SmartSwitch (DPU-enabled) and non-DPU platforms:

  • ✅ Unit tests added under tests/test_module_base.py covering:
    • Transition management (set/get/clear)
    • Timeout behavior and concurrency lock handling
    • PCIe detach/reattach and sensor config updates
    • Graceful shutdown/startup flows (set_admin_state_gracefully)
  • ✅ Verified Redis DB updates for transition keys under CHASSIS_MODULE_TABLE
  • ✅ Simulated shutdown and startup sequences:
    • module_pre_shutdown() → safely detaches PCIe and updates state
    • module_post_startup() → rescans PCIe and restores sensor configuration
  • ✅ Regression-tested existing platform daemons to ensure backward compatibility

Additional Information (Optional)

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors and expands the ModuleBase class to support graceful state transitions for SmartSwitch modules, along with comprehensive test coverage improvements.

  • Adds graceful admin state management with timeout handling and state transition tracking
  • Introduces state database initialization and centralized transition flag management
  • Refactors test suite for better organization, readability, and parametrization

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
sonic_platform_base/module_base.py Adds graceful shutdown support, state DB initialization, transition lock mechanism, and state transition management APIs
tests/module_base_test.py Completely refactored tests with better organization, parametrization, and comprehensive coverage of new graceful shutdown and state transition features
Comments suppressed due to low confidence (1)

sonic_platform_base/module_base.py:116

  • The _file_operation_lock method attempts to open the lock file without ensuring the parent directory exists. If /var/lock/ doesn't exist, this will raise FileNotFoundError. The tests mock os.makedirs (line 93 in test file), suggesting directory creation was intended but never implemented in the production code. Add os.makedirs(os.path.dirname(lock_file_path), exist_ok=True) before opening the file.
    def _file_operation_lock(self, lock_file_path):
        """Common file-based lock for operations using flock"""
        with open(lock_file_path, 'w') as f:

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam vvolam changed the title Enhance ModuleBase with graceful shutdown and startup transition handling [SmartSwitch] Enhance ModuleBase with graceful shutdown and startup transition handling Nov 1, 2025
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam
Copy link
Contributor Author

vvolam commented Nov 4, 2025

@rameshraghupathy @gpunathilell could you please review this latest PR

@rameshraghupathy
Copy link
Contributor

@vvolam A mechanism to indicate the completion of "module_pre_shutdown" to "gnoi_shutdown_daemon" is needed in module_base.py so that the gnoi request can be sent to the DPU.

@rameshraghupathy
Copy link
Contributor

@vvolam The graceful_shutdown_handler in module_base should check for the completion of gnoi from the gnoi_shutdown_daemon before proceeding to trigger set_admin_state(False)

@vvolam
Copy link
Contributor Author

vvolam commented Nov 5, 2025

@vvolam The graceful_shutdown_handler in module_base should check for the completion of gnoi from the gnoi_shutdown_daemon before proceeding to trigger set_admin_state(False)

@rameshraghupathy I think this is not considered in the HLD as we are not protecting whole state transition using transition_in_progress. Let introduce a new field gnoi_halt_in_progress to start and complete gnoi HALT and update the PR. Thank you for pointing this out

@mssonicbld
Copy link
Collaborator

/azp run

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rameshraghupathy
Copy link
Contributor

@vvolam Mostly LGTM. I'll approve, but take care of the following? I guess, @gpunathilell also has raised this. get_module_state_transition() does not check for timeout, while set_module_state_transition() does. This creates an inconsistency where: get_module_state_transition() returns True if the flag is set, even if the transition has timed out
Callers in sonic-utilities rely on this to make decisions about whether to proceed with operations
This could lead to blocking operations indefinitely or incorrect state assumptions
The function should validate timeout before returning the transition state, similar to how set_module_state_transition() handles it.

@vvolam
Copy link
Contributor Author

vvolam commented Nov 20, 2025

@vvolam Mostly LGTM. I'll approve, but take care of the following? I guess, @gpunathilell also has raised this. get_module_state_transition() does not check for timeout, while set_module_state_transition() does. This creates an inconsistency where: get_module_state_transition() returns True if the flag is set, even if the transition has timed out Callers in sonic-utilities rely on this to make decisions about whether to proceed with operations This could lead to blocking operations indefinitely or incorrect state assumptions The function should validate timeout before returning the transition state, similar to how set_module_state_transition() handles it.

@rameshraghupathy latest PR commit is handling timeout condition in get_module_state_transition(). Please check lines at 705 in module_base.py. Thank you

@vvolam vvolam requested a review from gpunathilell November 20, 2025 01:10
@yxieca yxieca merged commit cdf3008 into sonic-net:master Nov 21, 2025
5 checks passed
@vvolam vvolam deleted the graceful-shutdown branch November 21, 2025 19:01
@liushilongbuaa
Copy link
Contributor

@vvolam , can you help check submodule advance PR failure? Mellanox build keeps failing.
sonic-net/sonic-buildimage#24598

@vvolam
Copy link
Contributor Author

vvolam commented Dec 2, 2025

@vvolam , can you help check submodule advance PR failure? Mellanox build keeps failing. sonic-net/sonic-buildimage#24598

@liushilongbuaa Issue should be fixed with PR #612

@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202511: #613

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants