-
Notifications
You must be signed in to change notification settings - Fork 134
Module graceful shutdown support #255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Module graceful shutdown support #255
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Do you mind pasting the steps and output for testing (commands) in the PR description |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 15 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Co-authored-by: Copilot <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@qiluo-msft could you please review this new service addition to sonic-host-services? |
scripts/check_platform.py
Outdated
| capture_output=True, | ||
| text=True, | ||
| timeout=5 | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use native python for this purpose?
from swsscommon import swsscommon
config_db = swsscommon.ConfigDBConnector()
config_db.connect()
entry = config_db.get_entry('DEVICE_METADATA', 'localhost')
subtype = entry.get('subtype') if entry else None
If you need the exact same behavior as the subprocess (string
output):
from swsscommon import swsscommon
config_db = swsscommon.ConfigDBConnector()
config_db.connect()
entry = config_db.get_entry('DEVICE_METADATA', 'localhost')
result = entry.get('subtype', '') if entry else ''
With error handling like the original subprocess:
from swsscommon import swsscommon
try:
config_db = swsscommon.ConfigDBConnector()
config_db.connect()
entry = config_db.get_entry('DEVICE_METADATA', 'localhost')
subtype = entry.get('subtype') if entry else None
except Exception:
subtype = None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hdwhdw I have simplified it even further please take a look.
| # gNOI helpers | ||
| # ############ | ||
|
|
||
| def execute_gnoi_command(command_args, timeout_sec=REBOOT_RPC_TIMEOUT_SEC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy in that case maybe at least rename the function to execute_command instead of execute_gnoi_command?
| 'scripts/determine-reboot-cause', | ||
| 'scripts/process-reboot-cause', | ||
| 'scripts/check_platform.py', | ||
| 'scripts/wait-for-sonic-core.sh', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy why are we not copying gnoi_shutdown_daemon.py? Is the PR tested end to end? If yes, could you share the gnmi.logs with reboot call log samples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolum Looks like it is missed, thanks! Tested it locally, please find below the results.
gNOI halt request failing case: using DPU1
2025 Nov 15 09:04:11.933165 sonic NOTICE gnoi-shutdown-daemon[3931659]: DPU1: Admin shutdown detected, initiating gNOI HALT
2025 Nov 15 09:04:11.933340 sonic INFO python3[3931659]: NOTICE:gnoi-shutdown-daemon:DPU1: Admin shutdown detected, initiating gNOI HALT
2025 Nov 15 09:04:11.933439 sonic NOTICE gnoi-shutdown-daemon[3931659]: DPU1: Starting gNOI shutdown sequence
2025 Nov 15 09:04:11.933557 sonic INFO python3[3931659]: NOTICE:gnoi-shutdown-daemon:DPU1: Starting gNOI shutdown sequence
2025 Nov 15 09:04:12.352564 sonic NOTICE gnoi-shutdown-daemon[3931659]: DPU1: PCI detach complete, proceeding for halting services via gNOI
2025 Nov 15 09:04:12.352701 sonic INFO python3[3931659]: NOTICE:gnoi-shutdown-daemon:DPU1: PCI detach complete, proceeding for halting services via gNOI
2025 Nov 15 09:05:29.396271 sonic INFO python3[3931659]: NOTICE:gnoi-shutdown-daemon:DPU1: gNOI sequence failed
gNOI halt request normal passing case: using DPU2
2025 Nov 15 10:02:52.643878 sonic NOTICE gnoi-shutdown-daemon[3931659]: DPU2: Admin shutdown detected, initiating gNOI HALT
2025 Nov 15 10:02:52.644048 sonic NOTICE gnoi-shutdown-daemon[3931659]: DPU2: Starting gNOI shutdown sequence
2025 Nov 15 10:02:52.644156 sonic INFO python3[3931659]: NOTICE:gnoi-shutdown-daemon:DPU2: Admin shutdown detected, initiating gNOI HALT
2025 Nov 15 10:02:52.644212 sonic INFO python3[3931659]: NOTICE:gnoi-shutdown-daemon:DPU2: Starting gNOI shutdown sequence
2025 Nov 15 10:02:53.030039 sonic NOTICE gnoi-shutdown-daemon[3931659]: DPU2: PCI detach complete, proceeding for halting services via gNOI
2025 Nov 15 10:02:53.030204 sonic INFO python3[3931659]: NOTICE:gnoi-shutdown-daemon:DPU2: PCI detach complete, proceeding for halting services via gNOI
2025 Nov 15 10:03:47.303558 sonic NOTICE gnoi-shutdown-daemon[3931659]: DPU2: boot halt success rc: 0 out_s:System RebootStatus#012{"reason":"Halt reboot completed","count":1,"method":3,"status":{"status":1}}
2025 Nov 15 10:03:47.303710 sonic INFO python3[3931659]: NOTICE:gnoi-shutdown-daemon:DPU2: boot halt success rc: 0 out_s:System RebootStatus
2025 Nov 15 10:03:47.304954 sonic NOTICE gnoi-shutdown-daemon[3931659]: DPU2: gNOI sequence completed
2025 Nov 15 10:03:47.305021 sonic INFO python3[3931659]: NOTICE:gnoi-shutdown-daemon:DPU2: gNOI sequence completed
Name Description Physical-Slot Oper-Status Admin-Status Serial
DPU0 N/A N/A Offline down N/A
DPU1 AMD Pensando DSC N/A Offline down FLM281704EZ-1
DPU2 AMD Pensando DSC N/A Offline down FLM281704EK-0
DPU3 N/A N/A Offline down N/A
DPU4 AMD Pensando DSC N/A Online up FLM281704EM-0
DPU5 AMD Pensando DSC N/A Online up FLM281704EM-1
DPU6 AMD Pensando DSC N/A Online up FLM281704EU-0
DPU7 N/A N/A Offline down N/A
Note:
Make sure
- gnoi_shutdown_daemon.py is present in /usr/local/bin
- Make sure the daemon is running
- For MtFuji the platform module.py should have the following until this feature is committed.
def module_pre_shutdown(self):
return True
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Provide support for SmartSwitch DPU module graceful shutdown.
Description:
Single source of truth for transitions
All components now use
sonic_platform_base.module_base.ModuleBasehelpers:set_module_state_transition(db, name, transition_type)clear_module_state_transition(db, name)get_module_state_transition(db, name) -> dictis_module_state_transition_timed_out(db, name, timeout_secs) -> boolEliminates duplicated logic and race-prone direct Redis writes.
Correct table everywhere
CHASSIS_MODULE_TABLE(replacesCHASSIS_MODULE_INFO_TABLE).Ownership & lifecycle
The initiator of an operation (
startup/shutdown/reboot) sets:state_transition_in_progress=Truetransition_type=<op>transition_start_time=<utc-iso8601>The platform (
set_admin_state()) is responsible for clearing:state_transition_in_progress=Falsetransition_end_time=<epoch>(or similar end stamp).CLI pre-clears only when a prior transition is timed out.
Timeouts & policy
Platform JSON path only:
/usr/share/sonic/device/{plat}/platform.json; else constants.Typical production values used:
startup: 180s,shutdown: 180s(≈graceful_wait 60s + power 120s),reboot: 120s.Graceful wait (e.g., waiting for “Graceful shutdown complete”) is a platform policy and implemented inside platform
set_admin_state()—not in ModuleBase.Boot behavior
chassisdon start:set_initial_dpu_admin_state()which marks transitions via ModuleBase before calling platformset_admin_state().gNOI shutdown daemon
Listens on
CHASSIS_MODULE_TABLEand triggers only when:state_transition_in_progress=Trueandtransition_type=shutdown.Never clears the flag (ownership stays with the platform).
Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).
CLI (
config chassis modules …)is_module_state_transition_timed_out()→ auto-clear then proceed.startup/shutdown; platform clears on completion.Redis robustness
hset(mapping=...)usage.Race reduction & consistency
transition_start_time; clears may add an end stamp.Change scope
HLD: # 1991 sonic-net/SONiC#1991
sonic-platform-common: #567 sonic-net/sonic-platform-common#567
sonic-utilities: sonic-net/sonic-utilities#4031
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667
How to verify it
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU