-
Notifications
You must be signed in to change notification settings - Fork 137
Module graceful shutdown support #255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Module graceful shutdown support #255
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Do you mind pasting the steps and output for testing (commands) in the PR description |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
vvolam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thank you
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@rameshraghupathy Submodule update is blocked due to loganalyzer failures resulting from this PR, I think. Please take a look and help fix: Blocked submodule update: sonic-net/sonic-buildimage#24404 |
|
@rameshraghupathy Here is more information on the error path |
|
Cherry-pick PR to 202511: #324 |
…nic-net#255 Why I did it The gNOI shutdown daemon service was causing loganalyzer test failures on non-SmartSwitch platforms (e.g., vlab-01). The service attempted to start via ExecStartPre=/usr/local/bin/check_platform.py, which exited with code 1 on incompatible platforms. This caused systemd to log ERROR messages like: ERR systemd[1]: Failed to start gnoi-shutdown.service - gNOI based DPU Graceful Shutdown Daemon These errors blocked CI/CD submodule updates due to loganalyzer failures. How I did it Changed the service file to use ExecCondition= instead of ExecStartPre= for platform checking: ExecCondition=/usr/bin/python3 /usr/local/bin/check_platform.py runs before service start When check_platform.py returns exit code 1 on non-SmartSwitch platforms, systemd treats this as a condition not met rather than a failure Service is gracefully skipped without error logs on incompatible platforms Changed Restart=always to Restart=on-failure to avoid unnecessary restart attempts when conditions aren't met How to verify it On SmartSwitch NPU platform: Service starts normally and handles DPU graceful shutdown sonic-net/sonic-buildimage#24609 is run with this change Which release branch to backport [x]202511
… (#333) Why I did it The gNOI shutdown daemon service was causing loganalyzer test failures on non-SmartSwitch platforms (e.g., vlab-01). The service attempted to start via ExecStartPre=/usr/local/bin/check_platform.py, which exited with code 1 on incompatible platforms. This caused systemd to log ERROR messages like: ERR systemd[1]: Failed to start gnoi-shutdown.service - gNOI based DPU Graceful Shutdown Daemon These errors blocked CI/CD submodule updates due to loganalyzer failures. How I did it Changed the service file to use ExecCondition= instead of ExecStartPre= for platform checking: ExecCondition=/usr/bin/python3 /usr/local/bin/check_platform.py runs before service start When check_platform.py returns exit code 1 on non-SmartSwitch platforms, systemd treats this as a condition not met rather than a failure Service is gracefully skipped without error logs on incompatible platforms Changed Restart=always to Restart=on-failure to avoid unnecessary restart attempts when conditions aren't met How to verify it On SmartSwitch NPU platform: Service starts normally and handles DPU graceful shutdown sonic-net/sonic-buildimage#24609 is run with this change Which release branch to backport [x]202511
| # Hard dep we expect to be up before we start: swss | ||
| if ! systemctl is-active --quiet swss.service; then | ||
| log "Waiting for swss.service to become active…" | ||
| systemctl --no-pager --full status swss.service || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam, @rameshraghupathy, @qiluo-msft
Why is this needed? Why don’t we use dependencies in the data/debian/sonic-host-services-data.gnoi-shutdown.service file? Why do we need to create an entire script for something that can be handled directly in the service file?
[Unit]
Description=gNOI based DPU Graceful Shutdown Daemon
Requires=database.service
Wants=network-online.target
After=network-online.target database.service swss.service gnmi.service pmon.service
[Service]
Type=simple
ExecStartPre=/usr/bin/python3 /usr/local/bin/check_platform.py
ExecStartPre=/bin/bash /usr/local/bin/wait-for-sonic-core.sh
ExecStart=/usr/bin/python3 /usr/local/bin/gnoi_shutdown_daemon.py
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
Provide support for SmartSwitch DPU module graceful shutdown.
Description:
Single source of truth for transitions
All components now use
sonic_platform_base.module_base.ModuleBasehelpers:set_module_state_transition(db, name, transition_type)clear_module_state_transition(db, name)get_module_state_transition(db, name) -> dictis_module_state_transition_timed_out(db, name, timeout_secs) -> boolEliminates duplicated logic and race-prone direct Redis writes.
Correct table everywhere
CHASSIS_MODULE_TABLE(replacesCHASSIS_MODULE_INFO_TABLE).Ownership & lifecycle
The initiator of an operation (
startup/shutdown/reboot) sets:state_transition_in_progress=Truetransition_type=<op>transition_start_time=<utc-iso8601>The platform (
set_admin_state()) is responsible for clearing:state_transition_in_progress=Falsetransition_end_time=<epoch>(or similar end stamp).CLI pre-clears only when a prior transition is timed out.
Timeouts & policy
Platform JSON path only:
/usr/share/sonic/device/{plat}/platform.json; else constants.Typical production values used:
startup: 180s,shutdown: 180s(≈graceful_wait 60s + power 120s),reboot: 120s.Graceful wait (e.g., waiting for “Graceful shutdown complete”) is a platform policy and implemented inside platform
set_admin_state()—not in ModuleBase.Boot behavior
chassisdon start:set_initial_dpu_admin_state()which marks transitions via ModuleBase before calling platformset_admin_state().gNOI shutdown daemon
Listens on
CHASSIS_MODULE_TABLEand triggers only when:state_transition_in_progress=Trueandtransition_type=shutdown.Never clears the flag (ownership stays with the platform).
Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).
CLI (
config chassis modules …)is_module_state_transition_timed_out()→ auto-clear then proceed.startup/shutdown; platform clears on completion.Redis robustness
hset(mapping=...)usage.Race reduction & consistency
transition_start_time; clears may add an end stamp.Change scope
HLD: # 1991 sonic-net/SONiC#1991
sonic-platform-common: #567 sonic-net/sonic-platform-common#567
sonic-utilities: sonic-net/sonic-utilities#4031
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667
How to verify it
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU