Skip to content

Commit ed14cf2

Browse files
authored
Module graceful shutdown support (#324)
Provide support for SmartSwitch DPU module graceful shutdown. # Description: * **Single source of truth for transitions** failure_prs.log skip_prs.log All components now use `sonic_platform_base.module_base.ModuleBase` helpers: failure_prs.log skip_prs.log `set_module_state_transition(db, name, transition_type)` failure_prs.log skip_prs.log `clear_module_state_transition(db, name)` failure_prs.log skip_prs.log `get_module_state_transition(db, name) -> dict` failure_prs.log skip_prs.log `is_module_state_transition_timed_out(db, name, timeout_secs) -> bool` failure_prs.log skip_prs.log Eliminates duplicated logic and race-prone direct Redis writes. * **Correct table everywhere** failure_prs.log skip_prs.log Standardized on **`CHASSIS_MODULE_TABLE`** (replaces `CHASSIS_MODULE_INFO_TABLE`). failure_prs.log skip_prs.log HLD mismatch addressed in code (HLD fix tracked separately). * **Ownership & lifecycle** failure_prs.log skip_prs.log The **initiator** of an operation (`startup`/`shutdown`/`reboot`) sets: failure_prs.log skip_prs.log `state_transition_in_progress=True` failure_prs.log skip_prs.log `transition_type=<op>` failure_prs.log skip_prs.log `transition_start_time=<utc-iso8601>` failure_prs.log skip_prs.log The **platform** (`set_admin_state()`) is responsible for clearing: failure_prs.log skip_prs.log `state_transition_in_progress=False` failure_prs.log skip_prs.log optionally `transition_end_time=<epoch>` (or similar end stamp). failure_prs.log skip_prs.log CLI pre-clears only when a prior transition is **timed out**. * **Timeouts & policy** failure_prs.log skip_prs.log Platform JSON path only: `/usr/share/sonic/device/{plat}/platform.json`; else **constants**. failure_prs.log skip_prs.log Typical production values used: failure_prs.log skip_prs.log `startup: 180s`, `shutdown: 180s` (≈ `graceful_wait 60s + power 120s`), `reboot: 120s`. failure_prs.log skip_prs.log **Graceful wait** (e.g., waiting for “Graceful shutdown complete”) is a **platform policy** and implemented inside platform `set_admin_state()`—not in ModuleBase. * **Boot behavior** failure_prs.log skip_prs.log `chassisd` on start: 1. **Clears stale flags once** (centralized sweep). 2. Runs `set_initial_dpu_admin_state()` which **marks transitions** via ModuleBase before calling platform `set_admin_state()`. 3. Leaves clearing to the platform or to well-defined status transitions (ONLINE/OFFLINE) where appropriate. * **gNOI shutdown daemon** failure_prs.log skip_prs.log Listens on **`CHASSIS_MODULE_TABLE`** and triggers only when: failure_prs.log skip_prs.log `state_transition_in_progress=True` **and** `transition_type=shutdown`. failure_prs.log skip_prs.log Never clears the flag (ownership stays with the platform). failure_prs.log skip_prs.log Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon). * **CLI (`config chassis modules …`)** failure_prs.log skip_prs.log Uses ModuleBase APIs for all set/get/timeout checks. failure_prs.log skip_prs.log If a previous transition is stuck, `is_module_state_transition_timed_out()` → auto-clear then proceed. failure_prs.log skip_prs.log Sets transition at the start of `startup`/`shutdown`; platform clears on completion. failure_prs.log skip_prs.log Fabric card flow retained; edits are surgical. * **Redis robustness** failure_prs.log skip_prs.log Helpers handle both stacks (swsssdk/swsscommon); no `hset(mapping=...)` usage. failure_prs.log skip_prs.log Consistent HGETALL/HSET paths; resilient to connector differences. * **Race reduction & consistency** failure_prs.log skip_prs.log Centralized writes prevent multi-writer races. failure_prs.log skip_prs.log All transition writes include `transition_start_time`; clears may add an end stamp. failure_prs.log skip_prs.log Existing PCI/file-lock logic left intact; unrelated behavior unchanged. * **Change scope** failure_prs.log skip_prs.log Minimal, targeted diffs. failure_prs.log skip_prs.log No background tasks added, no broad refactors beyond transition handling. failure_prs.log skip_prs.log Behavior changes are limited to making transition semantics correct and uniform across repos. HLD: # 1991 sonic-net/SONiC#1991 sonic-platform-common: #567 sonic-net/sonic-platform-common#567 sonic-utilities: sonic-net/sonic-utilities#4031 sonic-platform-daemons: sonic-net/sonic-platform-daemons#667 How to verify it Issue the "config chassis modules shutdown DPUx" command Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU
1 parent 1633661 commit ed14cf2

File tree

8 files changed

+1047
-2
lines changed

8 files changed

+1047
-2
lines changed

data/debian/rules

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,5 +20,6 @@ override_dh_installsystemd:
2020
dh_installsystemd --no-start --name=procdockerstatsd
2121
dh_installsystemd --no-start --name=determine-reboot-cause
2222
dh_installsystemd --no-start --name=process-reboot-cause
23+
dh_installsystemd --no-start --name=gnoi-shutdown
2324
dh_installsystemd $(HOST_SERVICE_OPTS) --name=sonic-hostservice
2425

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
[Unit]
2+
Description=gNOI based DPU Graceful Shutdown Daemon
3+
Requires=database.service
4+
Wants=network-online.target
5+
After=network-online.target database.service
6+
7+
[Service]
8+
Type=simple
9+
ExecStartPre=/usr/bin/python3 /usr/local/bin/check_platform.py
10+
ExecStartPre=/bin/bash /usr/local/bin/wait-for-sonic-core.sh
11+
ExecStart=/usr/bin/python3 /usr/local/bin/gnoi_shutdown_daemon.py
12+
Restart=always
13+
RestartSec=5
14+
15+
[Install]
16+
WantedBy=multi-user.target

scripts/check_platform.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Check if the current platform is a SmartSwitch NPU (not DPU).
4+
Exit 0 if SmartSwitch NPU, exit 1 otherwise.
5+
"""
6+
import sys
7+
8+
def main():
9+
try:
10+
from sonic_py_common import device_info
11+
from utilities_common.chassis import is_dpu
12+
13+
# Check if SmartSwitch NPU (not DPU)
14+
if device_info.is_smartswitch() and not is_dpu():
15+
sys.exit(0)
16+
else:
17+
sys.exit(1)
18+
except (ImportError, AttributeError, RuntimeError) as e:
19+
sys.stderr.write("check_platform failed: {}\n".format(str(e)))
20+
sys.exit(1)
21+
22+
if __name__ == "__main__":
23+
main()

0 commit comments

Comments
 (0)