Commit ed14cf2
authored
Module graceful shutdown support (#324)
Provide support for SmartSwitch DPU module graceful shutdown.
# Description:
* **Single source of truth for transitions**
failure_prs.log skip_prs.log All components now use `sonic_platform_base.module_base.ModuleBase` helpers:
failure_prs.log skip_prs.log `set_module_state_transition(db, name, transition_type)`
failure_prs.log skip_prs.log `clear_module_state_transition(db, name)`
failure_prs.log skip_prs.log `get_module_state_transition(db, name) -> dict`
failure_prs.log skip_prs.log `is_module_state_transition_timed_out(db, name, timeout_secs) -> bool`
failure_prs.log skip_prs.log Eliminates duplicated logic and race-prone direct Redis writes.
* **Correct table everywhere**
failure_prs.log skip_prs.log Standardized on **`CHASSIS_MODULE_TABLE`** (replaces `CHASSIS_MODULE_INFO_TABLE`).
failure_prs.log skip_prs.log HLD mismatch addressed in code (HLD fix tracked separately).
* **Ownership & lifecycle**
failure_prs.log skip_prs.log The **initiator** of an operation (`startup`/`shutdown`/`reboot`) sets:
failure_prs.log skip_prs.log `state_transition_in_progress=True`
failure_prs.log skip_prs.log `transition_type=<op>`
failure_prs.log skip_prs.log `transition_start_time=<utc-iso8601>`
failure_prs.log skip_prs.log The **platform** (`set_admin_state()`) is responsible for clearing:
failure_prs.log skip_prs.log `state_transition_in_progress=False`
failure_prs.log skip_prs.log optionally `transition_end_time=<epoch>` (or similar end stamp).
failure_prs.log skip_prs.log CLI pre-clears only when a prior transition is **timed out**.
* **Timeouts & policy**
failure_prs.log skip_prs.log Platform JSON path only: `/usr/share/sonic/device/{plat}/platform.json`; else **constants**.
failure_prs.log skip_prs.log Typical production values used:
failure_prs.log skip_prs.log `startup: 180s`, `shutdown: 180s` (≈ `graceful_wait 60s + power 120s`), `reboot: 120s`.
failure_prs.log skip_prs.log **Graceful wait** (e.g., waiting for “Graceful shutdown complete”) is a **platform policy** and implemented inside platform `set_admin_state()`—not in ModuleBase.
* **Boot behavior**
failure_prs.log skip_prs.log `chassisd` on start:
1. **Clears stale flags once** (centralized sweep).
2. Runs `set_initial_dpu_admin_state()` which **marks transitions** via ModuleBase before calling platform `set_admin_state()`.
3. Leaves clearing to the platform or to well-defined status transitions (ONLINE/OFFLINE) where appropriate.
* **gNOI shutdown daemon**
failure_prs.log skip_prs.log Listens on **`CHASSIS_MODULE_TABLE`** and triggers only when:
failure_prs.log skip_prs.log `state_transition_in_progress=True` **and** `transition_type=shutdown`.
failure_prs.log skip_prs.log Never clears the flag (ownership stays with the platform).
failure_prs.log skip_prs.log Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).
* **CLI (`config chassis modules …`)**
failure_prs.log skip_prs.log Uses ModuleBase APIs for all set/get/timeout checks.
failure_prs.log skip_prs.log If a previous transition is stuck, `is_module_state_transition_timed_out()` → auto-clear then proceed.
failure_prs.log skip_prs.log Sets transition at the start of `startup`/`shutdown`; platform clears on completion.
failure_prs.log skip_prs.log Fabric card flow retained; edits are surgical.
* **Redis robustness**
failure_prs.log skip_prs.log Helpers handle both stacks (swsssdk/swsscommon); no `hset(mapping=...)` usage.
failure_prs.log skip_prs.log Consistent HGETALL/HSET paths; resilient to connector differences.
* **Race reduction & consistency**
failure_prs.log skip_prs.log Centralized writes prevent multi-writer races.
failure_prs.log skip_prs.log All transition writes include `transition_start_time`; clears may add an end stamp.
failure_prs.log skip_prs.log Existing PCI/file-lock logic left intact; unrelated behavior unchanged.
* **Change scope**
failure_prs.log skip_prs.log Minimal, targeted diffs.
failure_prs.log skip_prs.log No background tasks added, no broad refactors beyond transition handling.
failure_prs.log skip_prs.log Behavior changes are limited to making transition semantics correct and uniform across repos.
HLD: # 1991 sonic-net/SONiC#1991
sonic-platform-common: #567 sonic-net/sonic-platform-common#567
sonic-utilities: sonic-net/sonic-utilities#4031
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667
How to verify it
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU1 parent 1633661 commit ed14cf2
File tree
8 files changed
+1047
-2
lines changed- data/debian
- scripts
- tests
8 files changed
+1047
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
| 23 | + | |
23 | 24 | | |
24 | 25 | | |
Lines changed: 16 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
0 commit comments