NPU Driven hamgrd infrastructure by BYGX-wcr · Pull Request #145 · sonic-net/sonic-dash-ha

BYGX-wcr · 2026-02-17T20:18:58Z

What I did

I added all the basic infrastructure for NPU-driven Hamgrd
I implemented the state machine for NPU-driven HA

Why I did it

To enable NPU-driven HA.

How I verified it

Details if related

Signed-off-by: BYGX-wcr <wcr@live.cn>

mssonicbld · 2026-02-17T20:19:06Z

/azp run

azure-pipelines · 2026-02-17T20:19:13Z

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

Signed-off-by: BYGX-wcr <wcr@live.cn>

mssonicbld · 2026-02-20T06:08:16Z

/azp run

… ha scope Signed-off-by: BYGX-wcr <wcr@live.cn>

mssonicbld · 2026-02-20T19:10:23Z

/azp run

Signed-off-by: BYGX-wcr <wcr@live.cn>

BYGX-wcr · 2026-02-20T19:31:31Z

/azpw run

mssonicbld · 2026-02-20T19:31:34Z

/AzurePipelines run

Signed-off-by: BYGX-wcr <wcr@live.cn>

mssonicbld · 2026-02-21T23:34:14Z

/azp run

Signed-off-by: BYGX-wcr <wcr@live.cn>

mssonicbld · 2026-02-22T05:00:48Z

/azp run

… delay Signed-off-by: BYGX-wcr <wcr@live.cn>

Signed-off-by: BYGX-wcr <wcr@live.cn>

mssonicbld · 2026-02-24T21:59:31Z

/azp run

Copilot

Pull request overview

This PR implements NPU-driven High Availability (HA) infrastructure for SONiC SmartSwitch DASH deployments, adding a complete state machine and supporting components alongside the existing DPU-driven HA implementation.

Changes:

Adds delayed message capability to swbus-actor for scheduled peer communication
Implements comprehensive NPU-driven HA state machine with peer voting, heartbeat, and bulk sync protocols
Refactors HaScopeActor into an enum supporting both DPU and NPU-driven modes with shared base functionality
Adds new database structures for flow sync sessions and HA set state tracking
Updates HaSetActor to handle NPU-driven mode and route updates based on HA owner

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
crates/swbus-actor/src/state/outgoing.rs	Adds send_with_delay method for scheduling messages; contains critical bug in delay logic
crates/hamgrd/src/ha_actor_messages.rs	Defines new message types for NPU HA protocol (PeerHeartbeat, VoteRequest/Reply, BulkSyncUpdate, HAStateChanged, SelfNotification); adds vdpu_ids and owner fields to existing messages
crates/hamgrd/src/db_structs.rs	Adds DashFlowSyncSessionTable, DashFlowSyncSessionState, DpuDashHaSetState; adds ha_term and peer state tracking to NpuDashHaScopeState
crates/hamgrd/src/actors/test.rs	Extends test macros to support excluding fields from comparison for dynamic values like timestamps
crates/hamgrd/src/actors/ha_set.rs	Adds dp_channel_is_alive and ha_owner tracking; conditionally updates VNET routes based on owner; handles HaScopeActorState updates for NPU mode
crates/hamgrd/src/actors/ha_scope/npu.rs	Implements complete NPU-driven HA state machine with peer communication, voting protocol, and state transitions
crates/hamgrd/src/actors/ha_scope/mod.rs	Refactors HaScopeActor into enum with Dpu/Npu variants; adds comprehensive tests for both modes
crates/hamgrd/src/actors/ha_scope/dpu.rs	Extracts DPU-driven implementation from monolithic actor
crates/hamgrd/src/actors/ha_scope/base.rs	Shared base functionality for both DPU and NPU variants
crates/hamgrd/src/main.rs	Adds producer bridge for DASH_FLOW_SYNC_SESSION_TABLE

crates/hamgrd/src/actors/ha_scope/mod.rs

crates/hamgrd/src/actors/ha_scope/npu.rs

crates/hamgrd/src/actors/ha_scope/mod.rs

Signed-off-by: BYGX-wcr <wcr@live.cn>

mssonicbld · 2026-02-24T22:29:58Z

/azp run

…orState Signed-off-by: BYGX-wcr <wcr@live.cn>

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

crates/swbus-actor/src/state/outgoing.rs:88

In send_queued_messages, delayed messages keep their original time_sent (scheduled time) even when they are actually sent. Since time_sent is later used by the resend/retention logic (get_elapsed_time(&msg.time_sent)), this can cause immediate resends (or premature dropping) if the actor didn't process the queue until long after the scheduled send time. Consider updating msg.time_sent = SystemTime::now() right before the first successful send_raw so resend timing is based on the real send time.

        let mut delayed_messages: Vec<UnackedMessage> = Vec::new();
        for msg in self.queued_messages.drain(..) {
            if SystemTime::now() < msg.time_sent {
                // The message hasn't reach its sending time yet
                delayed_messages.push(msg);
                continue;
            }

            debug!("Sending message: {msg:?}");
            self.swbus_client
                .send_raw(msg.swbus_message.clone())
                .await
                .expect("Sending swbus message failed");

            let id = msg.swbus_message.header.as_ref().unwrap().id;

crates/swbus-actor/src/state/outgoing.rs

Copilot · 2026-02-25T18:15:23Z

crates/hamgrd/src/ha_actor_messages.rs

+        ActorMessage::new(
+            Self::msg_key(my_id),
+            &Self {
+                up: true,


HaSetActorState::new_actor_msg ignores its up parameter and always serializes up: true. Callers now pass self.dp_channel_is_alive as up, but the value will never be reflected in messages. Either use the up argument when building Self, or remove the parameter and keep the field constant (and adjust call sites accordingly).

Suggested change

up: true,

up,

Copilot · 2026-02-25T18:15:24Z

crates/hamgrd/src/actors/ha_scope/base.rs

+    pub fn decode_hascope_actor_message<T>(&self, incoming: &Incoming, key: &str) -> Option<T>
+    where
+        T: DeserializeOwned,
+    {
+        let msg = incoming.get(key)?;
+        match msg.deserialize_data() {
+            Ok(data) => Some(data),
+            Err(e) => {
+                error!("Failed to deserialize VoteReply from message: {}", e);
+                None


decode_hascope_actor_message is generic over message types, but the error log always says "Failed to deserialize VoteReply". This is misleading when decoding other messages (e.g., HaScopeActorState, VoteRequest, etc.). Consider making the log message generic or including the target type/key in the log.

Copilot · 2026-02-25T18:15:24Z

crates/hamgrd/src/actors/ha_scope/npu.rs

+pub struct NpuHaScopeActor {
+    pub(super) base: HaScopeBase,
+    /// Target state that HAmgrd should transition to upon HA events
+    pub(super) target_ha_scope_state: Option<TargetState>,
+    /// Retry count used for voting
+    pub(super) retry_count: u32,
+    /// Is peer connected?
+    pub(super) peer_connected: bool,
+}


retry_count is documented as "used for voting", but it's also used by the peer-connection retry loop (CheckPeerConnection). Sharing a single counter across independent retry workflows can cause incorrect behavior (e.g., vote retries affecting peer connection retries, or vice versa). Consider splitting this into separate counters (e.g., vote_retry_count and peer_connect_retry_count).

Copilot · 2026-02-25T18:15:24Z

crates/hamgrd/src/actors/ha_scope/npu.rs

+        let bulk_sync_session = DashFlowSyncSessionTable {
+            ha_set_id: ha_set_id.clone(),
+            target_server_ip: haset.ha_set.peer_ip.clone(),
+            target_server_port: haset
+                .ha_set
+                .cp_data_channel_port
+                .expect("cp_data_channel_port must be configured"),


add_bulk_sync_session uses expect("cp_data_channel_port must be configured") on an optional config field. If cp_data_channel_port is missing, hamgrd will panic and crash. Prefer returning an error (or skipping bulk sync) so the actor can continue running and surface a clear error/log message.

Suggested change

let bulk_sync_session = DashFlowSyncSessionTable {

ha_set_id: ha_set_id.clone(),

target_server_ip: haset.ha_set.peer_ip.clone(),

target_server_port: haset

.ha_set

.cp_data_channel_port

.expect("cp_data_channel_port must be configured"),

let Some(cp_data_channel_port) = haset.ha_set.cp_data_channel_port else {

warn!(

"HA-SET {} missing cp_data_channel_port. Cannot start bulk sync session!",

&ha_set_id

);

return Ok(None);

};

let bulk_sync_session = DashFlowSyncSessionTable {

ha_set_id: ha_set_id.clone(),

target_server_ip: haset.ha_set.peer_ip.clone(),

target_server_port: cp_data_channel_port,

Copilot · 2026-02-25T18:15:25Z

crates/hamgrd/src/actors/ha_scope/npu.rs

+        if key == HaScopeActor::table_name() {
+            match self.handle_dash_ha_scope_config_table_message_npu_driven(state, key, context) {
+                Ok(incoming_event) => {
+                    event = Some(incoming_event);
+                }
+                Err(_e) => {
+                    error!("Error when processing HA Scope Config Table Update!")
+                }
+            }
+        } else if key.starts_with(DpuDashHaScopeState::table_name()) {
+            // Update NPU ha scope state based on dpu ha scope state update
+            match self.handle_dpu_ha_scope_state_update_npu_driven(state) {
+                Ok(incoming_event) => {
+                    event = Some(incoming_event);
+                }
+                Err(_e) => {
+                    error!("Error when processing DPU HA Scope State Update!")
+                }


Several Err(_e) branches in handle_message_inner discard the underlying error value, which makes debugging production failures much harder. Consider logging the actual error (and ideally the key) in these branches, or propagating the error upward when appropriate.

crates/hamgrd/src/actors/ha_scope/npu.rs

BYGX-wcr · 2026-02-25T18:34:59Z

/azpw run

mssonicbld · 2026-02-25T18:35:01Z

/AzurePipelines run

BYGX-wcr added 28 commits January 15, 2026 18:27

add basic divergence between DPU-driven mode and NPU-driven mode

2a0fbb0

Signed-off-by: BYGX-wcr <wcr@live.cn>

enhance ha_set actor to handle ha_set dplane channel up/down signal

055e2bb

Signed-off-by: BYGX-wcr <wcr@live.cn>

add DpuDashHaSetState table

1093c4c

Signed-off-by: BYGX-wcr <wcr@live.cn>

Define three new actor message types

2bf826e

Signed-off-by: BYGX-wcr <wcr@live.cn>

add vdpu_ids in HaSetActorState messages

cfdec81

Signed-off-by: BYGX-wcr <wcr@live.cn>

initial NPU-driven HA message processing

3bd2582

Signed-off-by: BYGX-wcr <wcr@live.cn>

define HAStateChanged

f9269a3

Signed-off-by: BYGX-wcr <wcr@live.cn>

add peer_ha_state_last_updated_time_in_ms in db_struct.rs

0b01bd1

Signed-off-by: BYGX-wcr <wcr@live.cn>

add term and timestamp to HAStateChanged

bd24554

Signed-off-by: BYGX-wcr <wcr@live.cn>

remove unnecessary registration types

e8aaa6e

Signed-off-by: BYGX-wcr <wcr@live.cn>

revision based on latest update on HLD

826d521

Signed-off-by: BYGX-wcr <wcr@live.cn>

add self notification in ha actor messages

b4ea39a

Signed-off-by: BYGX-wcr <wcr@live.cn>

revise state machines

f970a7d

Signed-off-by: BYGX-wcr <wcr@live.cn>

adding more HA events

7c5bb10

Signed-off-by: BYGX-wcr <wcr@live.cn>

add DASH_FLOW_SYNC_SESSION_TABLE in db_struct.rs

9bfdaf1

Signed-off-by: BYGX-wcr <wcr@live.cn>

add DASH_FLOW_SYNC_SESSION_STATE in db_struct.rs

50985c4

Signed-off-by: BYGX-wcr <wcr@live.cn>

add bulk sync handling

95c491d

Signed-off-by: BYGX-wcr <wcr@live.cn>

add admin launch handling and vote request initiation

b08dd0b

Signed-off-by: BYGX-wcr <wcr@live.cn>

add handling for DPU state changes

ce68f83

Signed-off-by: BYGX-wcr <wcr@live.cn>

add handling for HA term changes

ca12c33

Signed-off-by: BYGX-wcr <wcr@live.cn>

add notification for ha owner from ha scope actor to ha set actor

2a7e7d3

Signed-off-by: BYGX-wcr <wcr@live.cn>

fix syntax error

2787353

Signed-off-by: BYGX-wcr <wcr@live.cn>

move generic functions to common code section

22bc758

Signed-off-by: BYGX-wcr <wcr@live.cn>

add VNET route update based on HA scope state update in HA set actor

699d5d2

Signed-off-by: BYGX-wcr <wcr@live.cn>

remove forbidden as_str and rom_str impl for HaState

0fef08a

Signed-off-by: BYGX-wcr <wcr@live.cn>

remove forbidden as_str and rom_str impl for HaRole

65b6563

Signed-off-by: BYGX-wcr <wcr@live.cn>

remove forbidden as_str and rom_str impl for DesiredHaStae

4cfdf8d

Signed-off-by: BYGX-wcr <wcr@live.cn>

add emissions of HaScopeActorState

2a89547

Signed-off-by: BYGX-wcr <wcr@live.cn>

make linter happy

23332c9

Signed-off-by: BYGX-wcr <wcr@live.cn>

major code refactoring to separate npu-driven ha scope and dpu-driven…

aab7b9e

… ha scope Signed-off-by: BYGX-wcr <wcr@live.cn>

fix for compilation errors

2282938

Signed-off-by: BYGX-wcr <wcr@live.cn>

BYGX-wcr force-pushed the npu-driven-hamgrd-infra branch from 12b68c9 to 2282938 Compare February 20, 2026 19:30

fix UTs

7a4cd86

Signed-off-by: BYGX-wcr <wcr@live.cn>

fix formatting errors

f0e7ee5

Signed-off-by: BYGX-wcr <wcr@live.cn>

BYGX-wcr added 3 commits February 24, 2026 21:57

implement the original missing mechanism to send a swbus message with…

bf30454

… delay Signed-off-by: BYGX-wcr <wcr@live.cn>

implement the 'exclude' keyword for recv! macro

9eb7750

Signed-off-by: BYGX-wcr <wcr@live.cn>

add unit tests for NPU-driven HA scope actor / fix bugs / refactoring

c4719ba

Signed-off-by: BYGX-wcr <wcr@live.cn>

BYGX-wcr marked this pull request as ready for review February 24, 2026 21:59

Copilot AI review requested due to automatic review settings February 24, 2026 21:59

Copilot started reviewing on behalf of BYGX-wcr February 24, 2026 22:00 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

BYGX-wcr changed the title ~~NPU Driven hamgrd~~ NPU Driven hamgrd infrastructure Feb 24, 2026

fix typos

331d1f1

Signed-off-by: BYGX-wcr <wcr@live.cn>

BYGX-wcr mentioned this pull request Feb 24, 2026

Update SmartSwitch HA HLD sonic-net/SONiC#2180

Open

refactor the code to merge the usage of HaStateChanged and HaScopeAct…

7fcfd3c

…orState Signed-off-by: BYGX-wcr <wcr@live.cn>

Copilot AI review requested due to automatic review settings February 25, 2026 18:09

Copilot started reviewing on behalf of BYGX-wcr February 25, 2026 18:10 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

Conversation

BYGX-wcr commented Feb 17, 2026

Uh oh!

mssonicbld commented Feb 17, 2026

Uh oh!

azure-pipelines bot commented Feb 17, 2026

Uh oh!

mssonicbld commented Feb 20, 2026

Uh oh!

mssonicbld commented Feb 20, 2026

Uh oh!

BYGX-wcr commented Feb 20, 2026

Uh oh!

mssonicbld commented Feb 20, 2026

Uh oh!

mssonicbld commented Feb 21, 2026

Uh oh!

mssonicbld commented Feb 22, 2026

Uh oh!

mssonicbld commented Feb 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mssonicbld commented Feb 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BYGX-wcr commented Feb 25, 2026

Uh oh!

mssonicbld commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants