-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Is it platform specific
generic
Importance or Severity
Critical
Description of the bug
First issue:
During chassisd initialization and startup, if DPUs need to be powered off, we see a crash in chassisd due to multiple child threads trying to power off at the same time, but the dbconnector used to write the transition flag is still the same one, we see an issue sometimes (90% reproducible) of the chassisd crash with the following error log:
2025 Aug 6 00:40:36.571914 sonic ERR pmon#chassisd: :- checkReplyType: Expected to get redis type 3 got type 2, err: NON-STRING-REPLY
2025 Aug 6 00:40:36.571914 sonic ERR pmon#chassisd: :- checkReplyType: Expected to get redis type 1756728512 got type 3, err: NON-STRING-REPLY
This is because of use of same db connector for multiple child threads as seen here:
https://github.com/sonic-net/sonic-platform-daemons/blob/7b347c67f2d1a97b983fd8cd522710fa064147be/sonic-chassisd/scripts/chassisd#L712
https://github.com/sonic-net/sonic-platform-daemons/blob/7b347c67f2d1a97b983fd8cd522710fa064147be/sonic-chassisd/scripts/chassisd#L714
https://github.com/sonic-net/sonic-platform-daemons/blob/7b347c67f2d1a97b983fd8cd522710fa064147be/sonic-chassisd/scripts/chassisd#L1434
Issue was also reported in xcvrd with a similar behaviour:
#10530
Impact:
chassisd crash
Second Issue:
During merge of:
sonic-net/sonic-platform-daemons#607
One of the set_transition_flag function was removed:
Commit where change was present:

https://github.com/sonic-net/sonic-platform-daemons/pull/607/files/eaa33bcc5533ca97e7c0f204058a8f4477286cfd
Final commit:

https://github.com/sonic-net/sonic-platform-daemons/pull/607/files
Impact:
chassisd does not add the transition flag (consecutive admin state changes are possible)
once this is added back, there is still the issue with multiple threads using the asme db connector to write / read data, causing the same issue as above (70% reproducibility)
Third Issue:
One of the clear Transition Flags was removed as part of this PR:
sonic-net/sonic-platform-daemons#631

Impact:
One transition form offline to online we do not clear the transition flag
Fourth Issue:
Due to the if conditions present in the chassisd module updater on DPU some of the cases are not covered by adding transition flags on:
- Config reload
- System Reboot and reinitialization
Fifth Issue:
delete_entry does not exist for a DBConnector object being used
https://github.com/sonic-net/sonic-utilities/blob/3e3daf369f9ba4a99bc183e403717bae18a19120/config/chassis_modules.py#L78
Sixth Issue:
is_transition_timed_out function returns false on failure to get the value, this would lead to cases where we are unable to execute the config commands if there is some issue with the transition timeout flag
Seventh Issue:
Assuming config_db entries are updated to change the states of the DPU, this is not handled as we do not see transition flags being set due to this
Eigth Issue:
The formatting of the transition_start_time is not same from chassisd and sonic-utilities:
https://github.com/sonic-net/sonic-platform-daemons/blob/7b347c67f2d1a97b983fd8cd522710fa064147be/sonic-chassisd/scripts/chassisd#L1422
and
https://github.com/sonic-net/sonic-utilities/blob/1418f218825484551b1f8893ff836d420f0a6135/config/chassis_modules.py#L75
Causes failure in reading time during config chassis calls
Issues were only seen now due to previous issue which was causing chassisd to crash:
#22430
Steps to Reproduce
Two ways to reproduce, no config_db entries to power on DPUs, DPUs are powered on, If we reboot the switch chassisd crashes
In light mode, after adding the changes relevant to setting transition flag, power off all DPUs in parallel using bash script, and chassisd crashes again:
config chassis modules startup DPUx &
Actual Behavior and Expected Behavior
Actual behaviour:
chassisd crash, error logs
Expected behaviour:
No error logs, no chassisd crash
Relevant log output
Output of show version, show techsupport
202506 image, and latest master
Hash: 1fd32735e95ba5ba65027945a2f4ecd170794f97
With one additional PR: https://github.com/sonic-net/sonic-platform-daemons/pull/645Attach files (if any)
No response