Skip to content

Bug: [Chassisd][Smartswitch] Issues with dbconnector in chassisd and merge conflict resolves #23602

@gpunathilell

Description

@gpunathilell

Is it platform specific

generic

Importance or Severity

Critical

Description of the bug

First issue:
During chassisd initialization and startup, if DPUs need to be powered off, we see a crash in chassisd due to multiple child threads trying to power off at the same time, but the dbconnector used to write the transition flag is still the same one, we see an issue sometimes (90% reproducible) of the chassisd crash with the following error log:

2025 Aug  6 00:40:36.571914 sonic ERR pmon#chassisd: :- checkReplyType: Expected to get redis type 3 got type 2, err: NON-STRING-REPLY
2025 Aug  6 00:40:36.571914 sonic ERR pmon#chassisd: :- checkReplyType: Expected to get redis type 1756728512 got type 3, err: NON-STRING-REPLY

This is because of use of same db connector for multiple child threads as seen here:
https://github.com/sonic-net/sonic-platform-daemons/blob/7b347c67f2d1a97b983fd8cd522710fa064147be/sonic-chassisd/scripts/chassisd#L712
https://github.com/sonic-net/sonic-platform-daemons/blob/7b347c67f2d1a97b983fd8cd522710fa064147be/sonic-chassisd/scripts/chassisd#L714
https://github.com/sonic-net/sonic-platform-daemons/blob/7b347c67f2d1a97b983fd8cd522710fa064147be/sonic-chassisd/scripts/chassisd#L1434

Issue was also reported in xcvrd with a similar behaviour:
#10530
Impact:
chassisd crash

Second Issue:
During merge of:
sonic-net/sonic-platform-daemons#607
One of the set_transition_flag function was removed:
Commit where change was present:
Image
https://github.com/sonic-net/sonic-platform-daemons/pull/607/files/eaa33bcc5533ca97e7c0f204058a8f4477286cfd
Final commit:
Image
https://github.com/sonic-net/sonic-platform-daemons/pull/607/files
Impact:
chassisd does not add the transition flag (consecutive admin state changes are possible)
once this is added back, there is still the issue with multiple threads using the asme db connector to write / read data, causing the same issue as above (70% reproducibility)

Third Issue:
One of the clear Transition Flags was removed as part of this PR:
sonic-net/sonic-platform-daemons#631
Image
Impact:
One transition form offline to online we do not clear the transition flag

Fourth Issue:
Due to the if conditions present in the chassisd module updater on DPU some of the cases are not covered by adding transition flags on:

  • Config reload
  • System Reboot and reinitialization

Fifth Issue:
delete_entry does not exist for a DBConnector object being used
https://github.com/sonic-net/sonic-utilities/blob/3e3daf369f9ba4a99bc183e403717bae18a19120/config/chassis_modules.py#L78

Sixth Issue:
is_transition_timed_out function returns false on failure to get the value, this would lead to cases where we are unable to execute the config commands if there is some issue with the transition timeout flag

Seventh Issue:
Assuming config_db entries are updated to change the states of the DPU, this is not handled as we do not see transition flags being set due to this

Eigth Issue:
The formatting of the transition_start_time is not same from chassisd and sonic-utilities:
https://github.com/sonic-net/sonic-platform-daemons/blob/7b347c67f2d1a97b983fd8cd522710fa064147be/sonic-chassisd/scripts/chassisd#L1422
and
https://github.com/sonic-net/sonic-utilities/blob/1418f218825484551b1f8893ff836d420f0a6135/config/chassis_modules.py#L75
Causes failure in reading time during config chassis calls

Issues were only seen now due to previous issue which was causing chassisd to crash:
#22430

Steps to Reproduce

Two ways to reproduce, no config_db entries to power on DPUs, DPUs are powered on, If we reboot the switch chassisd crashes
In light mode, after adding the changes relevant to setting transition flag, power off all DPUs in parallel using bash script, and chassisd crashes again:

config chassis modules startup DPUx &

Actual Behavior and Expected Behavior

Actual behaviour:
chassisd crash, error logs
Expected behaviour:
No error logs, no chassisd crash

Relevant log output

Output of show version, show techsupport

202506 image, and latest master
Hash: 1fd32735e95ba5ba65027945a2f4ecd170794f97
With one additional PR: https://github.com/sonic-net/sonic-platform-daemons/pull/645

Attach files (if any)

No response

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions