-
Notifications
You must be signed in to change notification settings - Fork 648
Connect to teamd before adding the lag to STATE_DB #3984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Connect to teamd before adding the lag to STATE_DB #3984
Conversation
Signed-off-by: Saikrishna Arcot <[email protected]>
|
/azp run |
1 similar comment
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses a race condition where teamd might create a port channel interface and then exit due to failure before fully initializing, causing dependent applications to think configuration is complete when it's not. The fix adds a verification step to connect to teamd and verify it's running before adding the LAG entry to STATE_DB.
Key changes:
- Added teamdctl connection verification before STATE_DB entry creation
- Included libteamdctl library dependency for daemon communication
- Reorganized error logging to occur before retry check
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| teamsyncd/teamsync.h | Added teamdctl.h header for teamd daemon communication |
| teamsyncd/teamsync.cpp | Added teamdctl connection and verification logic to ensure teamd is responsive before STATE_DB updates |
| teamsyncd/Makefile.am | Added -lteamdctl library dependency to link against libteamdctl |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #include <team.h> | ||
| #include <teamdctl.h> |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maintainability: These includes are already present in teamsync.h (lines 12-13), which is included on line 13. The duplicate includes are redundant and should be removed.
| #include <team.h> | |
| #include <teamdctl.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a stylistic choice to explicitly include what I'm using in this file, so that if the header files change, this file doesn't require changing.
| } | ||
|
|
||
| char *response; | ||
| err = teamdctl_config_get_raw_direct(m_teamdctl, &response); |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Memory leak - the response pointer returned by teamdctl_config_get_raw_direct is not freed. According to libteamdctl documentation, the response needs to be freed using teamdctl_config_get_raw_direct_free(response) or free(response) to prevent memory leaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Saikrishna Arcot <[email protected]>
…5/sonic-swss into fix-teamdsyncd-race-condition
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
What I did
It's possible for teamd to start up, create the port channel interface in the kernel, and then later exit because some failure condition was hit, or initialization took too long (and so the parent process killed it), or something else. When this happens, teamsyncd will get a notification from the kernel saying that the port channel interface has been created, and will start to add it into STATE_DB. If this happens before teamd goes down (and the port channel interface gets removed from the kernel), then anything depending on that STATE_DB entry will begin its processing, not realizing that that interface will get removed. This will result in dependent applications thinking everything has been processed and configs have been applied, but they haven't really been applied. (In the case of LAGs, this will be intfmgrd adding the IP address to the interface.)
This is functionally a race condition between teamd creating and then deleting the interface (due to a failure condition), teamsyncd acting too fast, and dependent applications assuming all setup is complete. This race condition is more visible on weaker systems.
Therefore, to try to prevent it, before adding an entry in STATE_DB, make sure that teamsyncd can get information from the kernel about the port channel interface, and then directly connect to teamd and make sure that the teamd is processing requests. If both of these succeed, then it can be assumed that all setup is done, and that teamd won't be exiting soon.
Why I did it
How I verified it
Perform several config reloads on a weaker hardware to make sure there are no cases of the LAG entry being added to STATE_DB, and then having teamd exit afterwards.
Details if related