fix: instance rejoin against legacy join server#64599
Open
fix: instance rejoin against legacy join server#64599
Conversation
Fixes #64598 This commit fixes an issue with Instance identity "self-healing". When starting up with additional enabled services which have no corresponding role in the current Instance identity, the node attempts a new cluster join request with the currently configured join token to get an Instance identity with all required roles. There's currently a bug if: - the node is new enough to attempt joining via the new join service and fall back to the legacy join service - the Auth is old enough it doesn't support the new join service - the token used for the rejoin includes all of the original instance cert roles and the newly required roles The bug exists because `state.IdentityID.HostUUID` is overloaded. When originally joining, it is expected to be populated with a plain UUID. When it is parsed from an existing certificate, it includes a suffix of `.<clustername>`. So when the rejoin attempt passes in the current `IdentityID` from the existing identity, it includes the clustername suffix, and then when rejoining auth adds the suffix again, and you end up with a double suffix. This doesn't affect joining via the new join service because the node doesn't explicitly pass its desired HostUUID at all, Auth extracts it from the authenticated identity used for the rejoin request. The fix is to call `state.IdentityID.HostID()` which strips the clustername suffix. This will fix new 18.7.x+ nodes rejoining to older 18.2.10- auth servers. Added test coverage for four cases: 1. rejoing with new join service with token including all required roles 2. rejoing with new join service with token including only the newly required role 3. rejoing with legacy join service with token including all required roles 4. rejoing with legacy join service with token including only the newly required role The bug currently effects case 3, all others still work. changelog: fixed a bug affecting nodes on v18.3.0+ rejoining with new system roles to clusters with Auth services on v18.2.10-
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #64598
This commit fixes an issue with Instance identity "self-healing". When starting up with additional enabled services which have no corresponding role in the current Instance identity, the node attempts a new cluster join request with the currently configured join token to get an Instance identity with all required roles.
There's currently a bug if:
The bug exists because
state.IdentityID.HostUUIDis overloaded. When originally joining, it is expected to be populated with a plain UUID. When it is parsed from an existing certificate, it includes a suffix of.<clustername>. So when the rejoin attempt passes in the currentIdentityIDfrom the existing identity, it includes the clustername suffix, and then when rejoining auth adds the suffix again, and you end up with a double suffix.This doesn't affect joining via the new join service because the node doesn't explicitly pass its desired HostUUID at all, Auth extracts it from the authenticated identity used for the rejoin request.
The fix is to call
state.IdentityID.HostID()which strips the clustername suffix. This will fix new 18.7.x+ nodes rejoining to older 18.2.10- auth servers.Added test coverage for four cases:
The bug currently effects case 3, all others still work.
changelog: fixed a bug affecting nodes on v18.3.0+ rejoining with new system roles to clusters with Auth services on v18.2.10-
Manual Test Plan
Test Environment
Cluster with Auth server running v18.2.10 or older.
Node with this fix that has already joined with the Node (ssh) role only.
Test Cases