Skip to content

fix: instance rejoin against legacy join server#64599

Open
nklaassen wants to merge 1 commit intomasterfrom
nklaassen/fix-legacyrejoin
Open

fix: instance rejoin against legacy join server#64599
nklaassen wants to merge 1 commit intomasterfrom
nklaassen/fix-legacyrejoin

Conversation

@nklaassen
Copy link
Contributor

@nklaassen nklaassen commented Mar 13, 2026

Fixes #64598

This commit fixes an issue with Instance identity "self-healing". When starting up with additional enabled services which have no corresponding role in the current Instance identity, the node attempts a new cluster join request with the currently configured join token to get an Instance identity with all required roles.

There's currently a bug if:

  • the node is new enough to attempt joining via the new join service and fall back to the legacy join service
  • the Auth is old enough it doesn't support the new join service
  • the token used for the rejoin includes all of the original instance cert roles and the newly required roles

The bug exists because state.IdentityID.HostUUID is overloaded. When originally joining, it is expected to be populated with a plain UUID. When it is parsed from an existing certificate, it includes a suffix of .<clustername>. So when the rejoin attempt passes in the current IdentityID from the existing identity, it includes the clustername suffix, and then when rejoining auth adds the suffix again, and you end up with a double suffix.

This doesn't affect joining via the new join service because the node doesn't explicitly pass its desired HostUUID at all, Auth extracts it from the authenticated identity used for the rejoin request.

The fix is to call state.IdentityID.HostID() which strips the clustername suffix. This will fix new 18.7.x+ nodes rejoining to older 18.2.10- auth servers.

Added test coverage for four cases:

  1. rejoing with new join service with token including all required roles
  2. rejoing with new join service with token including only the newly required role
  3. rejoing with legacy join service with token including all required roles
  4. rejoing with legacy join service with token including only the newly required role

The bug currently effects case 3, all others still work.

changelog: fixed a bug affecting nodes on v18.3.0+ rejoining with new system roles to clusters with Auth services on v18.2.10-

Manual Test Plan

Test Environment

Cluster with Auth server running v18.2.10 or older.
Node with this fix that has already joined with the Node (ssh) role only.

Test Cases

  • Add the App role to the join token. Enable the app service in the node config. Restart the node. It should rejoin and work.
  • Upgrade the Auth server to this branch and retest the above.

Fixes #64598

This commit fixes an issue with Instance identity "self-healing".
When starting up with additional enabled services which have no
corresponding role in the current Instance identity, the node attempts a
new cluster join request with the currently configured join token to get
an Instance identity with all required roles.

There's currently a bug if:
- the node is new enough to attempt joining via the new join service and
  fall back to the legacy join service
- the Auth is old enough it doesn't support the new join service
- the token used for the rejoin includes all of the original instance
  cert roles and the newly required roles

The bug exists because `state.IdentityID.HostUUID` is overloaded. When
originally joining, it is expected to be populated with a plain UUID.
When it is parsed from an existing certificate, it includes a suffix of
`.<clustername>`. So when the rejoin attempt passes in the current
`IdentityID` from the existing identity, it includes the clustername
suffix, and then when rejoining auth adds the suffix again, and you end
up with a double suffix.

This doesn't affect joining via the new join service because the node
doesn't explicitly pass its desired HostUUID at all, Auth extracts it
from the authenticated identity used for the rejoin request.

The fix is to call `state.IdentityID.HostID()` which strips the
clustername suffix. This will fix new 18.7.x+ nodes rejoining to older
18.2.10- auth servers.

Added test coverage for four cases:
1. rejoing with new join service with token including all required roles
2. rejoing with new join service with token including only the newly
  required role
3. rejoing with legacy join service with token including all required roles
4. rejoing with legacy join service with token including only the newly
  required role

The bug currently effects case 3, all others still work.

changelog: fixed a bug affecting nodes on v18.3.0+ rejoining with new system roles to clusters with Auth services on v18.2.10-
@nklaassen nklaassen marked this pull request as ready for review March 13, 2026 05:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Instance identity breaks after adding system role

1 participant