Skip to content

Secure discovery boot data with jwt signed tokens#750

Open
stefanhipfel wants to merge 5 commits intomainfrom
issue_749
Open

Secure discovery boot data with jwt signed tokens#750
stefanhipfel wants to merge 5 commits intomainfrom
issue_749

Conversation

@stefanhipfel
Copy link
Copy Markdown
Contributor

@stefanhipfel stefanhipfel commented Mar 23, 2026

Proposed Changes

Fixes #

Summary by CodeRabbit

  • New Features

    • Discovery token auth added for registry registration and bootstate flows
    • Controller now generates and injects signed discovery tokens; probe CLI accepts a discovery-token flag
  • Security / Validation

    • HMAC‑SHA256 (HS256) signed tokens with 1‑hour freshness; registry enforces validation and returns 401/403 for auth failures
  • Documentation

    • New comprehensive guide on token lifecycle, secret management, rotation, and troubleshooting
  • Tests

    • New token tests and updated ignition/probe tests to validate token behavior

@stefanhipfel stefanhipfel marked this pull request as ready for review March 24, 2026 12:00
@stefanhipfel stefanhipfel requested a review from a team as a code owner March 24, 2026 12:00
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 24, 2026

Walkthrough

Adds HMAC-SHA256 signed discovery tokens: controller creates or ensures a cluster signing secret, generates signed JWT tokens per server and injects them into ignition; metalprobe includes the token when calling registry endpoints; registry loads the signing secret and validates tokens for /register and /bootstate.

Changes

Cohort / File(s) Summary
Token package & constants
internal/token/token.go, internal/token/constants.go
New token package: 32-byte signing-secret generation, HS256 JWT generation (GenerateSignedDiscoveryToken) and verification (VerifySignedDiscoveryToken) plus exported constants for secret name/key.
Token tests
internal/token/token_test.go
Comprehensive tests for signing-secret generation, token creation, and verification (positive and negative cases).
Registry API payloads
internal/api/registry/registry.go
Added DiscoveryToken field to RegistrationPayload and BootstatePayload.
Registry server & tests
internal/registry/server.go, internal/registry/registry_suite_test.go
Server API changed to accept signing-secret name/namespace; server loads/keeps signing secret, adds token validation logic (skips in unit-test mode), enforces system UUID matching and returns 401 on auth failures.
Controller: ignition & secret management
internal/controller/server_controller.go, internal/controller/server_controller_test.go, internal/controller/suite_test.go
Controller adds getOrCreateSigningSecret, generates signed tokens for server.SystemUUID, injects --discovery-token=<token> into probe flags; tests updated to generate/extract and assert token presence; test wiring updated to pass signing-secret name/namespace.
Probe agent & tests
internal/probe/probe.go, internal/probe/probe_suite_test.go
Agent gains DiscoveryToken field; NewAgent signature extended to accept token; registration payload now includes token; registerServer treats 401/403 as authentication failures (stop retries).
CLI / entrypoints
cmd/main.go, cmd/metalprobe/main.go
Registry server constructor calls extended to pass signing-secret name and namespace; metalprobe adds -discovery-token CLI flag and fails startup if token missing.
Docs
docs/discovery-token-security.md
New documentation detailing token format, lifecycle, secret management, validation semantics, ops guidance, and troubleshooting.
Test/wiring updates
internal/registry/..., internal/probe/..., various *_suite_test.go` files
Multiple test call-sites updated to pass additional parameters or test tokens and to assert token-related behavior.
Dependencies
go.mod
Added github.com/golang-jwt/jwt/v5 dependency.

Sequence Diagram

sequenceDiagram
    participant K8s as Kubernetes
    participant Ctrl as Controller
    participant Registry as Registry Server
    participant Probe as Metalprobe

    Ctrl->>K8s: getOrCreateSigningSecret()
    K8s-->>Ctrl: signing-key (32 bytes)
    Ctrl->>Ctrl: GenerateSignedDiscoveryToken(signing-key, systemUUID)
    Ctrl-->>K8s: Apply ignition with --discovery-token=token
    K8s-->>Probe: Boot with token flag

    Probe->>Registry: POST /register (DiscoveryToken)
    Registry->>Registry: VerifySignedDiscoveryToken(signing-key, token)
    alt Token Valid
        Registry-->>Probe: 200 OK
    else Token Invalid/Missing
        Registry-->>Probe: 401 Unauthorized
    end

    Probe->>Registry: POST /bootstate (DiscoveryToken)
    Registry->>Registry: VerifySignedDiscoveryToken(signing-key, token)
    alt Token Valid
        Registry-->>Probe: 200 OK
    else Token Invalid/Missing
        Registry-->>Probe: 401 Unauthorized
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • Update kubebuilder to v4.11.1 #636 — modifies internal/registry/server.go constructor and related server wiring; strongly related to the registry server signature change and token-loading behavior.

Suggested labels

documentation, enhancement, api-change

Suggested reviewers

  • afritzler
  • xkonni
  • Nuckal777
🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is entirely incomplete, containing only placeholder lines and no actual details about the changes, implementation rationale, or which issue is being fixed. Complete the description by listing the key changes, explaining the implementation approach, and specifying which issue number is fixed. Reference the substantial token generation, registry authentication, and documentation additions.
Docstring Coverage ⚠️ Warning Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Secure discovery boot data with jwt signed tokens' clearly and specifically summarizes the main change—implementing JWT signed token security for discovery boot data, which is the core feature added across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue_749

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (7)
cmd/main.go (1)

627-633: Consider extracting the hardcoded secret name to a shared constant.

The secret name "discovery-token-signing-secret" is duplicated here and in server_controller.go (line 723). Extracting it to a shared constant would prevent accidental drift between the controller and registry server configurations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/main.go` around lines 627 - 633, The secret name
"discovery-token-signing-secret" is duplicated; extract it to a shared constant
and use that constant wherever the literal is used. Define a package-level
constant (e.g., DiscoveryTokenSigningSecret) in a common package/file imported
by both the registry setup and the controller, then replace the literal in the
registry.NewServer call and the usage in server_controller.go (and any other
occurrences) with that constant to ensure a single source of truth.
internal/token/token_test.go (2)

163-173: Test comment is misleading - this doesn't verify constant-time behavior.

The comment says "HMAC provides constant-time comparison internally" but the test only verifies that token verification works, not timing properties. Timing-safe comparison cannot be tested this way.

Consider renaming or updating the test description to reflect what it actually tests.

📝 Suggested description improvement
-	It("should use constant-time comparison (HMAC)", func() {
-		// HMAC provides constant-time comparison internally
-		// This test verifies the signature mechanism works
+	It("should verify tokens using HMAC signature mechanism", func() {
+		// Note: HMAC uses hmac.Equal internally for constant-time comparison
+		// This test verifies basic signature verification works
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/token/token_test.go` around lines 163 - 173, The test description is
misleading about timing; update the It description under Context("Token
Security") to reflect that it verifies signature correctness rather than
constant-time behavior: change the It("should use constant-time comparison
(HMAC)") text to something like It("verifies signed discovery token using HMAC
signature") and/or update the comment above the test; keep the test body using
GenerateSignedDiscoveryToken and VerifySignedDiscoveryToken unchanged so it
still asserts successful generation and valid verification.

67-76: Consider using explicit time manipulation instead of relying on timing.

The Eventually pattern here relies on wall-clock time passing to generate a different timestamp. This could be flaky in fast CI environments or if the timestamp resolution changes. Consider either:

  1. Mocking time in the token package
  2. Simply accepting that same-second tokens may be identical (expected behavior)

The current test is technically correct but may cause intermittent failures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/token/token_test.go` around lines 67 - 76, Test relies on wall-clock
progress and can flake; update the code or test to control time
deterministically: either add a time provider to GenerateSignedDiscoveryToken
(e.g., inject a nowFunc or accept a timestamp parameter) and in the test call
GenerateSignedDiscoveryToken with two different timestamps to assert tokens
differ, or change the test to assert tokens may be equal within the same second
and avoid Eventually. Locate GenerateSignedDiscoveryToken and the token_test.go
case and implement the time-injection approach (introduce nowFunc or timestamp
arg) or alter the test to compare tokens produced with explicit different times
instead of using Eventually.
internal/controller/server_controller_test.go (2)

437-446: Fragile string parsing for token extraction.

This manual string parsing with IndexAny and hardcoded delimiters is brittle. If the ignition format changes slightly (e.g., different quoting or whitespace), this could break silently.

Consider using a regex pattern for more robust extraction, or better yet, parse the YAML structure to get the systemd unit contents and then extract the flag value.

♻️ Suggested regex-based extraction
-		// Extract token using regex or string parsing
-		tokenStart := strings.Index(contents, "--discovery-token=")
-		Expect(tokenStart).To(BeNumerically(">", 0))
-		tokenValueStart := tokenStart + len("--discovery-token=")
-		tokenEnd := strings.IndexAny(contents[tokenValueStart:], " \n\t\"'")
-		var actualToken string
-		if tokenEnd == -1 {
-			actualToken = contents[tokenValueStart:]
-		} else {
-			actualToken = contents[tokenValueStart : tokenValueStart+tokenEnd]
-		}
-		Expect(actualToken).ToNot(BeEmpty())
+		// Extract token using regex for more robust parsing
+		tokenRegex := regexp.MustCompile(`--discovery-token=([A-Za-z0-9_-]+)`)
+		matches := tokenRegex.FindStringSubmatch(contents)
+		Expect(matches).To(HaveLen(2), "should find discovery token in ignition")
+		actualToken := matches[1]
+		Expect(actualToken).ToNot(BeEmpty())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/server_controller_test.go` around lines 437 - 446, The
current brittle manual parsing of contents (variables tokenStart,
tokenValueStart, tokenEnd, actualToken) using strings.Index and IndexAny should
be replaced with a more robust extraction: use a regex to locate
--discovery-token=<value> (or parse the YAML/ignition to get the systemd unit
content first) and capture the token value, or parse the YAML structure to
retrieve the unit and then extract the flag; update the test to use that
regex/capture group (or YAML unmarshalling) instead of the Index/IndexAny logic
so variations in quoting/whitespace won’t break the extraction.

271-289: Consider extracting repeated signing secret retrieval into a helper function.

The pattern of fetching the signing secret and generating a token is duplicated across multiple tests (here, lines 506-522, and lines 853-867). This could be extracted into a test helper to reduce duplication and improve maintainability.

♻️ Suggested helper function
// Add this helper near the top of the test file or in a test utilities file
func getSignedDiscoveryToken(ctx context.Context, k8sClient client.Client, ns, systemUUID string) string {
	signingSecret := &v1.Secret{}
	Eventually(func() error {
		return k8sClient.Get(ctx, client.ObjectKey{
			Name:      "discovery-token-signing-secret",
			Namespace: ns,
		}, signingSecret)
	}).Should(Succeed())

	signingKey := signingSecret.Data["signing-key"]
	Expect(signingKey).To(HaveLen(32))

	token, err := metaltoken.GenerateSignedDiscoveryToken(signingKey, systemUUID)
	Expect(err).NotTo(HaveOccurred())
	return token
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/server_controller_test.go` around lines 271 - 289, The
test duplicates code to fetch the "discovery-token-signing-secret" and generate
a signed discovery token; extract that logic into a helper function (e.g.,
getSignedDiscoveryToken(ctx context.Context, k8sClient client.Client, ns string,
systemUUID string) string) placed near the top of the test file or a test
utilities file, have it perform the Eventually k8sClient.Get into a v1.Secret,
assert signingKey length and GenerateSignedDiscoveryToken
(metaltoken.GenerateSignedDiscoveryToken), return the token, and then replace
the inline blocks in the tests (the creation of signingSecret, signingKey check,
metaltoken.GenerateSignedDiscoveryToken call, and probe.NewAgent usage sites)
with calls to getSignedDiscoveryToken to reduce duplication and keep assertions
intact.
internal/token/token.go (1)

86-97: Manual delimiter parsing could use strings.Index or strings.Split.

The manual loop to find || can be simplified using strings.Index:

♻️ Simplified delimiter parsing
-	// Parse by splitting on "||"
-	firstDelim := -1
-	for i := 0; i < len(payload)-1; i++ {
-		if payload[i] == '|' && payload[i+1] == '|' {
-			firstDelim = i
-			break
-		}
-	}
-
-	if firstDelim == -1 {
+	// Parse by finding "||" delimiter
+	firstDelim := strings.Index(payload, "||")
+	if firstDelim == -1 {
		return "", 0, false, nil // Invalid format
	}

Note: You would need to add "strings" to imports.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/token/token.go` around lines 86 - 97, Replace the manual loop that
searches for the "||" delimiter on the payload with strings.Index(payload, "||")
(or strings.SplitN(payload, "||", 2)) to simplify and clarify parsing; update
the code paths that use firstDelim (the existing early-return when delimiter not
found and subsequent substring logic) to use the Index/Split result, and add
"strings" to the imports. This keeps the same behavior (returning "", 0, false,
nil when not found) but removes the manual byte-by-byte scan and improves
readability around the payload delimiter handling.
docs/discovery-token-security.md (1)

13-15: Add language specifiers to fenced code blocks.

Static analysis flagged these code blocks as missing language specifiers. Adding text or plaintext for the token format and ASCII diagram would satisfy the linter and improve rendering in some viewers.

📝 Suggested fix
-```
+```text
 token = base64url(systemUUID||timestamp||HMAC-SHA256(systemUUID||timestamp, secret))

And for the sequence diagram:
```diff
-```
+```text
 ┌──────────────┐                  ┌──────────┐                  ┌──────────┐

Also applies to: 52-77

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/discovery-token-security.md` around lines 13 - 15, Update the fenced
code blocks that show the token format (the block containing "token =
base64url(systemUUID||timestamp||HMAC-SHA256(systemUUID||timestamp, secret))")
and the ASCII sequence diagram (the block starting with the box characters
"┌──────────────┐ ⟨...⟩") to include a language specifier such as "text" or
"plaintext" after the opening backticks; apply the same change to the other code
fences noted in the comment range (the blocks around lines 52–77) so the linter
accepts them and viewers render them as plain text.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/discovery-token-security.md`:
- Around line 317-321: The manual command uses "openssl rand -base64 32" which
base64-encodes the 32 bytes and causes the stored signing-key to be ~44 bytes
after Kubernetes base64-ing it; instead generate 32 raw bytes, write them to a
temporary file, create the Kubernetes secret discovery-token-signing-secret
using --from-file=signing-key=<tempfile> so the exact 32-byte value is stored,
and then remove the temporary file; references: discovery-token-signing-secret,
signing-key, and avoid using openssl rand -base64 32.

In `@internal/controller/server_controller.go`:
- Around line 720-767: getOrCreateSigningSecret has two issues: concurrent
reconciles can race on creating the secret and invalid secrets are never
recovered. Modify getOrCreateSigningSecret to: when an existing secret is found
validate the "signing-key" and if invalid attempt to replace it (generate a new
key and Update/Patch the secret, handling Conflict by re-getting and
re-validating); when Create returns AlreadyExists after generating a new secret,
treat it as a benign race by fetching the newly-created secret and
validating/returning its signing-key instead of failing; in all fetch/update
paths handle transient IsNotFound/Conflict/AlreadyExists errors by re-fetching
and retrying validation once before returning an error. Reference:
getOrCreateSigningSecret, secret.Data["signing-key"], r.Create, r.Get, and use
metaltoken.GenerateSigningSecret for regeneration.

In `@internal/probe/probe.go`:
- Around line 214-219: The auth-error branch currently returns (false, nil)
which signals "not done, no error" so ExponentialBackoffWithContext keeps
retrying; change it to return a non-nil error to stop retries. In the block that
logs the auth failure (the a.log.Error call that checks resp.StatusCode ==
http.StatusUnauthorized || resp.StatusCode == http.StatusForbidden), return a
terminal error (e.g., errors.New or fmt.Errorf with the status code and context)
instead of nil so ExponentialBackoffWithContext will cease retries.

In `@internal/registry/server.go`:
- Around line 89-96: The current validateDiscoveryToken function blindly allows
requests when s.k8sClient == nil which is risky; change the logic to require an
explicit test-mode flag instead of inferring from s.k8sClient by adding a Server
field like testMode (bool) and only bypass validation when testMode is true, and
also add a startup-time log warning if testMode is enabled; update
validateDiscoveryToken to check s.testMode (and still log that validation is
skipped) and ensure any creation/initialization paths set testMode explicitly
rather than relying on a nil k8sClient.
- Around line 104-133: validateDiscoveryToken performs lazy loading of
s.signingSecret without synchronization, causing a data race; add a synchronized
one-time loader and protect writes: introduce fields on Server (e.g.,
signingSecretOnce sync.Once and signingSecretMu sync.Mutex) and implement a
helper method loadSigningSecret() that uses signingSecretOnce.Do to fetch the
Secret via s.k8sClient, validate secret.Data["signing-key"], then assign
s.signingSecret while holding signingSecretMu; call loadSigningSecret() from
validateDiscoveryToken and only read s.signingSecret under the mutex or after
the Once to ensure safe concurrent access.

In `@internal/token/token.go`:
- Around line 40-42: Validate the systemUUID before constructing the payload to
prevent delimiter injection: in the function that builds the payload (where
systemUUID, timestamp, and payload are used) add an RFC‑4122 UUID regex check
against systemUUID and return an error if it doesn't match, so payload :=
fmt.Sprintf("%s||%d", systemUUID, timestamp) only runs for a valid UUID;
alternatively you may replace the delimiter approach with length‑prefixing or a
null byte separator and update the corresponding parsing logic (the token
parsing code referenced at lines 86-99) to use the new format.

---

Nitpick comments:
In `@cmd/main.go`:
- Around line 627-633: The secret name "discovery-token-signing-secret" is
duplicated; extract it to a shared constant and use that constant wherever the
literal is used. Define a package-level constant (e.g.,
DiscoveryTokenSigningSecret) in a common package/file imported by both the
registry setup and the controller, then replace the literal in the
registry.NewServer call and the usage in server_controller.go (and any other
occurrences) with that constant to ensure a single source of truth.

In `@docs/discovery-token-security.md`:
- Around line 13-15: Update the fenced code blocks that show the token format
(the block containing "token =
base64url(systemUUID||timestamp||HMAC-SHA256(systemUUID||timestamp, secret))")
and the ASCII sequence diagram (the block starting with the box characters
"┌──────────────┐ ⟨...⟩") to include a language specifier such as "text" or
"plaintext" after the opening backticks; apply the same change to the other code
fences noted in the comment range (the blocks around lines 52–77) so the linter
accepts them and viewers render them as plain text.

In `@internal/controller/server_controller_test.go`:
- Around line 437-446: The current brittle manual parsing of contents (variables
tokenStart, tokenValueStart, tokenEnd, actualToken) using strings.Index and
IndexAny should be replaced with a more robust extraction: use a regex to locate
--discovery-token=<value> (or parse the YAML/ignition to get the systemd unit
content first) and capture the token value, or parse the YAML structure to
retrieve the unit and then extract the flag; update the test to use that
regex/capture group (or YAML unmarshalling) instead of the Index/IndexAny logic
so variations in quoting/whitespace won’t break the extraction.
- Around line 271-289: The test duplicates code to fetch the
"discovery-token-signing-secret" and generate a signed discovery token; extract
that logic into a helper function (e.g., getSignedDiscoveryToken(ctx
context.Context, k8sClient client.Client, ns string, systemUUID string) string)
placed near the top of the test file or a test utilities file, have it perform
the Eventually k8sClient.Get into a v1.Secret, assert signingKey length and
GenerateSignedDiscoveryToken (metaltoken.GenerateSignedDiscoveryToken), return
the token, and then replace the inline blocks in the tests (the creation of
signingSecret, signingKey check, metaltoken.GenerateSignedDiscoveryToken call,
and probe.NewAgent usage sites) with calls to getSignedDiscoveryToken to reduce
duplication and keep assertions intact.

In `@internal/token/token_test.go`:
- Around line 163-173: The test description is misleading about timing; update
the It description under Context("Token Security") to reflect that it verifies
signature correctness rather than constant-time behavior: change the It("should
use constant-time comparison (HMAC)") text to something like It("verifies signed
discovery token using HMAC signature") and/or update the comment above the test;
keep the test body using GenerateSignedDiscoveryToken and
VerifySignedDiscoveryToken unchanged so it still asserts successful generation
and valid verification.
- Around line 67-76: Test relies on wall-clock progress and can flake; update
the code or test to control time deterministically: either add a time provider
to GenerateSignedDiscoveryToken (e.g., inject a nowFunc or accept a timestamp
parameter) and in the test call GenerateSignedDiscoveryToken with two different
timestamps to assert tokens differ, or change the test to assert tokens may be
equal within the same second and avoid Eventually. Locate
GenerateSignedDiscoveryToken and the token_test.go case and implement the
time-injection approach (introduce nowFunc or timestamp arg) or alter the test
to compare tokens produced with explicit different times instead of using
Eventually.

In `@internal/token/token.go`:
- Around line 86-97: Replace the manual loop that searches for the "||"
delimiter on the payload with strings.Index(payload, "||") (or
strings.SplitN(payload, "||", 2)) to simplify and clarify parsing; update the
code paths that use firstDelim (the existing early-return when delimiter not
found and subsequent substring logic) to use the Index/Split result, and add
"strings" to the imports. This keeps the same behavior (returning "", 0, false,
nil when not found) but removes the manual byte-by-byte scan and improves
readability around the payload delimiter handling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6424543f-9838-4804-8743-f03cb9c7032c

📥 Commits

Reviewing files that changed from the base of the PR and between 8937145 and f697e38.

📒 Files selected for processing (13)
  • cmd/main.go
  • cmd/metalprobe/main.go
  • docs/discovery-token-security.md
  • internal/api/registry/registry.go
  • internal/controller/server_controller.go
  • internal/controller/server_controller_test.go
  • internal/controller/suite_test.go
  • internal/probe/probe.go
  • internal/probe/probe_suite_test.go
  • internal/registry/registry_suite_test.go
  • internal/registry/server.go
  • internal/token/token.go
  • internal/token/token_test.go

Comment on lines +89 to +96
func (s *Server) validateDiscoveryToken(receivedToken string) (string, bool) {
// Skip validation in unit test mode (no k8s client)
if s.k8sClient == nil {
s.log.V(1).Info("Skipping token validation (no K8s client - unit test mode)")
// In test mode, extract systemUUID from token if possible, otherwise return empty
// For now, just allow all requests in test mode
return "", true
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Bypassing authentication in test mode is risky.

When k8sClient == nil, all token validation is skipped and requests are allowed. While this is convenient for unit tests, it creates a risk if someone accidentally deploys with a nil client or misconfigures the server.

Consider logging a warning at startup if running in test mode, or requiring an explicit flag to enable test mode rather than inferring it from client presence.

📝 Suggested improvement
 func NewServer(logger logr.Logger, addr string, k8sClient client.Client, signingSecretName, signingSecretNamespace string) *Server {
+	if k8sClient == nil {
+		logger.Info("WARNING: Running without K8s client - token validation DISABLED (test mode only)")
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/registry/server.go` around lines 89 - 96, The current
validateDiscoveryToken function blindly allows requests when s.k8sClient == nil
which is risky; change the logic to require an explicit test-mode flag instead
of inferring from s.k8sClient by adding a Server field like testMode (bool) and
only bypass validation when testMode is true, and also add a startup-time log
warning if testMode is enabled; update validateDiscoveryToken to check
s.testMode (and still log that validation is skipped) and ensure any
creation/initialization paths set testMode explicitly rather than relying on a
nil k8sClient.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
docs/discovery-token-security.md (1)

339-342: ⚠️ Potential issue | 🟠 Major

Manual secret creation command still produces incorrect key length.

As previously flagged, openssl rand -base64 32 generates 32 random bytes and base64-encodes them to ~44 characters. When stored via --from-literal, kubectl base64-encodes this again, so the decoded signing-key will be 44 bytes, not the required 32 bytes.

🔧 Corrected command
-# Or manually create the secret
-kubectl create secret generic discovery-token-signing-secret \
-  --from-literal=signing-key=$(openssl rand -base64 32) \
-  -n metal-operator-system
+# Or manually create the secret (generate 32 raw bytes)
+kubectl create secret generic discovery-token-signing-secret \
+  --from-file=signing-key=<(openssl rand 32) \
+  -n metal-operator-system
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/discovery-token-security.md` around lines 339 - 342, The current command
uses openssl rand -base64 32 with --from-literal which yields a 44-byte base64
string that kubectl base64-encodes again, producing an incorrect 44-byte decoded
signing-key; instead generate raw 32 bytes to a file and create the secret from
that file so Kubernetes stores exactly 32 bytes: run openssl rand 32 >
signing-key.bin and then kubectl create secret generic
discovery-token-signing-secret --from-file=signing-key=signing-key.bin -n
metal-operator-system (then securely remove signing-key.bin); update the
documented command that references openssl rand -base64 32, --from-literal, and
signing-key accordingly.
🧹 Nitpick comments (2)
internal/token/token_test.go (1)

69-78: Consider using time.Sleep or handling the error in Eventually.

The test swallows the error from GenerateSignedDiscoveryToken on line 75. If token generation fails, the test would hang until timeout rather than fail fast with a clear error.

♻️ Proposed fix to handle error explicitly
 		It("should generate different tokens at different times", func() {
 			token1, err := GenerateSignedDiscoveryToken(signingSecret, "test-uuid")
 			Expect(err).NotTo(HaveOccurred())

-			// Small delay to ensure different timestamp
-			Eventually(func() string {
-				token2, _ := GenerateSignedDiscoveryToken(signingSecret, "test-uuid")
-				return token2
-			}).ShouldNot(Equal(token1))
+			// Wait to ensure different timestamp (JWT uses second precision)
+			time.Sleep(1100 * time.Millisecond)
+			
+			token2, err := GenerateSignedDiscoveryToken(signingSecret, "test-uuid")
+			Expect(err).NotTo(HaveOccurred())
+			Expect(token2).NotTo(Equal(token1))
 		})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/token/token_test.go` around lines 69 - 78, The test swallows the
error from GenerateSignedDiscoveryToken inside Eventually causing silent
failures; update the "should generate different tokens at different times" test
to handle errors explicitly — either insert a short time.Sleep between calls and
call GenerateSignedDiscoveryToken twice with Expect(err).NotTo(HaveOccurred())
before comparing token1 vs token2, or change the Eventually invocation to call a
helper that returns (string, error) and assert on the error immediately;
reference GenerateSignedDiscoveryToken, token1/token2 and signingSecret when
making the change so the test fails fast on token generation errors instead of
hanging.
docs/discovery-token-security.md (1)

13-15: Add language specifiers to fenced code blocks for better rendering.

The markdown linter flags code blocks without language specifiers. For non-code text blocks (token format examples and ASCII diagrams), use the text specifier.

📐 Proposed fix
-```
+```text
 token = header.payload.signature

JWT Structure:

- +text
..

And for the ASCII diagram around line 68:

-```
+```text
 ┌──────────────┐                  ┌──────────┐                  ┌──────────┐
 │  Controller  │                  │ Registry │                  │  Server  │
 └──────┬───────┘                  └────┬─────┘                  └────┬─────┘

Also applies to: 19-21, 68-94

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/discovery-token-security.md` around lines 13 - 15, The fenced code
blocks in docs/discovery-token-security.md (examples like the line "token =
header.payload.signature", the "JWT Structure" block showing
"<base64-header>.<base64-payload>.<base64-signature>" and the ASCII diagram
starting with "┌──────────────┐") lack language specifiers and fail the markdown
linter; update each such fenced block (including the blocks around lines 19-21
and 68-94) to use the text language specifier (i.e., replace ``` with ```text
for the token example, the JWT Structure example, and the ASCII diagram blocks)
so they render correctly and satisfy the linter.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/discovery-token-security.md`:
- Around line 441-447: The "24-hour token lifetime (increased from 1 hour)"
statement under the "Not Needed (Now Implemented)" section is inconsistent with
the rest of the document and code examples that use a 1-hour expiry; update that
statement to reflect a 1-hour token lifetime (or alternatively update the
code/examples and the "Why 1 hour?" section if you intend to change to 24
hours). Specifically, edit the text under the "Not Needed (Now Implemented)"
heading to say "1-hour token lifetime" (and remove the "(increased from 1 hour)"
clause) and verify the JWT example and the "Why 1 hour?" section wording match
this value so the document is consistent.
- Around line 28-36: In the "Claims (Payload):" section remove the "-
**secret**: 32-byte (256-bit) shared secret..." bullet since the signing secret
is not a JWT claim; keep only "- **sub**", "- **iat**", and "- **exp**" as the
transmitted claims, and instead add a short note near the "Signature:
HMAC-SHA256(header + payload, secret)" line clarifying that the secret is used
only to compute/verify the HMAC and is not included in the token payload.

---

Duplicate comments:
In `@docs/discovery-token-security.md`:
- Around line 339-342: The current command uses openssl rand -base64 32 with
--from-literal which yields a 44-byte base64 string that kubectl base64-encodes
again, producing an incorrect 44-byte decoded signing-key; instead generate raw
32 bytes to a file and create the secret from that file so Kubernetes stores
exactly 32 bytes: run openssl rand 32 > signing-key.bin and then kubectl create
secret generic discovery-token-signing-secret
--from-file=signing-key=signing-key.bin -n metal-operator-system (then securely
remove signing-key.bin); update the documented command that references openssl
rand -base64 32, --from-literal, and signing-key accordingly.

---

Nitpick comments:
In `@docs/discovery-token-security.md`:
- Around line 13-15: The fenced code blocks in docs/discovery-token-security.md
(examples like the line "token = header.payload.signature", the "JWT Structure"
block showing "<base64-header>.<base64-payload>.<base64-signature>" and the
ASCII diagram starting with "┌──────────────┐") lack language specifiers and
fail the markdown linter; update each such fenced block (including the blocks
around lines 19-21 and 68-94) to use the text language specifier (i.e., replace
``` with ```text for the token example, the JWT Structure example, and the ASCII
diagram blocks) so they render correctly and satisfy the linter.

In `@internal/token/token_test.go`:
- Around line 69-78: The test swallows the error from
GenerateSignedDiscoveryToken inside Eventually causing silent failures; update
the "should generate different tokens at different times" test to handle errors
explicitly — either insert a short time.Sleep between calls and call
GenerateSignedDiscoveryToken twice with Expect(err).NotTo(HaveOccurred()) before
comparing token1 vs token2, or change the Eventually invocation to call a helper
that returns (string, error) and assert on the error immediately; reference
GenerateSignedDiscoveryToken, token1/token2 and signingSecret when making the
change so the test fails fast on token generation errors instead of hanging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 34fd7cba-42bb-49b5-a86a-77dfe6d83e10

📥 Commits

Reviewing files that changed from the base of the PR and between f697e38 and 6388ba3.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (4)
  • docs/discovery-token-security.md
  • go.mod
  • internal/token/token.go
  • internal/token/token_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/token/token.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/registry/server.go (1)

310-319: ⚠️ Potential issue | 🔴 Critical

Nil pointer dereference when k8sClient is nil and payload.Booted is true.

In test mode (k8sClient == nil), token validation passes at Lines 294-299. If payload.Booted is true, the code reaches Line 315 which calls s.k8sClient.List(), causing a panic.

🐛 Proposed fix to guard against nil client
 	s.log.Info("Received boot state for system", "SystemUUID", payload.SystemUUID, "BootState", payload.Booted)
 	if !payload.Booted {
 		w.WriteHeader(http.StatusOK)
 		return
 	}
+	// In test mode without k8s client, we cannot update ServerBootConfiguration
+	if s.k8sClient == nil {
+		s.log.Info("Running in test mode, skipping ServerBootConfiguration update")
+		w.WriteHeader(http.StatusOK)
+		return
+	}
 	var servers metalv1alpha1.ServerList
 	if err := s.k8sClient.List(r.Context(), &servers, client.MatchingFields{"spec.systemUUID": payload.SystemUUID}); err != nil {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/registry/server.go` around lines 310 - 319, The handler panics when
payload.Booted is true because s.k8sClient can be nil in test mode; before
calling s.k8sClient.List(...) check for s.k8sClient == nil and short-circuit
(e.g., write a success response and return or return a controlled error) to
avoid dereferencing nil. Update the code path that uses s.k8sClient.List,
referencing s.k8sClient and the payload.Booted check, to handle the nil client
case (log a clear message and respond without calling List).
♻️ Duplicate comments (1)
internal/controller/server_controller.go (1)

730-736: ⚠️ Potential issue | 🟠 Major

Invalid secret still causes permanent failure without auto-recovery.

When the signing secret exists but has an invalid key (wrong length or missing), this path returns an error without attempting to regenerate. This will cause permanent reconciliation failures until manual intervention, while the AlreadyExists path (Lines 778-790) does attempt recovery.

Consider applying the same regeneration logic here for consistency:

🐛 Proposed fix to add auto-recovery
 	if err == nil {
 		// Secret exists, return the signing key
 		if signingKey, ok := secret.Data[metaltoken.DiscoveryTokenSigningSecretKey]; ok && len(signingKey) == 32 {
 			return signingKey, nil
 		}
-		// Secret exists but is invalid, regenerate
-		return nil, fmt.Errorf("existing signing secret is invalid")
+		// Secret exists but is invalid, update it with a valid key
+		ctrl.LoggerFrom(ctx).Info("Existing signing secret is invalid, regenerating", "name", secretName)
+		signingKey, err := metaltoken.GenerateSigningSecret()
+		if err != nil {
+			return nil, fmt.Errorf("failed to generate signing secret: %w", err)
+		}
+		secret.Data = map[string][]byte{metaltoken.DiscoveryTokenSigningSecretKey: signingKey}
+		if err := r.Update(ctx, secret); err != nil {
+			return nil, fmt.Errorf("failed to update signing secret: %w", err)
+		}
+		return signingKey, nil
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/server_controller.go` around lines 730 - 736, The current
branch inside the secret-fetch logic returns an error when
secret.Data[metaltoken.DiscoveryTokenSigningSecretKey] is missing or not 32
bytes, causing permanent reconciliation failure; update the block in the
function handling secret retrieval so that if signingKey is missing/invalid you
treat it like a corrupted secret and trigger the same regeneration/recreate
logic used in the AlreadyExists path (the code that handles creating a new
signing secret), i.e., delete or overwrite the invalid secret and call the
existing regeneration routine to create a valid 32-byte signing key rather than
returning fmt.Errorf("existing signing secret is invalid").
🧹 Nitpick comments (1)
internal/registry/server.go (1)

144-151: Test mode bypass logs at V(1) which may not be visible.

When running without a K8s client, authentication is completely bypassed. The warning is logged at verbosity level 1, which may not be visible in default logging configurations. Consider logging at Info level (V(0)) to ensure operators notice this security-sensitive configuration.

📝 Suggested improvement
 	// Skip validation in test mode (no k8s client)
 	if s.k8sClient == nil {
-		s.log.V(1).Info("Running in TEST MODE without K8s client - skipping secret loading")
+		s.log.Info("WARNING: Running in TEST MODE without K8s client - token validation DISABLED")
 		// In test mode, extract systemUUID from token if possible, otherwise return empty
 		// For now, just allow all requests in test mode
 		return "", true
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/registry/server.go` around lines 144 - 151, The test-mode bypass in
validateDiscoveryToken currently logs the security-sensitive message at
verbosity V(1) which may be hidden; update the logging call on the
s.k8sClient==nil branch (the s.log.V(1).Info(...) invocation inside
validateDiscoveryToken) to use the default/info level (e.g. s.log.Info(...) or
s.log.V(0).Info(...)) and make the message explicit (mention "TEST MODE -
authentication bypassed, no K8s client") so operators will see it in normal log
configurations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/probe/probe.go`:
- Around line 223-226: The log call is passing a nil `err` when the HTTP request
returned a non-success status; replace that by creating a real error describing
the response (e.g. using fmt.Errorf to include resp.StatusCode and optionally
the response body) and pass that error to `a.log.Error`, or alternatively call
`a.log.Error` without an error parameter and include the status and
`a.RegistryURL` in the message; target the block that checks `resp.StatusCode`
(the `if resp.StatusCode != http.StatusOK && resp.StatusCode !=
http.StatusCreated { ... }`), update the logging to use the newly constructed
error (or message) instead of the nil `err` so the log is informative and
non-nil.

---

Outside diff comments:
In `@internal/registry/server.go`:
- Around line 310-319: The handler panics when payload.Booted is true because
s.k8sClient can be nil in test mode; before calling s.k8sClient.List(...) check
for s.k8sClient == nil and short-circuit (e.g., write a success response and
return or return a controlled error) to avoid dereferencing nil. Update the code
path that uses s.k8sClient.List, referencing s.k8sClient and the payload.Booted
check, to handle the nil client case (log a clear message and respond without
calling List).

---

Duplicate comments:
In `@internal/controller/server_controller.go`:
- Around line 730-736: The current branch inside the secret-fetch logic returns
an error when secret.Data[metaltoken.DiscoveryTokenSigningSecretKey] is missing
or not 32 bytes, causing permanent reconciliation failure; update the block in
the function handling secret retrieval so that if signingKey is missing/invalid
you treat it like a corrupted secret and trigger the same regeneration/recreate
logic used in the AlreadyExists path (the code that handles creating a new
signing secret), i.e., delete or overwrite the invalid secret and call the
existing regeneration routine to create a valid 32-byte signing key rather than
returning fmt.Errorf("existing signing secret is invalid").

---

Nitpick comments:
In `@internal/registry/server.go`:
- Around line 144-151: The test-mode bypass in validateDiscoveryToken currently
logs the security-sensitive message at verbosity V(1) which may be hidden;
update the logging call on the s.k8sClient==nil branch (the s.log.V(1).Info(...)
invocation inside validateDiscoveryToken) to use the default/info level (e.g.
s.log.Info(...) or s.log.V(0).Info(...)) and make the message explicit (mention
"TEST MODE - authentication bypassed, no K8s client") so operators will see it
in normal log configurations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2c4ad7fa-b741-4419-82bd-b7a396b8689d

📥 Commits

Reviewing files that changed from the base of the PR and between 6388ba3 and f7cff68.

📒 Files selected for processing (8)
  • cmd/main.go
  • docs/discovery-token-security.md
  • internal/controller/server_controller.go
  • internal/controller/server_controller_test.go
  • internal/probe/probe.go
  • internal/registry/server.go
  • internal/token/constants.go
  • internal/token/token_test.go
✅ Files skipped from review due to trivial changes (2)
  • docs/discovery-token-security.md
  • internal/token/constants.go

Comment on lines 223 to 226
if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusCreated {
a.log.Error(err, "failed to register server", "url", a.RegistryURL)
a.log.Error(err, "failed to register server", "url", a.RegistryURL, "statusCode", resp.StatusCode)
return false, nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Logging nil error on non-success HTTP response.

At Line 224, err is nil at this point (the HTTP request succeeded but returned a non-success status code), yet a.log.Error(err, ...) is called. This logs a nil error which is misleading.

🐛 Proposed fix
 			if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusCreated {
-				a.log.Error(err, "failed to register server", "url", a.RegistryURL, "statusCode", resp.StatusCode)
+				a.log.Info("Failed to register server", "url", a.RegistryURL, "statusCode", resp.StatusCode)
 				return false, nil
 			}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusCreated {
a.log.Error(err, "failed to register server", "url", a.RegistryURL)
a.log.Error(err, "failed to register server", "url", a.RegistryURL, "statusCode", resp.StatusCode)
return false, nil
}
if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusCreated {
a.log.Info("Failed to register server", "url", a.RegistryURL, "statusCode", resp.StatusCode)
return false, nil
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/probe/probe.go` around lines 223 - 226, The log call is passing a
nil `err` when the HTTP request returned a non-success status; replace that by
creating a real error describing the response (e.g. using fmt.Errorf to include
resp.StatusCode and optionally the response body) and pass that error to
`a.log.Error`, or alternatively call `a.log.Error` without an error parameter
and include the status and `a.RegistryURL` in the message; target the block that
checks `resp.StatusCode` (the `if resp.StatusCode != http.StatusOK &&
resp.StatusCode != http.StatusCreated { ... }`), update the logging to use the
newly constructed error (or message) instead of the nil `err` so the log is
informative and non-nil.

@stefanhipfel stefanhipfel changed the title Secure discovery boot data with HMAC-SHA256 signed tokens (Issue #749) Secure discovery boot data with jwt signed tokens (Issue #749) Mar 24, 2026
@afritzler afritzler changed the title Secure discovery boot data with jwt signed tokens (Issue #749) Secure discovery boot data with jwt signed tokens Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog
Status: No status

Development

Successfully merging this pull request may close these issues.

Secure discovery boot data with per-server tokens

2 participants