Skip to content

Conversation

@jparsai
Copy link
Collaborator

@jparsai jparsai commented Feb 3, 2026

This PR implements the "Self Cluster Registration" feature, allowing agents to automatically register by creating Cluster Secrets when they connect to the principal. A JWT bearer token signed by the principal's key is used for authentication, but existing manually registered clusters will continue to work using TLS cert CN authentication

Supported authentication methods in principal:

  • Self-registered clusters authenticated using JWT Bearer Token + Shared client certificate
  • Manually registered clusters authenticated using TLS Client Certificate CN (Existing approach)

Fixes #392

How to test changes / Special notes to the reviewer:

  • Create a shared client certificate secret in control-plane.
  • Set following CLI flags while starting principal:
    --enable-self-cluster-registration
    --self-reg-client-cert-secret
    --enable-resource-proxy
  • Don't register agent using argocd-agentctl agent create <Agent name>
  • Start the agent and cluster secret will be created by principal

Assisted by: Cursor

Summary by CodeRabbit

  • New Features

    • Self-cluster registration: agents can auto-register, create/refresh bearer tokens, and be configured via an enable flag and client-cert secret; configurable resource-proxy address.
    • Resource-proxy tokens: issuance and validation added; bearer-token authentication supported for agent identification.
  • Tests

    • Broad unit and E2E coverage for self-registration flows, token lifecycle, TLS/secret handling, and auth precedence.
  • Chores

    • Dev script exposes env flags and derives resource-proxy address for E2E.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 3, 2026

📝 Walkthrough

Walkthrough

Adds optional self-cluster registration and resource-proxy support: new server options and CLI flags to enable/configure registration and resource-proxy address; ClusterRegistrationManager to create/refresh cluster secrets and bearer tokens; Issuer extensions for resource-proxy tokens; auth and resource-path changes to use bearer tokens (with TLS fallback); unit and E2E tests and dev-env/startup wiring.

Changes

Cohort / File(s) Summary
CLI / Dev env
cmd/argocd-agent/principal.go, hack/dev-env/start-principal.sh
Add --resource-proxy-address, --enable-self-cluster-registration, --self-reg-client-secret flags; export/derive resource-proxy env and wire flags into principal startup.
Server options & wiring
principal/options.go, principal/server.go, principal/server_test.go, principal/listen.go
New ServerOptions fields and setters: selfClusterRegistrationEnabled, resourceProxyAddress, clientCertSecretName; initialize ClusterRegistrationManager in NewServer and pass it into auth server.
Auth integration
principal/apis/auth/options.go, principal/apis/auth/auth.go, principal/apis/auth/auth_test.go
Add clusterRegistrationManager option/field and call RegisterCluster during Authenticate when enabled; tests for enabled/disabled/failure cases.
Cluster registration manager
principal/clusterregistration/...
principal/clusterregistration/clusterregistration.go, principal/clusterregistration/clusterregistration_test.go
New ClusterRegistrationManager with RegisterCluster, token validation/refresh, create-with-token flow, IsSelfClusterRegistrationEnabled, and unit tests/mocks.
Cluster secret management
internal/argocd/cluster/...
internal/argocd/cluster/cluster.go, internal/argocd/cluster/cluster_test.go
Add TokenIssuer interface and functions to create/update/read cluster secrets with bearer tokens; read TLS client cert helper; tests for bearer-token flows.
Issuer (tokens)
internal/issuer/...
internal/issuer/issuer.go, internal/issuer/jwt.go, internal/issuer/jwt_test.go, internal/issuer/mocks/Issuer.go
Extend Issuer interface with IssueResourceProxyToken/ValidateResourceProxyToken; JwtIssuer adds rpAudience and issues/validates resource-proxy JWTs; tests and mocks updated (note duplicated mock block).
Principal resource/auth flow
principal/resource.go, principal/resource_test.go
Prefer Bearer resource-proxy tokens, fallback to TLS client-cert CN; add extractAgentFromAuth helper, agent-name validation, updated logging and tests for token/TLS precedence and failure modes.
E2E tests & fixtures
test/e2e/...
test/e2e/self_cluster_registration_test.go, test/e2e/fixture/selfregistration.go, test/e2e/rp_test.go, test/e2e/terminal_test.go
Add E2E fixture helpers to enable/disable self-registration and manage secrets; add comprehensive self-registration E2E tests for secret lifecycle, resource-proxy, and terminal streaming scenarios.
Dev/test mocks & tests
internal/issuer/mocks/Issuer.go, various *_test.go files
Add mocks and extensive unit tests across issuer, cluster, clusterregistration, auth, resource paths, and principal server construction.

Sequence Diagram

sequenceDiagram
    participant Agent as Agent
    participant Principal as Principal Server
    participant Issuer as Token Issuer
    participant K8s as Kubernetes API

    Agent->>Principal: Connect (TLS client cert) or Bearer token
    Principal->>Principal: extractAgentFromAuth() -> agentName
    Principal->>Principal: Is self-cluster registration enabled?
    alt enabled
        Principal->>K8s: ClusterSecretExists(agentName)?
        alt secret exists
            K8s-->>Principal: Secret (may contain bearer token)
            Principal->>Issuer: ValidateResourceProxyToken(token)
            alt token invalid or subject mismatch
                Principal->>Issuer: IssueResourceProxyToken(agentName)
                Principal->>K8s: Update cluster Secret with new token
            end
        else secret missing
            Principal->>Issuer: IssueResourceProxyToken(agentName)
            Principal->>K8s: Read client cert secret (tls.crt,tls.key,ca.crt)
            Principal->>K8s: Create cluster Secret (labelled self-registered) with token and TLS material
        end
    end
    Principal->>Principal: Issue access tokens for agent auth
    Principal-->>Agent: Return auth tokens / connection accepted
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • jannfis
  • jgwest
  • chetan-rns
  • mikeshng

Poem

🐇 I hopped with a cert and a brave little key,
Whispered my name to the principal tree,
A token was minted, a secret took root,
Labels were planted — no manual suit,
A rabbit's small hop made the cluster agree.

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 32.08% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: enable self registration of agents' directly and concisely describes the main change: implementing self-registration functionality for agents to automatically register with the principal.
Linked Issues check ✅ Passed The PR implements all primary coding requirements from issue #392: self-registration feature toggle via CLI flags, principal-side cluster secret creation on agent connection, validation of agent identity via bearer token and client certificate, resource proxy address wiring, and comprehensive test coverage.
Out of Scope Changes check ✅ Passed All changes are directly related to self-registration implementation including authentication flow updates, cluster secret management, issuer token methods, server options, and comprehensive e2e testing. No unrelated or extraneous changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jparsai
Copy link
Collaborator Author

jparsai commented Feb 3, 2026

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 3, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@internal/argocd/cluster/cluster.go`:
- Around line 31-32: Remove the duplicate unaliased import of
"github.com/argoproj/argo-cd/v3/pkg/apis/application/v1alpha1" and keep only the
aliased import appv1; then update all code references that use the plain
v1alpha1 identifier (e.g., types like v1alpha1.Application,
v1alpha1.ApplicationSpec, any functions or variables named with v1alpha1) to use
appv1 instead (appv1.Application, appv1.ApplicationSpec, etc.) so the package
builds without the duplicate-import error.
- Around line 195-238: CreateCluster currently fails when Secrets.Create returns
an AlreadyExists error during concurrent registration; change the post-create
error handling in CreateCluster so that if
kubeclient.CoreV1().Secrets(namespace).Create(...) returns an error that
satisfies apierrors.IsAlreadyExists(err) (from
k8s.io/apimachinery/pkg/api/errors) you treat it as success and return nil
instead of an error, otherwise return the wrapped error as before; update the
CreateCluster function's error handling for the Create call to check
apierrors.IsAlreadyExists and return nil for that case.
🧹 Nitpick comments (3)
hack/dev-env/start-principal.sh (1)

50-51: Address shellcheck warning and add validation for derived address.

The variable declaration and assignment should be separated per SC2155 to avoid masking return values. Additionally, if no default route exists, ARGOCD_AGENT_RESOURCE_PROXY will be empty, causing an invalid --resource-proxy-address value of :9090.

🔧 Proposed fix
-export ARGOCD_AGENT_RESOURCE_PROXY=$(ip r show default | sed -e 's,.*\ src\ ,,' | sed -e 's,\ metric.*$,,')
+ARGOCD_AGENT_RESOURCE_PROXY=$(ip r show default | sed -e 's,.*\ src\ ,,' | sed -e 's,\ metric.*$,,')
+if test -z "$ARGOCD_AGENT_RESOURCE_PROXY"; then
+    echo "Could not determine resource proxy address from default route." >&2
+    echo "Please set ARGOCD_AGENT_RESOURCE_PROXY manually" >&2
+    exit 1
+fi
+export ARGOCD_AGENT_RESOURCE_PROXY
principal/apis/auth/auth.go (1)

127-134: Consider whether registration failure should block authentication.

Currently, if RegisterCluster fails, the entire authentication is rejected even though the agent's credentials were validated successfully. This creates a tight coupling where transient cluster registration issues (e.g., Kubernetes API errors, CA secret temporarily unavailable) will prevent authenticated agents from connecting.

Should this be a non-blocking error that logs a warning but allows authentication to proceed? The cluster secret could be created on a subsequent connection attempt.

💡 Alternative: Non-blocking registration
 	// If self cluster registration is enabled, register the agent's cluster and create cluster secret if it doesn't exist
 	if s.clusterRegistrationManager != nil && s.clusterRegistrationManager.IsSelfClusterRegistrationEnabled() {
 		if err := s.clusterRegistrationManager.RegisterCluster(ctx, clientID); err != nil {
-			logCtx.WithError(err).WithField("client", clientID).Error("Failed to self register agent's cluster")
-			return nil, errAuthenticationFailed
+			// Log error but allow authentication to proceed - registration can be retried on next connection
+			logCtx.WithError(err).WithField("client", clientID).Warn("Failed to self register agent's cluster, will retry on next connection")
 		}
 	}
test/e2e/self_cluster_registration_test.go (1)

444-451: Minor: disableSelfClusterRegistration silently ignores errors.

The os.Remove call may fail for reasons other than "file not found" (e.g., permissions). Consider returning the error only if it's not os.ErrNotExist:

🔧 Proposed improvement
 func disableSelfClusterRegistration() error {
-	return os.Remove(fixture.EnvVariablesFromE2EFile)
+	err := os.Remove(fixture.EnvVariablesFromE2EFile)
+	if err != nil && !os.IsNotExist(err) {
+		return err
+	}
+	return nil
 }

@codecov-commenter
Copy link

codecov-commenter commented Feb 3, 2026

Codecov Report

❌ Patch coverage is 76.57993% with 63 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.93%. Comparing base (c638a2b) to head (7166169).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
internal/argocd/cluster/cluster.go 79.09% 18 Missing and 5 partials ⚠️
cmd/argocd-agent/principal.go 0.00% 21 Missing ⚠️
...incipal/clusterregistration/clusterregistration.go 79.66% 6 Missing and 6 partials ⚠️
principal/options.go 69.23% 4 Missing ⚠️
principal/resource.go 90.90% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #736      +/-   ##
==========================================
+ Coverage   46.30%   46.93%   +0.62%     
==========================================
  Files         101      103       +2     
  Lines       12480    12739     +259     
==========================================
+ Hits         5779     5979     +200     
- Misses       6146     6194      +48     
- Partials      555      566      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jparsai
Copy link
Collaborator Author

jparsai commented Feb 3, 2026

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 3, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@hack/dev-env/start-principal.sh`:
- Line 50: Split the combined declaration+assignment for
ARGOCD_AGENT_RESOURCE_PROXY so the subshell return value isn't masked: first
create the variable (or export an empty name) and then perform the command
substitution on a separate line, e.g. declare or export
ARGOCD_AGENT_RESOURCE_PROXY, then ARGOCD_AGENT_RESOURCE_PROXY=$(ip r show
default | sed -e 's,.*\ src\ ,,' | sed -e 's,\ metric.*$,,') (and export after
if needed); update the lines that reference ARGOCD_AGENT_RESOURCE_PROXY
accordingly.
🧹 Nitpick comments (4)
hack/dev-env/start-principal.sh (1)

52-52: SCRIPTPATH variable is unused.

The SCRIPTPATH variable is defined but never referenced in the script. Consider removing it or using it if needed.

🧹 Proposed fix
-SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"
principal/clusterregistration/clusterregistration_test.go (1)

38-40: Consider using the function from cluster package instead of duplicating.

The getClusterSecretName helper duplicates logic from internal/argocd/cluster/cluster.go. If the original function were exported (GetClusterSecretName), this duplication could be avoided, reducing the risk of divergence.

test/e2e/self_cluster_registration_test.go (1)

449-451: Consider handling the case where the env file doesn't exist.

os.Remove returns an error if the file doesn't exist. While this error is ignored, it may be cleaner to explicitly handle os.IsNotExist.

🔧 Proposed fix
 func disableSelfClusterRegistration() error {
-	return os.Remove(fixture.EnvVariablesFromE2EFile)
+	err := os.Remove(fixture.EnvVariablesFromE2EFile)
+	if err != nil && !os.IsNotExist(err) {
+		return err
+	}
+	return nil
 }
internal/argocd/cluster/cluster.go (1)

288-290: Validate agentName before constructing secret name
Ensure agentName conforms to DNS subdomain rules (e.g. via validation.NameIsDNS1123Subdomain(agentName)) and return an error if invalid.

@drewbailey
Copy link
Contributor

drewbailey commented Feb 3, 2026

@jparsai Thanks for this - self-registration will be really nice. I ran into friction testing this in an environment that uses cert-manager with an external ClusterIssuer (where the CA private key isn't accessible to workloads).

Problem

Self-registration currently requires the CA private key (argocd-agent-ca with tls.key) to generate unique client certificates per agent. This works when you manage your own CA, but breaks in environments where:

  • cert-manager provisions certs via external issuers (e.g., Vault, ACME, internal PKI)
  • The CA private key is managed by the issuer, not stored in-cluster
  • CN is constrained (e.g., must equal namespace name)

Root Cause

The resource proxy identifies agents by extracting CommonName from the client certificate (principal/resource.go:66):
agentName := cert.Subject.CommonName

This requires a unique cert per agent, hence the need to generate/sign certs at registration time.

Proposed Simplification

The cluster secret URL already contains the agent name as a query parameter:
Server: fmt.Sprintf("https://%s?agentName=%s", resourceProxyAddress, agentName)

We could:

  1. Have the resource proxy read agentName from the query param instead of cert CN
  2. Add a --client-cert-secret flag to specify a pre-existing shared TLS secret
  3. Self-registration copies creds from shared secret into cluster secrets (no cert generation needed)

This allows:

  • Single cert-manager Certificate for all agents
  • No CA private key required
  • Works with any external issuer
  • mTLS still enforced (client cert validated, just not used for identity)

I've prototyped this locally - happy to open a follow-up PR if there's interest. The changes are fairly minimal (~50 lines across resource.go, cluster.go, and clusterregistration.go).

gist of the patch https://gist.github.com/drewbailey/9591238eb95be9cfd8dba1e0a383289b

@jparsai
Copy link
Collaborator Author

jparsai commented Feb 4, 2026

Thanks @drewbailey, I would address this issue and make changes as you suggested.

@jannfis
Copy link
Collaborator

jannfis commented Feb 4, 2026

Yes, I agree that taking the identity from the client cert in case of the resource-proxy might be a bit too much. As you pointed out, it would require another unique client cert per agent, in addition to the one the agent already uses for identification.

I'm thinking about your proposal to take the identity from the cluster's server field. I fear that it might not be future-proof. I was wondering if instead, we could generate a bearer token in the principal (signed by the principal's JWT key) and use that for identification to the resource-proxy.

@jparsai jparsai force-pushed the self-registration branch 3 times, most recently from 7ab50dd to 6bdfab6 Compare February 4, 2026 15:54
@drewbailey
Copy link
Contributor

current diff / suggestion for using jwt https://gist.github.com/drewbailey/6a9f823a5a222c37b1271c257a8f9cbc

 Key differences:
  Aspect: Auth mechanism
  PR 736: TLS client cert CN → query param agentName
  My Patch: JWT bearer token in Authorization header
  ────────────────────────────────────────
  Aspect: CA private key
  PR 736: Required for legacy mode; shared-cert mode needs cert+key+ca secret
  My Patch: Not needed - only CA public cert
  ────────────────────────────────────────
  Aspect: Cluster secret
  PR 736: TLSClientConfig with CertData/KeyData/CAData
  My Patch: BearerToken + TLSClientConfig.CAData only
  ────────────────────────────────────────
  Aspect: Config flag
  PR 736: --self-cluster-registration-shared-cert (secret name)
  My Patch: Reuses --resource-proxy-ca-path
  ────────────────────────────────────────
  Aspect: Token refresh
  PR 736: None - if cert invalid, stuck
  My Patch: Auto-refresh on reconnect if token invalid
  ────────────────────────────────────────
  Aspect: New code
  PR 736: Uses existing TLS infrastructure
  My Patch: Added IssueResourceProxyToken/ValidateResourceProxyToken to issuer

@jparsai jparsai force-pushed the self-registration branch 2 times, most recently from 03e0d3f to fd2b44f Compare February 5, 2026 12:23
@jparsai
Copy link
Collaborator Author

jparsai commented Feb 5, 2026

Thanks @drewbailey and @jannfis for your suggestions, I have modified implementation to address them. PTAL.

Thanks @drewbailey for providing patch, it was very helpful. I mostly followed your patch, but modified few things as per my understanding. Please let me know if you feel implementation is missing something.

@jparsai jparsai marked this pull request as ready for review February 5, 2026 13:21
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@cmd/argocd-agent/principal.go`:
- Around line 102-103: When enableSelfClusterRegistration is true, perform a
preflight validation at startup (where flags/options are parsed) to check
required inputs for self-registration: ensure CA data is present (e.g., caData
or caFile option used by cluster secret creation) and that resource-proxy
settings/credentials (the address/auth config the server uses to talk to the
resource proxy) are provided; if any required input is missing, log a clear
error with details and exit non‑zero immediately so the server fails fast.
Locate and add this validation around the enableSelfClusterRegistration usage
(the startup/initialization block using enableSelfClusterRegistration and the
code paths that perform cluster secret creation/resource proxy registration
referenced at the locations around lines ~102, ~331-335, and ~493-495) so
self-registration is gated before the server starts.

In `@hack/dev-env/start-principal.sh`:
- Line 55: The variable SCRIPTPATH is declared but never used; remove the unused
assignment line that sets SCRIPTPATH in the start-principal.sh script (the line
beginning with SCRIPTPATH="$( cd -- "$(dirname "$0")" ... )") to eliminate dead
code and avoid shellcheck warnings, or if the script should use the script
directory, replace usages to reference SCRIPTPATH in relevant commands such as
sourcing or path calculations instead of hardcoding paths.

In `@principal/clusterregistration/clusterregistration.go`:
- Around line 60-125: When an existing cluster secret is "manual" (uses TLS CN)
we must not inject or refresh a bearer token; modify the flow in the block that
handles exists == true to first detect manual auth and bail out without calling
UpdateClusterBearerToken or CreateClusterWithBearerToken. Concretely: after
ClusterSecretExists(...) returns true, call a helper in the cluster package
(e.g., cluster.IsManualClusterSecret or cluster.GetClusterAuthMode) to determine
if the secret is manual; if it is, log and return nil; otherwise continue with
validateClusterToken(...) and only call cluster.UpdateClusterBearerToken(...)
when the secret is not manual. Also ensure
validateClusterToken/GetClusterBearerToken tolerate manual secrets (return
empty/no-op) or short-circuit earlier to avoid changing auth mode.

In `@principal/resource.go`:
- Around line 390-430: The listener address for the resource-proxy is currently
hardcoded and must be configurable and default to localhost; add a CLI flag
(e.g., --resource-proxy-listen-addr) and a ServerOption to set the listener
address on Server, introduce/replace the constant
defaultResourceProxyListenerAddr to "127.0.0.1:9090", wire the new ServerOption
into Server creation/initialization (where the server binds the resource proxy
listener) so the configured value is used instead of the hardcoded one, and
ensure any code that starts the listener reads Server.resourceProxyListenAddr
(or the new field) so operators can bind to 127.0.0.1 or override via the flag.
🧹 Nitpick comments (7)
hack/dev-env/start-principal.sh (1)

62-62: Consider validating the derived resource proxy address.

If the command substitution on line 53 fails or produces an empty result, --resource-proxy-address will receive :9090 which may cause unexpected behavior. Consider adding validation similar to the Redis address handling (lines 23-35).

💡 Example validation pattern
ARGOCD_AGENT_RESOURCE_PROXY=$(ip r show default | sed -e 's,.*\ src\ ,,' | sed -e 's,\ metric.*$,,')
if test -z "$ARGOCD_AGENT_RESOURCE_PROXY"; then
    echo "Could not determine resource proxy address from default route." >&2
    echo "Please set ARGOCD_AGENT_RESOURCE_PROXY manually" >&2
    exit 1
fi
export ARGOCD_AGENT_RESOURCE_PROXY
internal/issuer/jwt.go (1)

203-224: Consider a TTL/rotation policy for resource-proxy tokens.
Tokens are intentionally non-expiring; a leaked token remains valid until key rotation. If feasible, consider adding a configurable expiry (or periodic rotation) to reduce blast radius.

internal/argocd/cluster/cluster.go (1)

199-233: URL-encode agentName when building the resource-proxy URL.
This avoids malformed URLs if agentName contains characters that need escaping and keeps the query construction robust.

Proposed change
 import (
 	"context"
 	"errors"
 	"fmt"
 	"os"
+	"net/url"
 	"time"
@@
-	cluster := &appv1.Cluster{
-		Server: fmt.Sprintf("https://%s?agentName=%s", resourceProxyAddress, agentName),
+	rpURL := url.URL{Scheme: "https", Host: resourceProxyAddress}
+	q := rpURL.Query()
+	q.Set("agentName", agentName)
+	rpURL.RawQuery = q.Encode()
+
+	cluster := &appv1.Cluster{
+		Server: rpURL.String(),
 		Name:   agentName,
internal/argocd/cluster/cluster_test.go (1)

333-401: Consider asserting token persistence in the created secret.

The test currently checks labels only; adding a token assertion would cover the primary behavior more directly.

💡 Suggested test assertion
@@
 		require.NoError(t, err)
 		require.NotNil(t, secret)
 		require.Equal(t, "test-agent", secret.Labels[LabelKeyClusterAgentMapping])
 		require.Equal(t, "true", secret.Labels[LabelKeySelfRegisteredCluster])
+
+		token, err := GetClusterBearerToken(context.Background(), kubeclient, testNamespace, "test-agent")
+		require.NoError(t, err)
+		require.Equal(t, "test-bearer-token", token)
principal/apis/auth/auth_test.go (1)

150-170: Consider stubbing the auth issuer even though registration is expected to fail.

This avoids a nil deref if registration ever succeeds unexpectedly and keeps failures focused on the intended assertion.

🧪 Suggested test hardening
@@
 		mockIss := issuermock.NewIssuer(t)
 		mockIss.On("IssueResourceProxyToken", "user1").Return("test-token", nil)
+		mockIss.On("IssueAccessToken", encodedSubject, mock.Anything).Return("access", nil).Maybe()
+		mockIss.On("IssueRefreshToken", encodedSubject, mock.Anything).Return("refresh", nil).Maybe()
 		mgr := clusterregistration.NewClusterRegistrationManager(true, "argocd", "resource-proxy:8443", "", kubeclient, mockIss)
 
-		auths, err := NewServer(queues, ams, nil, WithClusterRegistrationManager(mgr))
+		auths, err := NewServer(queues, ams, mockIss, WithClusterRegistrationManager(mgr))
principal/apis/auth/auth.go (1)

128-134: Consider distinguishing registration failures from credential failures.

Registration errors currently return errAuthenticationFailed (gRPC Unauthenticated). If these can be transient, clients may not retry. Consider a more specific status or confirm agents retry on this code.

test/e2e/self_cluster_registration_test.go (1)

17-25: Preserve existing E2E env file contents when enabling self‑registration.

enableSelfClusterRegistration() overwrites fixture.EnvVariablesFromE2EFile, which can drop other test-configured env vars. Consider merging/adding just the two keys.

♻️ Suggested merge behavior
@@
-import (
+import (
 	"context"
 	"encoding/base64"
 	"fmt"
 	"os"
 	"os/exec"
+	"strings"
 	"testing"
 	"time"
@@
 func enableSelfClusterRegistration() error {
@@
-	// Set both the enable flag and the CA path environment variables
-	envVars := fmt.Sprintf("ARGOCD_PRINCIPAL_ENABLE_SELF_CLUSTER_REGISTRATION=true\nARGOCD_PRINCIPAL_RESOURCE_PROXY_TLS_CA_PATH=%s\n", caCertTempFile)
-
-	return os.WriteFile(fixture.EnvVariablesFromE2EFile, []byte(envVars), 0644)
+	// Set both the enable flag and the CA path environment variables (preserve existing entries)
+	existing, _ := os.ReadFile(fixture.EnvVariablesFromE2EFile)
+	envVars := string(existing)
+	if !strings.Contains(envVars, "ARGOCD_PRINCIPAL_ENABLE_SELF_CLUSTER_REGISTRATION=") {
+		envVars += "ARGOCD_PRINCIPAL_ENABLE_SELF_CLUSTER_REGISTRATION=true\n"
+	}
+	if !strings.Contains(envVars, "ARGOCD_PRINCIPAL_RESOURCE_PROXY_TLS_CA_PATH=") {
+		envVars += fmt.Sprintf("ARGOCD_PRINCIPAL_RESOURCE_PROXY_TLS_CA_PATH=%s\n", caCertTempFile)
+	}
+	return os.WriteFile(fixture.EnvVariablesFromE2EFile, []byte(envVars), 0644)
 }

Also applies to: 452-461

Comment on lines 390 to 430
// extractAgentFromAuth extracts the agent name from the request using multiple methods.
// Authentication methods in order of preference:
// 1. JWT bearer token in Authorization header (for self-registered clusters)
// 2. TLS client certificate Common Name (for manually created cluster secrets)
func (s *Server) extractAgentFromAuth(r *http.Request) (string, error) {
logCtx := log().WithField("function", "extractAgentFromAuth")

// Method 1: Try Authorization header first (Bearer token)
// Used by self-registered clusters that store JWT tokens in cluster secrets
authHeader := r.Header.Get("Authorization")
if strings.HasPrefix(authHeader, "Bearer ") {
logCtx.Debug("Attempting JWT bearer token authentication")
token := strings.TrimPrefix(authHeader, "Bearer ")
claims, err := s.issuer.ValidateResourceProxyToken(token)
if err != nil {
logCtx.WithError(err).Debug("Bearer token validation failed")
return "", fmt.Errorf("invalid bearer token: %v", err)
}
subject, err := claims.GetSubject()
if err != nil {
logCtx.WithError(err).Debug("Could not get subject from token")
return "", fmt.Errorf("could not get subject from token: %v", err)
}
logCtx.WithField("agent", subject).Debug("Successfully authenticated via bearer token")
return subject, nil
}

// Method 2: Extract agent name from TLS client certificate CN
// Used by manually created cluster secrets (via argocd-agentctl agent create)
if r.TLS != nil && len(r.TLS.PeerCertificates) > 0 {
cert := r.TLS.PeerCertificates[0]
agentName := cert.Subject.CommonName
if agentName != "" {
logCtx.WithField("agent", agentName).Debug("Successfully authenticated via TLS client certificate CN")
return agentName, nil
}
logCtx.Debug("TLS client certificate has empty Common Name")
}

logCtx.Debug("No authorization method found")
return "", fmt.Errorf("no authorization found")
Copy link
Contributor

@coderabbitai coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

rg -n "resourceProxyListenAddr|resource-proxy.*listen|listen.*resource-proxy|--listen-host" --type go --type yaml --type md

Repository: argoproj-labs/argocd-agent

Length of output: 1006


🏁 Script executed:

rg -n "ServerConfig|listen|addr.*principal" principal/ --type go -A 2 -B 2

Repository: argoproj-labs/argocd-agent

Length of output: 15311


🏁 Script executed:

fd "principal" --type d

Repository: argoproj-labs/argocd-agent

Length of output: 148


🏁 Script executed:

git ls-files principal/

Repository: argoproj-labs/argocd-agent

Length of output: 2015


🏁 Script executed:

rg -n "resourceProxyListenAddr|resource-proxy.*port|resource-proxy.*addr" principal/server.go -A 2 -B 2

Repository: argoproj-labs/argocd-agent

Length of output: 761


🏁 Script executed:

rg -n "resource-proxy\|resourceProxy" principal/options.go -A 3 -B 1

Repository: argoproj-labs/argocd-agent

Length of output: 52


🏁 Script executed:

rg -n "defaultResourceProxyListenerAddr" -r "const defaultResourceProxyListenerAddr"

Repository: argoproj-labs/argocd-agent

Length of output: 354


🏁 Script executed:

rg -n "WithResourceProxyListenerAddr|resource-proxy.*flag|resourceProxy" principal/options.go -A 3 -B 1

Repository: argoproj-labs/argocd-agent

Length of output: 1349


🏁 Script executed:

rg -n "flag\..*resource|resource.*flag" --type go | head -30

Repository: argoproj-labs/argocd-agent

Length of output: 52


🏁 Script executed:

fd "main.go|cmd" --type f | grep -E "cmd|main"

Repository: argoproj-labs/argocd-agent

Length of output: 191


🏁 Script executed:

rg -n "resourceProxyAddress|resourceProxyListenAddr" principal/server.go -A 1 -B 1

Repository: argoproj-labs/argocd-agent

Length of output: 682


🏁 Script executed:

fd "main.go" principal/ --type f

Repository: argoproj-labs/argocd-agent

Length of output: 52


🏁 Script executed:

fd "cmd" --type d | head -5

Repository: argoproj-labs/argocd-agent

Length of output: 87


🏁 Script executed:

fd "principal" cmd/ --type f

Repository: argoproj-labs/argocd-agent

Length of output: 134


🏁 Script executed:

git ls-files cmd/ | grep -i principal

Repository: argoproj-labs/argocd-agent

Length of output: 134


🏁 Script executed:

rg -n "resourceProxyListenAddr\|--resource-proxy" --type go --type yaml

Repository: argoproj-labs/argocd-agent

Length of output: 52


🏁 Script executed:

cat -n cmd/argocd-agent/principal.go | head -150

Repository: argoproj-labs/argocd-agent

Length of output: 6053


🏁 Script executed:

cat -n cmd/argocd-agent/principal.go | grep -A 5 -B 5 "resourceProxy"

Repository: argoproj-labs/argocd-agent

Length of output: 5256


🏁 Script executed:

cat -n cmd/argocd-agent/principal.go | tail -200 | head -100

Repository: argoproj-labs/argocd-agent

Length of output: 4946


🏁 Script executed:

rg -n "WithResourceProxyAddress" principal/ --type go -A 2 -B 2

Repository: argoproj-labs/argocd-agent

Length of output: 624


🏁 Script executed:

cat -n principal/server.go | sed -n '370,390p'

Repository: argoproj-labs/argocd-agent

Length of output: 982


🏁 Script executed:

rg -n "resourceProxyListenAddr" principal/ --type go

Repository: argoproj-labs/argocd-agent

Length of output: 467


🏁 Script executed:

rg -n "0\.0\.0\.0:9090|defaultResourceProxyListenerAddr" principal/ --type go -B 2 -A 2

Repository: argoproj-labs/argocd-agent

Length of output: 719


🏁 Script executed:

cat -n principal/options.go | grep -A 50 "type ServerOptions"

Repository: argoproj-labs/argocd-agent

Length of output: 2015


🏁 Script executed:

rg -n "resourceProxyListenAddr.*flag|SetResourceProxyListenAddr" --type go

Repository: argoproj-labs/argocd-agent

Length of output: 52


🏁 Script executed:

cat -n principal/options.go | sed -n '430,470p'

Repository: argoproj-labs/argocd-agent

Length of output: 1248


🏁 Script executed:

cat -n principal/server.go | sed -n '185,195p'

Repository: argoproj-labs/argocd-agent

Length of output: 582


🏁 Script executed:

rg -n "extractAgentFromAuth" principal/resource.go -A 3 -B 3

Repository: argoproj-labs/argocd-agent

Length of output: 1036


The resource-proxy listener must be bound to localhost only and made configurable.

The extractAgentFromAuth method now accepts bearer tokens from Authorization headers, but defaultResourceProxyListenerAddr is hardcoded to "0.0.0.0:9090" (line 189 in principal/server.go). Without listener address configuration, the resource-proxy is exposed to all interfaces by default, allowing attackers to directly reach it and forge authentication headers to impersonate agents.

Add a CLI flag (e.g., --resource-proxy-listen-addr) and corresponding ServerOption to allow binding to localhost, and change the default to "127.0.0.1:9090". Network policies alone are insufficient if the operator can accidentally expose it.

🤖 Prompt for AI Agents
In `@principal/resource.go` around lines 390 - 430, The listener address for the
resource-proxy is currently hardcoded and must be configurable and default to
localhost; add a CLI flag (e.g., --resource-proxy-listen-addr) and a
ServerOption to set the listener address on Server, introduce/replace the
constant defaultResourceProxyListenerAddr to "127.0.0.1:9090", wire the new
ServerOption into Server creation/initialization (where the server binds the
resource proxy listener) so the configured value is used instead of the
hardcoded one, and ensure any code that starts the listener reads
Server.resourceProxyListenAddr (or the new field) so operators can bind to
127.0.0.1 or override via the flag.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think it is a security risk. This is an existing implementation and even if it is set to 0.0.0.0:9090, any attacker would still need a valid client certificate as well as a valid JWT bearer token.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for TLS configuration and client certificate requirements in the resource proxy setup
rg -n "tls\.Config|ClientAuth|RequireAndVerifyClientCert|NoClientCert" principal/ --type go -B 3 -A 3

Length of output: 4647


🏁 Script executed:

#!/bin/bash
# Search for resource-proxy TLS server configuration and listener setup
rg -n "resource-proxy.*listen|resourceProxyListen|StartResourceProxy" principal/ --type go -B 3 -A 3

Length of output: 1411


🏁 Script executed:

#!/bin/bash
# Check if there's a flag for requiring client certificates
rg -n "resource-proxy-require-client-cert|requireClientCert" --type go --type yaml

Length of output: 881


🏁 Script executed:

#!/bin/bash
# Search for where resource proxy TLS config is created and configured
rg -n "resourceProxyTLSConfig|WithResourceProxyTLS" --type go -B 5 -A 10

Length of output: 4123


🏁 Script executed:

#!/bin/bash
# Search for resource proxy TLS server setup in cmd/principal.go
rg -n "resource-proxy.*tls|resourceProxyCA|resource-proxy-ca" cmd/argocd-agent/principal.go -B 3 -A 5

Length of output: 1968


🏁 Script executed:

#!/bin/bash
# Check how the resource proxy listener is started
ast-grep --pattern $'func ($$$) Start($$$) {
  $$$
  TLS
  $$$
}'

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Find the TLS config creation functions for resource proxy
rg -n "func getResourceProxyTLS" cmd/argocd-agent/principal.go -A 30

Length of output: 1921


🏁 Script executed:

#!/bin/bash
# Search for ClientAuth configuration in resource proxy setup
rg -n "ClientAuth|RequireAndVerify|NoClientCert" cmd/argocd-agent/ --type go -B 2 -A 2

Length of output: 526


🏁 Script executed:

#!/bin/bash
# Check the resource proxy server Start method to see TLS setup
rg -n "func.*Start" principal/resourceproxy/ --type go -A 20 | head -60

Length of output: 3120


@jparsai You're absolutely right! I apologize for the confusion.

Looking at the TLS configuration code, the resource proxy does require client certificates at the TLS transport layer:

// cmd/argocd-agent/principal.go lines 546-552, 570-576
proxyTLS := &tls.Config{
    Certificates: []tls.Certificate{proxyCert},
    ClientAuth: tls.RequireAndVerifyClientCert,
    ClientCAs:  clientCA,
}

This means an attacker would need both:

  1. A valid client certificate signed by the CA to complete the TLS handshake
  2. A valid JWT bearer token OR a client certificate with the correct CN for application-level authentication

The extractAgentFromAuth method only runs after the TLS handshake succeeds. Without a valid client certificate, connections are rejected at the TLS layer before reaching the application logic.

However, I notice from the PR discussion that there's consideration for making the client certificate requirement configurable (via --resource-proxy-require-client-cert). If that feature is added in a future PR to support scenarios like web terminals for self-registered clusters, then binding to localhost would become critical security requirement, as the JWT-only authentication path would be exposed.

For the current implementation in this PR, binding to 0.0.0.0:9090 is acceptable given the mandatory TLS client certificate requirement.


🧬 Code Graph Analysis Results

[principal/resource.go] Summary of potentially relevant authentication logic

  • Description: Extracts the agent name from the HTTP request as part of authenticating a client for resource proxying.
  • Key behavior:
    • Priority 1: JWT bearer token in the Authorization header (Bearer )
      • Validates the token via s.issuer.ValidateResourceProxyToken.
      • On success, extracts the subject from the token and returns it as the agent name.
      • On failure, returns an error (invalid bearer token).
    • Priority 2: TLS client certificate Common Name (CN)
      • If a TLS client certificate is present, uses cert.Subject.CommonName as the agent name if non-empty.
    • Fallback: If neither method yields an agent name, returns an error indicating no authorization found.
  • Parameters: r *http.Request
  • Returns: (string, error)
    • string: agent name if authentication succeeds
    • error: non-nil if authentication fails
  • Exceptions / error handling:
    • If Authorization header starts with "Bearer ", failure to validate token yields an error (invalid bearer token).
    • If TLS cert present but CN is empty, falls through to indicate no authorization found.
  • Important implementation details:
    • Logs progression at various points (attempting bearer token auth, success/failure, TLS CN success, CN empty, no authorization found).
    • Uses subject from the JWT as the agent identity.
    • TLS-based authentication uses the first peer certificate's CommonName.
  • Relation to user comment: This function enforces authentication for resource proxy requests, requiring either a valid JWT bearer token or a valid TLS client certificate CN before proceeding with proxying.

🧠 Learnings used
Learnt from: jannfis
Repo: argoproj-labs/argocd-agent PR: 699
File: docs/configuration/principal/configuration.md:260-268
Timestamp: 2026-01-09T18:54:23.709Z
Learning: For header-based authentication in argocd-agent, documentation must prominently warn users that the principal service must not be directly reachable from outside the service mesh, and should recommend binding to localhost with --listen-host=127.0.0.1.

Learnt from: jannfis
Repo: argoproj-labs/argocd-agent PR: 699
File: docs/configuration/principal/configuration.md:260-268
Timestamp: 2026-01-09T18:54:23.709Z
Learning: When using header-based authentication in argocd-agent principal, the service MUST be configured to listen on localhost (127.0.0.1) only and protected by network policies to prevent direct external access. Without these protections, attackers can bypass the service mesh sidecar and forge authentication headers to impersonate any agent.

@jparsai
Copy link
Collaborator Author

jparsai commented Feb 5, 2026

@jannfis @drewbailey I found one issue while trying web based terminal with self registered cluster. With bearer token implementation agents are connected successfully and apps are also synced, but while trying to open terminal in UI, it started throwing this error in principal logs:

level=error msg="http: TLS handshake error from 172.19.0.3:58797: tls: client didn't provide a certificate" module=proxy

After debugging this issue, I got the root cause. Self registered cluster secret has JWT bearer token, but it does not have client certificate which used by ArgoCD to get resource.
I am still trying to find a fix for it. Do you have any suggestions?

@drewbailey
Copy link
Contributor

hey @jparsai,

Yeah, the resource proxy TLS configuration defaults to tls.RequireAndVerifyClientCert, but self-registered clusters use JWT bearer tokens instead of client certificates.

on my branch I addressed this by making the client certificate requirement configurable via --resource-proxy-require-client-cert flag (default: true). When using JWT bearer token authentication for self-registered clusters, set this to false.

Principal configuration:
argocd-agent principal
--resource-proxy-require-client-cert=false
...

Or via environment variable:
ARGOCD_PRINCIPAL_RESOURCE_PROXY_REQUIRE_CLIENT_CERT=false

The resource proxy authentication would now support two modes:

  1. JWT Bearer Token (new) - Token in Authorization: Bearer header, validated by principal's signing key
  2. TLS Client Certificate (legacy) - Agent name extracted from certificate CN via query param ?agentName=
 1. New flag in cmd/argocd-agent/principal.go:
  command.Flags().BoolVar(&resourceProxyRequireClientCert, "resource-proxy-require-client-cert",
      env.BoolWithDefault("ARGOCD_PRINCIPAL_RESOURCE_PROXY_REQUIRE_CLIENT_CERT", true),
      "Require TLS client certificates for resource proxy (disable for JWT bearer token auth)")

 2. TLS config change (in both getResourceProxyTLSConfigFromKube and getResourceProxyTLSConfigFromFiles):
  clientAuth := tls.NoClientCert
  if requireClientCert {
      clientAuth = tls.RequireAndVerifyClientCert
  }

  proxyTLS := &tls.Config{
      Certificates: []tls.Certificate{proxyCert},
      ClientAuth:   clientAuth,  // was: tls.RequireAndVerifyClientCert
      ClientCAs:    clientCA,
  }

  3. Authentication in principal/resource.go - JWT bearer token support:
  func (s *Server) extractAgentFromAuth(r *http.Request) (string, error) {
      // Try Authorization header first (Bearer token)
      authHeader := r.Header.Get("Authorization")
      if strings.HasPrefix(authHeader, "Bearer ") {
          token := strings.TrimPrefix(authHeader, "Bearer ")
          claims, err := s.issuer.ValidateResourceProxyToken(token)
          if err != nil {
              return "", fmt.Errorf("invalid bearer token: %w", err)
          }
          return claims.GetSubject()
      }

      // Fall back to query parameter for backward compatibility with TLS client cert mode
      agentName := r.URL.Query().Get("agentName")
      if agentName != "" {
          if r.TLS == nil || len(r.TLS.PeerCertificates) < 1 {
              return "", fmt.Errorf("query param mode requires TLS client certificate")
          }
          return agentName, nil
      }

      return "", fmt.Errorf("no authorization found")
  }

  ---
  Usage for self-registered clusters:
  argocd-agent principal \
    --enable-self-cluster-registration=true \
    --resource-proxy-require-client-cert=false \
    ...

@jannfis
Copy link
Collaborator

jannfis commented Feb 5, 2026

Yeah, the resource proxy TLS configuration defaults to tls.RequireAndVerifyClientCert, but self-registered clusters use JWT bearer tokens instead of client certificates.

Shouldn't they use both? A generic client certificate plus a unique JWT? Or is this a limitation of the k8s client, that it can use only either of them?

@drewbailey
Copy link
Contributor

@jannfis yeah that makes sense,

I can stop including these if they are annoying, but just in case, patch with tls + jwt for identity https://gist.github.com/drewbailey/f1e39a11d68cb1bb4a26bb43d0df40af

Assisted by: Cursor

Signed-off-by: Jayendra Parsai <[email protected]>
Assisted by: Cursor

Signed-off-by: Jayendra Parsai <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@principal/apis/auth/auth.go`:
- Around line 127-134: The current authentication flow in auth.go treats any
error from s.clusterRegistrationManager.RegisterCluster(ctx, clientID) as fatal,
which blocks reconnections on transient K8s failures; change this so
RegisterCluster failures are non-fatal when the cluster secret already exists:
call a check method on the clusterRegistrationManager (e.g., ClusterSecretExists
or IsClusterRegistered) before returning error, and if the secret exists treat
the RegisterCluster error as a warning (log with warning level and continue
authentication) but if the secret does not exist keep returning
errAuthenticationFailed; if such an existence-check method is missing, add it to
the ClusterRegistrationManager interface and implement it, and replace the
current unconditional log+return in the RegisterCluster error branch
accordingly.

In `@principal/clusterregistration/clusterregistration_test.go`:
- Around line 90-95: The mock issuer created by createMockIssuer sets
ValidateResourceProxyToken to return (nil, nil) which violates the production
contract (claims must be non-nil when err==nil); update createMockIssuer so
ValidateResourceProxyToken returns either a valid claims object (implementing
GetSubject with a sensible test subject) and nil error on success, or return
(nil, someError) on failure, and remove or tighten the .Maybe() so tests
exercise the successful path; ensure IssueResourceProxyToken and
ValidateResourceProxyToken expectations match actual test cases that call
claims.GetSubject().

In `@test/e2e/self_cluster_registration_test.go`:
- Around line 52-64: The SetupSuite currently swallows errors from
getClusterSecret leaving suite.originalSecret nil and causing later no-ops;
change SetupSuite (in SelfClusterRegistrationTestSuite) to fail the suite
immediately when getClusterSecret returns an error by calling suite.T().Fatalf
(or Fatal) with the error so tests do not run without a restoreable
originalSecret, otherwise keep the existing DeepCopy/log behavior when no error
occurs.
- Around line 384-417: The test Test_NoSecretAndSelfRegistrationDisabled
currently asserts clusterSecretExists immediately after fixture.RestartAgent,
which can pass before the agent attempts connection; update the test to wait for
the agent to attempt/connect before asserting absence of the secret by using the
existing connection check helpers: after calling fixture.RestartAgent and
fixture.CheckReadiness, use requires.Eventually (or requires.Never/Consistently
if preferred) with HasConnectionStatus (or another connection-status helper used
elsewhere) to observe that the agent has attempted a connection (or reached a
failed connection state), and only then call
requires.False(clusterSecretExists(...)) to ensure the assertion runs after a
real connection attempt. Ensure you reference
Test_NoSecretAndSelfRegistrationDisabled, fixture.RestartAgent/CheckReadiness,
HasConnectionStatus (or the equivalent), and clusterSecretExists when making the
change.
🧹 Nitpick comments (15)
principal/clusterregistration/clusterregistration_test.go (1)

44-46: Test helper getClusterSecretName duplicates production naming logic.

If the production getClusterSecretName (in the clusterregistration package or cluster package) changes its naming convention, this test helper won't break, silently diverging. Consider importing the production function or using a package-level constant for the prefix.

internal/issuer/jwt.go (1)

203-217: Non-expiring resource-proxy JWTs — consider adding a long TTL.

The token is intentionally issued without ExpiresAt, meaning it remains valid for the lifetime of the signing key. While the PR discussion reflects this as a deliberate choice (with refresh-on-reconnect as mitigation), a leaked token cannot be revoked or time-bounded.

Consider adding a generous but finite TTL (e.g., 30–90 days) and relying on the existing refresh logic to reissue tokens before expiry. This limits the blast radius of a compromised token without impacting normal operation.

principal/resource_test.go (1)

294-398: Good test coverage for the dual authentication flow.

The Test_extractAgentFromAuth suite covers the key scenarios well: valid/invalid bearer tokens, missing auth, TLS CN fallback, empty CN, and precedence rules.

One edge case not covered: a malformed Authorization header that has Bearer prefix but an empty token string (i.e., "Bearer " with trailing space only). Currently extractAgentFromAuth would pass an empty string to ValidateResourceProxyToken — worth a quick test to confirm the issuer rejects it gracefully.

principal/resource.go (1)

394-431: Excessive Info-level logging on every proxied request.

extractAgentFromAuth runs on every resource proxy request. Lines 401, 413, and 423 all log at Info level for the normal success path, which will be extremely noisy under load (every K8s API call through the proxy triggers these). Reserve Info for notable events (auth failures, first-time auth); use Debug or Trace for per-request success logging.

♻️ Suggested log-level adjustment
-	logCtx.Info("Attempting JWT bearer token authentication")
+	logCtx.Debug("Attempting JWT bearer token authentication")
 	token := strings.TrimPrefix(authHeader, "Bearer ")
 	claims, err := s.issuer.ValidateResourceProxyToken(token)
 ...
-	logCtx.WithField("agent", subject).Info("Successfully authenticated via bearer token")
+	logCtx.WithField("agent", subject).Debug("Successfully authenticated via bearer token")
 	return subject, nil
 ...
-	logCtx.WithField("agent", agentName).Info("Successfully authenticated via TLS client certificate CN")
+	logCtx.WithField("agent", agentName).Debug("Successfully authenticated via TLS client certificate CN")
 	return agentName, nil
 ...
-	logCtx.Info("TLS client certificate has empty Common Name")
+	logCtx.Debug("TLS client certificate has empty Common Name")
 ...
-	logCtx.Info("No authorization method found")
+	logCtx.Debug("No authorization method found")
principal/server_test.go (1)

141-149: Potential test panic: assert.NotNil won't stop the test if it fails.

Line 147 uses assert.NotNil for s, but line 148 dereferences s.clusterRegistrationManager without a prior require.NotNil. If the server creation ever fails to initialize the manager, this test will panic instead of producing a clear failure. Use require.NotNil for guards before dereference.

♻️ Safer assertions
 		assert.NoError(t, err)
-		assert.NotNil(t, s)
-		assert.False(t, s.clusterRegistrationManager.IsSelfClusterRegistrationEnabled())
+		require.NotNil(t, s)
+		require.NotNil(t, s.clusterRegistrationManager)
+		assert.False(t, s.clusterRegistrationManager.IsSelfClusterRegistrationEnabled())

Same pattern applies to the test at lines 151–160.

internal/argocd/cluster/cluster.go (3)

271-313: UpdateClusterBearerToken is vulnerable to update conflicts without retry.

The read-modify-write on lines 279–308 doesn't use the secret's ResourceVersion for optimistic concurrency (it does implicitly through the fetched secret object), but there's no retry on Conflict errors. If the cluster secret is updated concurrently (e.g., by ArgoCD's own reconciliation or a second auth attempt), the update will fail and propagate as an auth failure.

Consider adding a retry loop for IsConflict errors, similar to the IsAlreadyExists handling in CreateClusterWithBearerToken.


342-361: Noisy Info-level logging on the read path.

GetClusterBearerToken (line 345) logs at Info on every call. Since this is invoked from validateClusterToken during each authentication, it generates excessive log volume. Use Debug instead.

♻️ Lower log level
 func GetClusterBearerToken(ctx context.Context, kubeclient kubernetes.Interface, namespace, agentName string) (string, error) {
 	logCtx := log().WithField("agent", agentName).WithField("process", "self-registration")
-	logCtx.Info("Retrieving bearer token from cluster secret")
+	logCtx.Debug("Retrieving bearer token from cluster secret")

198-260: Verbose Info logging on the create path — same concern as elsewhere.

Lines 203, 211, 219, 258 all log at Info for normal operation during cluster creation. Since CreateClusterWithBearerToken is called from the auth flow, this adds up quickly. Consider Debug for the intermediate steps and keep Info only for the final "Successfully created" message.

principal/clusterregistration/clusterregistration.go (3)

60-103: Three redundant GETs to the same secret on every reconnect.

The happy path for an existing self-registered cluster calls:

  1. ClusterSecretExists → GET secret (Line 61)
  2. IsClusterSelfRegistered → GET same secret (Line 70)
  3. GetClusterBearerToken (inside validateClusterToken) → GET same secret again (Line 119)

This triples the K8s API load per authentication. Consider fetching the secret once and passing the data through, or creating a single helper that returns (exists bool, selfRegistered bool, bearerToken string, err error).


117-141: Subject mismatch in validateClusterToken should be logged as a warning.

Line 140 silently returns false when the token's subject doesn't match the agent name. This could indicate a misconfigured or tampered cluster secret (token for agent A stored in agent B's secret). A warning log would aid troubleshooting and security auditing.

♻️ Add warning for subject mismatch
+	if subject != agentName {
+		log().WithField("agent", agentName).WithField("tokenSubject", subject).
+			Warn("Bearer token subject does not match agent name")
+	}
 	return subject == agentName, nil

51-115: Verbose Info logging on every registration attempt.

The RegisterCluster method logs ~5–7 Info messages during the normal "exists + valid token" path (lines 58, 67, 81, 91). Since this runs on every agent authentication, the volume will be significant. Consider Debug for intermediate steps and Info only for state changes (creation, token refresh).

test/e2e/self_cluster_registration_test.go (4)

106-130: ensureOriginalSecretExists and restoreOriginalSecret are near-identical.

Both methods follow the same get-secret → if-not-found-create → if-self-registered-delete-and-create flow. Consider extracting a single shared method and calling it from both SetupTest and TearDownTest.

Also applies to: 132-157


427-437: clusterSecretExists creates a new kubernetes.Interface on every call, including inside Eventually loops.

This is called repeatedly (up to 30 times per Eventually block). Consider constructing the client once outside the loop, or using the fixture.KubeClient.Get approach (like getClusterSecret) and checking for IsNotFound instead.


119-122: Non-NotFound errors are silently swallowed in setup/teardown.

If getClusterSecret fails with, e.g., a network error, both ensureOriginalSecretExists (line 120) and restoreOriginalSecret (line 147) just log and return. This could leave tests running against a stale or missing secret without any signal. Consider failing the test (suite.Require().NoError(err)) or at minimum using suite.T().Fatalf so issues are visible early.

Also applies to: 146-149


367-381: One-directional key check — only verifies manual ⊆ self-registered.

The loop on line 378 checks that every key in manualSecret.Data exists in selfRegSecret.Data, but not the reverse. If the self-registered secret has unexpected extra data keys, the test won't catch it. This may be intentional (e.g., bearer token field), but if so, a comment clarifying that would help future readers.

Assisted by: Cursor

Signed-off-by: Jayendra Parsai <[email protected]>
@jparsai
Copy link
Collaborator Author

jparsai commented Feb 6, 2026

@drewbailey , @jannfis I changed the implementation, now it is using both JWT bearer token and shared client certificate for authentication for self registered clusters. PTAL.
I also updated E2E tests for web-based terminal and resource proxy to ensure any other issue caused by self registered secret is not missed in future.

Thanks again @drewbailey for providing patch, it is very helpful and gives me an idea of approach you have in mind for implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Self-registration of agents

4 participants