-
Notifications
You must be signed in to change notification settings - Fork 59
feat: enable self registration of agents #736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds optional self-cluster registration and resource-proxy support: new server options and CLI flags to enable/configure registration and resource-proxy address; ClusterRegistrationManager to create/refresh cluster secrets and bearer tokens; Issuer extensions for resource-proxy tokens; auth and resource-path changes to use bearer tokens (with TLS fallback); unit and E2E tests and dev-env/startup wiring. Changes
Sequence DiagramsequenceDiagram
participant Agent as Agent
participant Principal as Principal Server
participant Issuer as Token Issuer
participant K8s as Kubernetes API
Agent->>Principal: Connect (TLS client cert) or Bearer token
Principal->>Principal: extractAgentFromAuth() -> agentName
Principal->>Principal: Is self-cluster registration enabled?
alt enabled
Principal->>K8s: ClusterSecretExists(agentName)?
alt secret exists
K8s-->>Principal: Secret (may contain bearer token)
Principal->>Issuer: ValidateResourceProxyToken(token)
alt token invalid or subject mismatch
Principal->>Issuer: IssueResourceProxyToken(agentName)
Principal->>K8s: Update cluster Secret with new token
end
else secret missing
Principal->>Issuer: IssueResourceProxyToken(agentName)
Principal->>K8s: Read client cert secret (tls.crt,tls.key,ca.crt)
Principal->>K8s: Create cluster Secret (labelled self-registered) with token and TLS material
end
end
Principal->>Principal: Issue access tokens for agent auth
Principal-->>Agent: Return auth tokens / connection accepted
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@internal/argocd/cluster/cluster.go`:
- Around line 31-32: Remove the duplicate unaliased import of
"github.com/argoproj/argo-cd/v3/pkg/apis/application/v1alpha1" and keep only the
aliased import appv1; then update all code references that use the plain
v1alpha1 identifier (e.g., types like v1alpha1.Application,
v1alpha1.ApplicationSpec, any functions or variables named with v1alpha1) to use
appv1 instead (appv1.Application, appv1.ApplicationSpec, etc.) so the package
builds without the duplicate-import error.
- Around line 195-238: CreateCluster currently fails when Secrets.Create returns
an AlreadyExists error during concurrent registration; change the post-create
error handling in CreateCluster so that if
kubeclient.CoreV1().Secrets(namespace).Create(...) returns an error that
satisfies apierrors.IsAlreadyExists(err) (from
k8s.io/apimachinery/pkg/api/errors) you treat it as success and return nil
instead of an error, otherwise return the wrapped error as before; update the
CreateCluster function's error handling for the Create call to check
apierrors.IsAlreadyExists and return nil for that case.
🧹 Nitpick comments (3)
hack/dev-env/start-principal.sh (1)
50-51: Address shellcheck warning and add validation for derived address.The variable declaration and assignment should be separated per SC2155 to avoid masking return values. Additionally, if no default route exists,
ARGOCD_AGENT_RESOURCE_PROXYwill be empty, causing an invalid--resource-proxy-addressvalue of:9090.🔧 Proposed fix
-export ARGOCD_AGENT_RESOURCE_PROXY=$(ip r show default | sed -e 's,.*\ src\ ,,' | sed -e 's,\ metric.*$,,') +ARGOCD_AGENT_RESOURCE_PROXY=$(ip r show default | sed -e 's,.*\ src\ ,,' | sed -e 's,\ metric.*$,,') +if test -z "$ARGOCD_AGENT_RESOURCE_PROXY"; then + echo "Could not determine resource proxy address from default route." >&2 + echo "Please set ARGOCD_AGENT_RESOURCE_PROXY manually" >&2 + exit 1 +fi +export ARGOCD_AGENT_RESOURCE_PROXYprincipal/apis/auth/auth.go (1)
127-134: Consider whether registration failure should block authentication.Currently, if
RegisterClusterfails, the entire authentication is rejected even though the agent's credentials were validated successfully. This creates a tight coupling where transient cluster registration issues (e.g., Kubernetes API errors, CA secret temporarily unavailable) will prevent authenticated agents from connecting.Should this be a non-blocking error that logs a warning but allows authentication to proceed? The cluster secret could be created on a subsequent connection attempt.
💡 Alternative: Non-blocking registration
// If self cluster registration is enabled, register the agent's cluster and create cluster secret if it doesn't exist if s.clusterRegistrationManager != nil && s.clusterRegistrationManager.IsSelfClusterRegistrationEnabled() { if err := s.clusterRegistrationManager.RegisterCluster(ctx, clientID); err != nil { - logCtx.WithError(err).WithField("client", clientID).Error("Failed to self register agent's cluster") - return nil, errAuthenticationFailed + // Log error but allow authentication to proceed - registration can be retried on next connection + logCtx.WithError(err).WithField("client", clientID).Warn("Failed to self register agent's cluster, will retry on next connection") } }test/e2e/self_cluster_registration_test.go (1)
444-451: Minor:disableSelfClusterRegistrationsilently ignores errors.The
os.Removecall may fail for reasons other than "file not found" (e.g., permissions). Consider returning the error only if it's notos.ErrNotExist:🔧 Proposed improvement
func disableSelfClusterRegistration() error { - return os.Remove(fixture.EnvVariablesFromE2EFile) + err := os.Remove(fixture.EnvVariablesFromE2EFile) + if err != nil && !os.IsNotExist(err) { + return err + } + return nil }
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #736 +/- ##
==========================================
+ Coverage 46.30% 46.93% +0.62%
==========================================
Files 101 103 +2
Lines 12480 12739 +259
==========================================
+ Hits 5779 5979 +200
- Misses 6146 6194 +48
- Partials 555 566 +11 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
cdbe69e to
a54ebbd
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@hack/dev-env/start-principal.sh`:
- Line 50: Split the combined declaration+assignment for
ARGOCD_AGENT_RESOURCE_PROXY so the subshell return value isn't masked: first
create the variable (or export an empty name) and then perform the command
substitution on a separate line, e.g. declare or export
ARGOCD_AGENT_RESOURCE_PROXY, then ARGOCD_AGENT_RESOURCE_PROXY=$(ip r show
default | sed -e 's,.*\ src\ ,,' | sed -e 's,\ metric.*$,,') (and export after
if needed); update the lines that reference ARGOCD_AGENT_RESOURCE_PROXY
accordingly.
🧹 Nitpick comments (4)
hack/dev-env/start-principal.sh (1)
52-52: SCRIPTPATH variable is unused.The
SCRIPTPATHvariable is defined but never referenced in the script. Consider removing it or using it if needed.🧹 Proposed fix
-SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"principal/clusterregistration/clusterregistration_test.go (1)
38-40: Consider using the function from cluster package instead of duplicating.The
getClusterSecretNamehelper duplicates logic frominternal/argocd/cluster/cluster.go. If the original function were exported (GetClusterSecretName), this duplication could be avoided, reducing the risk of divergence.test/e2e/self_cluster_registration_test.go (1)
449-451: Consider handling the case where the env file doesn't exist.
os.Removereturns an error if the file doesn't exist. While this error is ignored, it may be cleaner to explicitly handleos.IsNotExist.🔧 Proposed fix
func disableSelfClusterRegistration() error { - return os.Remove(fixture.EnvVariablesFromE2EFile) + err := os.Remove(fixture.EnvVariablesFromE2EFile) + if err != nil && !os.IsNotExist(err) { + return err + } + return nil }internal/argocd/cluster/cluster.go (1)
288-290: Validate agentName before constructing secret name
Ensure agentName conforms to DNS subdomain rules (e.g. via validation.NameIsDNS1123Subdomain(agentName)) and return an error if invalid.
|
@jparsai Thanks for this - self-registration will be really nice. I ran into friction testing this in an environment that uses cert-manager with an external ClusterIssuer (where the CA private key isn't accessible to workloads). ProblemSelf-registration currently requires the CA private key (argocd-agent-ca with tls.key) to generate unique client certificates per agent. This works when you manage your own CA, but breaks in environments where:
Root CauseThe resource proxy identifies agents by extracting CommonName from the client certificate (principal/resource.go:66): This requires a unique cert per agent, hence the need to generate/sign certs at registration time. Proposed SimplificationThe cluster secret URL already contains the agent name as a query parameter: We could:
This allows:
I've prototyped this locally - happy to open a follow-up PR if there's interest. The changes are fairly minimal (~50 lines across resource.go, cluster.go, and clusterregistration.go). gist of the patch https://gist.github.com/drewbailey/9591238eb95be9cfd8dba1e0a383289b |
|
Thanks @drewbailey, I would address this issue and make changes as you suggested. |
|
Yes, I agree that taking the identity from the client cert in case of the resource-proxy might be a bit too much. As you pointed out, it would require another unique client cert per agent, in addition to the one the agent already uses for identification. I'm thinking about your proposal to take the identity from the cluster's |
7ab50dd to
6bdfab6
Compare
|
current diff / suggestion for using jwt https://gist.github.com/drewbailey/6a9f823a5a222c37b1271c257a8f9cbc |
03e0d3f to
fd2b44f
Compare
|
Thanks @drewbailey and @jannfis for your suggestions, I have modified implementation to address them. PTAL. Thanks @drewbailey for providing patch, it was very helpful. I mostly followed your patch, but modified few things as per my understanding. Please let me know if you feel implementation is missing something. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@cmd/argocd-agent/principal.go`:
- Around line 102-103: When enableSelfClusterRegistration is true, perform a
preflight validation at startup (where flags/options are parsed) to check
required inputs for self-registration: ensure CA data is present (e.g., caData
or caFile option used by cluster secret creation) and that resource-proxy
settings/credentials (the address/auth config the server uses to talk to the
resource proxy) are provided; if any required input is missing, log a clear
error with details and exit non‑zero immediately so the server fails fast.
Locate and add this validation around the enableSelfClusterRegistration usage
(the startup/initialization block using enableSelfClusterRegistration and the
code paths that perform cluster secret creation/resource proxy registration
referenced at the locations around lines ~102, ~331-335, and ~493-495) so
self-registration is gated before the server starts.
In `@hack/dev-env/start-principal.sh`:
- Line 55: The variable SCRIPTPATH is declared but never used; remove the unused
assignment line that sets SCRIPTPATH in the start-principal.sh script (the line
beginning with SCRIPTPATH="$( cd -- "$(dirname "$0")" ... )") to eliminate dead
code and avoid shellcheck warnings, or if the script should use the script
directory, replace usages to reference SCRIPTPATH in relevant commands such as
sourcing or path calculations instead of hardcoding paths.
In `@principal/clusterregistration/clusterregistration.go`:
- Around line 60-125: When an existing cluster secret is "manual" (uses TLS CN)
we must not inject or refresh a bearer token; modify the flow in the block that
handles exists == true to first detect manual auth and bail out without calling
UpdateClusterBearerToken or CreateClusterWithBearerToken. Concretely: after
ClusterSecretExists(...) returns true, call a helper in the cluster package
(e.g., cluster.IsManualClusterSecret or cluster.GetClusterAuthMode) to determine
if the secret is manual; if it is, log and return nil; otherwise continue with
validateClusterToken(...) and only call cluster.UpdateClusterBearerToken(...)
when the secret is not manual. Also ensure
validateClusterToken/GetClusterBearerToken tolerate manual secrets (return
empty/no-op) or short-circuit earlier to avoid changing auth mode.
In `@principal/resource.go`:
- Around line 390-430: The listener address for the resource-proxy is currently
hardcoded and must be configurable and default to localhost; add a CLI flag
(e.g., --resource-proxy-listen-addr) and a ServerOption to set the listener
address on Server, introduce/replace the constant
defaultResourceProxyListenerAddr to "127.0.0.1:9090", wire the new ServerOption
into Server creation/initialization (where the server binds the resource proxy
listener) so the configured value is used instead of the hardcoded one, and
ensure any code that starts the listener reads Server.resourceProxyListenAddr
(or the new field) so operators can bind to 127.0.0.1 or override via the flag.
🧹 Nitpick comments (7)
hack/dev-env/start-principal.sh (1)
62-62: Consider validating the derived resource proxy address.If the command substitution on line 53 fails or produces an empty result,
--resource-proxy-addresswill receive:9090which may cause unexpected behavior. Consider adding validation similar to the Redis address handling (lines 23-35).💡 Example validation pattern
ARGOCD_AGENT_RESOURCE_PROXY=$(ip r show default | sed -e 's,.*\ src\ ,,' | sed -e 's,\ metric.*$,,') if test -z "$ARGOCD_AGENT_RESOURCE_PROXY"; then echo "Could not determine resource proxy address from default route." >&2 echo "Please set ARGOCD_AGENT_RESOURCE_PROXY manually" >&2 exit 1 fi export ARGOCD_AGENT_RESOURCE_PROXYinternal/issuer/jwt.go (1)
203-224: Consider a TTL/rotation policy for resource-proxy tokens.
Tokens are intentionally non-expiring; a leaked token remains valid until key rotation. If feasible, consider adding a configurable expiry (or periodic rotation) to reduce blast radius.internal/argocd/cluster/cluster.go (1)
199-233: URL-encodeagentNamewhen building the resource-proxy URL.
This avoids malformed URLs if agentName contains characters that need escaping and keeps the query construction robust.Proposed change
import ( "context" "errors" "fmt" "os" + "net/url" "time" @@ - cluster := &appv1.Cluster{ - Server: fmt.Sprintf("https://%s?agentName=%s", resourceProxyAddress, agentName), + rpURL := url.URL{Scheme: "https", Host: resourceProxyAddress} + q := rpURL.Query() + q.Set("agentName", agentName) + rpURL.RawQuery = q.Encode() + + cluster := &appv1.Cluster{ + Server: rpURL.String(), Name: agentName,internal/argocd/cluster/cluster_test.go (1)
333-401: Consider asserting token persistence in the created secret.The test currently checks labels only; adding a token assertion would cover the primary behavior more directly.
💡 Suggested test assertion
@@ require.NoError(t, err) require.NotNil(t, secret) require.Equal(t, "test-agent", secret.Labels[LabelKeyClusterAgentMapping]) require.Equal(t, "true", secret.Labels[LabelKeySelfRegisteredCluster]) + + token, err := GetClusterBearerToken(context.Background(), kubeclient, testNamespace, "test-agent") + require.NoError(t, err) + require.Equal(t, "test-bearer-token", token)principal/apis/auth/auth_test.go (1)
150-170: Consider stubbing the auth issuer even though registration is expected to fail.This avoids a nil deref if registration ever succeeds unexpectedly and keeps failures focused on the intended assertion.
🧪 Suggested test hardening
@@ mockIss := issuermock.NewIssuer(t) mockIss.On("IssueResourceProxyToken", "user1").Return("test-token", nil) + mockIss.On("IssueAccessToken", encodedSubject, mock.Anything).Return("access", nil).Maybe() + mockIss.On("IssueRefreshToken", encodedSubject, mock.Anything).Return("refresh", nil).Maybe() mgr := clusterregistration.NewClusterRegistrationManager(true, "argocd", "resource-proxy:8443", "", kubeclient, mockIss) - auths, err := NewServer(queues, ams, nil, WithClusterRegistrationManager(mgr)) + auths, err := NewServer(queues, ams, mockIss, WithClusterRegistrationManager(mgr))principal/apis/auth/auth.go (1)
128-134: Consider distinguishing registration failures from credential failures.Registration errors currently return
errAuthenticationFailed(gRPC Unauthenticated). If these can be transient, clients may not retry. Consider a more specific status or confirm agents retry on this code.test/e2e/self_cluster_registration_test.go (1)
17-25: Preserve existing E2E env file contents when enabling self‑registration.
enableSelfClusterRegistration()overwritesfixture.EnvVariablesFromE2EFile, which can drop other test-configured env vars. Consider merging/adding just the two keys.♻️ Suggested merge behavior
@@ -import ( +import ( "context" "encoding/base64" "fmt" "os" "os/exec" + "strings" "testing" "time" @@ func enableSelfClusterRegistration() error { @@ - // Set both the enable flag and the CA path environment variables - envVars := fmt.Sprintf("ARGOCD_PRINCIPAL_ENABLE_SELF_CLUSTER_REGISTRATION=true\nARGOCD_PRINCIPAL_RESOURCE_PROXY_TLS_CA_PATH=%s\n", caCertTempFile) - - return os.WriteFile(fixture.EnvVariablesFromE2EFile, []byte(envVars), 0644) + // Set both the enable flag and the CA path environment variables (preserve existing entries) + existing, _ := os.ReadFile(fixture.EnvVariablesFromE2EFile) + envVars := string(existing) + if !strings.Contains(envVars, "ARGOCD_PRINCIPAL_ENABLE_SELF_CLUSTER_REGISTRATION=") { + envVars += "ARGOCD_PRINCIPAL_ENABLE_SELF_CLUSTER_REGISTRATION=true\n" + } + if !strings.Contains(envVars, "ARGOCD_PRINCIPAL_RESOURCE_PROXY_TLS_CA_PATH=") { + envVars += fmt.Sprintf("ARGOCD_PRINCIPAL_RESOURCE_PROXY_TLS_CA_PATH=%s\n", caCertTempFile) + } + return os.WriteFile(fixture.EnvVariablesFromE2EFile, []byte(envVars), 0644) }Also applies to: 452-461
principal/resource.go
Outdated
| // extractAgentFromAuth extracts the agent name from the request using multiple methods. | ||
| // Authentication methods in order of preference: | ||
| // 1. JWT bearer token in Authorization header (for self-registered clusters) | ||
| // 2. TLS client certificate Common Name (for manually created cluster secrets) | ||
| func (s *Server) extractAgentFromAuth(r *http.Request) (string, error) { | ||
| logCtx := log().WithField("function", "extractAgentFromAuth") | ||
|
|
||
| // Method 1: Try Authorization header first (Bearer token) | ||
| // Used by self-registered clusters that store JWT tokens in cluster secrets | ||
| authHeader := r.Header.Get("Authorization") | ||
| if strings.HasPrefix(authHeader, "Bearer ") { | ||
| logCtx.Debug("Attempting JWT bearer token authentication") | ||
| token := strings.TrimPrefix(authHeader, "Bearer ") | ||
| claims, err := s.issuer.ValidateResourceProxyToken(token) | ||
| if err != nil { | ||
| logCtx.WithError(err).Debug("Bearer token validation failed") | ||
| return "", fmt.Errorf("invalid bearer token: %v", err) | ||
| } | ||
| subject, err := claims.GetSubject() | ||
| if err != nil { | ||
| logCtx.WithError(err).Debug("Could not get subject from token") | ||
| return "", fmt.Errorf("could not get subject from token: %v", err) | ||
| } | ||
| logCtx.WithField("agent", subject).Debug("Successfully authenticated via bearer token") | ||
| return subject, nil | ||
| } | ||
|
|
||
| // Method 2: Extract agent name from TLS client certificate CN | ||
| // Used by manually created cluster secrets (via argocd-agentctl agent create) | ||
| if r.TLS != nil && len(r.TLS.PeerCertificates) > 0 { | ||
| cert := r.TLS.PeerCertificates[0] | ||
| agentName := cert.Subject.CommonName | ||
| if agentName != "" { | ||
| logCtx.WithField("agent", agentName).Debug("Successfully authenticated via TLS client certificate CN") | ||
| return agentName, nil | ||
| } | ||
| logCtx.Debug("TLS client certificate has empty Common Name") | ||
| } | ||
|
|
||
| logCtx.Debug("No authorization method found") | ||
| return "", fmt.Errorf("no authorization found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
rg -n "resourceProxyListenAddr|resource-proxy.*listen|listen.*resource-proxy|--listen-host" --type go --type yaml --type mdRepository: argoproj-labs/argocd-agent
Length of output: 1006
🏁 Script executed:
rg -n "ServerConfig|listen|addr.*principal" principal/ --type go -A 2 -B 2Repository: argoproj-labs/argocd-agent
Length of output: 15311
🏁 Script executed:
fd "principal" --type dRepository: argoproj-labs/argocd-agent
Length of output: 148
🏁 Script executed:
git ls-files principal/Repository: argoproj-labs/argocd-agent
Length of output: 2015
🏁 Script executed:
rg -n "resourceProxyListenAddr|resource-proxy.*port|resource-proxy.*addr" principal/server.go -A 2 -B 2Repository: argoproj-labs/argocd-agent
Length of output: 761
🏁 Script executed:
rg -n "resource-proxy\|resourceProxy" principal/options.go -A 3 -B 1Repository: argoproj-labs/argocd-agent
Length of output: 52
🏁 Script executed:
rg -n "defaultResourceProxyListenerAddr" -r "const defaultResourceProxyListenerAddr"Repository: argoproj-labs/argocd-agent
Length of output: 354
🏁 Script executed:
rg -n "WithResourceProxyListenerAddr|resource-proxy.*flag|resourceProxy" principal/options.go -A 3 -B 1Repository: argoproj-labs/argocd-agent
Length of output: 1349
🏁 Script executed:
rg -n "flag\..*resource|resource.*flag" --type go | head -30Repository: argoproj-labs/argocd-agent
Length of output: 52
🏁 Script executed:
fd "main.go|cmd" --type f | grep -E "cmd|main"Repository: argoproj-labs/argocd-agent
Length of output: 191
🏁 Script executed:
rg -n "resourceProxyAddress|resourceProxyListenAddr" principal/server.go -A 1 -B 1Repository: argoproj-labs/argocd-agent
Length of output: 682
🏁 Script executed:
fd "main.go" principal/ --type fRepository: argoproj-labs/argocd-agent
Length of output: 52
🏁 Script executed:
fd "cmd" --type d | head -5Repository: argoproj-labs/argocd-agent
Length of output: 87
🏁 Script executed:
fd "principal" cmd/ --type fRepository: argoproj-labs/argocd-agent
Length of output: 134
🏁 Script executed:
git ls-files cmd/ | grep -i principalRepository: argoproj-labs/argocd-agent
Length of output: 134
🏁 Script executed:
rg -n "resourceProxyListenAddr\|--resource-proxy" --type go --type yamlRepository: argoproj-labs/argocd-agent
Length of output: 52
🏁 Script executed:
cat -n cmd/argocd-agent/principal.go | head -150Repository: argoproj-labs/argocd-agent
Length of output: 6053
🏁 Script executed:
cat -n cmd/argocd-agent/principal.go | grep -A 5 -B 5 "resourceProxy"Repository: argoproj-labs/argocd-agent
Length of output: 5256
🏁 Script executed:
cat -n cmd/argocd-agent/principal.go | tail -200 | head -100Repository: argoproj-labs/argocd-agent
Length of output: 4946
🏁 Script executed:
rg -n "WithResourceProxyAddress" principal/ --type go -A 2 -B 2Repository: argoproj-labs/argocd-agent
Length of output: 624
🏁 Script executed:
cat -n principal/server.go | sed -n '370,390p'Repository: argoproj-labs/argocd-agent
Length of output: 982
🏁 Script executed:
rg -n "resourceProxyListenAddr" principal/ --type goRepository: argoproj-labs/argocd-agent
Length of output: 467
🏁 Script executed:
rg -n "0\.0\.0\.0:9090|defaultResourceProxyListenerAddr" principal/ --type go -B 2 -A 2Repository: argoproj-labs/argocd-agent
Length of output: 719
🏁 Script executed:
cat -n principal/options.go | grep -A 50 "type ServerOptions"Repository: argoproj-labs/argocd-agent
Length of output: 2015
🏁 Script executed:
rg -n "resourceProxyListenAddr.*flag|SetResourceProxyListenAddr" --type goRepository: argoproj-labs/argocd-agent
Length of output: 52
🏁 Script executed:
cat -n principal/options.go | sed -n '430,470p'Repository: argoproj-labs/argocd-agent
Length of output: 1248
🏁 Script executed:
cat -n principal/server.go | sed -n '185,195p'Repository: argoproj-labs/argocd-agent
Length of output: 582
🏁 Script executed:
rg -n "extractAgentFromAuth" principal/resource.go -A 3 -B 3Repository: argoproj-labs/argocd-agent
Length of output: 1036
The resource-proxy listener must be bound to localhost only and made configurable.
The extractAgentFromAuth method now accepts bearer tokens from Authorization headers, but defaultResourceProxyListenerAddr is hardcoded to "0.0.0.0:9090" (line 189 in principal/server.go). Without listener address configuration, the resource-proxy is exposed to all interfaces by default, allowing attackers to directly reach it and forge authentication headers to impersonate agents.
Add a CLI flag (e.g., --resource-proxy-listen-addr) and corresponding ServerOption to allow binding to localhost, and change the default to "127.0.0.1:9090". Network policies alone are insufficient if the operator can accidentally expose it.
🤖 Prompt for AI Agents
In `@principal/resource.go` around lines 390 - 430, The listener address for the
resource-proxy is currently hardcoded and must be configurable and default to
localhost; add a CLI flag (e.g., --resource-proxy-listen-addr) and a
ServerOption to set the listener address on Server, introduce/replace the
constant defaultResourceProxyListenerAddr to "127.0.0.1:9090", wire the new
ServerOption into Server creation/initialization (where the server binds the
resource proxy listener) so the configured value is used instead of the
hardcoded one, and ensure any code that starts the listener reads
Server.resourceProxyListenAddr (or the new field) so operators can bind to
127.0.0.1 or override via the flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think it is a security risk. This is an existing implementation and even if it is set to 0.0.0.0:9090, any attacker would still need a valid client certificate as well as a valid JWT bearer token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for TLS configuration and client certificate requirements in the resource proxy setup
rg -n "tls\.Config|ClientAuth|RequireAndVerifyClientCert|NoClientCert" principal/ --type go -B 3 -A 3
Length of output: 4647
🏁 Script executed:
#!/bin/bash
# Search for resource-proxy TLS server configuration and listener setup
rg -n "resource-proxy.*listen|resourceProxyListen|StartResourceProxy" principal/ --type go -B 3 -A 3
Length of output: 1411
🏁 Script executed:
#!/bin/bash
# Check if there's a flag for requiring client certificates
rg -n "resource-proxy-require-client-cert|requireClientCert" --type go --type yaml
Length of output: 881
🏁 Script executed:
#!/bin/bash
# Search for where resource proxy TLS config is created and configured
rg -n "resourceProxyTLSConfig|WithResourceProxyTLS" --type go -B 5 -A 10
Length of output: 4123
🏁 Script executed:
#!/bin/bash
# Search for resource proxy TLS server setup in cmd/principal.go
rg -n "resource-proxy.*tls|resourceProxyCA|resource-proxy-ca" cmd/argocd-agent/principal.go -B 3 -A 5
Length of output: 1968
🏁 Script executed:
#!/bin/bash
# Check how the resource proxy listener is started
ast-grep --pattern $'func ($$$) Start($$$) {
$$$
TLS
$$$
}'
Length of output: 0
🏁 Script executed:
#!/bin/bash
# Find the TLS config creation functions for resource proxy
rg -n "func getResourceProxyTLS" cmd/argocd-agent/principal.go -A 30
Length of output: 1921
🏁 Script executed:
#!/bin/bash
# Search for ClientAuth configuration in resource proxy setup
rg -n "ClientAuth|RequireAndVerify|NoClientCert" cmd/argocd-agent/ --type go -B 2 -A 2
Length of output: 526
🏁 Script executed:
#!/bin/bash
# Check the resource proxy server Start method to see TLS setup
rg -n "func.*Start" principal/resourceproxy/ --type go -A 20 | head -60
Length of output: 3120
@jparsai You're absolutely right! I apologize for the confusion.
Looking at the TLS configuration code, the resource proxy does require client certificates at the TLS transport layer:
// cmd/argocd-agent/principal.go lines 546-552, 570-576
proxyTLS := &tls.Config{
Certificates: []tls.Certificate{proxyCert},
ClientAuth: tls.RequireAndVerifyClientCert,
ClientCAs: clientCA,
}This means an attacker would need both:
- A valid client certificate signed by the CA to complete the TLS handshake
- A valid JWT bearer token OR a client certificate with the correct CN for application-level authentication
The extractAgentFromAuth method only runs after the TLS handshake succeeds. Without a valid client certificate, connections are rejected at the TLS layer before reaching the application logic.
However, I notice from the PR discussion that there's consideration for making the client certificate requirement configurable (via --resource-proxy-require-client-cert). If that feature is added in a future PR to support scenarios like web terminals for self-registered clusters, then binding to localhost would become critical security requirement, as the JWT-only authentication path would be exposed.
For the current implementation in this PR, binding to 0.0.0.0:9090 is acceptable given the mandatory TLS client certificate requirement.
🧬 Code Graph Analysis Results
[principal/resource.go] Summary of potentially relevant authentication logic
- Description: Extracts the agent name from the HTTP request as part of authenticating a client for resource proxying.
- Key behavior:
- Priority 1: JWT bearer token in the Authorization header (Bearer )
- Validates the token via s.issuer.ValidateResourceProxyToken.
- On success, extracts the subject from the token and returns it as the agent name.
- On failure, returns an error (invalid bearer token).
- Priority 2: TLS client certificate Common Name (CN)
- If a TLS client certificate is present, uses cert.Subject.CommonName as the agent name if non-empty.
- Fallback: If neither method yields an agent name, returns an error indicating no authorization found.
- Priority 1: JWT bearer token in the Authorization header (Bearer )
- Parameters: r *http.Request
- Returns: (string, error)
- string: agent name if authentication succeeds
- error: non-nil if authentication fails
- Exceptions / error handling:
- If Authorization header starts with "Bearer ", failure to validate token yields an error (invalid bearer token).
- If TLS cert present but CN is empty, falls through to indicate no authorization found.
- Important implementation details:
- Logs progression at various points (attempting bearer token auth, success/failure, TLS CN success, CN empty, no authorization found).
- Uses subject from the JWT as the agent identity.
- TLS-based authentication uses the first peer certificate's CommonName.
- Relation to user comment: This function enforces authentication for resource proxy requests, requiring either a valid JWT bearer token or a valid TLS client certificate CN before proceeding with proxying.
🧠 Learnings used
Learnt from: jannfis
Repo: argoproj-labs/argocd-agent PR: 699
File: docs/configuration/principal/configuration.md:260-268
Timestamp: 2026-01-09T18:54:23.709Z
Learning: For header-based authentication in argocd-agent, documentation must prominently warn users that the principal service must not be directly reachable from outside the service mesh, and should recommend binding to localhost with --listen-host=127.0.0.1.
Learnt from: jannfis
Repo: argoproj-labs/argocd-agent PR: 699
File: docs/configuration/principal/configuration.md:260-268
Timestamp: 2026-01-09T18:54:23.709Z
Learning: When using header-based authentication in argocd-agent principal, the service MUST be configured to listen on localhost (127.0.0.1) only and protected by network policies to prevent direct external access. Without these protections, attackers can bypass the service mesh sidecar and forge authentication headers to impersonate any agent.
|
@jannfis @drewbailey I found one issue while trying web based terminal with self registered cluster. With bearer token implementation agents are connected successfully and apps are also synced, but while trying to open terminal in UI, it started throwing this error in principal logs:
After debugging this issue, I got the root cause. Self registered cluster secret has JWT bearer token, but it does not have client certificate which used by ArgoCD to get resource. |
|
hey @jparsai, Yeah, the resource proxy TLS configuration defaults to tls.RequireAndVerifyClientCert, but self-registered clusters use JWT bearer tokens instead of client certificates. on my branch I addressed this by making the client certificate requirement configurable via --resource-proxy-require-client-cert flag (default: true). When using JWT bearer token authentication for self-registered clusters, set this to false. Principal configuration: Or via environment variable: The resource proxy authentication would now support two modes:
1. New flag in cmd/argocd-agent/principal.go:
command.Flags().BoolVar(&resourceProxyRequireClientCert, "resource-proxy-require-client-cert",
env.BoolWithDefault("ARGOCD_PRINCIPAL_RESOURCE_PROXY_REQUIRE_CLIENT_CERT", true),
"Require TLS client certificates for resource proxy (disable for JWT bearer token auth)")
2. TLS config change (in both getResourceProxyTLSConfigFromKube and getResourceProxyTLSConfigFromFiles):
clientAuth := tls.NoClientCert
if requireClientCert {
clientAuth = tls.RequireAndVerifyClientCert
}
proxyTLS := &tls.Config{
Certificates: []tls.Certificate{proxyCert},
ClientAuth: clientAuth, // was: tls.RequireAndVerifyClientCert
ClientCAs: clientCA,
}
3. Authentication in principal/resource.go - JWT bearer token support:
func (s *Server) extractAgentFromAuth(r *http.Request) (string, error) {
// Try Authorization header first (Bearer token)
authHeader := r.Header.Get("Authorization")
if strings.HasPrefix(authHeader, "Bearer ") {
token := strings.TrimPrefix(authHeader, "Bearer ")
claims, err := s.issuer.ValidateResourceProxyToken(token)
if err != nil {
return "", fmt.Errorf("invalid bearer token: %w", err)
}
return claims.GetSubject()
}
// Fall back to query parameter for backward compatibility with TLS client cert mode
agentName := r.URL.Query().Get("agentName")
if agentName != "" {
if r.TLS == nil || len(r.TLS.PeerCertificates) < 1 {
return "", fmt.Errorf("query param mode requires TLS client certificate")
}
return agentName, nil
}
return "", fmt.Errorf("no authorization found")
}
---
Usage for self-registered clusters:
argocd-agent principal \
--enable-self-cluster-registration=true \
--resource-proxy-require-client-cert=false \
... |
Shouldn't they use both? A generic client certificate plus a unique JWT? Or is this a limitation of the k8s client, that it can use only either of them? |
|
@jannfis yeah that makes sense, I can stop including these if they are annoying, but just in case, patch with tls + jwt for identity https://gist.github.com/drewbailey/f1e39a11d68cb1bb4a26bb43d0df40af |
Assisted by: Cursor Signed-off-by: Jayendra Parsai <[email protected]>
Assisted by: Cursor Signed-off-by: Jayendra Parsai <[email protected]>
fd2b44f to
2927003
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@principal/apis/auth/auth.go`:
- Around line 127-134: The current authentication flow in auth.go treats any
error from s.clusterRegistrationManager.RegisterCluster(ctx, clientID) as fatal,
which blocks reconnections on transient K8s failures; change this so
RegisterCluster failures are non-fatal when the cluster secret already exists:
call a check method on the clusterRegistrationManager (e.g., ClusterSecretExists
or IsClusterRegistered) before returning error, and if the secret exists treat
the RegisterCluster error as a warning (log with warning level and continue
authentication) but if the secret does not exist keep returning
errAuthenticationFailed; if such an existence-check method is missing, add it to
the ClusterRegistrationManager interface and implement it, and replace the
current unconditional log+return in the RegisterCluster error branch
accordingly.
In `@principal/clusterregistration/clusterregistration_test.go`:
- Around line 90-95: The mock issuer created by createMockIssuer sets
ValidateResourceProxyToken to return (nil, nil) which violates the production
contract (claims must be non-nil when err==nil); update createMockIssuer so
ValidateResourceProxyToken returns either a valid claims object (implementing
GetSubject with a sensible test subject) and nil error on success, or return
(nil, someError) on failure, and remove or tighten the .Maybe() so tests
exercise the successful path; ensure IssueResourceProxyToken and
ValidateResourceProxyToken expectations match actual test cases that call
claims.GetSubject().
In `@test/e2e/self_cluster_registration_test.go`:
- Around line 52-64: The SetupSuite currently swallows errors from
getClusterSecret leaving suite.originalSecret nil and causing later no-ops;
change SetupSuite (in SelfClusterRegistrationTestSuite) to fail the suite
immediately when getClusterSecret returns an error by calling suite.T().Fatalf
(or Fatal) with the error so tests do not run without a restoreable
originalSecret, otherwise keep the existing DeepCopy/log behavior when no error
occurs.
- Around line 384-417: The test Test_NoSecretAndSelfRegistrationDisabled
currently asserts clusterSecretExists immediately after fixture.RestartAgent,
which can pass before the agent attempts connection; update the test to wait for
the agent to attempt/connect before asserting absence of the secret by using the
existing connection check helpers: after calling fixture.RestartAgent and
fixture.CheckReadiness, use requires.Eventually (or requires.Never/Consistently
if preferred) with HasConnectionStatus (or another connection-status helper used
elsewhere) to observe that the agent has attempted a connection (or reached a
failed connection state), and only then call
requires.False(clusterSecretExists(...)) to ensure the assertion runs after a
real connection attempt. Ensure you reference
Test_NoSecretAndSelfRegistrationDisabled, fixture.RestartAgent/CheckReadiness,
HasConnectionStatus (or the equivalent), and clusterSecretExists when making the
change.
🧹 Nitpick comments (15)
principal/clusterregistration/clusterregistration_test.go (1)
44-46: Test helpergetClusterSecretNameduplicates production naming logic.If the production
getClusterSecretName(in theclusterregistrationpackage orclusterpackage) changes its naming convention, this test helper won't break, silently diverging. Consider importing the production function or using a package-level constant for the prefix.internal/issuer/jwt.go (1)
203-217: Non-expiring resource-proxy JWTs — consider adding a long TTL.The token is intentionally issued without
ExpiresAt, meaning it remains valid for the lifetime of the signing key. While the PR discussion reflects this as a deliberate choice (with refresh-on-reconnect as mitigation), a leaked token cannot be revoked or time-bounded.Consider adding a generous but finite TTL (e.g., 30–90 days) and relying on the existing refresh logic to reissue tokens before expiry. This limits the blast radius of a compromised token without impacting normal operation.
principal/resource_test.go (1)
294-398: Good test coverage for the dual authentication flow.The
Test_extractAgentFromAuthsuite covers the key scenarios well: valid/invalid bearer tokens, missing auth, TLS CN fallback, empty CN, and precedence rules.One edge case not covered: a malformed
Authorizationheader that hasBearerprefix but an empty token string (i.e.,"Bearer "with trailing space only). CurrentlyextractAgentFromAuthwould pass an empty string toValidateResourceProxyToken— worth a quick test to confirm the issuer rejects it gracefully.principal/resource.go (1)
394-431: ExcessiveInfo-level logging on every proxied request.
extractAgentFromAuthruns on every resource proxy request. Lines 401, 413, and 423 all log atInfolevel for the normal success path, which will be extremely noisy under load (every K8s API call through the proxy triggers these). ReserveInfofor notable events (auth failures, first-time auth); useDebugorTracefor per-request success logging.♻️ Suggested log-level adjustment
- logCtx.Info("Attempting JWT bearer token authentication") + logCtx.Debug("Attempting JWT bearer token authentication") token := strings.TrimPrefix(authHeader, "Bearer ") claims, err := s.issuer.ValidateResourceProxyToken(token) ... - logCtx.WithField("agent", subject).Info("Successfully authenticated via bearer token") + logCtx.WithField("agent", subject).Debug("Successfully authenticated via bearer token") return subject, nil ... - logCtx.WithField("agent", agentName).Info("Successfully authenticated via TLS client certificate CN") + logCtx.WithField("agent", agentName).Debug("Successfully authenticated via TLS client certificate CN") return agentName, nil ... - logCtx.Info("TLS client certificate has empty Common Name") + logCtx.Debug("TLS client certificate has empty Common Name") ... - logCtx.Info("No authorization method found") + logCtx.Debug("No authorization method found")principal/server_test.go (1)
141-149: Potential test panic:assert.NotNilwon't stop the test if it fails.Line 147 uses
assert.NotNilfors, but line 148 dereferencess.clusterRegistrationManagerwithout a priorrequire.NotNil. If the server creation ever fails to initialize the manager, this test will panic instead of producing a clear failure. Userequire.NotNilfor guards before dereference.♻️ Safer assertions
assert.NoError(t, err) - assert.NotNil(t, s) - assert.False(t, s.clusterRegistrationManager.IsSelfClusterRegistrationEnabled()) + require.NotNil(t, s) + require.NotNil(t, s.clusterRegistrationManager) + assert.False(t, s.clusterRegistrationManager.IsSelfClusterRegistrationEnabled())Same pattern applies to the test at lines 151–160.
internal/argocd/cluster/cluster.go (3)
271-313:UpdateClusterBearerTokenis vulnerable to update conflicts without retry.The read-modify-write on lines 279–308 doesn't use the secret's
ResourceVersionfor optimistic concurrency (it does implicitly through the fetched secret object), but there's no retry onConflicterrors. If the cluster secret is updated concurrently (e.g., by ArgoCD's own reconciliation or a second auth attempt), the update will fail and propagate as an auth failure.Consider adding a retry loop for
IsConflicterrors, similar to theIsAlreadyExistshandling inCreateClusterWithBearerToken.
342-361: NoisyInfo-level logging on the read path.
GetClusterBearerToken(line 345) logs atInfoon every call. Since this is invoked fromvalidateClusterTokenduring each authentication, it generates excessive log volume. UseDebuginstead.♻️ Lower log level
func GetClusterBearerToken(ctx context.Context, kubeclient kubernetes.Interface, namespace, agentName string) (string, error) { logCtx := log().WithField("agent", agentName).WithField("process", "self-registration") - logCtx.Info("Retrieving bearer token from cluster secret") + logCtx.Debug("Retrieving bearer token from cluster secret")
198-260: VerboseInfologging on the create path — same concern as elsewhere.Lines 203, 211, 219, 258 all log at
Infofor normal operation during cluster creation. SinceCreateClusterWithBearerTokenis called from the auth flow, this adds up quickly. ConsiderDebugfor the intermediate steps and keepInfoonly for the final "Successfully created" message.principal/clusterregistration/clusterregistration.go (3)
60-103: Three redundant GETs to the same secret on every reconnect.The happy path for an existing self-registered cluster calls:
ClusterSecretExists→ GET secret (Line 61)IsClusterSelfRegistered→ GET same secret (Line 70)GetClusterBearerToken(insidevalidateClusterToken) → GET same secret again (Line 119)This triples the K8s API load per authentication. Consider fetching the secret once and passing the data through, or creating a single helper that returns
(exists bool, selfRegistered bool, bearerToken string, err error).
117-141: Subject mismatch invalidateClusterTokenshould be logged as a warning.Line 140 silently returns
falsewhen the token's subject doesn't match the agent name. This could indicate a misconfigured or tampered cluster secret (token for agent A stored in agent B's secret). A warning log would aid troubleshooting and security auditing.♻️ Add warning for subject mismatch
+ if subject != agentName { + log().WithField("agent", agentName).WithField("tokenSubject", subject). + Warn("Bearer token subject does not match agent name") + } return subject == agentName, nil
51-115: VerboseInfologging on every registration attempt.The
RegisterClustermethod logs ~5–7Infomessages during the normal "exists + valid token" path (lines 58, 67, 81, 91). Since this runs on every agent authentication, the volume will be significant. ConsiderDebugfor intermediate steps andInfoonly for state changes (creation, token refresh).test/e2e/self_cluster_registration_test.go (4)
106-130:ensureOriginalSecretExistsandrestoreOriginalSecretare near-identical.Both methods follow the same get-secret → if-not-found-create → if-self-registered-delete-and-create flow. Consider extracting a single shared method and calling it from both
SetupTestandTearDownTest.Also applies to: 132-157
427-437:clusterSecretExistscreates a newkubernetes.Interfaceon every call, including insideEventuallyloops.This is called repeatedly (up to 30 times per
Eventuallyblock). Consider constructing the client once outside the loop, or using thefixture.KubeClient.Getapproach (likegetClusterSecret) and checking forIsNotFoundinstead.
119-122: Non-NotFound errors are silently swallowed in setup/teardown.If
getClusterSecretfails with, e.g., a network error, bothensureOriginalSecretExists(line 120) andrestoreOriginalSecret(line 147) just log and return. This could leave tests running against a stale or missing secret without any signal. Consider failing the test (suite.Require().NoError(err)) or at minimum usingsuite.T().Fatalfso issues are visible early.Also applies to: 146-149
367-381: One-directional key check — only verifies manual ⊆ self-registered.The loop on line 378 checks that every key in
manualSecret.Dataexists inselfRegSecret.Data, but not the reverse. If the self-registered secret has unexpected extra data keys, the test won't catch it. This may be intentional (e.g., bearer token field), but if so, a comment clarifying that would help future readers.
2927003 to
3884171
Compare
Assisted by: Cursor Signed-off-by: Jayendra Parsai <[email protected]>
3884171 to
7166169
Compare
|
@drewbailey , @jannfis I changed the implementation, now it is using both JWT bearer token and shared client certificate for authentication for self registered clusters. PTAL. Thanks again @drewbailey for providing patch, it is very helpful and gives me an idea of approach you have in mind for implementation. |
This PR implements the "Self Cluster Registration" feature, allowing agents to automatically register by creating Cluster Secrets when they connect to the principal. A JWT bearer token signed by the principal's key is used for authentication, but existing manually registered clusters will continue to work using TLS cert CN authentication
Supported authentication methods in principal:
Fixes #392
How to test changes / Special notes to the reviewer:
--enable-self-cluster-registration--self-reg-client-cert-secret--enable-resource-proxyargocd-agentctl agent create <Agent name>Assisted by: Cursor
Summary by CodeRabbit
New Features
Tests
Chores