feat(operator): add per-tenant namespace RBAC for multi-tenant job execution#656
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (4)
controllers/evalhub/service_accounts.go (2)
894-915: Tenant SA labels are not updated if they drift.The function returns early if the ServiceAccount exists without checking whether its labels match the desired state. If labels are manually modified or become inconsistent, subsequent cleanup operations could fail to identify managed resources.
♻️ Proposed refactor to ensure label consistency
func (r *EvalHubReconciler) ensureTenantServiceAccount(ctx context.Context, name, namespace string, labels map[string]string) error { sa := &corev1.ServiceAccount{} err := r.Get(ctx, types.NamespacedName{Name: name, Namespace: namespace}, sa) if err == nil { - return nil // already exists + // Ensure labels are up-to-date + needsUpdate := false + if sa.Labels == nil { + sa.Labels = make(map[string]string) + } + for k, v := range labels { + if sa.Labels[k] != v { + sa.Labels[k] = v + needsUpdate = true + } + } + if needsUpdate { + log.FromContext(ctx).Info("Updating tenant SA labels", "namespace", namespace, "name", name) + return r.Update(ctx, sa) + } + return nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@controllers/evalhub/service_accounts.go` around lines 894 - 915, ensureTenantServiceAccount currently returns early if the ServiceAccount exists and does not reconcile labels; change it to fetch the existing SA (using r.Get in ensureTenantServiceAccount), compare its ObjectMeta.Labels with the desired labels passed in, and if they differ replace/merge as appropriate and call r.Update(ctx, sa) (or a patch) to persist the label changes; keep the not-found path creating the SA as-is and ensure errors from Update are returned.
782-788: Consider continuing reconciliation for remaining namespaces on partial failure.When
reconcileTenantNamespacefails for one namespace, the function returns immediately without attempting to reconcile other tenant namespaces. This means a single problematic namespace blocks RBAC provisioning for all other tenants. Consider accumulating errors and continuing, or at minimum documenting this behavior.♻️ Proposed refactor to continue on partial failure
+ var reconcileErr error // Reconcile each tenant namespace for ns := range tenantNS { if err := r.reconcileTenantNamespace(ctx, instance, ns); err != nil { log.Error(err, "Failed to reconcile tenant namespace", "namespace", ns) - return fmt.Errorf("reconciling tenant namespace %s: %w", ns, err) + reconcileErr = fmt.Errorf("reconciling tenant namespace %s: %w", ns, err) + // Continue with other namespaces } } + + // Return first error after attempting all namespaces + if reconcileErr != nil { + return reconcileErr + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@controllers/evalhub/service_accounts.go` around lines 782 - 788, The loop over tenantNS currently returns on the first error from reconcileTenantNamespace, blocking reconciliation for remaining namespaces; change this to continue reconciling all namespaces while collecting errors (e.g., append errors to a slice or use a multierror) and log each failure with log.Error including the namespace, then after the loop return a composed error if any failures occurred (or nil if none). Specifically update the loop that calls r.reconcileTenantNamespace(ctx, instance, ns) to not return immediately on error, accumulate per-namespace errors, and return an aggregated error (or wrapped error list) at the end so callers can see partial-failure details.controllers/evalhub/tenant_rbac_test.go (1)
252-265: Consider usingassert.Errorfor clearer test assertions.Using
assert.True(t, err != nil, ...)is less idiomatic thanassert.Error(t, err, ...)for verifying that an error occurred.♻️ Proposed refactor
// Verify SA was cleaned up err = fakeClient.Get(ctx, types.NamespacedName{ Name: evalHubName + "-jobs", Namespace: tenantNS, }, sa) - assert.True(t, err != nil, "expected SA to be deleted after annotation removal") + assert.Error(t, err, "expected SA to be deleted after annotation removal") // Verify RoleBindings were cleaned up rb := &rbacv1.RoleBinding{} err = fakeClient.Get(ctx, types.NamespacedName{ Name: evalHubName + "-tenant-jobs-writer", Namespace: tenantNS, }, rb) - assert.True(t, err != nil, "expected RoleBinding to be deleted after annotation removal") + assert.Error(t, err, "expected RoleBinding to be deleted after annotation removal")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@controllers/evalhub/tenant_rbac_test.go` around lines 252 - 265, Replace the non-idiomatic nil-check assertions using assert.True(t, err != nil, ...) with assert.Error to make the tests clearer: for the ServiceAccount check (after fakeClient.Get into sa for Name evalHubName + "-jobs" in namespace tenantNS) use assert.Error(t, err, "expected SA to be deleted after annotation removal"), and for the RoleBinding check (fakeClient.Get into rb for Name evalHubName + "-tenant-jobs-writer" in namespace tenantNS) use assert.Error(t, err, "expected RoleBinding to be deleted after annotation removal").controllers/evalhub/evalhub_controller.go (1)
212-214: Consider adding a predicate to filter namespace events.The current implementation triggers reconciliation of all EvalHub instances on every namespace event (create, update, delete). In clusters with many namespaces and frequent namespace activity, this could cause unnecessary reconciliation churn.
♻️ Proposed refactor to filter on annotation changes
+import "sigs.k8s.io/controller-runtime/pkg/predicate" + +// tenantAnnotationPredicate filters namespace events to only those +// where the tenant annotation changes (added, removed, or modified). +func tenantAnnotationPredicate() predicate.Predicate { + return predicate.Funcs{ + CreateFunc: func(e event.CreateEvent) bool { + _, ok := e.Object.GetAnnotations()[tenantAnnotation] + return ok + }, + UpdateFunc: func(e event.UpdateEvent) bool { + _, oldHas := e.ObjectOld.GetAnnotations()[tenantAnnotation] + _, newHas := e.ObjectNew.GetAnnotations()[tenantAnnotation] + return oldHas != newHas + }, + DeleteFunc: func(e event.DeleteEvent) bool { + _, ok := e.Object.GetAnnotations()[tenantAnnotation] + return ok + }, + } +}Then use it in the watch:
Watches(&corev1.Namespace{}, handler.EnqueueRequestsFromMapFunc(r.namespacesToEvalHub), + builder.WithPredicates(tenantAnnotationPredicate()), ).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@controllers/evalhub/evalhub_controller.go` around lines 212 - 214, The Watch on corev1.Namespace currently enqueues all namespace events via r.namespacesToEvalHub; add a predicate to only enqueue when namespace annotations relevant to EvalHub change (or when a namespace gains/loses the annotation) to avoid reconciliation churn. Implement a predicate (e.g., namespaceAnnotationPredicate) using predicate.Funcs that returns false for Create/Delete/Update unless the annotation set/value you care about differs between old and new (for Update compare Old.GetAnnotations() vs New.GetAnnotations(), for Create/Delete check presence of the annotation), then attach it to the watch call (replace the current Watches(...). with Watches(&corev1.Namespace{}, handler.EnqueueRequestsFromMapFunc(r.namespacesToEvalHub), builder.WithPredicates(namespaceAnnotationPredicate)). Ensure you reference the same r.namespacesToEvalHub mapping function and use the new predicate variable in the builder.WithPredicates call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@controllers/evalhub/evalhub_controller.go`:
- Around line 212-214: The Watch on corev1.Namespace currently enqueues all
namespace events via r.namespacesToEvalHub; add a predicate to only enqueue when
namespace annotations relevant to EvalHub change (or when a namespace
gains/loses the annotation) to avoid reconciliation churn. Implement a predicate
(e.g., namespaceAnnotationPredicate) using predicate.Funcs that returns false
for Create/Delete/Update unless the annotation set/value you care about differs
between old and new (for Update compare Old.GetAnnotations() vs
New.GetAnnotations(), for Create/Delete check presence of the annotation), then
attach it to the watch call (replace the current Watches(...). with
Watches(&corev1.Namespace{},
handler.EnqueueRequestsFromMapFunc(r.namespacesToEvalHub),
builder.WithPredicates(namespaceAnnotationPredicate)). Ensure you reference the
same r.namespacesToEvalHub mapping function and use the new predicate variable
in the builder.WithPredicates call.
In `@controllers/evalhub/service_accounts.go`:
- Around line 894-915: ensureTenantServiceAccount currently returns early if the
ServiceAccount exists and does not reconcile labels; change it to fetch the
existing SA (using r.Get in ensureTenantServiceAccount), compare its
ObjectMeta.Labels with the desired labels passed in, and if they differ
replace/merge as appropriate and call r.Update(ctx, sa) (or a patch) to persist
the label changes; keep the not-found path creating the SA as-is and ensure
errors from Update are returned.
- Around line 782-788: The loop over tenantNS currently returns on the first
error from reconcileTenantNamespace, blocking reconciliation for remaining
namespaces; change this to continue reconciling all namespaces while collecting
errors (e.g., append errors to a slice or use a multierror) and log each failure
with log.Error including the namespace, then after the loop return a composed
error if any failures occurred (or nil if none). Specifically update the loop
that calls r.reconcileTenantNamespace(ctx, instance, ns) to not return
immediately on error, accumulate per-namespace errors, and return an aggregated
error (or wrapped error list) at the end so callers can see partial-failure
details.
In `@controllers/evalhub/tenant_rbac_test.go`:
- Around line 252-265: Replace the non-idiomatic nil-check assertions using
assert.True(t, err != nil, ...) with assert.Error to make the tests clearer: for
the ServiceAccount check (after fakeClient.Get into sa for Name evalHubName +
"-jobs" in namespace tenantNS) use assert.Error(t, err, "expected SA to be
deleted after annotation removal"), and for the RoleBinding check
(fakeClient.Get into rb for Name evalHubName + "-tenant-jobs-writer" in
namespace tenantNS) use assert.Error(t, err, "expected RoleBinding to be deleted
after annotation removal").
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
config/rbac/role.yamlcontrollers/evalhub/constants.gocontrollers/evalhub/evalhub_controller.gocontrollers/evalhub/service_accounts.gocontrollers/evalhub/tenant_rbac_test.go
| } | ||
|
|
||
| // 2. RoleBinding: API SA → jobs-writer ClusterRole (create/delete jobs in tenant ns) | ||
| if err := r.ensureTenantRoleBinding(ctx, instance.Name+"-tenant-jobs-writer", namespace, managedLabels, |
There was a problem hiding this comment.
So we are going with the option with Role and RoleBinding for EH service accounts instead of ClusterRole and ClusterRoleBinding ? I am OK with this, just wanted to confirm.
There was a problem hiding this comment.
@mariusdanciu Yes. The tenant-namespace resources use RoleBindings that reference pre-existing ClusterRoles (e.g. evalhub-jobs-writer, evalhub-job-config, evalhub-mlflow-access).
My idea was mostly to avoid creating additional ClusterRoleBindings per tenant since with RoleBindings the API SA can only create jobs in namespaces the operator has explicitly provisioned.
| log.Info("Reconciling tenant namespace RBAC", "namespace", namespace) | ||
|
|
||
| apiSAName := generateServiceAccountName(instance) | ||
| jobsSAName := generateJobsServiceAccountName(instance) |
There was a problem hiding this comment.
generateJobsServiceAccountName adds the -api suffix. For jobs SA we should probably have -job suffix.
There was a problem hiding this comment.
@mariusdanciu thanks, I'll check again. I thought the two separate functions created
generateServiceAccountName():{name}-api(the API/proxy SA)generateJobsServiceAccountName():{name}-jobs(the jobs SA)
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
controllers/evalhub/tenant_rbac_test.go (1)
255-267: Use explicit NotFound assertions for cleanup checks.Using
err != nilcan mask unexpected failures; assertingIsNotFoundmakes cleanup expectations precise.🔍 Suggested test assertion refinement
+import apierrors "k8s.io/apimachinery/pkg/api/errors" @@ -assert.True(t, err != nil, "expected SA to be deleted after annotation removal") +assert.True(t, apierrors.IsNotFound(err), "expected SA to be deleted after annotation removal") @@ -assert.True(t, err != nil, "expected RoleBinding to be deleted after annotation removal") +assert.True(t, apierrors.IsNotFound(err), "expected RoleBinding to be deleted after annotation removal")Also applies to: 329-342, 404-409
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@controllers/evalhub/tenant_rbac_test.go` around lines 255 - 267, Replace the loose "err != nil" assertions used after fakeClient.Get for the service account (sa) and RoleBinding (rb) cleanup checks with explicit NotFound checks using k8s API errors (apierrors.IsNotFound(err)); update the assertions around the Get calls (the ones referencing evalHubName+"-jobs" and evalHubName+"-tenant-jobs-writer") to assert that apierrors.IsNotFound(err) is true and include a clear message, and import k8s.io/apimachinery/pkg/api/errors as apierrors if not already present; apply the same change to the other similar assertions mentioned (around lines 329-342 and 404-409).controllers/evalhub/evalhub_controller.go (1)
212-214: Consider filtering namespace-triggered reconciles to tenant-relevant changes only.Current watch fan-outs all EvalHub reconciles for any namespace update, which can become noisy at scale. Limiting triggers to tenant annotation add/remove/change (plus delete) will reduce unnecessary churn.
Also applies to: 221-240
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@controllers/evalhub/evalhub_controller.go` around lines 212 - 214, The namespace Watch currently enqueues all EvalHub reconciles for any Namespace change; restrict it by adding a predicate (predicate.Funcs) to the Watches on corev1.Namespace that only returns true for Create/Delete or Update events where the specific tenant annotation key is added/removed/changed (compare old/new annotations for the tenant key), and keep using handler.EnqueueRequestsFromMapFunc(r.namespacesToEvalHub); apply the same predicate-based filtering to the other Watches in the same block (the ones referenced around lines 221-240) so only tenant-relevant namespace changes trigger reconciles.controllers/evalhub/service_accounts.go (2)
900-905: Reconcile labels/annotations on existing tenant resources to prevent orphaned RBAC.For pre-existing objects, metadata drift is not corrected. Since cleanup relies on
tenantLabel, missing labels can leave stale RBAC after annotation removal or instance deletion.Also applies to: 947-963
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@controllers/evalhub/service_accounts.go` around lines 900 - 905, When an existing tenant resource is found (e.g., in ensureTenantServiceAccount) its metadata should be reconciled instead of returning early; merge/patch the provided labels and annotations into the existing object's ObjectMeta (preserving other keys) and call Update (or Patch) to persist changes so tenantLabel and annotations are never missing; apply the same pattern to the other tenant reconciliation functions referenced (the code block around lines 947-963) so pre-existing Role/RoleBinding/ServiceAccount objects get their labels and annotations reconciled rather than skipped.
773-779: Filter out non-tenant control-plane namespaces by annotation value.Current logic treats any namespace with
tenantAnnotationas tenant. If control-plane namespaces also carry that key (e.g.,"Control Plane"), tenant RBAC can be provisioned there unintentionally.✅ Suggested tenant namespace filter
- if _, ok := ns.Annotations[tenantAnnotation]; ok { - tenantNS[ns.Name] = true - } + if desc, ok := ns.Annotations[tenantAnnotation]; ok { + if strings.EqualFold(strings.TrimSpace(desc), "Control Plane") { + continue + } + tenantNS[ns.Name] = true + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@controllers/evalhub/service_accounts.go` around lines 773 - 779, The loop currently treats any namespace with the key tenantAnnotation as a tenant; change it to also validate the annotation value before marking tenantNS[ns.Name] = true (e.g., read value := ns.Annotations[tenantAnnotation] and only set tenantNS when value equals the canonical tenant marker like "tenant" or a configurable expectedTenantAnnotationValue), and skip namespaces where the annotation value indicates control-plane (e.g., "Control Plane") or is empty; update the code around nsList, ns.Annotations, tenantAnnotation, tenantNS and instance.Namespace accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@controllers/evalhub/service_accounts.go`:
- Around line 791-792: The tenant ownership selector uses only instance.Name
(managedLabel := client.MatchingLabels{tenantLabel: instance.Name}) which can
collide across namespaces; change the label value to include the namespace
(e.g., combine instance.Namespace and instance.Name or use
instance.GetNamespace()) so the selector is unique per EvalHub CR, and update
the other occurrences of tenantLabel usage (the other managedLabel/selector
constructions around the references noted) to use the same namespaced label
format so RBAC cleanup/update won't cross namespaces.
---
Nitpick comments:
In `@controllers/evalhub/evalhub_controller.go`:
- Around line 212-214: The namespace Watch currently enqueues all EvalHub
reconciles for any Namespace change; restrict it by adding a predicate
(predicate.Funcs) to the Watches on corev1.Namespace that only returns true for
Create/Delete or Update events where the specific tenant annotation key is
added/removed/changed (compare old/new annotations for the tenant key), and keep
using handler.EnqueueRequestsFromMapFunc(r.namespacesToEvalHub); apply the same
predicate-based filtering to the other Watches in the same block (the ones
referenced around lines 221-240) so only tenant-relevant namespace changes
trigger reconciles.
In `@controllers/evalhub/service_accounts.go`:
- Around line 900-905: When an existing tenant resource is found (e.g., in
ensureTenantServiceAccount) its metadata should be reconciled instead of
returning early; merge/patch the provided labels and annotations into the
existing object's ObjectMeta (preserving other keys) and call Update (or Patch)
to persist changes so tenantLabel and annotations are never missing; apply the
same pattern to the other tenant reconciliation functions referenced (the code
block around lines 947-963) so pre-existing Role/RoleBinding/ServiceAccount
objects get their labels and annotations reconciled rather than skipped.
- Around line 773-779: The loop currently treats any namespace with the key
tenantAnnotation as a tenant; change it to also validate the annotation value
before marking tenantNS[ns.Name] = true (e.g., read value :=
ns.Annotations[tenantAnnotation] and only set tenantNS when value equals the
canonical tenant marker like "tenant" or a configurable
expectedTenantAnnotationValue), and skip namespaces where the annotation value
indicates control-plane (e.g., "Control Plane") or is empty; update the code
around nsList, ns.Annotations, tenantAnnotation, tenantNS and instance.Namespace
accordingly.
In `@controllers/evalhub/tenant_rbac_test.go`:
- Around line 255-267: Replace the loose "err != nil" assertions used after
fakeClient.Get for the service account (sa) and RoleBinding (rb) cleanup checks
with explicit NotFound checks using k8s API errors (apierrors.IsNotFound(err));
update the assertions around the Get calls (the ones referencing
evalHubName+"-jobs" and evalHubName+"-tenant-jobs-writer") to assert that
apierrors.IsNotFound(err) is true and include a clear message, and import
k8s.io/apimachinery/pkg/api/errors as apierrors if not already present; apply
the same change to the other similar assertions mentioned (around lines 329-342
and 404-409).
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
config/rbac/role.yamlcontrollers/evalhub/constants.gocontrollers/evalhub/evalhub_controller.gocontrollers/evalhub/service_accounts.gocontrollers/evalhub/tenant_rbac_test.go
| managedLabel := client.MatchingLabels{tenantLabel: instance.Name} | ||
|
|
There was a problem hiding this comment.
Tenant ownership selector is not unique enough across namespaces.
Using only instance.Name for tenantLabel allows collisions when two EvalHub CRs share a name in different namespaces, which can cause cross-instance update/cleanup of tenant RBAC.
🛠️ Suggested ownership key hardening
@@
- managedLabel := client.MatchingLabels{tenantLabel: instance.Name}
+ ownerKey := normalizeDNS1123LabelValue(instance.Namespace + "-" + instance.Name)
+ managedLabel := client.MatchingLabels{tenantLabel: ownerKey}
@@
- managedLabels := map[string]string{
- tenantLabel: instance.Name,
+ ownerKey := normalizeDNS1123LabelValue(instance.Namespace + "-" + instance.Name)
+ managedLabels := map[string]string{
+ tenantLabel: ownerKey,
@@
- managedLabel := client.MatchingLabels{tenantLabel: instance.Name}
+ ownerKey := normalizeDNS1123LabelValue(instance.Namespace + "-" + instance.Name)
+ managedLabel := client.MatchingLabels{tenantLabel: ownerKey}Also applies to: 836-838, 974-975
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@controllers/evalhub/service_accounts.go` around lines 791 - 792, The tenant
ownership selector uses only instance.Name (managedLabel :=
client.MatchingLabels{tenantLabel: instance.Name}) which can collide across
namespaces; change the label value to include the namespace (e.g., combine
instance.Namespace and instance.Name or use instance.GetNamespace()) so the
selector is unique per EvalHub CR, and update the other occurrences of
tenantLabel usage (the other managedLabel/selector constructions around the
references noted) to use the same namespaced label format so RBAC cleanup/update
won't cross namespaces.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mariusdanciu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
2006c90
into
trustyai-explainability:dev/evalhub-multi-tenancy
Summary
Changes
Unchanged:
Summary by CodeRabbit
Release Notes
New Features
Tests