feat: Add MLFlow integration#646
Conversation
Reviewer's GuideAdds MLFlow integration by granting EvalHub service accounts the necessary Kubernetes RBAC via RoleBindings to the built-in "edit" ClusterRole, and wiring MLFlow-related environment variables and projected service account token/CA volumes into the EvalHub deployment. Sequence diagram for MLFlow kubernetes-auth SubjectAccessReview with new RoleBindingsequenceDiagram
actor User
participant EvalHubPod
participant MlflowServer
participant KubeApi
participant RbacAuthz
User->>EvalHubPod: Invoke MLFlow-tracked operation
EvalHubPod->>EvalHubPod: Read MLFLOW_TOKEN_PATH
EvalHubPod->>MlflowServer: HTTP request with SA token
MlflowServer->>KubeApi: SubjectAccessReview for token in workspace namespace
KubeApi->>RbacAuthz: Evaluate permissions using RoleBindings and ClusterRoles
RbacAuthz->>RbacAuthz: Find RoleBinding evalhub-mlflow-proxy/jobs
RbacAuthz->>RbacAuthz: Resolve ClusterRole edit
RbacAuthz-->>KubeApi: Allow if edit grants requested verbs
KubeApi-->>MlflowServer: SubjectAccessReview allowed
MlflowServer-->>EvalHubPod: Request succeeds
EvalHubPod-->>User: Operation completed
Flow diagram for createServiceAccount with MLFlow RoleBindingsgraph TD
Start["Start createServiceAccount"] --> CreateSA["Create main ServiceAccount"]
CreateSA --> CreateJobsSA["Create jobs ServiceAccount"]
CreateJobsSA --> RBProxy["Create evalhub proxy RoleBinding (existing)"]
RBProxy --> RBJobsProxy["Create jobs proxy RoleBinding (existing)"]
RBJobsProxy --> RBMlflowProxy["createMLFlowAccessRoleBinding for main SA (suffix proxy)"]
RBMlflowProxy --> RBMlflowJobs["createMLFlowAccessRoleBinding for jobs SA (suffix jobs)"]
RBMlflowJobs --> End["Done"]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR refactors EvalHub RBAC from proxy-centric to API and MLFlow-centric authorization with per-instance scoping. It introduces Service CA and MLFlow token handling in deployments, replaces monolithic resource-manager roles with granular least-privilege roles, and updates tests and manifests accordingly. Changes
Sequence Diagram(s)sequenceDiagram
participant Operator as Operator Reconciler
participant K8sAPI as Kubernetes API
participant RBAC as RBAC System
participant Pod as EvalHub Pod
Operator->>K8sAPI: Create per-instance API Access Role
K8sAPI-->>Operator: Role created
Operator->>K8sAPI: Create API Access RoleBinding<br/>(binds Role to API SA)
K8sAPI-->>Operator: RoleBinding created
Operator->>K8sAPI: Create MLFlow Access RoleBinding<br/>(binds ClusterRole to API SA)
K8sAPI-->>Operator: RoleBinding created
Operator->>K8sAPI: Create Deployment with volumes<br/>(service-ca, mlflow-token)
K8sAPI-->>Operator: Deployment created
K8sAPI->>Pod: Inject service-ca volume<br/>and mlflow-token volume
K8sAPI->>Pod: Mount volumes as read-only
K8sAPI->>Pod: Set env vars (MLFLOW_CA_CERT_PATH, MLFLOW_TOKEN_PATH)
Pod->>RBAC: Check permissions for operations
RBAC-->>Pod: Authorized via per-instance Role<br/>and MLFlow ClusterRole
Pod->>Pod: Access service-ca from mount
Pod->>Pod: Access mlflow-token from mount
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- Using the built-in
editClusterRole for MLFlow access is quite broad; if MLFlow only needs a subset of verbs/resources for its SubjectAccessReview checks, consider defining a narrower Role/ClusterRole to reduce the permission surface. - The
ServiceAccountTokenProjectionfor the MLFlow token volume does not set anAudience; consider specifying a dedicated audience expected by MLFlow to avoid overly generic tokens that could be reused by other consumers.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Using the built-in `edit` ClusterRole for MLFlow access is quite broad; if MLFlow only needs a subset of verbs/resources for its SubjectAccessReview checks, consider defining a narrower Role/ClusterRole to reduce the permission surface.
- The `ServiceAccountTokenProjection` for the MLFlow token volume does not set an `Audience`; consider specifying a dedicated audience expected by MLFlow to avoid overly generic tokens that could be reused by other consumers.
## Individual Comments
### Comment 1
<location> `controllers/evalhub/service_accounts.go:261` </location>
<code_context>
+// MLFlow access uses the built-in "edit" ClusterRole which provides the permissions
+// that MLFlow's kubernetes-auth plugin checks via SubjectAccessReview.
+const mlflowAccessClusterRoleName = "edit"
+
+// createMLFlowAccessRoleBinding creates a RoleBinding for a ServiceAccount to the "edit"
</code_context>
<issue_to_address>
**🚨 suggestion (security):** Binding to the built-in "edit" ClusterRole may grant broader permissions than strictly required for MLFlow access.
Relying on the broad "edit" role is convenient but likely over-privileged for MLFlow’s SubjectAccessReview needs. Consider defining a dedicated ClusterRole with only the specific verbs/resources required by the kubernetes-auth plugin and binding to that instead to limit blast radius if MLFlow credentials are compromised.
Suggested implementation:
```golang
// MLFlow access uses a dedicated ClusterRole which provides the minimal permissions
// that MLFlow's kubernetes-auth plugin checks via SubjectAccessReview.
const mlflowAccessClusterRoleName = "mlflow-access"
```
```golang
// createMLFlowAccessRoleBinding creates a RoleBinding for a ServiceAccount to the
// mlflow-access ClusterRole in the instance namespace. This allows the ServiceAccount
// to pass MLFlow's kubernetes-auth SubjectAccessReview checks in the workspace namespace
// while avoiding the broader permissions of the built-in "edit" role.
```
To fully implement the least-privilege approach:
1. Define a `ClusterRole` named `mlflow-access` in your RBAC manifests (YAML) with only the specific verbs/resources required by the MLFlow kubernetes-auth plugin’s SubjectAccessReview checks.
2. Ensure that any existing references in manifests or code that assumed binding to the built-in `edit` ClusterRole are updated to use the new `mlflow-access` role instead.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@controllers/evalhub/service_accounts.go`:
- Around line 85-95: The RoleBindings for MLFlow are created unconditionally —
wrap the two calls to createMLFlowAccessRoleBinding (for serviceAccountName and
jobsServiceAccountName) in the same MLFlow-enabled check used elsewhere (e.g.,
the deployment MLFlow gate in deployment.go); only call
r.createMLFlowAccessRoleBinding(ctx, instance, serviceAccountName, "proxy") and
r.createMLFlowAccessRoleBinding(ctx, instance, jobsServiceAccountName, "jobs")
when the MLFlow feature flag/config check (the function or boolean you use to
decide MLFlow deployment) returns true so RBAC is only applied when MLFlow is
configured.
- Around line 259-261: The constant mlflowAccessClusterRoleName currently points
to the built-in "edit" ClusterRole which is overly permissive and does not grant
MLFlow the specific SubjectAccessReview permission it needs; create a minimal
custom ClusterRole (e.g., name it
"trustyai-service-operator-evalhub-mlflow-access") that grants
authorization.k8s.io/subjectaccessreviews:create (and optionally
authentication.k8s.io/tokenreviews:create) and replace the value of
mlflowAccessClusterRoleName with that custom role name; ensure any RBAC
manifests/creation logic uses the new ClusterRole name so the service account
binds to the minimal custom role rather than "edit".
🧹 Nitpick comments (3)
controllers/evalhub/deployment.go (2)
122-133: MLFlow environment variables are injected unconditionally.
MLFLOW_CA_CERT_PATH,MLFLOW_WORKSPACE, andMLFLOW_TOKEN_PATHare always set regardless of whether MLFlow integration is actually needed. This is consistent with the unconditional volume mounts (flagged above). If you gate the volumes behind an MLFlow-enabled check, these env vars should be gated similarly.
146-155: MLFlow volume mounts also unconditionally added — keep in sync with any gating applied to the volumes.Same concern as the volumes and env vars above. If the volumes are gated or made optional, these mounts should follow the same pattern.
controllers/evalhub/service_accounts.go (1)
266-331: Consider extracting common RoleBinding reconciliation into a shared helper.The create-or-update-RoleBinding pattern (check existence → create if missing → update subjects or delete+recreate if RoleRef changed) is repeated across
createResourceManagementRoleBinding,createJobsResourceManagementRoleBinding,createJobsProxyRoleBinding, and nowcreateMLFlowAccessRoleBinding. A shared helper likereconcileRoleBinding(ctx, instance, name, subjects, roleRef)would reduce ~50 lines of duplication per call site and make future additions less error-prone.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@controllers/evalhub/service_accounts.go`:
- Around line 85-95: The comment above the MLFlow RoleBinding creation is stale
(it mentions the "edit" ClusterRole) — update the comment to reflect that
createMLFlowAccessRoleBinding now binds to the custom ClusterRole
"evalhub-mlflow-access" (used in createMLFlowAccessRoleBinding). Edit the
comment to describe that MLFlow's kubernetes-auth plugin validates tokens via
SubjectAccessReview against the workspace namespace and that the custom
"evalhub-mlflow-access" ClusterRole provides the required permissions for both
service accounts (serviceAccountName and jobsServiceAccountName).
🧹 Nitpick comments (1)
controllers/evalhub/service_accounts.go (1)
259-334: Consider extracting a sharedreconcileRoleBindinghelper to reduce duplication.
createMLFlowAccessRoleBinding,createResourceManagementRoleBinding,createJobsResourceManagementRoleBinding, andcreateJobsProxyRoleBindingall follow the identical get-or-create/update/delete-recreate pattern, differing only in the RoleBinding name, labels, subjects, and roleRef. A single parameterized helper would eliminate ~80 lines of near-duplicate code and make it easier to consistently fix any reconciliation bugs across all bindings.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@controllers/evalhub/deployment_test.go`:
- Around line 374-375: The test assertions for evalhub container VolumeMounts
are outdated: update the Expect(...VolumeMounts).To(HaveLen(...)) checks so they
match the current mounts (evalhub-config, service-ca, mlflow-token and optional
DB secret). Specifically, change the assertion that currently expects 2 mounts
(the Expect(container.VolumeMounts).To(HaveLen(2)) in the DB-configured test) to
expect 4 mounts, and change the assertion that expects 1 mount (the
Expect(container.VolumeMounts).To(HaveLen(1)) in the non-DB test) to expect 3
mounts; leave the existing deployment.Spec.Template.Spec.Volumes expectation
as-is. Use the container variable/identifier used in the tests to locate the two
assertions to update.
🧹 Nitpick comments (1)
controllers/evalhub/service_accounts.go (1)
260-335: Consider extracting a generic RoleBinding reconciliation helper.
createMLFlowAccessRoleBinding,createResourceManagementRoleBinding,createJobsResourceManagementRoleBinding, andcreateJobsProxyRoleBindingshare nearly identical get→create / compare→update / delete→recreate logic. A singlereconcileRoleBinding(ctx, instance, name, namespace, labels, subjects, roleRef)helper would eliminate ~100 lines of duplication and make future additions (like this one) trivial.
tarilabs
left a comment
There was a problem hiding this comment.
adds the necessary Operator and EH Job grants
/lgtm
thanks @ruivieira
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
controllers/evalhub/service_accounts.go (1)
331-394:⚠️ Potential issue | 🔴 CriticalAuth-reviewer CRB correctly omits owner references — but label issue remains.
Good: The comment at line 367 correctly explains why owner references cannot be set on cluster-scoped resources. The subjects/roleRef comparison and update logic is sound.
However, this is where the
app.kubernetes.io/namelabel is set to the full CRB name (line 343), which is the root cause of the pipeline failures flagged above.
🤖 Fix all issues with AI agents
In `@controllers/evalhub/proxy_rbac_test.go`:
- Around line 17-23: The test failures are caused by cluster role binding names
(produced by generateAuthReviewerClusterRoleBindingName) exceeding Kubernetes'
63-char label limit and being placed verbatim into the app.kubernetes.io/name
label in service_accounts.go; fix service_accounts.go so the code that assigns
the app.kubernetes.io/name label for service accounts (the label assignment
logic near where service accounts are created) enforces Kubernetes label
length/format constraints by truncating or hashing the generated name to a
DNS-1123 compatible string no longer than 63 characters (preserve uniqueness,
e.g., keep a human-readable prefix and append a short hash), and use that
normalized value instead of the raw generateAuthReviewerClusterRoleBindingName
output.
In `@controllers/evalhub/service_accounts.go`:
- Around line 33-35: generateAuthReviewerClusterRoleBindingName builds a long
string (<name>-<namespace>-auth-reviewer-crb) that is later used as the
app.kubernetes.io/name label in createAuthReviewerClusterRoleBinding, which can
exceed Kubernetes' 63-char label limit; modify the code so the full CRB
metadata.name still uses generateAuthReviewerClusterRoleBindingName but the
label uses a safe, truncation-or-hash-derived value (e.g., take the first 63
characters or use a stable short hash of the full name) before setting
app.kubernetes.io/name in createAuthReviewerClusterRoleBinding; ensure the
label-producing logic is deterministic and referenced by both
createAuthReviewerClusterRoleBinding and any other places reading that label.
🧹 Nitpick comments (4)
controllers/evalhub/rbac_manifests_test.go (2)
38-49: Minor: only the first rule containingconfigmapsis checked.If the manifest ever has multiple rules referencing
configmaps(unlikely but possible), only the first one's verbs are validated. Current behavior is fine given the manifest structure, but worth noting.
54-65: Consider usingreflect.DeepEqualorslices.Equalfor the sorted comparison.The manual element-by-element comparison works correctly but could be simplified.
♻️ Optional simplification
+ "reflect" ... sort.Strings(gotVerbs) sort.Strings(wantVerbs) - if len(gotVerbs) != len(wantVerbs) { - t.Fatalf("unexpected verbs for configmaps: got=%v want=%v", gotVerbs, wantVerbs) - } - for i := range wantVerbs { - if gotVerbs[i] != wantVerbs[i] { - t.Fatalf("unexpected verbs for configmaps: got=%v want=%v", gotVerbs, wantVerbs) - } + if !reflect.DeepEqual(gotVerbs, wantVerbs) { + t.Fatalf("unexpected verbs for configmaps: got=%v want=%v", gotVerbs, wantVerbs) }controllers/evalhub/service_accounts.go (2)
38-128:createServiceAccounthas grown into a large orchestration function creating ~8 RBAC resources.This function now creates a ServiceAccount, a ClusterRoleBinding, 2 Roles, and 5 RoleBindings. If any step fails mid-way, partial RBAC state is left behind (though reconciliation will retry). Consider extracting an
ensureRBACForInstancehelper to improve readability and make the orchestration explicit.Not a blocker given the reconciler's idempotent retry design, but worth considering for maintainability.
504-531:equalPolicyRulesuses positional comparison — fragile if rule order changes.The function compares rules by index position. If the desired rules are reordered (e.g., during a refactor), existing roles will be unnecessarily updated. For the current usage where both sides are constructed in the same code this is acceptable, but consider sorting before comparison for robustness.
| var _ = Describe("EvalHub API RBAC", func() { | ||
| const ( | ||
| testNamespacePrefix = "evalhub-proxy-rbac-test" | ||
| testNamespacePrefix = "evalhub-api-rbac-test" | ||
| operatorNamespacePrefix = "operator-system" | ||
| evalHubName = "proxy-rbac-evalhub" | ||
| evalHubName = "api-rbac-evalhub" | ||
| configMapName = "trustyai-service-operator-config" | ||
| ) |
There was a problem hiding this comment.
All pipeline failures in this file trace to app.kubernetes.io/name label exceeding 63 chars.
The test namespace names include nanosecond timestamps (line 38), making them ~40 chars long. Combined with evalHubName in generateAuthReviewerClusterRoleBindingName, the CRB name exceeds 63 characters and is then set verbatim as the app.kubernetes.io/name label in service_accounts.go (line 343). Kubernetes rejects label values >63 chars.
This is addressed in the service_accounts.go review comment — the fix needs to happen there, not in the tests. The test is correctly exposing a real production-time risk.
🤖 Prompt for AI Agents
In `@controllers/evalhub/proxy_rbac_test.go` around lines 17 - 23, The test
failures are caused by cluster role binding names (produced by
generateAuthReviewerClusterRoleBindingName) exceeding Kubernetes' 63-char label
limit and being placed verbatim into the app.kubernetes.io/name label in
service_accounts.go; fix service_accounts.go so the code that assigns the
app.kubernetes.io/name label for service accounts (the label assignment logic
near where service accounts are created) enforces Kubernetes label length/format
constraints by truncating or hashing the generated name to a DNS-1123 compatible
string no longer than 63 characters (preserve uniqueness, e.g., keep a
human-readable prefix and append a short hash), and use that normalized value
instead of the raw generateAuthReviewerClusterRoleBindingName output.
| func generateAuthReviewerClusterRoleBindingName(instance *evalhubv1alpha1.EvalHub) string { | ||
| return instance.Name + "-" + instance.Namespace + "-auth-reviewer-crb" | ||
| } |
There was a problem hiding this comment.
CRB name used as label value can exceed Kubernetes' 63-character limit — root cause of all pipeline failures.
generateAuthReviewerClusterRoleBindingName produces <name>-<namespace>-auth-reviewer-crb. This value is then set as the app.kubernetes.io/name label (line 343), but Kubernetes label values are capped at 63 characters. With even moderately sized instance names and namespace names this limit is easily breached, as confirmed by the pipeline failures.
The .metadata.name field allows up to 253 characters, so the CRB name itself is fine — only the label needs to be truncated or computed differently.
Proposed fix
func generateAuthReviewerClusterRoleBindingName(instance *evalhubv1alpha1.EvalHub) string {
return instance.Name + "-" + instance.Namespace + "-auth-reviewer-crb"
}
+
+// truncateLabel truncates a string to the Kubernetes label value limit (63 chars).
+func truncateLabel(s string, maxLen int) string {
+ if len(s) <= maxLen {
+ return s
+ }
+ return s[:maxLen]
+}Then in createAuthReviewerClusterRoleBinding (line 343):
Labels: map[string]string{
"app": "eval-hub",
- "app.kubernetes.io/name": clusterRoleBindingName,
+ "app.kubernetes.io/name": truncateLabel(clusterRoleBindingName, 63),
"app.kubernetes.io/instance": instance.Name,
"app.kubernetes.io/part-of": "eval-hub",
"app.kubernetes.io/version": constants.Version,
},🤖 Prompt for AI Agents
In `@controllers/evalhub/service_accounts.go` around lines 33 - 35,
generateAuthReviewerClusterRoleBindingName builds a long string
(<name>-<namespace>-auth-reviewer-crb) that is later used as the
app.kubernetes.io/name label in createAuthReviewerClusterRoleBinding, which can
exceed Kubernetes' 63-char label limit; modify the code so the full CRB
metadata.name still uses generateAuthReviewerClusterRoleBindingName but the
label uses a safe, truncation-or-hash-derived value (e.g., take the first 63
characters or use a stable short hash of the full name) before setting
app.kubernetes.io/name in createAuthReviewerClusterRoleBinding; ensure the
label-producing logic is deterministic and referenced by both
createAuthReviewerClusterRoleBinding and any other places reading that label.
|
@ruivieira: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
tarilabs
left a comment
There was a problem hiding this comment.
thanks @ruivieira
minor comments below for your 👀
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ppadashe-psp, tarilabs The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Co-authored-by: Matteo Mortari <matteo.mortari@gmail.com>
|
New changes are detected. LGTM label has been removed. |
353b718
into
trustyai-explainability:main
Summary by Sourcery
Integrate EvalHub with MLFlow by configuring RBAC, environment variables, and pod volumes required for Kubernetes-authenticated access.
New Features:
Summary by CodeRabbit
Release Notes
New Features
Chores