RHOAIENG-47877: Add TLS and mTLS reconciler logic#158
RHOAIENG-47877: Add TLS and mTLS reconciler logic#158kryanbeane wants to merge 1 commit intoopendatahub-io:enhanced-security-featuresfrom
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ray-operator/controllers/ray/common/pod_test.go`:
- Around line 2295-2298: The test currently iterates over
podTemplate.Spec.InitContainers and asserts RAY_USE_TLS per init container but
can vacuously pass when InitContainers is empty; update the test to first assert
that podTemplate.Spec.InitContainers is non-empty (e.g., using assert.NotEmpty
or assert.Greater/Len) before the loop so the test fails when no init containers
are present, then keep the existing loop that calls getEnvVar(initContainer,
utils.RAY_USE_TLS) and asserts the env var is not nil.
In `@ray-operator/controllers/ray/common/pod.go`:
- Around line 293-337: The TLS volume and environment variables are being
appended unconditionally in configureTLS and SetContainerTLSConfig which can
create duplicate volumes/envs/mounts; before appending, check
podTemplate.Spec.Volumes for an existing volume with Name ==
utils.RayTLSVolumeName and only add it if missing, and in SetContainerTLSConfig
check the container.Env for existing names (utils.RAY_USE_TLS,
utils.RAY_TLS_SERVER_CERT, utils.RAY_TLS_SERVER_KEY, utils.RAY_TLS_CA_CERT) and
the container.VolumeMounts for a mount with Name == utils.RayTLSVolumeName (or
MountPath == utils.RayTLSCertMountPath) and only append the missing env vars and
the volume mount; apply these guards when modifying
podTemplate.Spec.Containers[utils.RayContainerIndex] and each
podTemplate.Spec.InitContainers[i] to avoid duplicate entries.
In `@ray-operator/controllers/ray/raycluster_controller_tls.go`:
- Around line 87-90: The helpers currently bail out when a resource already
exists (e.g., the Issuer variable check using r.Get with issuerName, the
Certificate creation block, and Secret/ClusterIssuer blocks), preventing spec
drift reconciliation; change each helper to: r.Get the resource, if NotFound
then Create it, otherwise compare the existing object's
spec/labels/annotations/ownerReferences to the desired values and call r.Update
or client.Patch to apply diffs (preserving cluster-generated fields), ensuring
you set owner references (ownerRef) and preserve immutable fields where needed;
replace the early "return nil" on existence with this fetch-compare-update flow
for the issuer, certificate, and secret reconcile blocks (the code regions
around issuerName, certificateName, and secretName).
- Around line 39-47: The composite literal for steps uses the struct type with
fields ordered fn then name, but the entries are written as {"self-signed
issuer", r.reconcileSelfSignedIssuer} etc.; fix each entry in the steps slice so
the function value comes first and the description string second (e.g.,
r.reconcileSelfSignedIssuer, "self-signed issuer") for all entries including
r.reconcileCACertificate, r.reconcileCAIssuer, r.reconcileHeadCertificate and
r.reconcileWorkerCertificate so the values match the declared fields.
In `@ray-operator/controllers/ray/utils/util.go`:
- Around line 735-751: IsTLSEnabled and GetTLSSecretName currently dereference
spec without nil checks which can panic; update
IsTLSEnabled(rayv1.RayClusterSpec) to return false if spec == nil or
spec.TLSOptions == nil, and update GetTLSSecretName(clusterName, spec, nodeType)
to safely handle spec == nil (and spec.TLSOptions/SecretName == nil or empty) by
falling back to the auto-generated secret names (use RayHeadSecretPrefix or
RayWorkerSecretPrefix with clusterName) instead of dereferencing a nil pointer;
ensure all existing branches still return the user-provided SecretName when
present.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
⛔ Files ignored due to path filters (1)
ray-operator/go.sumis excluded by!**/*.sum
📒 Files selected for processing (10)
ray-operator/config/rbac/role.yamlray-operator/controllers/ray/common/pod.goray-operator/controllers/ray/common/pod_test.goray-operator/controllers/ray/raycluster_controller.goray-operator/controllers/ray/raycluster_controller_tls.goray-operator/controllers/ray/raycluster_controller_tls_test.goray-operator/controllers/ray/utils/constant.goray-operator/controllers/ray/utils/util.goray-operator/go.modray-operator/main.go
| for _, initContainer := range podTemplate.Spec.InitContainers { | ||
| env := getEnvVar(initContainer, utils.RAY_USE_TLS) | ||
| assert.NotNil(t, env, "init container should have RAY_USE_TLS") | ||
| } |
There was a problem hiding this comment.
Avoid vacuous pass in worker init-container TLS assertion
This assertion loop can pass even when no init containers exist. Add an explicit non-empty check so the test actually validates TLS propagation.
💡 Proposed fix
// Verify init container also gets TLS config for GCS health checks.
+ require.NotEmpty(t, podTemplate.Spec.InitContainers, "expected at least one init container to validate TLS propagation")
for _, initContainer := range podTemplate.Spec.InitContainers {
env := getEnvVar(initContainer, utils.RAY_USE_TLS)
assert.NotNil(t, env, "init container should have RAY_USE_TLS")
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for _, initContainer := range podTemplate.Spec.InitContainers { | |
| env := getEnvVar(initContainer, utils.RAY_USE_TLS) | |
| assert.NotNil(t, env, "init container should have RAY_USE_TLS") | |
| } | |
| // Verify init container also gets TLS config for GCS health checks. | |
| require.NotEmpty(t, podTemplate.Spec.InitContainers, "expected at least one init container to validate TLS propagation") | |
| for _, initContainer := range podTemplate.Spec.InitContainers { | |
| env := getEnvVar(initContainer, utils.RAY_USE_TLS) | |
| assert.NotNil(t, env, "init container should have RAY_USE_TLS") | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ray-operator/controllers/ray/common/pod_test.go` around lines 2295 - 2298,
The test currently iterates over podTemplate.Spec.InitContainers and asserts
RAY_USE_TLS per init container but can vacuously pass when InitContainers is
empty; update the test to first assert that podTemplate.Spec.InitContainers is
non-empty (e.g., using assert.NotEmpty or assert.Greater/Len) before the loop so
the test fails when no init containers are present, then keep the existing loop
that calls getEnvVar(initContainer, utils.RAY_USE_TLS) and asserts the env var
is not nil.
| // configureTLS injects TLS/mTLS configuration into the pod template. | ||
| // This adds the TLS secret volume, volume mounts, and Ray TLS environment variables | ||
| // to the Ray container and any init containers that need TLS for GCS health checks. | ||
| func configureTLS(podTemplate *corev1.PodTemplateSpec, instance rayv1.RayCluster, rayNodeType rayv1.RayNodeType) { | ||
| if !utils.IsTLSEnabled(&instance.Spec) { | ||
| return | ||
| } | ||
|
|
||
| secretName := utils.GetTLSSecretName(instance.Name, &instance.Spec, rayNodeType) | ||
|
|
||
| // Add the TLS secret as a projected volume to the pod. | ||
| podTemplate.Spec.Volumes = append(podTemplate.Spec.Volumes, corev1.Volume{ | ||
| Name: utils.RayTLSVolumeName, | ||
| VolumeSource: corev1.VolumeSource{ | ||
| Secret: &corev1.SecretVolumeSource{ | ||
| SecretName: secretName, | ||
| }, | ||
| }, | ||
| }) | ||
|
|
||
| // Inject env vars and volume mount into the Ray container. | ||
| SetContainerTLSConfig(&podTemplate.Spec.Containers[utils.RayContainerIndex]) | ||
|
|
||
| // Inject into init containers that need TLS (e.g. wait-gcs-ready). | ||
| for i := range podTemplate.Spec.InitContainers { | ||
| SetContainerTLSConfig(&podTemplate.Spec.InitContainers[i]) | ||
| } | ||
| } | ||
|
|
||
| // SetContainerTLSConfig adds TLS environment variables and volume mount to a container. | ||
| // This is exported so it can be reused by RayJob submitter containers. | ||
| func SetContainerTLSConfig(container *corev1.Container) { | ||
| container.Env = append(container.Env, | ||
| corev1.EnvVar{Name: utils.RAY_USE_TLS, Value: "1"}, | ||
| corev1.EnvVar{Name: utils.RAY_TLS_SERVER_CERT, Value: utils.RayTLSCertMountPath + "/tls.crt"}, | ||
| corev1.EnvVar{Name: utils.RAY_TLS_SERVER_KEY, Value: utils.RayTLSCertMountPath + "/tls.key"}, | ||
| corev1.EnvVar{Name: utils.RAY_TLS_CA_CERT, Value: utils.RayTLSCertMountPath + "/ca.crt"}, | ||
| ) | ||
|
|
||
| container.VolumeMounts = append(container.VolumeMounts, corev1.VolumeMount{ | ||
| Name: utils.RayTLSVolumeName, | ||
| MountPath: utils.RayTLSCertMountPath, | ||
| ReadOnly: true, | ||
| }) | ||
| } |
There was a problem hiding this comment.
Prevent duplicate TLS volume/env injection
At Line [304], Line [325], and Line [332], TLS artifacts are appended unconditionally. If a template already defines ray-tls or TLS env vars/mounts, this can produce duplicate entries (including invalid pod specs for duplicate volume names).
💡 Proposed fix
func configureTLS(podTemplate *corev1.PodTemplateSpec, instance rayv1.RayCluster, rayNodeType rayv1.RayNodeType) {
if !utils.IsTLSEnabled(&instance.Spec) {
return
}
@@
- podTemplate.Spec.Volumes = append(podTemplate.Spec.Volumes, corev1.Volume{
- Name: utils.RayTLSVolumeName,
- VolumeSource: corev1.VolumeSource{
- Secret: &corev1.SecretVolumeSource{
- SecretName: secretName,
- },
- },
- })
+ if !checkIfVolumeExists(&corev1.Pod{Spec: podTemplate.Spec}, utils.RayTLSVolumeName) {
+ podTemplate.Spec.Volumes = append(podTemplate.Spec.Volumes, corev1.Volume{
+ Name: utils.RayTLSVolumeName,
+ VolumeSource: corev1.VolumeSource{
+ Secret: &corev1.SecretVolumeSource{
+ SecretName: secretName,
+ },
+ },
+ })
+ }
@@
func SetContainerTLSConfig(container *corev1.Container) {
- container.Env = append(container.Env,
- corev1.EnvVar{Name: utils.RAY_USE_TLS, Value: "1"},
- corev1.EnvVar{Name: utils.RAY_TLS_SERVER_CERT, Value: utils.RayTLSCertMountPath + "/tls.crt"},
- corev1.EnvVar{Name: utils.RAY_TLS_SERVER_KEY, Value: utils.RayTLSCertMountPath + "/tls.key"},
- corev1.EnvVar{Name: utils.RAY_TLS_CA_CERT, Value: utils.RayTLSCertMountPath + "/ca.crt"},
- )
+ upsert := func(name, value string) {
+ for i := range container.Env {
+ if container.Env[i].Name == name {
+ container.Env[i].Value = value
+ container.Env[i].ValueFrom = nil
+ return
+ }
+ }
+ container.Env = append(container.Env, corev1.EnvVar{Name: name, Value: value})
+ }
+ upsert(utils.RAY_USE_TLS, "1")
+ upsert(utils.RAY_TLS_SERVER_CERT, utils.RayTLSCertMountPath+"/tls.crt")
+ upsert(utils.RAY_TLS_SERVER_KEY, utils.RayTLSCertMountPath+"/tls.key")
+ upsert(utils.RAY_TLS_CA_CERT, utils.RayTLSCertMountPath+"/ca.crt")
- container.VolumeMounts = append(container.VolumeMounts, corev1.VolumeMount{
- Name: utils.RayTLSVolumeName,
- MountPath: utils.RayTLSCertMountPath,
- ReadOnly: true,
- })
+ for _, vm := range container.VolumeMounts {
+ if vm.Name == utils.RayTLSVolumeName || vm.MountPath == utils.RayTLSCertMountPath {
+ return
+ }
+ }
+ container.VolumeMounts = append(container.VolumeMounts, corev1.VolumeMount{
+ Name: utils.RayTLSVolumeName,
+ MountPath: utils.RayTLSCertMountPath,
+ ReadOnly: true,
+ })
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ray-operator/controllers/ray/common/pod.go` around lines 293 - 337, The TLS
volume and environment variables are being appended unconditionally in
configureTLS and SetContainerTLSConfig which can create duplicate
volumes/envs/mounts; before appending, check podTemplate.Spec.Volumes for an
existing volume with Name == utils.RayTLSVolumeName and only add it if missing,
and in SetContainerTLSConfig check the container.Env for existing names
(utils.RAY_USE_TLS, utils.RAY_TLS_SERVER_CERT, utils.RAY_TLS_SERVER_KEY,
utils.RAY_TLS_CA_CERT) and the container.VolumeMounts for a mount with Name ==
utils.RayTLSVolumeName (or MountPath == utils.RayTLSCertMountPath) and only
append the missing env vars and the volume mount; apply these guards when
modifying podTemplate.Spec.Containers[utils.RayContainerIndex] and each
podTemplate.Spec.InitContainers[i] to avoid duplicate entries.
| steps := []struct { | ||
| fn func(context.Context, *rayv1.RayCluster) error | ||
| name string | ||
| }{ | ||
| {"self-signed issuer", r.reconcileSelfSignedIssuer}, | ||
| {"CA certificate", r.reconcileCACertificate}, | ||
| {"CA issuer", r.reconcileCAIssuer}, | ||
| {"head certificate", r.reconcileHeadCertificate}, | ||
| {"worker certificate", r.reconcileWorkerCertificate}, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Read the file and examine the struct declaration and initializers around lines 39-47
cat -n ray-operator/controllers/ray/raycluster_controller_tls.go | sed -n '35,55p'Repository: opendatahub-io/kuberay
Length of output: 863
🏁 Script executed:
#!/bin/bash
# Check the function signatures for reconcile methods
rg "func \(r \*.*RayClusterTLSReconciler\) reconcile(SelfSignedIssuer|CACertificate|CAIssuer|HeadCertificate|WorkerCertificate)" ray-operator/controllers/ray/raycluster_controller_tls.go -A 1Repository: opendatahub-io/kuberay
Length of output: 48
🏁 Script executed:
#!/bin/bash
# Search for the reconcile function definitions
rg "func \(r" ray-operator/controllers/ray/raycluster_controller_tls.go | grep -i reconcileRepository: opendatahub-io/kuberay
Length of output: 832
Fix struct literal field ordering — code does not compile.
The steps struct is declared with fields in order fn (function type), then name (string type), but the literals at lines 43–47 provide values in reverse order: {string, function}. This causes a type mismatch where strings are assigned to the fn field and functions to the name field. Reorder each literal to {function_reference, "string"} to match the declared field order.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ray-operator/controllers/ray/raycluster_controller_tls.go` around lines 39 -
47, The composite literal for steps uses the struct type with fields ordered fn
then name, but the entries are written as {"self-signed issuer",
r.reconcileSelfSignedIssuer} etc.; fix each entry in the steps slice so the
function value comes first and the description string second (e.g.,
r.reconcileSelfSignedIssuer, "self-signed issuer") for all entries including
r.reconcileCACertificate, r.reconcileCAIssuer, r.reconcileHeadCertificate and
r.reconcileWorkerCertificate so the values match the declared fields.
| if err := r.Get(ctx, client.ObjectKey{Name: issuerName, Namespace: instance.Namespace}, issuer); err == nil { | ||
| return nil // already exists | ||
| } else if !errors.IsNotFound(err) { | ||
| return err |
There was a problem hiding this comment.
TLS resources are create-only; drift and spec changes won’t reconcile.
Each reconcile helper exits immediately when the resource exists, so updates to desired state (e.g., TLS mode/usages/SAN set changes) are never applied. This breaks idempotent reconciliation and can leave stale cert specs in place.
Also applies to: 117-120, 167-170, 200-203, 253-256
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ray-operator/controllers/ray/raycluster_controller_tls.go` around lines 87 -
90, The helpers currently bail out when a resource already exists (e.g., the
Issuer variable check using r.Get with issuerName, the Certificate creation
block, and Secret/ClusterIssuer blocks), preventing spec drift reconciliation;
change each helper to: r.Get the resource, if NotFound then Create it, otherwise
compare the existing object's spec/labels/annotations/ownerReferences to the
desired values and call r.Update or client.Patch to apply diffs (preserving
cluster-generated fields), ensuring you set owner references (ownerRef) and
preserve immutable fields where needed; replace the early "return nil" on
existence with this fetch-compare-update flow for the issuer, certificate, and
secret reconcile blocks (the code regions around issuerName, certificateName,
and secretName).
| dnsNames = append(dnsNames, | ||
| fmt.Sprintf("*.%s.svc", instance.Namespace), | ||
| fmt.Sprintf("*.%s.svc.cluster.local", instance.Namespace), | ||
| ) |
There was a problem hiding this comment.
Worker certificate SAN scope is overly broad for namespace security boundaries.
Adding *.${namespace}.svc and *.${namespace}.svc.cluster.local grants the worker cert identity coverage across essentially all service names in the namespace. That is broader than Ray-only identities and weakens least-privilege TLS posture.
| // IsTLSEnabled returns whether TLS/mTLS is enabled for the RayCluster. | ||
| func IsTLSEnabled(spec *rayv1.RayClusterSpec) bool { | ||
| return spec.TLSOptions != nil | ||
| } | ||
|
|
||
| // GetTLSSecretName returns the secret name to mount for the given node type. | ||
| // For BYOC mode, it returns the user-specified SecretName for all node types. | ||
| // For auto-generate mode, it returns the per-node-type secret name. | ||
| func GetTLSSecretName(clusterName string, spec *rayv1.RayClusterSpec, nodeType rayv1.RayNodeType) string { | ||
| if spec.TLSOptions != nil && spec.TLSOptions.SecretName != nil && *spec.TLSOptions.SecretName != "" { | ||
| return *spec.TLSOptions.SecretName | ||
| } | ||
| if nodeType == rayv1.HeadNode { | ||
| return fmt.Sprintf("%s-%s", RayHeadSecretPrefix, clusterName) | ||
| } | ||
| return fmt.Sprintf("%s-%s", RayWorkerSecretPrefix, clusterName) | ||
| } |
There was a problem hiding this comment.
Guard TLS helpers against nil spec pointers
At Line [737] and Line [744], dereferencing spec without a nil check can panic if callers pass nil.
💡 Proposed fix
func IsTLSEnabled(spec *rayv1.RayClusterSpec) bool {
- return spec.TLSOptions != nil
+ return spec != nil && spec.TLSOptions != nil
}
@@
func GetTLSSecretName(clusterName string, spec *rayv1.RayClusterSpec, nodeType rayv1.RayNodeType) string {
- if spec.TLSOptions != nil && spec.TLSOptions.SecretName != nil && *spec.TLSOptions.SecretName != "" {
+ if spec != nil && spec.TLSOptions != nil && spec.TLSOptions.SecretName != nil && *spec.TLSOptions.SecretName != "" {
return *spec.TLSOptions.SecretName
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // IsTLSEnabled returns whether TLS/mTLS is enabled for the RayCluster. | |
| func IsTLSEnabled(spec *rayv1.RayClusterSpec) bool { | |
| return spec.TLSOptions != nil | |
| } | |
| // GetTLSSecretName returns the secret name to mount for the given node type. | |
| // For BYOC mode, it returns the user-specified SecretName for all node types. | |
| // For auto-generate mode, it returns the per-node-type secret name. | |
| func GetTLSSecretName(clusterName string, spec *rayv1.RayClusterSpec, nodeType rayv1.RayNodeType) string { | |
| if spec.TLSOptions != nil && spec.TLSOptions.SecretName != nil && *spec.TLSOptions.SecretName != "" { | |
| return *spec.TLSOptions.SecretName | |
| } | |
| if nodeType == rayv1.HeadNode { | |
| return fmt.Sprintf("%s-%s", RayHeadSecretPrefix, clusterName) | |
| } | |
| return fmt.Sprintf("%s-%s", RayWorkerSecretPrefix, clusterName) | |
| } | |
| // IsTLSEnabled returns whether TLS/mTLS is enabled for the RayCluster. | |
| func IsTLSEnabled(spec *rayv1.RayClusterSpec) bool { | |
| return spec != nil && spec.TLSOptions != nil | |
| } | |
| // GetTLSSecretName returns the secret name to mount for the given node type. | |
| // For BYOC mode, it returns the user-specified SecretName for all node types. | |
| // For auto-generate mode, it returns the per-node-type secret name. | |
| func GetTLSSecretName(clusterName string, spec *rayv1.RayClusterSpec, nodeType rayv1.RayNodeType) string { | |
| if spec != nil && spec.TLSOptions != nil && spec.TLSOptions.SecretName != nil && *spec.TLSOptions.SecretName != "" { | |
| return *spec.TLSOptions.SecretName | |
| } | |
| if nodeType == rayv1.HeadNode { | |
| return fmt.Sprintf("%s-%s", RayHeadSecretPrefix, clusterName) | |
| } | |
| return fmt.Sprintf("%s-%s", RayWorkerSecretPrefix, clusterName) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ray-operator/controllers/ray/utils/util.go` around lines 735 - 751,
IsTLSEnabled and GetTLSSecretName currently dereference spec without nil checks
which can panic; update IsTLSEnabled(rayv1.RayClusterSpec) to return false if
spec == nil or spec.TLSOptions == nil, and update GetTLSSecretName(clusterName,
spec, nodeType) to safely handle spec == nil (and spec.TLSOptions/SecretName ==
nil or empty) by falling back to the auto-generated secret names (use
RayHeadSecretPrefix or RayWorkerSecretPrefix with clusterName) instead of
dereferencing a nil pointer; ensure all existing branches still return the
user-provided SecretName when present.
033c67b to
466af71
Compare
Why are these changes needed?
Related issue number
Checks
Summary by CodeRabbit
New Features
Tests