Skip to content

feat(operator): add custom probes in configmaps with override option#500

Open
mvasilenko wants to merge 5 commits intodragonflydb:mainfrom
mvasilenko:feat/custom-probe-scripts
Open

feat(operator): add custom probes in configmaps with override option#500
mvasilenko wants to merge 5 commits intodragonflydb:mainfrom
mvasilenko:feat/custom-probe-scripts

Conversation

@mvasilenko
Copy link
Copy Markdown
Contributor

@mvasilenko mvasilenko commented Mar 29, 2026

Problem:

Currently CRD doesn't provide a way to customise pod readiness, liveness checks and startup probes - they are hardcoded in the operator binary.

Following suggestions #401 (comment) and dragonflydb/dragonfly#5921 (comment) opening PR to introduce custom pod readiness, liveness checks and startup probes.

Proposed solution:

  • 3 ConfigMaps per Dragonfly instance with the default probe scripts, mounted into pods at /etc/dragonfly/probes/
  • 3 new optional CR fields — customLivenessProbeConfigMap, customReadinessProbeConfigMap,
    customStartupProbeConfigMap
  • default scripts are embedded into the operator binary //go:embed scripts/*.sh and written into the generated
    ConfigMaps at reconcile time
  • RBAC ClusterRole updated to include configmaps CRUD verbs

Example:

custom liveness probe with connect timeout

apiVersion: v1
kind: ConfigMap
metadata:
  name: dragonfly-sample-probes
  namespace: default
data:
  liveness-check.sh: |
    #!/bin/sh
    RESPONSE=$(timeout 4 redis-cli -h localhost -p ${HEALTHCHECK_PORT:-9999} PING 2>/dev/null)
    case "$RESPONSE" in
      PONG|*LOADING*) exit 0 ;;
      *)              exit 1 ;;
    esac
---
apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
  name: dragonfly-sample
spec:
  replicas: 1
  customLivenessProbeConfigMap:
    name: dragonfly-sample-probes

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for customizable liveness/readiness/startup probe scripts for Dragonfly pods by generating default script ConfigMaps per instance and allowing users to override each probe via new CR fields.

Changes:

  • Introduces three new optional CRD fields to reference custom probe ConfigMaps (liveness/readiness/startup).
  • Embeds default probe scripts in the operator and generates per-Dragonfly ConfigMaps, mounting them into pods and wiring probes to execute those scripts (adds a default StartupProbe).
  • Expands RBAC to allow ConfigMap CRUD and updates unit/e2e tests accordingly.

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
api/v1alpha1/dragonfly_types.go Adds new custom*ProbeConfigMap fields to the Dragonfly spec.
api/v1alpha1/zz_generated.deepcopy.go Updates deepcopy generation for the new spec fields.
internal/resources/const.go Adds constants for probe ConfigMap names, script keys, mount paths, and volume names.
internal/resources/scripts.go Embeds the default probe scripts into the operator binary.
internal/resources/scripts/*.sh Adds default liveness/readiness/startup shell scripts.
internal/resources/resources.go Generates probe ConfigMaps, mounts them, and updates probes to execute mounted scripts; adds StartupProbe and resolves probe port from args.
internal/resources/resources_test.go Adds/extends unit tests covering port resolution, generated ConfigMaps, and probe volume/mount wiring.
internal/controller/dragonfly_controller.go Updates kubebuilder RBAC markers to include ConfigMaps.
config/rbac/role.yaml Adds ConfigMaps permissions to manager role.
manifests/dragonfly-operator.yaml Updates rendered CRD schema and RBAC to include new fields and ConfigMap permissions.
manifests/crd.yaml / config/crd/bases/dragonflydb.io_dragonflies.yaml Updates rendered CRD schemas with the new optional fields.
e2e/util.go Adds a helper to wait for master pod existence.
e2e/dragonfly_controller_test.go Adjusts lifecycle waits to use PhaseReady and adds explicit master-election waiting in one scenario.
internal/controller/util_test.go Minor formatting alignment in a test struct.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/resources/resources.go Outdated
Comment on lines +445 to +453
if df.Spec.CustomLivenessProbeConfigMap != nil {
livenessConfigMapName = df.Spec.CustomLivenessProbeConfigMap.Name
}
readinessConfigMapName := fmt.Sprintf("%s-%s", df.Name, ReadinessProbeConfigMapSuffix)
if df.Spec.CustomReadinessProbeConfigMap != nil {
readinessConfigMapName = df.Spec.CustomReadinessProbeConfigMap.Name
}
startupConfigMapName := fmt.Sprintf("%s-%s", df.Name, StartupProbeConfigMapSuffix)
if df.Spec.CustomStartupProbeConfigMap != nil {
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a custom probe ConfigMap reference is provided but name is empty (allowed by the CRD default/compatibility), this code will set the mounted ConfigMap name to an empty string, producing an invalid Pod spec (and the pod will stay Pending with a ConfigMap name "" not found). Guard the override with ref != nil && ref.Name != "" (or validate and surface a clear reconcile error/event) and fall back to the generated default name otherwise. Apply the same check for liveness/readiness/startup refs.

Suggested change
if df.Spec.CustomLivenessProbeConfigMap != nil {
livenessConfigMapName = df.Spec.CustomLivenessProbeConfigMap.Name
}
readinessConfigMapName := fmt.Sprintf("%s-%s", df.Name, ReadinessProbeConfigMapSuffix)
if df.Spec.CustomReadinessProbeConfigMap != nil {
readinessConfigMapName = df.Spec.CustomReadinessProbeConfigMap.Name
}
startupConfigMapName := fmt.Sprintf("%s-%s", df.Name, StartupProbeConfigMapSuffix)
if df.Spec.CustomStartupProbeConfigMap != nil {
if df.Spec.CustomLivenessProbeConfigMap != nil && df.Spec.CustomLivenessProbeConfigMap.Name != "" {
livenessConfigMapName = df.Spec.CustomLivenessProbeConfigMap.Name
}
readinessConfigMapName := fmt.Sprintf("%s-%s", df.Name, ReadinessProbeConfigMapSuffix)
if df.Spec.CustomReadinessProbeConfigMap != nil && df.Spec.CustomReadinessProbeConfigMap.Name != "" {
readinessConfigMapName = df.Spec.CustomReadinessProbeConfigMap.Name
}
startupConfigMapName := fmt.Sprintf("%s-%s", df.Name, StartupProbeConfigMapSuffix)
if df.Spec.CustomStartupProbeConfigMap != nil && df.Spec.CustomStartupProbeConfigMap.Name != "" {

Copilot uses AI. Check for mistakes.
Comment thread internal/resources/resources.go Outdated
Comment on lines +486 to +490
corev1.VolumeMount{
Name: LivenessProbeVolumeName,
MountPath: ProbeMountPath + "/" + LivenessScriptKey,
SubPath: LivenessScriptKey,
},
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These probe scripts are mounted using SubPath. Kubernetes does not propagate ConfigMap updates into subPath mounts, so changing the generated/custom ConfigMap will not update the probe script inside already-running pods (until they restart). If you want ConfigMap updates to take effect, consider using a single projected volume (or mount each ConfigMap as a directory) and reference the scripts from that directory without SubPath.

Copilot uses AI. Check for mistakes.
Comment thread internal/resources/resources.go Outdated
Comment on lines +175 to +179
FailureThreshold: 60,
InitialDelaySeconds: 0,
PeriodSeconds: 5,
SuccessThreshold: 1,
TimeoutSeconds: 5,
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default StartupProbe here runs a script that currently fails while the server reports LOADING (see startup-check.sh), and with FailureThreshold=60 + PeriodSeconds=5 the pod gets at most ~5 minutes before kubelet restarts it. If snapshot/dataset loading can exceed this, pods may enter a restart loop. Consider increasing the allowed startup window substantially and/or changing the default startup script semantics so LOADING doesn't fail startup (leaving readiness to gate traffic).

Copilot uses AI. Check for mistakes.
Comment thread internal/resources/resources_test.go Outdated
Comment on lines +248 to +263
func TestresolveDragonflyPort_Default(t *testing.T) {
assert.Equal(t, int32(DragonflyPort), resolveDragonflyPort(nil))
assert.Equal(t, int32(DragonflyPort), resolveDragonflyPort([]string{}))
assert.Equal(t, int32(DragonflyPort), resolveDragonflyPort([]string{"--alsologtostderr"}))
}

func TestresolveDragonflyPort_CustomPort(t *testing.T) {
assert.Equal(t, int32(6380), resolveDragonflyPort([]string{"--port=6380"}))
}

func TestresolveDragonflyPort_InvalidPort(t *testing.T) {
// invalid value falls back to default
assert.Equal(t, int32(DragonflyPort), resolveDragonflyPort([]string{"--port=notanumber"}))
}

func TestresolveDragonflyPort_OutOfRange(t *testing.T) {
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions won’t run as Go tests because the name after Test must start with an uppercase letter (^Test[A-Z]). Rename to TestResolveDragonflyPort_* (capital R) so the resolveDragonflyPort behavior is actually exercised in CI.

Suggested change
func TestresolveDragonflyPort_Default(t *testing.T) {
assert.Equal(t, int32(DragonflyPort), resolveDragonflyPort(nil))
assert.Equal(t, int32(DragonflyPort), resolveDragonflyPort([]string{}))
assert.Equal(t, int32(DragonflyPort), resolveDragonflyPort([]string{"--alsologtostderr"}))
}
func TestresolveDragonflyPort_CustomPort(t *testing.T) {
assert.Equal(t, int32(6380), resolveDragonflyPort([]string{"--port=6380"}))
}
func TestresolveDragonflyPort_InvalidPort(t *testing.T) {
// invalid value falls back to default
assert.Equal(t, int32(DragonflyPort), resolveDragonflyPort([]string{"--port=notanumber"}))
}
func TestresolveDragonflyPort_OutOfRange(t *testing.T) {
func TestResolveDragonflyPort_Default(t *testing.T) {
assert.Equal(t, int32(DragonflyPort), resolveDragonflyPort(nil))
assert.Equal(t, int32(DragonflyPort), resolveDragonflyPort([]string{}))
assert.Equal(t, int32(DragonflyPort), resolveDragonflyPort([]string{"--alsologtostderr"}))
}
func TestResolveDragonflyPort_CustomPort(t *testing.T) {
assert.Equal(t, int32(6380), resolveDragonflyPort([]string{"--port=6380"}))
}
func TestResolveDragonflyPort_InvalidPort(t *testing.T) {
// invalid value falls back to default
assert.Equal(t, int32(DragonflyPort), resolveDragonflyPort([]string{"--port=notanumber"}))
}
func TestResolveDragonflyPort_OutOfRange(t *testing.T) {

Copilot uses AI. Check for mistakes.
Comment thread e2e/util.go
Comment on lines +72 to +88
func waitForMasterPod(ctx context.Context, c client.Client, name, namespace string, maxDuration time.Duration) error {
ctx, cancel := context.WithTimeout(ctx, maxDuration)
defer cancel()
for {
select {
case <-ctx.Done():
return fmt.Errorf("timed out waiting for master pod for %s", name)
default:
var pods corev1.PodList
if err := c.List(ctx, &pods, client.InNamespace(namespace), client.MatchingLabels{
resources.DragonflyNameLabelKey: name,
resources.RoleLabelKey: resources.Master,
}); err == nil && len(pods.Items) > 0 {
return nil
}
}
}
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waitForMasterPod busy-loops with no sleep/backoff and ignores list errors, which can peg CPU and hammer the API server (especially when the list call is failing). Add a small polling interval (e.g. time.Sleep or wait.PollUntilContextTimeout) and handle/return errors appropriately (or at least log + backoff) while waiting.

Copilot uses AI. Check for mistakes.
Comment thread e2e/dragonfly_controller_test.go Outdated
Comment on lines +140 to +141
waitForDragonflyPhase(ctx, k8sClient, name, namespace, controller.PhaseReady, 3*time.Minute)
waitForStatefulSetReady(ctx, k8sClient, name, namespace, 3*time.Minute)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return values from waitForDragonflyPhase / waitForStatefulSetReady are ignored here, so timeouts/errors won’t fail the test at the point they occur (leading to harder-to-debug downstream failures). Capture the returned error and Expect(err).To(BeNil()) / Expect(...).To(Succeed()), consistent with other call sites in this suite.

Suggested change
waitForDragonflyPhase(ctx, k8sClient, name, namespace, controller.PhaseReady, 3*time.Minute)
waitForStatefulSetReady(ctx, k8sClient, name, namespace, 3*time.Minute)
err = waitForDragonflyPhase(ctx, k8sClient, name, namespace, controller.PhaseReady, 3*time.Minute)
Expect(err).To(BeNil())
err = waitForStatefulSetReady(ctx, k8sClient, name, namespace, 3*time.Minute)
Expect(err).To(BeNil())

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@miledxz miledxz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, left few comments @mvasilenko please take a look when you have time,

sorry for later reply, was over busy this week :)

Comment thread internal/resources/resources.go Outdated
Comment on lines +506 to +512
// ConfigMaps are appended before the StatefulSet so they exist when pods start
// and can mount the probe script volumes without getting stuck in Pending.
resources = append(resources,
generateProbeConfigMap(df, LivenessProbeConfigMapSuffix, LivenessScriptKey, defaultLivenessScript),
generateProbeConfigMap(df, ReadinessProbeConfigMapSuffix, ReadinessScriptKey, defaultReadinessScript),
generateProbeConfigMap(df, StartupProbeConfigMapSuffix, StartupScriptKey, defaultStartupScript),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These ConfigMaps are the first resources without a .Spec field to enter the reconcile loop. The existing resourceSpecsEqual() in dragonfly_instance.go compares objects by FieldByName("Spec") https://github.com/dragonflydb/dragonfly-operator/blob/main/internal/controller/dragonfly_instance.go#L742 — when that field doesn't exist (as with ConfigMaps, which store content in .Data), it returns true ("equal") and skips the update. The patch path has the same issue.

Impact on that is next,
if the operator ships updated probe scripts in a future release, existing ConfigMaps in the cluster will never be patched. Users would have to manually delete them.

This is a pre-existing gap in the reconciler, but this PR is the first to hit it.

I suggest fixing resourceSpecsEqual and the patch block in reconcileResources to handle ConfigMap .Data before merging — otherwise probe script upgrades will silently fail on existing clusters.

Suggestion:

In resourceSpecsEqual in dragonfly_instance.go, add before the FieldByName("Spec") block:

if cmDesired, ok := desired.(*corev1.ConfigMap); ok {
   if cmExisting, ok := existing.(*corev1.ConfigMap); ok {
         return reflect.DeepEqual(cmDesired.Data, cmExisting.Data)
     }
 }

or something similar,

WDYT @mvasilenko ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also,

these three ConfigMaps are always generated even when the user has set customLivenessProbeConfigMap / customReadinessProbeConfigMap / customStartupProbeConfigMap.

The custom override only changes the volume reference (lines 436-447), but the default ConfigMaps are still created, sit unused in the namespace, and get reconciled every loop.

My suggestion — skip generating the default when a custom override is provided:

 if df.Spec.CustomLivenessProbeConfigMap == nil {
     resources = append(resources, generateProbeConfigMap(df, LivenessProbeConfigMapSuffix, LivenessScriptKey, defaultLivenessScript))
 }
 if df.Spec.CustomReadinessProbeConfigMap == nil {
     resources = append(resources, generateProbeConfigMap(df, ReadinessProbeConfigMapSuffix, ReadinessScriptKey, defaultReadinessScript))
 }
 if df.Spec.CustomStartupProbeConfigMap == nil {
     resources = append(resources, generateProbeConfigMap(df, StartupProbeConfigMapSuffix, StartupScriptKey, defaultStartupScript))
 }

Comment on lines -105 to -107
Env: append(df.Spec.Env, corev1.EnvVar{
Name: "HEALTHCHECK_PORT",
Value: fmt.Sprintf("%d", DragonflyAdminPort),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HEALTHCHECK_PORT was removed from the container env here, but the PR description's example for custom probes still references ${HEALTHCHECK_PORT:-6379}. Anyone following the example will have their custom probes silently hit port 6379 instead of the admin port 9999 — which defeats the purpose on TLS clusters.

we can re-add HEALTHCHECK_PORT=9999 as an env var so custom scripts can reference it portably,
it is more user-friendly — it gives custom script authors a stable abstraction rather than requiring them to know internal port numbers

HOST="localhost"
PORT=9999 # Dragonfly admin port — always plain-text, not user-configurable

RESPONSE=$(redis-cli -h "$HOST" -p "$PORT" --no-auth-warning INFO persistence 2>/dev/null)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Liveness and startup scripts pass ${DFLY_requirepass:+-a "$DFLY_requirepass"} to redis-cli, but the readiness script does not:

readiness (no auth)

RESPONSE=$(redis-cli -h "$HOST" -p "$PORT" --no-auth-warning INFO persistence 2>/dev/null)

liveness (has auth)

RESPONSE=$(redis-cli -h "$HOST" -p "$PORT" --no-auth-warning
${DFLY_requirepass:+-a "$DFLY_requirepass"} PING 2>/dev/null)

Even if the admin port currently doesn't require auth, the scripts should be consistent. If admin port auth behavior ever changes, only the readiness probe would break. I suggest adding the same ${DFLY_requirepass:+-a "$DFLY_requirepass"} to the readiness script as well.

Comment thread internal/resources/resources.go Outdated
Comment on lines +499 to +501
// Note: AdditionalVolumes takes precedence — a volume named "liveness-probe",
// "readiness-probe", or "startup-probe" in spec.additionalVolumes will override
// the operator-generated probe volume for that slot.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here correctly notes that additionalVolumes can override the probe volumes by matching the volume name. Combined with the new customXxxProbeConfigMap API fields, users now have two different ways to override probe scripts, and nothing prevents them from using both at once (which could conflict).

maybe worth considering of adding a validation webhook that rejects the CR if both mechanisms target the same probe, or just documenting clearly in the CRD field descriptions which mechanism takes precedence.

Comment thread internal/resources/resources.go Outdated
Comment on lines +129 to +137
// Probe semantics during dataset LOADING (large snapshot restore):
// StartupProbe — succeeds on any PING response (PONG or LOADING); prevents
// liveness from firing before the process is up.
// LivenessProbe — succeeds on any PING response (PONG or LOADING); must NOT
// restart a pod that is mid-restore, as that aborts the load
// and creates a crash loop (see issues #426, #508).
// ReadinessProbe — fails on LOADING; gates traffic until the dataset is fully
// loaded and Dragonfly can serve requests.
ReadinessProbe: &corev1.Probe{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvasilenko worth running go fmt for this file

@mvasilenko mvasilenko force-pushed the feat/custom-probe-scripts branch 2 times, most recently from 9839739 to ab7f485 Compare April 3, 2026 09:48
@mvasilenko
Copy link
Copy Markdown
Contributor Author

@miledxz addressed comments, appreciate re-review.

Thanks for your patience — this is my first contribution

@mvasilenko mvasilenko requested a review from miledxz April 3, 2026 11:26
@Abhra303
Copy link
Copy Markdown
Contributor

Abhra303 commented Apr 8, 2026

Hey, can you fix the conflicts please?

@mvasilenko
Copy link
Copy Markdown
Contributor Author

updated branch with new main, could you check please?

Copy link
Copy Markdown
Contributor

@Abhra303 Abhra303 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, see some comments. Looks OK overall. But my main concern is when the operator is upgraded, all existing resources will be updated immediately as the operator reconciler will see the configmaps are missing and statefulset has different probe than desired. This is somewhat breaking the compatibility.

Comment thread e2e/dragonfly_controller_test.go Outdated
waitForDragonflyPhase(ctx, k8sClient, name, namespace, controller.PhaseResourcesCreated, 2*time.Minute)
waitForStatefulSetReady(ctx, k8sClient, name, namespace, 2*time.Minute)

// Wait for master election, then for all replicas to be Ready
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these e2e fix is already merged. Please update the PR with the latest main branch.

Comment thread e2e/util.go
// waitForMasterPod polls until at least one pod with role=master exists. Use this
// after waitForStatefulSetReady to guarantee the lifecycle controller has finished
// master election before the test tries to connect.
func waitForMasterPod(ctx context.Context, c client.Client, name, namespace string, maxDuration time.Duration) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are merged. Please update the PR.

Comment thread internal/resources/resources.go Outdated
dflyUserGroup int64 = 999
)

func generateProbeConfigMap(df *resourcesv1.Dragonfly, suffix, key, script string) *corev1.ConfigMap {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass configmap name instead of suffix and assign it to Name under ObjectMeta.

Comment thread internal/resources/resources.go Outdated
"/bin/sh",
"/usr/local/bin/healthcheck.sh",
},
Command: []string{"/bin/sh", ProbeMountPath + "/" + ReadinessScriptKey},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of concatenating strings every time, maybe use a helper function that accepts probe file name.

Comment thread internal/resources/resources.go Outdated
"/bin/sh",
"/usr/local/bin/healthcheck.sh",
},
Command: []string{"/bin/sh", ProbeMountPath + "/" + LivenessScriptKey},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of concatenating strings every time, maybe use a helper function that accepts probe file name.

Comment thread internal/resources/resources.go Outdated
Comment on lines +431 to +490
// Probe script volumes — the corresponding ConfigMaps are generated and appended to resources below.
livenessConfigMapName := fmt.Sprintf("%s-%s", df.Name, LivenessProbeConfigMapSuffix)
if df.Spec.CustomLivenessProbeConfigMap != nil && df.Spec.CustomLivenessProbeConfigMap.Name != "" {
livenessConfigMapName = df.Spec.CustomLivenessProbeConfigMap.Name
}
readinessConfigMapName := fmt.Sprintf("%s-%s", df.Name, ReadinessProbeConfigMapSuffix)
if df.Spec.CustomReadinessProbeConfigMap != nil && df.Spec.CustomReadinessProbeConfigMap.Name != "" {
readinessConfigMapName = df.Spec.CustomReadinessProbeConfigMap.Name
}
startupConfigMapName := fmt.Sprintf("%s-%s", df.Name, StartupProbeConfigMapSuffix)
if df.Spec.CustomStartupProbeConfigMap != nil && df.Spec.CustomStartupProbeConfigMap.Name != "" {
startupConfigMapName = df.Spec.CustomStartupProbeConfigMap.Name
}

statefulset.Spec.Template.Spec.Volumes = append(statefulset.Spec.Template.Spec.Volumes,
corev1.Volume{
Name: LivenessProbeVolumeName,
VolumeSource: corev1.VolumeSource{
ConfigMap: &corev1.ConfigMapVolumeSource{
LocalObjectReference: corev1.LocalObjectReference{Name: livenessConfigMapName},
},
},
},
corev1.Volume{
Name: ReadinessProbeVolumeName,
VolumeSource: corev1.VolumeSource{
ConfigMap: &corev1.ConfigMapVolumeSource{
LocalObjectReference: corev1.LocalObjectReference{Name: readinessConfigMapName},
},
},
},
corev1.Volume{
Name: StartupProbeVolumeName,
VolumeSource: corev1.VolumeSource{
ConfigMap: &corev1.ConfigMapVolumeSource{
LocalObjectReference: corev1.LocalObjectReference{Name: startupConfigMapName},
},
},
},
)

statefulset.Spec.Template.Spec.Containers[0].VolumeMounts = append(
statefulset.Spec.Template.Spec.Containers[0].VolumeMounts,
corev1.VolumeMount{
Name: LivenessProbeVolumeName,
MountPath: ProbeMountPath + "/" + LivenessScriptKey,
SubPath: LivenessScriptKey,
},
corev1.VolumeMount{
Name: ReadinessProbeVolumeName,
MountPath: ProbeMountPath + "/" + ReadinessScriptKey,
SubPath: ReadinessScriptKey,
},
corev1.VolumeMount{
Name: StartupProbeVolumeName,
MountPath: ProbeMountPath + "/" + StartupScriptKey,
SubPath: StartupScriptKey,
},
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please move all this to a helper function. The function is getting too big.

Comment thread internal/resources/resources.go Outdated
// and can mount the probe script volumes without getting stuck in Pending.
// Skip generating a default when the user has pointed to their own ConfigMap.
if df.Spec.CustomLivenessProbeConfigMap == nil || df.Spec.CustomLivenessProbeConfigMap.Name == "" {
resources = append(resources, generateProbeConfigMap(df, LivenessProbeConfigMapSuffix, LivenessScriptKey, defaultLivenessScript))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass the exact name i.e. fmt.Sprintf("%s-%s,....) here.

Comment thread internal/resources/resources.go Outdated
resources = append(resources, generateProbeConfigMap(df, LivenessProbeConfigMapSuffix, LivenessScriptKey, defaultLivenessScript))
}
if df.Spec.CustomReadinessProbeConfigMap == nil || df.Spec.CustomReadinessProbeConfigMap.Name == "" {
resources = append(resources, generateProbeConfigMap(df, ReadinessProbeConfigMapSuffix, ReadinessScriptKey, defaultReadinessScript))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass the exact name here instead of the suffix.

Comment thread internal/resources/resources.go Outdated
resources = append(resources, generateProbeConfigMap(df, ReadinessProbeConfigMapSuffix, ReadinessScriptKey, defaultReadinessScript))
}
if df.Spec.CustomStartupProbeConfigMap == nil || df.Spec.CustomStartupProbeConfigMap.Name == "" {
resources = append(resources, generateProbeConfigMap(df, StartupProbeConfigMapSuffix, StartupScriptKey, defaultStartupScript))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass the exact name instead of the suffix.

//+kubebuilder:rbac:groups="",resources=services,verbs=get;list;watch;create;update;patch;delete
//+kubebuilder:rbac:groups="",resources=pods,verbs=get;list;watch;create;update;patch;delete
//+kubebuilder:rbac:groups="",resources=events,verbs=create;patch
//+kubebuilder:rbac:groups="",resources=configmaps,verbs=get;list;watch;create;update;patch;delete
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to add .Owns(&corev1.ConfigMap{},...) in SetupWithManager function in the dragonfly_controller.go. Else, reconciler won't be triggered when configmaps are deleted/updated.

@mvasilenko
Copy link
Copy Markdown
Contributor Author

mvasilenko commented Apr 16, 2026

Thanks for the review, addressed comments.

As for the main concern, I've updated and tested the code - see gist, it should be no-op for existing users and only on CRD patch with opt-in for custom probes, pods would be recreated.

Another gap I found with current CM-based approach - on accidental probe's ConfigMap deletion, referenced in Dragonfly CR spec - the probes break and pods start failing.

So we could add finalizers like dragonflydb.io/probe-configmap-protection to user-provided probe CMs or webhooks for CMs deletion prevention, wdyt?

@mvasilenko mvasilenko requested a review from Abhra303 April 16, 2026 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Custom configmap-based health check probe scripts with override support

4 participants