-
Notifications
You must be signed in to change notification settings - Fork 80
Description
On GKE with Filestore CSI driver, we’re seeing intermittent provisioning/mount issues when the Dragonfly Operator manages the PVC via spec.snapshot.persistentVolumeClaimSpec. In many runs the Operator creates the PVC, but the PV is not created or the Pod mount phase fails with events claiming the PVC “already exists” — which is the very claim the Operator created.
To unblock production, we’d like the Operator to support referencing an existing PVC (pre-created and validated) via something like:
spec:
snapshot:
existingPersistentVolumeClaimName: <name>Right now the CRD only supports:
spec:
snapshot:
persistentVolumeClaimSpec: { ... }Environment
- Kubernetes (GKE) version: 1.33.2-gke.1240000
- Dragonfly Operator version / chart: dragonfly-operator-v1.1.11 / v1.1.11
- Dragonfly image version: docker.dragonflydb.io/dragonflydb/operator:v1.1.11
- Filestore CSI driver: enabled (using GKE-provided storage classes like
standard-rwx/enterprise-rwxas applicable) - Network / node pool: same cluster & node pool used for control tests
- StorageClass reclaim/binding mode tried:
WaitForFirstConsumer(default) andImmediate— both hit the same behavior
What happens
- Apply a
DragonflyCR that definesspec.snapshot.persistentVolumeClaimSpecpointing to a Filestore-backed RWX StorageClass. - Operator successfully creates the PVC (intermittent).
- Either:
- No PV gets bound (stuck Pending), or
- Pod mount fails with event like “PVC already exists” (while it’s the same, just-created PVC).
- When the Operator does create the claim and the PV binds, the Pod still may fail to mount with the same “already exists” style error.
What we validated
- Pre-created PVC works: If we create a Filestore-backed PVC ourselves and mount it on a manual Pod, it binds & mounts fine (no firewall/networking issues). Same cluster, same node pool.
- Binding mode: Using a StorageClass with
VolumeBindingMode: Immediateproduces the same symptoms. - Cluster/Filestore sanity: Filestore CSI works for other Pods in the cluster.
Why this feature is needed
Because the Operator currently only accepts an embedded persistentVolumeClaimSpec, users cannot point to a known-good existing PVC as a workaround. Supporting an existingPersistentVolumeClaimName (mutually exclusive with persistentVolumeClaimSpec) would:
- Let users pre-provision, validate, and manage the lifecycle of storage independently.
- Unblock production when CSI provisioning is flaky or constrained by platform policies.
- Align with patterns used by other operators that allow both “spec to create” and “reference existing” modes.
Minimal reproducible example (MRE)
StorageClass (example):
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: filestore-rwx
provisioner: filestore.csi.storage.gke.io
allowVolumeExpansion: true
parameters:
tier: ENTERPRISE
network: default
volumeBindingMode: WaitForFirstConsumer
mountOptions:
- nconnect=8Dragonfly CR:
apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
name: df-filestore
spec:
replicas: 1
snapshot:
# Current API: works intermittently; cannot reference an existing PVC
persistentVolumeClaimSpec:
accessModes: ["ReadWriteMany"]
storageClassName: filestore-rwx
resources:
requests:
storage: 50GiProposed alternative (feature request):
apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
name: df-filestore
spec:
replicas: 1
snapshot:
# New field (mutually exclusive with persistentVolumeClaimSpec)
existingPersistentVolumeClaimName: df-snapshots-pvcExpected behavior
- If
existingPersistentVolumeClaimNameis set, the Operator:- Validates the PVC exists and access mode/storage class is compatible.
- Skips provisioning a new PVC/PV.
- Mounts the referenced claim into the StatefulSet/Pod used for Dragonfly snapshots.
Actual behavior
- Operator intermittently creates the PVC but PV doesn’t bind, or Pod mount fails claiming the PVC already exists (even though it’s the same newly created claim).
Additional context
- Dragonfly Operator docs show only
snapshot.persistentVolumeClaimSpectoday; there’s no documented “use existing PVC” field. - We confirmed Filestore CSI basics per GKE docs; manual Pods can mount the same StorageClass successfully.
Proposed API
CRD schema addition (names/validation illustrative):
snapshot:
oneOf:
- required: ["persistentVolumeClaimSpec"]
- required: ["existingPersistentVolumeClaimName"]
properties:
existingPersistentVolumeClaimName:
type: string
description: "Name of an existing PVC to use for Dragonfly snapshots"Controller logic:
- If
existingPersistentVolumeClaimName→ fetch PVC, validate, inject into StatefulSet volume/volumeMounts, skip reconciler path that creates PVC.