Skip to content

Allow referencing an existing PVC for snapshots #342

@rdemoraes

Description

@rdemoraes

On GKE with Filestore CSI driver, we’re seeing intermittent provisioning/mount issues when the Dragonfly Operator manages the PVC via spec.snapshot.persistentVolumeClaimSpec. In many runs the Operator creates the PVC, but the PV is not created or the Pod mount phase fails with events claiming the PVC “already exists” — which is the very claim the Operator created.
To unblock production, we’d like the Operator to support referencing an existing PVC (pre-created and validated) via something like:

spec:
  snapshot:
    existingPersistentVolumeClaimName: <name>

Right now the CRD only supports:

spec:
  snapshot:
    persistentVolumeClaimSpec: { ... }

Environment

  • Kubernetes (GKE) version: 1.33.2-gke.1240000
  • Dragonfly Operator version / chart: dragonfly-operator-v1.1.11 / v1.1.11
  • Dragonfly image version: docker.dragonflydb.io/dragonflydb/operator:v1.1.11
  • Filestore CSI driver: enabled (using GKE-provided storage classes like standard-rwx / enterprise-rwx as applicable)
  • Network / node pool: same cluster & node pool used for control tests
  • StorageClass reclaim/binding mode tried: WaitForFirstConsumer (default) and Immediate — both hit the same behavior

What happens

  1. Apply a Dragonfly CR that defines spec.snapshot.persistentVolumeClaimSpec pointing to a Filestore-backed RWX StorageClass.
  2. Operator successfully creates the PVC (intermittent).
  3. Either:
    • No PV gets bound (stuck Pending), or
    • Pod mount fails with event like “PVC already exists” (while it’s the same, just-created PVC).
  4. When the Operator does create the claim and the PV binds, the Pod still may fail to mount with the same “already exists” style error.

What we validated

  • Pre-created PVC works: If we create a Filestore-backed PVC ourselves and mount it on a manual Pod, it binds & mounts fine (no firewall/networking issues). Same cluster, same node pool.
  • Binding mode: Using a StorageClass with VolumeBindingMode: Immediate produces the same symptoms.
  • Cluster/Filestore sanity: Filestore CSI works for other Pods in the cluster.

Why this feature is needed

Because the Operator currently only accepts an embedded persistentVolumeClaimSpec, users cannot point to a known-good existing PVC as a workaround. Supporting an existingPersistentVolumeClaimName (mutually exclusive with persistentVolumeClaimSpec) would:

  • Let users pre-provision, validate, and manage the lifecycle of storage independently.
  • Unblock production when CSI provisioning is flaky or constrained by platform policies.
  • Align with patterns used by other operators that allow both “spec to create” and “reference existing” modes.

Minimal reproducible example (MRE)

StorageClass (example):

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: filestore-rwx
provisioner: filestore.csi.storage.gke.io
allowVolumeExpansion: true
parameters:
  tier: ENTERPRISE
  network: default
volumeBindingMode: WaitForFirstConsumer
mountOptions:
  - nconnect=8

Dragonfly CR:

apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
  name: df-filestore
spec:
  replicas: 1
  snapshot:
    # Current API: works intermittently; cannot reference an existing PVC
    persistentVolumeClaimSpec:
      accessModes: ["ReadWriteMany"]
      storageClassName: filestore-rwx
      resources:
        requests:
          storage: 50Gi

Proposed alternative (feature request):

apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
  name: df-filestore
spec:
  replicas: 1
  snapshot:
    # New field (mutually exclusive with persistentVolumeClaimSpec)
    existingPersistentVolumeClaimName: df-snapshots-pvc

Expected behavior

  • If existingPersistentVolumeClaimName is set, the Operator:
    • Validates the PVC exists and access mode/storage class is compatible.
    • Skips provisioning a new PVC/PV.
    • Mounts the referenced claim into the StatefulSet/Pod used for Dragonfly snapshots.

Actual behavior

  • Operator intermittently creates the PVC but PV doesn’t bind, or Pod mount fails claiming the PVC already exists (even though it’s the same newly created claim).

Additional context

  • Dragonfly Operator docs show only snapshot.persistentVolumeClaimSpec today; there’s no documented “use existing PVC” field.
  • We confirmed Filestore CSI basics per GKE docs; manual Pods can mount the same StorageClass successfully.

Proposed API

CRD schema addition (names/validation illustrative):

snapshot:
  oneOf:
    - required: ["persistentVolumeClaimSpec"]
    - required: ["existingPersistentVolumeClaimName"]
  properties:
    existingPersistentVolumeClaimName:
      type: string
      description: "Name of an existing PVC to use for Dragonfly snapshots"

Controller logic:

  • If existingPersistentVolumeClaimName → fetch PVC, validate, inject into StatefulSet volume/volumeMounts, skip reconciler path that creates PVC.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions