Skip to content

Conversation

@xing-yang
Copy link
Contributor

@xing-yang xing-yang commented Nov 25, 2025

What this PR does / why we need it:
This PR changes the return code to a non-final error if supervisor PVC is not in bound state within the timeout limit. This allows external-provisioner to continue with retries and eventually deletes the volume when it is finally created.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Testing done:
WCP pre-check pipeline: https://jenkins-vcf-csifvt.devops.broadcom.net/job/wcp-instapp-e2e-pre-checkin/661/ (failed but unrelated to my change as I only modified pvCSI)
VKS pre-check pipeline: https://jenkins-vcf-csifvt.devops.broadcom.net/view/instapp/job/vks-instapp-e2e-pre-checkin/615/ (passed)

Manual tests with my fix:

Changed timeout in pvCSI to 1 min:
env:
- name: PROVISION_TIMEOUT_MINUTES
 value: "1"  ⇐ decreased from 4 minutes to 1 minute

Create a 50Gi PVC.
root@422c5ad9cdf63ebf0f433aecddad9ff7 [ ~ ]# cat pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pvc
  namespace: test-ns
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: wcpglobal-storage-profile

Wrote data (>29G) to the volume.
root@422c5ad9cdf63ebf0f433aecddad9ff7 [ ~ ]# kubectl exec data-writer -n test-ns -- sh -c "df -h /data && du -sh /data"
Filesystem                Size      Used Available Use% Mounted on
/dev/sdb                 48.9G     29.3G     17.1G  63% /data
29.3G	/data

Create a VolumeSnapshot on Guest:
root@422c5ad9cdf63ebf0f433aecddad9ff7 [ ~ ]# kubectl create -f snapshot.yaml
volumesnapshot.snapshot.storage.k8s.io/test-snapshot created
root@422c5ad9cdf63ebf0f433aecddad9ff7 [ ~ ]# kubectl get vs -n test-ns
NAME            READYTOUSE   SOURCEPVC   SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                SNAPSHOTCONTENT                                    CREATIONTIME   AGE
test-snapshot   true         test-pvc                            50Gi          volumesnapshotclass-delete   snapcontent-fc744924-87b9-4ecc-82df-1e42616bf3a8   <invalid>      6s

Supervisor:
root@422c5ad9cdf63ebf0f433aecddad9ff7 [ ~ ]# kubectl get vs -n svc-tkg-827eq
NAME                                                                        READYTOUSE   SOURCEPVC                                                                   SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                SNAPSHOTCONTENT                                    CREATIONTIME   AGE
9ba071c6-f073-4302-93be-8c7bf8169354-fc744924-87b9-4ecc-82df-1e42616bf3a8   true         9ba071c6-f073-4302-93be-8c7bf8169354-58880e37-c08c-47b4-9821-620e282f600a                           50Gi          volumesnapshotclass-delete   snapcontent-4775f562-a865-4ff8-a3b2-5d26bb21eb31   8s             17s

Guest:
Create a PVC from VolumeSnapshot.

root@422c5ad9cdf63ebf0f433aecddad9ff7 [ ~ ]# cat restore.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-restore
  namespace: test-ns
spec:
  storageClassName: wcpglobal-storage-profile
  dataSource:
    name: test-snapshot
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi

Guest:
root@422c5ad9cdf63ebf0f433aecddad9ff7 [ ~ ]# kubectl get pvc -n test-ns
NAME           STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                VOLUMEATTRIBUTESCLASS   AGE
test-pvc       Bound     pvc-58880e37-c08c-47b4-9821-620e282f600a   50Gi       RWO            wcpglobal-storage-profile   <unset>                 11h
test-restore   Pending                                                                        wcpglobal-storage-profile   <unset>                 3m6s

vsphere csi controller logs with timeout error:

{"level":"info","time":"2025-11-25T03:06:56.637592064Z","caller":"wcpguest/controller.go:304","msg":"CreateVolume: called with args name:\"pvc-192bff8d-2437-4aa2-a300-91ba3f6ffff5\" capacity_range:{required_bytes:53687091200} volume_capabilities:{mount:{fs_type:\"ext4\"} access_mode:{mode:SINGLE_NODE_WRITER}} parameters:{key:\"svStorageClass\" value:\"wcpglobal-storage-profile\"} volume_content_source:{snapshot:{snapshot_id:\"9ba071c6-f073-4302-93be-8c7bf8169354-9ee67a09-5fc7-4413-b370-c2aa43335275\"}} accessibility_requirements:{requisite:{segments:{key:\"topology.kubernetes.io/zone\" value:\"domain-c52\"}} preferred:{segments:{key:\"topology.kubernetes.io/zone\" value:\"domain-c52\"}}}","TraceId":"0836c3bd-af58-41fc-b414-9bdc7b7c2201"}

{"level":"info","time":"2025-11-25T03:06:56.938820871Z","caller":"wcpguest/controller_helper.go:370","msg":"provisionTimeout is set to 1 minutes","TraceId":"0836c3bd-af58-41fc-b414-9bdc7b7c2201"}

{"level":"info","time":"2025-11-25T03:06:56.939861911Z","caller":"wcpguest/controller_helper.go:324","msg":"Waiting up to 60 seconds for PersistentVolumeClaim 9ba071c6-f073-4302-93be-8c7bf8169354-192bff8d-2437-4aa2-a300-91ba3f6ffff5 in namespace svc-tkg-827eq to have phase Bound","TraceId":"0836c3bd-af58-41fc-b414-9bdc7b7c2201"}

{"level":"error","time":"2025-11-25T03:07:56.969531765Z","caller":"wcpguest/controller.go:445","msg":"failed to create volume on namespace: svc-tkg-827eq in supervisor cluster. Error: persistentVolumeClaim 9ba071c6-f073-4302-93be-8c7bf8169354-192bff8d-2437-4aa2-a300-91ba3f6ffff5 in namespace svc-tkg-827eq not in phase Bound within 60 seconds","TraceId":"0836c3bd-af58-41fc-b414-9bdc7b7c2201","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/wcpguest.(*controller).CreateVolume.func1\n\t/build/pkg/csi/service/wcpguest/controller.go:445\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/wcpguest.(*controller).CreateVolume\n\t/build/pkg/csi/service/wcpguest/controller.go:553\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/[email protected]/lib/go/csi/csi_grpc.pb.go:444\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1405\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1815\ngoogle.golang.org/grpc.(*Server).serveStreams.func2.1\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1035"}

Delete PVC being created from VolumeSnapshot without deleting the VolumeSnapshot.
root@422c5ad9cdf63ebf0f433aecddad9ff7 [ ~ ]# kubectl delete pvc test-restore -n test-ns
persistentvolumeclaim "test-restore" deleted

Supervisor:
PVC in Supervisor was pending. Eventually it got deleted after it was created.

root@422c5ad9cdf63ebf0f433aecddad9ff7 [ ~ ]# kubectl get pvc -n svc-tkg-827eq
NAME                                                                        STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                VOLUMEATTRIBUTESCLASS   AGE
9ba071c6-f073-4302-93be-8c7bf8169354-192bff8d-2437-4aa2-a300-91ba3f6ffff5   Pending                                                                        wcpglobal-storage-profile   <unset>                 3m42s
9ba071c6-f073-4302-93be-8c7bf8169354-58880e37-c08c-47b4-9821-620e282f600a   Bound     pvc-4bb82950-fc50-42ae-827f-cb37e5db0589   50Gi       RWO            wcpglobal-storage-profile   <unset>                 11h
root@422c5ad9cdf63ebf0f433aecddad9ff7 [ ~ ]# kubectl get pvc -n svc-tkg-827eq
NAME                                                                        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                VOLUMEATTRIBUTESCLASS   AGE
9ba071c6-f073-4302-93be-8c7bf8169354-58880e37-c08c-47b4-9821-620e282f600a   Bound    pvc-4bb82950-fc50-42ae-827f-cb37e5db0589   50Gi       RWO            wcpglobal-storage-profile   <unset>                 11h

Special notes for your reviewer:
I also submitted a fix in external-provisioner: kubernetes-csi/external-provisioner#1448
With the provisioner fix, the PVC in supervisor will also be deleted after it is finally created from VolumeSnapshot if the VolumeSnapshot itself is also being deleted. VolumeSnapshot will be deleted as well.

Note: The fix in pvCSI is independent of the fix in external-provisioner. Without the provisioner fix, the PVC in supervisor will be deleted after it is finally created from VolumeSnapshot if the VolumeSnapshot itself is not being deleted.

Release note:

Set error code to indicate timeout in pvCSI to avoid leaking volumes.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 25, 2025
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 25, 2025
@xing-yang xing-yang changed the title WIP: Test fix error code WIP: Set error code to indicate timeout in pvCSI Nov 25, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 25, 2025
@xing-yang xing-yang changed the title WIP: Set error code to indicate timeout in pvCSI Set error code to indicate timeout in pvCSI Nov 25, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 25, 2025
@deepakkinni
Copy link
Collaborator

FAILED --- Jenkins Build #661

@divyenpatel
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Nov 26, 2025
@divyenpatel
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 26, 2025
@xing-yang
Copy link
Contributor Author

/test pull-vsphere-csi-driver-verify-golangci-lint

@nikhilbarge
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nikhilbarge, xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants