Skip to content

Conversation

@kvaps
Copy link
Member

@kvaps kvaps commented Jan 7, 2026

Summary

Add two-tier validation to prevent misuse of allow-two-primaries for RWX block volumes:

  • Controller-side validation (ControllerPublishVolume): prevents multiple nodes from attaching the same volume to different VMs
  • Node-side validation (NodePublishVolume): prevents same node from mounting the volume for multiple pods
  • Support for KubeVirt hotplug disks (via ownerReferences resolution)

Note: This validation applies ONLY to RWX block volumes (volumeMode: Block). Filesystem RWX volumes (including NFS) are not affected.

Implementation Details

Controller-Side Validation (ControllerPublishVolume)

When attaching a RWX block volume:

  1. Query Kubernetes API for all pods using the target PVC
  2. Filter out non-active pods (Succeeded/Failed status)
  3. Extract VM names from pods:
    • Direct: read vm.kubevirt.io/name label from virt-launcher pods
    • Hotplug: follow ownerReferences from hotplug-disk pods to virt-launcher pod
  4. Validate that all pods belong to the same VM
  5. Reject if pods from different VMs are found

Error example (different VMs on different nodes):

Warning  FailedAttachVolume  6s (x5 over 15s)  attachdetach-controller  
AttachVolume.Attach failed for volume "pvc-xxx" : rpc error: code = FailedPrecondition 
desc = ControllerPublishVolume failed for pvc-xxx: RWX block volume tenant-ns/vm-pvc 
is being used by pods from different VMs (vm1 and vm2); this is not supported - 
RWX block volumes with allow-two-primaries are only for live migration of a single VM

Node-Side Validation (NodePublishVolume)

When mounting a RWX block volume:

  1. Check publish directory: /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/<volumeID>/
  2. If directory contains existing entries (other pod UIDs), block the mount
  3. This prevents second pod from mounting on the same node

Error example (same node):

Warning  FailedMapVolume  6s (x5 over 14s)  kubelet  
MapVolume.MapPodDevice failed for volume "pvc-xxx" : rpc error: code = FailedPrecondition 
desc = NodePublishVolume failed for pvc-xxx: RWX block volume is already mounted 
for another pod on this node - multiple pods on the same node sharing a block device 
is not supported (only for live migration across nodes)

Why node-side validation is needed:

CSI protocol calls ControllerPublishVolume once per (volumeID, nodeID) pair, NOT per pod. This means:

  • VM1 starts on node1 → ControllerPublishVolume called → volume attached
  • VM2 starts on node1 → ControllerPublishVolume NOT called (already attached)
  • VM2 → NodePublishVolume called directly

Without node-side validation, VM2 from a different VM could mount the same volume on node1.

Design Decisions

Why filesystem check instead of alternatives?

Alternative 1: Kubernetes API on node

  • CSI node pods have no ServiceAccount with K8s API access (by design)
  • Adding pod read permissions would be a security risk
  • Node pods should operate with minimal privileges
  • Would require RBAC changes that maintainers likely won't accept

Alternative 2: LINSTOR API with Aux properties

  • Requires LINSTOR API calls from every NodePublishVolume
  • Need to store VM name in properties, but can't reliably get VM name on node without K8s API
  • Adds complexity and external dependency
  • Slower than local filesystem check

Chosen solution: Local filesystem check

  • No external API calls - simpler, faster, no network dependency
  • No RBAC changes - works with existing permissions
  • Works for all cases - doesn't need VM name identification on node
  • Maintainer-friendly - no security concerns, minimal code
  • Simple and reliable - direct filesystem check

The node validation blocks ALL same-node multi-pod scenarios for block volumes, which is correct behavior:

  • Live migration happens across nodes, not on the same node
  • Multiple pods from same VM on same node sharing a block device is not a valid use case

Why Two-Tier Validation?

Controller validation (K8s API):

  • Blocks different VMs across multiple nodes
  • Blocks different VMs on same node (when both start simultaneously)
  • Fast fail before attach
  • Better error messages with VM names

Node validation (filesystem check):

  • Blocks same-node edge case (second pod starts after first is attached)
  • Protection when ControllerPublishVolume is skipped
  • Last line of defense before mount
  • No external API dependencies

Known Limitations

Migration Blocked by Failed Pods

If VM2 attempts to start on node2 but is blocked by controller validation (because VM1 is running on node1), VM2 will remain in Pending state with FailedAttachVolume. If VM1 then attempts to migrate to node2, the migration will also be blocked because:

  1. VM2 pod is in Pending with nodeName: node2 (already scheduled)
  2. Controller validation sees both VM1 (old + new target) and VM2 (pending)
  3. Migration is rejected with the same error

Workaround: Delete the blocked VM2 pod before migrating VM1 to the same node.

Why this happens: Kubernetes keeps pods in Pending state even after repeated attach failures. The CSI controller has no way to distinguish between a "legitimately pending" pod and a "permanently failed to attach" pod.

Test plan

  • Run unit tests: go test ./pkg/driver/... --run TestValidateRWX -v
  • Test different VMs on different nodes - blocked correctly
  • Test same node scenario - blocked correctly
  • Test live migration scenario with KubeVirt VM
  • Test hotplug disk attachment/detachment

Add validation to prevent misuse of allow-two-primaries for RWX block
volumes. The driver now checks that multiple pods using the same RWX
block volume belong to the same KubeVirt VM (identified by the
vm.kubevirt.io/name label).

This allows live migration (where source and target pods have the same
VM label) while preventing incorrect usage where different VMs try to
share the same block device.

The validation is performed in ControllerPublishVolume before calling
Attach, using the existing dynamic Kubernetes client.

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Andrei Kvapil <[email protected]>
@kvaps kvaps force-pushed the feat/rwx-block-validation branch from 3c201f2 to 52ab015 Compare January 7, 2026 15:58
@kvaps kvaps marked this pull request as ready for review January 7, 2026 20:03
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 7, 2026
Add custom linstor-csi image build to packages/system/linstor:

- Add Dockerfile based on upstream linstor-csi
- Import patch from upstream PR #403 for RWX block volume validation
  (prevents misuse of allow-two-primaries in KubeVirt live migration)
- Update Makefile to build both piraeus-server and linstor-csi images
- Configure LinstorCluster CR to use custom linstor-csi image in
  CSI controller and node pods

The RWX validation patch ensures that RWX block volumes with
allow-two-primaries are only used by pods belonging to the same
KubeVirt VM during live migration.

Upstream PR: piraeusdatastore/linstor-csi#403

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Andrei Kvapil <[email protected]>
@kvaps kvaps requested a review from WanzenBug January 7, 2026 20:12
@kvaps kvaps force-pushed the feat/rwx-block-validation branch 4 times, most recently from c61d278 to d3d1393 Compare January 7, 2026 20:33
Add two-tier validation to prevent misuse of allow-two-primaries for
RWX block volumes. This ensures volumes are only shared during live
migration of a single VM, not between different VMs.

Controller-side validation (ControllerPublishVolume):
- Query Kubernetes API for all pods using the PVC
- Extract VM names from pods (supports both virt-launcher and hotplug disks)
- Validate all pods belong to the same VM
- Reject if different VMs are detected

Node-side validation (NodePublishVolume):
- Check local publish directory for existing mounts
- Block second pod from mounting on the same node
- Protects edge case where ControllerPublishVolume is skipped
  (CSI calls ControllerPublishVolume once per node, not per pod)

Design decisions:
- Controller uses K8s API for VM identification
- Node uses filesystem check (no K8s API access on node pods)
- Simple, maintainer-friendly, no external dependencies
- Only affects RWX block volumes (volumeMode: Block)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Andrei Kvapil <[email protected]>
@kvaps kvaps force-pushed the feat/rwx-block-validation branch from d3d1393 to 991a9da Compare January 7, 2026 20:37
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 7, 2026
Add custom linstor-csi image build to packages/system/linstor:

- Add Dockerfile based on upstream linstor-csi
- Import patch from upstream PR #403 for RWX block volume validation
  (prevents misuse of allow-two-primaries in KubeVirt live migration)
- Update Makefile to build both piraeus-server and linstor-csi images
- Configure LinstorCluster CR to use custom linstor-csi image in
  CSI controller and node pods

The RWX validation patch ensures that RWX block volumes with
allow-two-primaries are only used by pods belonging to the same
KubeVirt VM during live migration.

Upstream PR: piraeusdatastore/linstor-csi#403

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Andrei Kvapil <[email protected]>
kvaps added a commit to cozystack/cozystack that referenced this pull request Jan 7, 2026
Add custom linstor-csi image build to packages/system/linstor:

- Add Dockerfile based on upstream linstor-csi
- Import patch from upstream PR #403 for RWX block volume validation
  (prevents misuse of allow-two-primaries in KubeVirt live migration)
- Update Makefile to build both piraeus-server and linstor-csi images
- Configure LinstorCluster CR to use custom linstor-csi image in
  CSI controller and node pods

The RWX validation patch ensures that RWX block volumes with
allow-two-primaries are only used by pods belonging to the same
KubeVirt VM during live migration.

Upstream PR: piraeusdatastore/linstor-csi#403

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Andrei Kvapil <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants