Skip to content

Commit 6e0342b

Browse files
committed
Add volume reconstruction changes
1 parent c43f6d6 commit 6e0342b

File tree

1 file changed

+36
-2
lines changed
  • keps/sig-storage/1710-selinux-relabeling

1 file changed

+36
-2
lines changed

keps/sig-storage/1710-selinux-relabeling/README.md

Lines changed: 36 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
- [Risks and Mitigations](#risks-and-mitigations)
2424
- [Design Details](#design-details)
2525
- [Required kubelet changes](#required-kubelet-changes)
26+
- [Volume Reconstruction](#volume-reconstruction)
2627
- [Implementation phases](#implementation-phases)
2728
- [Phase 1](#phase-1)
2829
- [Phase 2](#phase-2)
@@ -332,8 +333,7 @@ Apart from the obvious API change and behavior described above, kubelet + volume
332333
* Kubelet's VolumeManager needs to track which SELinux label should get a volume in global mount (to call `MountDevice()` with the right mount options).
333334
* It must call `UnmountDevice()` even when another pod wants to re-use a mounted volume, but it has a different SELinux context.
334335
* After kubelet restart, kubelet must reconstruct the original SELinux label it used to SetUp and MountDevice of each volume.
335-
* Volume reconstruction must be updated to get the SELinux label from mount (in-tree volume plugins) or stored json file (CSI).
336-
This label must be updated in VolumeManager's ActualStateOfWorld after reconstruction.
336+
See Volume Reconstruction below.
337337
* Reconciler must check also SELinux context used to mount a volume (both mounted devices and volumes) before considering what operation to take on a volume (`MountVolume` or `UnmountVolume`/`UnmountDevice` or nothing).
338338
It must throw proper error message telling that a Pod can't start because its volume is used by another Pod with a different SELinux context.
339339
* This is a good point to capture any metrics proposed below.
@@ -347,6 +347,40 @@ Apart from the obvious API change and behavior described above, kubelet + volume
347347
This error is already part of generic `storage_operation_duration_seconds` metric (with a label for failures).
348348
* Note that kubelet can't check mount options after `NodeStage`, because a CSI driver does not need to mount during NodeStage or it may choose to mount to another directory than the staging one.
349349
350+
#### Volume Reconstruction
351+
352+
Today, volume reconstruction works in this way:
353+
354+
1. When kubelet starts, it starts populating the volume manager's Desired State of World (DSW) immediately (e.g. with static pods),
355+
and it starts running Pods and mounting volumes for them. Kubelet depends on volume plugin / CSI driver idempotency if a volume
356+
is already mounted. At this point, the Actual State of World (ASW) is empty and it is getting populated with volumes
357+
mounted for Pods that are getting started.
358+
2. When kubelet establishes connection to the API server and DSW is fully populated, it reconstructs volumes from disk only for volumes not
359+
present in DSW. This should cover only volumes that don't have a Pod in the API server and need to be unmounted. Kubelet adds the
360+
volumes to the ASW and lets regular reconciler to unmount them.
361+
362+
This approach does not work for SELinux, because at step 1. above, the volume manager needs to know *if* a volume is mounted and with
363+
*what SELinux context mount option*. If the required and existing SELinux contexts of a volume match, the volume manager can continue
364+
mounting the volume. If they don't, volume manager needs to unmount the volume with the wrong SELinux context first and mount it again
365+
with the right one.
366+
367+
We need to populate the ASW as soon as possible after kubelet starts. Suggested changes:
368+
369+
1. When kubelet starts, the volume manager will reconstruct all volumes incl. their SELinux contexts and put them to the DSW as *uncertain*.
370+
At this point, kubelet may not have connection to the API server yet, hence this phase of volume reconstruction must work without it.
371+
Kubelet will store all reconstructed volumes in a separate array, to finish the reconstruction when the API server is available.
372+
* This implies that volume plugins can't expect that the API server is available in `ConstructVolumeSpec`, `ConstructBlockVolumeSpec`,
373+
`NewMounter`, `NewBlockVolumeMapper`, and `NewDeviceMounter` calls. Especially all `CSIDriver` checks in the CSI volume plugin must
374+
be moved to `SetUpAt` or `TearDownAt`, and their block volume counterparts.
375+
2. Only after the initial ASW is populated, kubelet starts running pods and mounting volumes for them. Since the existing volumes are marked
376+
as *uncertain*, volume manager will re-mount them (depending on volume plugin / CSI driver idempotency). Note that only mounting
377+
is allowed at this point, the volume manager can't unmount anything, because the DSW is not yet populated.
378+
3. When kubelet establishes a connection to the API server, it populates the DSW as usual.
379+
4. When the DSW is fully populated, the volume manager will finish reconstruction of volumes, i.e. file devicePaths from the
380+
`node.status.volumesInUse` field.
381+
5. Only after the second phase of volume reconstruction is done, i.e. the DSW is fully populated and volumes are fully reconstructed,
382+
the volume manager starts unmounting volumes that are not in the ASW.
383+
350384
### Implementation phases
351385

352386
Due to change of Kubernetes behavior, we will implement the feature only for cases where it can't break anything first.

0 commit comments

Comments
 (0)