You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -332,8 +333,7 @@ Apart from the obvious API change and behavior described above, kubelet + volume
332
333
* Kubelet's VolumeManager needs to track which SELinux label should get a volume in global mount (to call `MountDevice()` with the right mount options).
333
334
* It must call `UnmountDevice()` even when another pod wants to re-use a mounted volume, but it has a different SELinux context.
334
335
* After kubelet restart, kubelet must reconstruct the original SELinux label it used to SetUp and MountDevice of each volume.
335
-
* Volume reconstruction must be updated to get the SELinux label from mount (in-tree volume plugins) or stored json file (CSI).
336
-
This label must be updated in VolumeManager's ActualStateOfWorld after reconstruction.
336
+
SeeVolumeReconstruction below.
337
337
* Reconciler must check also SELinux context used to mount a volume (both mounted devices and volumes) before considering what operation to take on a volume (`MountVolume` or `UnmountVolume`/`UnmountDevice` or nothing).
338
338
It must throw proper error message telling that a Pod can't start because its volume is used by another Pod with a different SELinux context.
339
339
* This is a good point to capture any metrics proposed below.
@@ -347,6 +347,40 @@ Apart from the obvious API change and behavior described above, kubelet + volume
347
347
Thiserror is already part of generic `storage_operation_duration_seconds`metric (with a label for failures).
348
348
* Note that kubelet can't check mount options after `NodeStage`, because a CSI driver does not need to mount during NodeStage or it may choose to mount to another directory than the staging one.
349
349
350
+
#### Volume Reconstruction
351
+
352
+
Today, volume reconstruction works in this way:
353
+
354
+
1. When kubelet starts, it starts populating the volume manager's DesiredState of World (DSW) immediately (e.g. with static pods),
355
+
and it starts running Pods and mounting volumes for them. Kubelet depends on volume plugin / CSI driver idempotency if a volume
356
+
is already mounted. At this point, the ActualState of World (ASW) is empty and it is getting populated with volumes
357
+
mounted forPods that are getting started.
358
+
2. When kubelet establishes connection to the API server and DSW is fully populated, it reconstructs volumes from disk only for volumes not
359
+
present in DSW. This should cover only volumes that don't have a Pod in the API server and need to be unmounted. Kubelet adds the
360
+
volumes to the ASW and lets regular reconciler to unmount them.
361
+
362
+
This approach does not work for SELinux, because at step 1. above, the volume manager needs to know *if* a volume is mounted and with
363
+
*what SELinux context mount option*. If the required and existing SELinux contexts of a volume match, the volume manager can continue
364
+
mounting the volume. If they don't, volume manager needs to unmount the volume with the wrong SELinux context first and mount it again
365
+
with the right one.
366
+
367
+
We need to populate the ASW as soon as possible after kubelet starts. Suggested changes:
368
+
369
+
1. When kubelet starts, the volume manager will reconstruct all volumes incl. their SELinux contexts and put them to the DSW as *uncertain*.
370
+
At this point, kubelet may not have connection to the API server yet, hence this phase of volume reconstruction must work without it.
371
+
Kubelet will store all reconstructed volumes in a separate array, to finish the reconstruction when the API server is available.
372
+
* This implies that volume plugins can't expect that the API server is available in `ConstructVolumeSpec`, `ConstructBlockVolumeSpec`,
373
+
`NewMounter`, `NewBlockVolumeMapper`, and `NewDeviceMounter` calls. Especially all `CSIDriver` checks in the CSI volume plugin must
374
+
be moved to `SetUpAt` or `TearDownAt`, and their block volume counterparts.
375
+
2. Only after the initial ASW is populated, kubelet starts running pods and mounting volumes for them. Since the existing volumes are marked
376
+
as *uncertain*, volume manager will re-mount them (depending on volume plugin / CSI driver idempotency). Note that only mounting
377
+
is allowed at this point, the volume manager can't unmount anything, because the DSW is not yet populated.
378
+
3. When kubelet establishes a connection to the API server, it populates the DSW as usual.
379
+
4. When the DSW is fully populated, the volume manager will finish reconstruction of volumes, i.e. file devicePaths from the
380
+
`node.status.volumesInUse` field.
381
+
5. Only after the second phase of volume reconstruction is done, i.e. the DSW is fully populated and volumes are fully reconstructed,
382
+
the volume manager starts unmounting volumes that are not in the ASW.
383
+
350
384
### Implementation phases
351
385
352
386
Due to change of Kubernetes behavior, we will implement the feature only for cases where it can't break anything first.
0 commit comments