Skip to content

Commit 0f070f9

Browse files
authored
Merge pull request #4751 from saschagrunert/oci-volume-cri
KEP-4639: Update CRI API and workflow
2 parents 28c08f1 + 23105e9 commit 0f070f9

File tree

2 files changed

+61
-118
lines changed

2 files changed

+61
-118
lines changed

keps/sig-node/4639-oci-volume-source/README.md

Lines changed: 60 additions & 117 deletions
Original file line numberDiff line numberDiff line change
@@ -299,7 +299,7 @@ the OS or version of the scanning software.
299299
### Risks and Mitigations
300300

301301
- **Security Risks:**:
302-
- Allowing direct mounting of OCI images introduces potential attack
302+
- Allowing direct mounting of OCI objects introduces potential attack
303303
vectors. Mitigation includes thorough security reviews and limiting access
304304
to trusted registries. Limiting to OCI artifacts (non-runnable content)
305305
and read-only mode will lessen the security risk.
@@ -336,8 +336,8 @@ metadata:
336336
spec:
337337
volumes:
338338
- name: oci-volume
339-
oci:
340-
image: "example.com/my-image:latest"
339+
image:
340+
reference: "example.com/my-image:latest"
341341
pullPolicy: IfNotPresent
342342
containers:
343343
- name: my-container
@@ -357,24 +357,22 @@ by:
357357
type VolumeSource struct {
358358
// …
359359
360-
// oci represents a OCI object pulled and mounted on kubelet's host machine
361-
// +featureGate=OCIVolume
362-
// +optional
363-
OCI *OCIVolumeSource `json:"oci,omitempty" protobuf:"bytes,30,opt,name=oci"
360+
// image …
361+
Image *ImageVolumeSource `json:"image,omitempty" protobuf:"bytes,30,opt,name=image"
364362
}
365363
```
366364

367-
And add the corresponding `OCIVolumeSource` type:
365+
And add the corresponding `ImageVolumeSource` type:
368366

369367
```go
370-
// OCIVolumeSource represents a OCI volume resource.
371-
type OCIVolumeSource struct {
372-
// Required: Image or artifact reference to be used
373-
Reference string `json:"reference,omitempty" protobuf:"bytes,1,opt,name=reference"`
374-
375-
// Policy for pulling OCI objects
376-
// Defaults to IfNotPresent
377-
// +optional
368+
// ImageVolumeSource represents a image volume resource.
369+
type ImageVolumeSource struct {
370+
// Required: Image or artifact reference to be used.
371+
//
372+
Reference string `json:"reference" protobuf:"bytes,1,opt,name=reference"`
373+
374+
// Policy for pulling OCI objects.
375+
//
378376
PullPolicy PullPolicy `json:"pullPolicy,omitempty" protobuf:"bytes,2,opt,name=pullPolicy,casttype=PullPolicy"`
379377
}
380378
```
@@ -392,15 +390,15 @@ if source.OCI != nil {
392390
allErrs = append(allErrs, field.Forbidden(fldPath.Child("oci"), "may not specify more than 1 volume type"))
393391
} else {
394392
numVolumes++
395-
allErrs = append(allErrs, validateOCIVolumeSource(source.OCI, fldPath.Child("oci"))...)
393+
allErrs = append(allErrs, validateImageVolumeSource(source.OCI, fldPath.Child("oci"))...)
396394
}
397395
}
398396

399397
//
400398
```
401399

402400
```go
403-
func validateOCIVolumeSource(oci *core.OCIVolumeSource, fldPath *field.Path) field.ErrorList {
401+
func validateImageVolumeSource(oci *core.ImageVolumeSource, fldPath *field.Path) field.ErrorList {
404402
allErrs := field.ErrorList{}
405403
if len(oci.Reference) == 0 {
406404
allErrs = append(allErrs, field.Required(fldPath.Child("reference"), ""))
@@ -413,13 +411,13 @@ func validateOCIVolumeSource(oci *core.OCIVolumeSource, fldPath *field.Path) fie
413411
```go
414412
//
415413

416-
// Disallow subPath/subPathExpr for OCI volumes
414+
// Disallow subPath/subPathExpr for image volumes
417415
if v, ok := volumes[mnt.Name]; ok && v.OCI != nil {
418416
if mnt.SubPath != "" {
419-
allErrs = append(allErrs, field.Invalid(idxPath.Child("subPath"), mnt.SubPath, "not allowed in OCI volume sources"))
417+
allErrs = append(allErrs, field.Invalid(idxPath.Child("subPath"), mnt.SubPath, "not allowed in image volume sources"))
420418
}
421419
if mnt.SubPathExpr != "" {
422-
allErrs = append(allErrs, field.Invalid(idxPath.Child("subPathExpr"), mnt.SubPathExpr, "not allowed in OCI volume sources"))
420+
allErrs = append(allErrs, field.Invalid(idxPath.Child("subPathExpr"), mnt.SubPathExpr, "not allowed in image volume sources"))
423421
}
424422
}
425423

@@ -482,8 +480,8 @@ While the `imagePullPolicy` is working on container level, the introduced
482480
values `IfNotPresent`, `Always` and `Never`, but will only pull once per pod.
483481

484482
Technically it means that we need to pull in [`SyncPod`](https://github.com/kubernetes/kubernetes/blob/b498eb9/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L1049)
485-
for OCI objects on a pod level and not during [`EnsureImageExists`](https://github.com/kubernetes/kubernetes/blob/b498eb9/pkg/kubelet/images/image_manager.go#L102)
486-
before the container gets started.
483+
for OCI objects on a pod level and not for each container during [`EnsureImageExists`](https://github.com/kubernetes/kubernetes/blob/b498eb9/pkg/kubelet/images/image_manager.go#L102)
484+
before they get started.
487485

488486
If users want to re-pull artifacts when referencing moving tags like `latest`,
489487
then they need to restart / evict the pod.
@@ -500,50 +498,44 @@ container image.
500498
#### CRI
501499

502500
The CRI API is already capable of managing container images [via the `ImageService`](https://github.com/kubernetes/cri-api/blob/3a66d9d/pkg/apis/runtime/v1/api.proto#L146-L161).
503-
Those RPCs will be re-used for managing OCI artifacts, while the [`ImageSpec`](https://github.com/kubernetes/cri-api/blob/3a66d9d/pkg/apis/runtime/v1/api.proto#L798-L813)
504-
as well as [`PullImageResponse`](https://github.com/kubernetes/cri-api/blob/3a66d9d/pkg/apis/runtime/v1/api.proto#L1530-L1534)
505-
will be extended to mount the OCI object to a local path:
501+
Those RPCs will be re-used for managing OCI artifacts, while the [`Mount`](https://github.com/kubernetes/cri-api/blob/3a66d9d/pkg/apis/runtime/v1/api.proto#L220-L247)
502+
message will be extended to mount an OCI object using the existing [`ImageSpec`](https://github.com/kubernetes/cri-api/blob/3a66d9d/pkg/apis/runtime/v1/api.proto#L798-L813)
503+
on container creation:
506504

507505
```protobuf
508-
509-
// ImageSpec is an internal representation of an image.
510-
message ImageSpec {
511-
// …
512-
513-
// Indicate that the OCI object should be mounted.
514-
bool mount = 20;
515-
516-
// SELinux label to be used.
517-
string mount_label = 21;
518-
}
519-
520-
message PullImageResponse {
506+
// Mount specifies a host volume to mount into a container.
507+
message Mount {
521508
// …
522509
523-
// Absolute local path where the OCI object got mounted.
524-
string mountpoint = 2;
510+
// Mount an image reference (image ID, with or without digest), which is a
511+
// special use case for image volume mounts. If this field is set, then
512+
// host_path should be unset. All OCI mounts are per feature definition
513+
// readonly. The kubelet does an PullImage RPC and evaluates the returned
514+
// PullImageResponse.image_ref value, which is then set to the
515+
// ImageSpec.image field. Runtimes are expected to mount the image as
516+
// required.
517+
// Introduced in the OCI Volume Source KEP: https://kep.k8s.io/4639
518+
ImageSpec image = 9;
525519
}
526520
```
527521

528522
This allows to re-use the existing kubelet logic for managing the OCI objects,
529523
with the caveat that the new `VolumeSource` won't be isolated in a dedicated
530524
plugin as part of the existing [volume manager](https://github.com/kubernetes/kubernetes/tree/6d0aab2/pkg/kubelet/volumemanager).
531525

532-
The added `mount_label` allow the kubelet to support SELinux contexts.
526+
Runtimes are already aware of the correct SELinux parameters during container
527+
creation and will re-use them for the OCI object mounts.
533528

534-
The kubelet will use the `mountpoint` on container creation
535-
(by calling the `CreateContainer` RPC) to indicate the additional required volume mount ([`ContainerConfig.Mount`](https://github.com/kubernetes/cri-api/blob/3a66d9d/pkg/apis/runtime/v1/api.proto#L1102))
536-
from the runtime. The runtime needs to ensure that mount and also manages its
537-
lifecycle, for example to remove the bind mount on container removal.
529+
The kubelet will use the returned `PullImageResponse.image_ref` on pull and sets
530+
it to `Mount.image.image` together with the other fields for `Mount.image`. The
531+
runtime will then mount the OCI object directly on container creation assuming
532+
it's already present on disk. The runtime also manages the lifecycle of the
533+
mount, for example to remove the OCI bind mount on container removal as well as
534+
the object mount on the `RemoveImage` RPC.
538535

539536
The kubelet tracks the information about which OCI object is used by which
540-
sandbox and therefore manages the lifecycle of them.
541-
542-
The proposal also considers smaller CRI changes, for example to add a list of
543-
mounted volume paths to the `ImageStatusResponse.Image` message returned by the
544-
`ImageStatus` RPC. This allows providing the right amount of information between
545-
the kubelet and the runtime to ensure that no context gets lost in restart
546-
scenarios.
537+
sandbox and therefore manages the lifecycle of them for garbage collection
538+
purposes.
547539

548540
The overall flow for container creation will look like this:
549541

@@ -554,32 +546,30 @@ sequenceDiagram
554546
Note left of K: During pod sync
555547
Note over K,C: CRI
556548
K->>+C: RPC: PullImage
557-
Note right of C: Pull and mount<br/>OCI object
558-
C-->>-K: PullImageResponse.Mountpoint
549+
Note right of C: Pull OCI object
550+
C-->>-K: PullImageResponse.image_ref
559551
Note left of K: Add mount points<br/> to container<br/>creation request
560552
K->>+C: RPC: CreateContainer
561-
Note right of C: Add bind mounts<br/>from object mount<br/>point to container
553+
Note right of C: Mount OCI object
554+
Note right of C: Add OCI bind mounts<br/>from OCI object<br/>to container
562555
C-->>-K: CreateContainerResponse
563556
```
564557

565558
1. **Kubelet Initiates Image Pull**:
566559
- During pod setup, the kubelet initiates the pull for the OCI object based on the volume source.
567-
- The kubelet passes the necessary indicator to mount the object to the container runtime.
568560

569561
2. **Runtime Handles Mounting**:
570-
- The container runtime mounts the OCI object as a filesystem using the metadata provided by the kubelet.
571-
- The runtime returns the mount point information to the kubelet.
562+
- The runtime returns the image reference information to the kubelet.
572563

573564
3. **Redirecting of the Mountpoint**:
574-
- The kubelet uses the returned mount point to build the container creation request for each container using that mount.
575-
- The kubelet initiates the container creation and the runtime creates the required bind mounts to the target location.
565+
- The kubelet uses the returned image reference to build the container creation request for each container using that mount.
566+
- The kubelet initiates the container creation and the runtime creates the required OCI object mount as well as bind mounts to the target location.
576567
This is the current implemented behavior for all other mounts and should require no actual container runtime code change.
577568

578569
4. **Lifecycle Management**:
579570
- The container runtime manages the lifecycle of the mounts, ensuring they are created during pod setup and cleaned up upon sandbox removal.
580571

581572
5. **Tracking and Coordination**:
582-
- The kubelet and runtime coordinate to track pods requesting mounts to avoid removing containers with volumes in use.
583573
- During image garbage collection, the runtime provides the kubelet with the necessary mount information to ensure proper cleanup.
584574

585575
6. **SELinux Context Handling**:
@@ -597,19 +587,17 @@ sequenceDiagram
597587

598588
#### Container Runtimes
599589

600-
Container runtimes need to support the new `mount` field, otherwise the
601-
feature cannot be used. The kubelet will verify if the returned `mountpoint`
602-
actually exists on disk to check the feature availability, because Protobuf will
603-
strip the field in a backwards compatible way for older runtimes. Pods using the
604-
new `VolumeSource` combined with a not supported container runtime version will
605-
fail to run on the node.
590+
Container runtimes need to support the new `Mount.image` field, otherwise the
591+
feature cannot be used. Pods using the new `VolumeSource` combined with a not
592+
supported container runtime version will fail to run on the node, because the
593+
`Mount.host_path` field is not set for those mounts.
606594

607595
For security reasons, volume mounts should set the [`noexec`] and `ro`
608596
(read-only) options by default.
609597

610598
##### Filesystem representation
611599

612-
Container Runtimes are expected to return a `mountpoint`, which is a single
600+
Container Runtimes are expected to manage a `mountpoint`, which is a single
613601
directory containing the unpacked (in case of tarballs) and merged layer files
614602
from the image or artifact. If an OCI artifact has multiple layers (in the same
615603
way as for container images), then the runtime is expected to merge them
@@ -716,41 +704,6 @@ oras manifest fetch localhost:5000/image:v1 | jq .
716704
}
717705
```
718706

719-
The container runtime can now pull the artifact with the `mount = true` CRI
720-
field set, for example using an experimental [`crictl pull --mount` flag](https://github.com/kubernetes-sigs/cri-tools/compare/master...saschagrunert:oci-volumesource-poc):
721-
722-
```bash
723-
sudo crictl pull --mount localhost:5000/image:v1
724-
```
725-
726-
```console
727-
Image is up to date for localhost:5000/image@sha256:7728cb2fa5dc31ad8a1d05d4e4259d37c3fc72e1fbdc0e1555901687e34324e9
728-
Image mounted to: /var/lib/containers/storage/overlay/7ee9a1dcea9f152b10590871e55e485b249cd42ea912111ff9f99ab663c1001a/merged
729-
```
730-
731-
And the returned `mountpoint` contains the unpacked layers as directory tree:
732-
733-
```bash
734-
sudo tree /var/lib/containers/storage/overlay/7ee9a1dcea9f152b10590871e55e485b249cd42ea912111ff9f99ab663c1001a/merged
735-
```
736-
737-
```console
738-
/var/lib/containers/storage/overlay/7ee9a1dcea9f152b10590871e55e485b249cd42ea912111ff9f99ab663c1001a/merged
739-
├── dir
740-
│   └── file
741-
└── file
742-
743-
2 directories, 2 files
744-
```
745-
746-
```console
747-
$ sudo cat /var/lib/containers/storage/overlay/7ee9a1dcea9f152b10590871e55e485b249cd42ea912111ff9f99ab663c1001a/merged/dir/file
748-
layer0
749-
750-
$ sudo cat /var/lib/containers/storage/overlay/7ee9a1dcea9f152b10590871e55e485b249cd42ea912111ff9f99ab663c1001a/merged/file
751-
layer1
752-
```
753-
754707
ORAS (and other tools) are also able to push multiple files or directories
755708
within a single layer. This should be supported by container runtimes in the
756709
same way.
@@ -759,17 +712,7 @@ same way.
759712

760713
Traditionally, the container runtime is responsible of applying SELinux labels
761714
to volume mounts, which are inherited from the `securityContext` of the pod or
762-
container. Relabeling volume mounts can be time-consuming, especially when there
763-
are many files on the volume.
764-
765-
If the following criteria are met, then the kubelet will use the `mount_label`
766-
field in the CRI to apply the right SELinux label to the mount.
767-
768-
- The operating system must support SELinux
769-
- The Pod must have at least `seLinuxOptions.level` assigned in the
770-
`PodSecurityContext` or all volume using containers must have it set in their
771-
`SecurityContexts`. Kubernetes will read the default user, role and type from
772-
the operating system defaults (typically `system_u`, `system_r` and `container_t`).
715+
container on container creation. The same will apply to OCI volume mounts.
773716

774717
### Test Plan
775718

@@ -987,7 +930,7 @@ well as the [existing list] of feature gates.
987930
-->
988931

989932
- [x] Feature gate (also fill in values in `kep.yaml`)
990-
- Feature gate name: OCIVolume
933+
- Feature gate name: ImageVolume
991934
- Components depending on the feature gate:
992935
- kube-apiserver (API validation)
993936
- kubelet (volume mount)
@@ -1329,7 +1272,7 @@ Currently, a shared volume approach can be used. This involves packaging file to
13291272
An init container can be used to copy files from an image to a shared volume using shell commands. This volume can be made accessible to all
13301273
containers in the pod.
13311274

1332-
An OCI VolumeSource eliminates the need for a shell and an init container by allowing the direct mounting of OCI images as volumes,
1275+
An OCI VolumeSource eliminates the need for a shell and an init container by allowing the direct mounting of OCI objects as volumes,
13331276
making it easier to modularize. For example, in the case of LLMs and model-servers, it is useful to package them in separate images,
13341277
so various models can plug into the same model-server image. An OCI VolumeSource not only simplifies file copying but also allows
13351278
container native distribution, authentication, and version control for files.

keps/sig-node/4639-oci-volume-source/kep.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ milestone:
5757
# The following PRR answers are required at alpha release
5858
# List the feature gate name and the components for which it must be enabled
5959
feature-gates:
60-
- name: OCIVolume
60+
- name: ImageVolume
6161
components:
6262
- kube-apiserver
6363
- kubelet

0 commit comments

Comments
 (0)