|
| 1 | +// Module included in the following assemblies: |
| 2 | +// |
| 3 | +// * networking/hardware_networks/using-dpdk-and-rdma.adoc |
| 4 | + |
| 5 | +:_content-type: PROCEDURE |
| 6 | +[id="nw-running-dpdk-rootless-tap_{context}"] |
| 7 | += Using the TAP CNI to run a rootless DPDK workload with kernel access |
| 8 | + |
| 9 | +DPDK applications can use `virtio-user` as an exception path to inject certain types of packets, such as log messages, into the kernel for processing. For more information about this feature, see link:https://doc.dpdk.org/guides/howto/virtio_user_as_exception_path.html[Virtio_user as Exception Path]. |
| 10 | + |
| 11 | +In OpenShift Container Platform version 4.14 and later, you can use non-privileged pods to run DPDK applications alongside the tap CNI plugin. To enable this functionality, you need to mount the `vhost-net` device by setting the `needVhostNet` parameter to `true` within the `SriovNetworkNodePolicy` object. |
| 12 | + |
| 13 | +.DPDK and TAP example configuration |
| 14 | +image::348_OpenShift_rootless_DPDK_0923.png[DPDK and TAP plugin] |
| 15 | + |
| 16 | +.Prerequisites |
| 17 | + |
| 18 | +* You have installed the OpenShift CLI (`oc`). |
| 19 | +* You have installed the SR-IOV Network Operator. |
| 20 | +* You are logged in as a user with `cluster-admin` privileges. |
| 21 | +* Ensure that `setsebools container_use_devices=on` is set as root on all nodes. |
| 22 | ++ |
| 23 | +[NOTE] |
| 24 | +==== |
| 25 | +Use the Machine Config Operator to set this SELinux boolean. |
| 26 | +==== |
| 27 | +
|
| 28 | +.Procedure |
| 29 | + |
| 30 | +. Create a file, such as `test-namespace.yaml`, with content like the following example: |
| 31 | ++ |
| 32 | +[source,yaml] |
| 33 | +---- |
| 34 | +apiVersion: v1 |
| 35 | +kind: Namespace |
| 36 | +metadata: |
| 37 | + name: test-namespace |
| 38 | + labels: |
| 39 | + pod-security.kubernetes.io/enforce: privileged |
| 40 | + pod-security.kubernetes.io/audit: privileged |
| 41 | + pod-security.kubernetes.io/warn: privileged |
| 42 | + security.openshift.io/scc.podSecurityLabelSync: "false" |
| 43 | +---- |
| 44 | + |
| 45 | +. Create the new `Namespace` object by running the following command: |
| 46 | ++ |
| 47 | +[source,terminal] |
| 48 | +---- |
| 49 | +$ oc apply -f test-namespace.yaml |
| 50 | +---- |
| 51 | + |
| 52 | +. Create a file, such as `sriov-node-network-policy.yaml`, with content like the following example:: |
| 53 | ++ |
| 54 | +[source,yaml] |
| 55 | +---- |
| 56 | +apiVersion: sriovnetwork.openshift.io/v1 |
| 57 | +kind: SriovNetworkNodePolicy |
| 58 | +metadata: |
| 59 | + name: sriovnic |
| 60 | + namespace: openshift-sriov-network-operator |
| 61 | +spec: |
| 62 | + deviceType: netdevice <1> |
| 63 | + isRdma: true <2> |
| 64 | + needVhostNet: true <3> |
| 65 | + nicSelector: |
| 66 | + vendor: "15b3" <4> |
| 67 | + deviceID: "101b" <5> |
| 68 | + rootDevices: ["00:05.0"] |
| 69 | + numVfs: 10 |
| 70 | + priority: 99 |
| 71 | + resourceName: sriovnic |
| 72 | + nodeSelector: |
| 73 | + feature.node.kubernetes.io/network-sriov.capable: "true" |
| 74 | +---- |
| 75 | +<1> This indicates that the profile is tailored specifically for Mellanox Network Interface Controllers (NICs). |
| 76 | +<2> Setting `isRdma` to `true` is only required for a Mellanox NIC. |
| 77 | +<3> This mounts the `/dev/net/tun` and `/dev/vhost-net` devices into the container so the application can create a tap device and connect the tap device to the DPDK workload. |
| 78 | +<4> The vendor hexadecimal code of the SR-IOV network device. The value 15b3 is associated with a Mellanox NIC. |
| 79 | +<5> The device hexadecimal code of the SR-IOV network device. |
| 80 | + |
| 81 | +. Create the `SriovNetworkNodePolicy` object by running the following command: |
| 82 | ++ |
| 83 | +[source,terminal] |
| 84 | +---- |
| 85 | +$ oc create -f sriov-node-network-policy.yaml |
| 86 | +---- |
| 87 | + |
| 88 | +. Create the following `SriovNetwork` object, and then save the YAML in the `sriov-network-attachment.yaml` file: |
| 89 | ++ |
| 90 | +[source,yaml] |
| 91 | +---- |
| 92 | +apiVersion: sriovnetwork.openshift.io/v1 |
| 93 | +kind: SriovNetwork |
| 94 | +metadata: |
| 95 | + name: sriov-network |
| 96 | + namespace: openshift-sriov-network-operator |
| 97 | +spec: |
| 98 | + networkNamespace: test-namespace |
| 99 | + resourceName: sriovnic |
| 100 | + spoofChk: "off" |
| 101 | + trust: "on" |
| 102 | +---- |
| 103 | ++ |
| 104 | +[NOTE] |
| 105 | +===== |
| 106 | +See the "Configuring SR-IOV additional network" section for a detailed explanation on each option in `SriovNetwork`. |
| 107 | +===== |
| 108 | ++ |
| 109 | +An optional library, `app-netutil`, provides several API methods for gathering network information about a container's parent pod. |
| 110 | + |
| 111 | +. Create the `SriovNetwork` object by running the following command: |
| 112 | ++ |
| 113 | +[source,terminal] |
| 114 | +---- |
| 115 | +$ oc create -f sriov-network-attachment.yaml |
| 116 | +---- |
| 117 | + |
| 118 | +. Create a file, such as `tap-example.yaml`, that defines a network attachment definition, with content like the following example: |
| 119 | ++ |
| 120 | +[source,yaml] |
| 121 | +---- |
| 122 | +apiVersion: "k8s.cni.cncf.io/v1" |
| 123 | +kind: NetworkAttachmentDefinition |
| 124 | +metadata: |
| 125 | + name: tap-one |
| 126 | + namespace: test-namespace <1> |
| 127 | +spec: |
| 128 | + config: '{ |
| 129 | + "cniVersion": "0.4.0", |
| 130 | + "name": "tap", |
| 131 | + "plugins": [ |
| 132 | + { |
| 133 | + "type": "tap", |
| 134 | + "multiQueue": true, |
| 135 | + "selinuxcontext": "system_u:system_r:container_t:s0" |
| 136 | + }, |
| 137 | + { |
| 138 | + "type":"tuning", |
| 139 | + "capabilities":{ |
| 140 | + "mac":true |
| 141 | + } |
| 142 | + } |
| 143 | + ] |
| 144 | + }' |
| 145 | +---- |
| 146 | +<1> Specify the same `target_namespace` where the `SriovNetwork` object is created. |
| 147 | + |
| 148 | +. Create the `NetworkAttachmentDefinition` object by running the following command: |
| 149 | ++ |
| 150 | +[source,terminal] |
| 151 | +---- |
| 152 | +$ oc apply -f tap-example.yaml |
| 153 | +---- |
| 154 | + |
| 155 | +. Create a file, such as `dpdk-pod-rootless.yaml`, with content like the following example: |
| 156 | ++ |
| 157 | +[source,yaml] |
| 158 | +---- |
| 159 | +apiVersion: v1 |
| 160 | +kind: Pod |
| 161 | +metadata: |
| 162 | + name: dpdk-app |
| 163 | + namespace: test-namespace <1> |
| 164 | + annotations: |
| 165 | + k8s.v1.cni.cncf.io/networks: '[ |
| 166 | + {"name": "sriov-network", "namespace": "test-namespace"}, |
| 167 | + {"name": "tap-one", "interface": "ext0", "namespace": "test-namespace"}]' |
| 168 | +spec: |
| 169 | + nodeSelector: |
| 170 | + kubernetes.io/hostname: "worker-0" |
| 171 | + securityContext: |
| 172 | + fsGroup: 1001 <2> |
| 173 | + runAsGroup: 1001 <3> |
| 174 | + seccompProfile: |
| 175 | + type: RuntimeDefault |
| 176 | + containers: |
| 177 | + - name: testpmd |
| 178 | + image: <DPDK_image> <4> |
| 179 | + securityContext: |
| 180 | + capabilities: |
| 181 | + drop: ["ALL"] <5> |
| 182 | + add: <6> |
| 183 | + - IPC_LOCK |
| 184 | + - NET_RAW #for mlx only <7> |
| 185 | + runAsUser: 1001 <8> |
| 186 | + privileged: false <9> |
| 187 | + allowPrivilegeEscalation: true <10> |
| 188 | + runAsNonRoot: true <11> |
| 189 | + volumeMounts: |
| 190 | + - mountPath: /mnt/huge <12> |
| 191 | + name: hugepages |
| 192 | + resources: |
| 193 | + limits: |
| 194 | + openshift.io/sriovnic: "1" <13> |
| 195 | + memory: "1Gi" |
| 196 | + cpu: "4" <14> |
| 197 | + hugepages-1Gi: "4Gi" <15> |
| 198 | + requests: |
| 199 | + openshift.io/sriovnic: "1" |
| 200 | + memory: "1Gi" |
| 201 | + cpu: "4" |
| 202 | + hugepages-1Gi: "4Gi" |
| 203 | + command: ["sleep", "infinity"] |
| 204 | + runtimeClassName: performance-cnf-performanceprofile <16> |
| 205 | + volumes: |
| 206 | + - name: hugepages |
| 207 | + emptyDir: |
| 208 | + medium: HugePages |
| 209 | +---- |
| 210 | ++ |
| 211 | +-- |
| 212 | +<1> Specify the same `target_namespace` in which the `SriovNetwork` object is created. If you want to create the pod in a different namespace, change `target_namespace` in both the `Pod` spec and the `SriovNetwork` object. |
| 213 | +<2> Sets the group ownership of volume-mounted directories and files created in those volumes. |
| 214 | +<3> Specify the primary group ID used for running the container. |
| 215 | +<4> Specify the DPDK image that contains your application and the DPDK library used by application. |
| 216 | +<5> Removing all capabilities (`ALL`) from the container's securityContext means that the container has no special privileges beyond what is necessary for normal operation. |
| 217 | +<6> Specify additional capabilities required by the application inside the container for hugepage allocation, system resource allocation, and network interface access. These capabilities must also be set in the binary file by using the `setcap` command. |
| 218 | +<7> Mellanox network interface controller (NIC) requires the `NET_RAW` capability. |
| 219 | +<8> Specify the user ID used for running the container. |
| 220 | +<9> This setting indicates that the container or containers within the pod should not be granted privileged access to the host system. |
| 221 | +<10> This setting allows a container to escalate its privileges beyond the initial non-root privileges it might have been assigned. |
| 222 | +<11> This setting ensures that the container runs with a non-root user. This helps enforce the principle of least privilege, limiting the potential impact of compromising the container and reducing the attack surface. |
| 223 | +<12> Mount a hugepage volume to the DPDK pod under `/mnt/huge`. The hugepage volume is backed by the emptyDir volume type with the medium being `Hugepages`. |
| 224 | +<13> Optional: Specify the number of DPDK devices allocated for the DPDK pod. If not explicitly specified, this resource request and limit is automatically added by the SR-IOV network resource injector. The SR-IOV network resource injector is an admission controller component managed by SR-IOV Operator. It is enabled by default and can be disabled by setting the `enableInjector` option to `false` in the default `SriovOperatorConfig` CR. |
| 225 | +<14> Specify the number of CPUs. The DPDK pod usually requires exclusive CPUs to be allocated from the kubelet. This is achieved by setting CPU Manager policy to `static` and creating a pod with `Guaranteed` QoS. |
| 226 | +<15> Specify hugepage size `hugepages-1Gi` or `hugepages-2Mi` and the quantity of hugepages that will be allocated to the DPDK pod. Configure `2Mi` and `1Gi` hugepages separately. Configuring `1Gi` hugepage requires adding kernel arguments to Nodes. For example, adding kernel arguments `default_hugepagesz=1GB`, `hugepagesz=1G` and `hugepages=16` will result in `16*1Gi` hugepages be allocated during system boot. |
| 227 | +<16> If your performance profile is not named `cnf-performance profile`, replace that string with the correct performance profile name. |
| 228 | +-- |
| 229 | ++ |
| 230 | +. Create the DPDK pod by running the following command: |
| 231 | ++ |
| 232 | +[source,terminal] |
| 233 | +---- |
| 234 | +$ oc create -f dpdk-pod-rootless.yaml |
| 235 | +---- |
0 commit comments