|
| 1 | +# Guide to Using Network QoS |
| 2 | + |
| 3 | +## Contents |
| 4 | + |
| 5 | +1. [Overview](#1-overview) |
| 6 | +2. [Create a Secondary Network (NAD)](#2-create-a-secondary-network) |
| 7 | +3. [Define a NetworkQoS Policy](#3-define-a-networkqos-policy) |
| 8 | +4. [Create Sample Pods and Verify the Configuration](#4-create-sample-pods-and-verify-the-configuration) |
| 9 | +5. [Explain the NetworkQoS Object](#5-explain-the-networkqos-object) |
| 10 | + |
| 11 | +## **1 Overview** |
| 12 | + |
| 13 | +Differentiated Services Code Point (DSCP) marking and egress bandwidth metering let you prioritize or police specific traffic flows. The new **NetworkQoS** Custom Resource Definition (CRD) in [ovn-kubernetes](https://github.com/ovn-kubernetes/ovn-kubernetes/blob/master/dist/templates/k8s.ovn.org_networkqoses.yaml.j2) makes both features available to Kubernetes users on **all** pod interfaces—primary or secondary—without touching pod manifests. |
| 14 | + |
| 15 | +This guide provides a step-by-step example of how to use this feature. Before you begin, ensure that you have a Kubernetes cluster configured with the ovn-kubernetes CNI. Since the examples use network attachments, you must run the cluster with multiple network support enabled. In a kind cluster, you would use the following flags: |
| 16 | + |
| 17 | +```bash |
| 18 | +cd contrib |
| 19 | +./kind-helm.sh -nqe -mne ; # --enable-network-qos --enable-multi-network |
| 20 | +``` |
| 21 | + |
| 22 | +## **2 Create a Secondary Network** |
| 23 | + |
| 24 | +File: nad.yaml |
| 25 | + |
| 26 | +```yaml |
| 27 | +apiVersion: k8s.cni.cncf.io/v1 |
| 28 | +kind: NetworkAttachmentDefinition |
| 29 | +metadata: |
| 30 | + name: ovn-stream |
| 31 | + namespace: default |
| 32 | + labels: # label needed for NetworkQoS selector |
| 33 | + nad-type: ovn-kubernetes-nqos |
| 34 | +spec: |
| 35 | + config: |2 |
| 36 | + { |
| 37 | + "cniVersion": "1.0.0", |
| 38 | + "name": "ovn-stream", |
| 39 | + "type": "ovn-k8s-cni-overlay", |
| 40 | + "topology": "layer3", |
| 41 | + "subnets": "10.245.0.0/16/24", |
| 42 | + "mtu": 1300, |
| 43 | + "master": "eth1", |
| 44 | + "netAttachDefName": "default/ovn-stream" |
| 45 | + } |
| 46 | +``` |
| 47 | +*Why the label?* `NetworkQoS` uses a label selector to find matching NADs. Without at least one label, the selector cannot match. |
| 48 | + |
| 49 | +## **3 Define a NetworkQoS Policy** |
| 50 | + |
| 51 | +File: nqos.yaml |
| 52 | + |
| 53 | +```yaml |
| 54 | +apiVersion: k8s.ovn.org/v1alpha1 |
| 55 | +kind: NetworkQoS |
| 56 | +metadata: |
| 57 | + name: qos-external |
| 58 | + namespace: default |
| 59 | +spec: |
| 60 | + networkSelectors: |
| 61 | + - networkSelectionType: NetworkAttachmentDefinitions |
| 62 | + networkAttachmentDefinitionSelector: |
| 63 | + namespaceSelector: {} # any namespace |
| 64 | + networkSelector: |
| 65 | + matchLabels: |
| 66 | + nad-type: ovn-kubernetes-nqos |
| 67 | + podSelector: |
| 68 | + matchLabels: |
| 69 | + nqos-app: bw-limited |
| 70 | + priority: 10 # higher value wins in a tie-break |
| 71 | + egress: |
| 72 | + - dscp: 20 |
| 73 | + bandwidth: |
| 74 | + burst: 100 # kilobits |
| 75 | + rate: 20000 # kbps |
| 76 | + classifier: |
| 77 | + to: |
| 78 | + - ipBlock: |
| 79 | + cidr: 0.0.0.0/0 |
| 80 | + except: |
| 81 | + - 10.11.12.13/32 |
| 82 | + - 172.16.0.0/12 |
| 83 | + - 192.168.0.0/16 |
| 84 | +``` |
| 85 | +A full CRD template lives [here](https://github.com/ovn-kubernetes/ovn-kubernetes/blob/master/dist/templates/k8s.ovn.org_networkqoses.yaml.j2). |
| 86 | + |
| 87 | +The `egress` field is a list, allowing you to define multiple markings and bandwidth limits based on different classifiers. |
| 88 | + |
| 89 | +Note that this configuration will apply to the NAD of pods based on the network selector, and only on pods that have the label `nqos-app: bw-limited`. |
| 90 | + |
| 91 | +```bash |
| 92 | +$ kubectl create -f nad.yaml && \ |
| 93 | + kubectl create -f nqos.yaml |
| 94 | +
|
| 95 | +networkattachmentdefinition.k8s.cni.cncf.io/ovn-stream created |
| 96 | +networkqos.k8s.ovn.org/qos-external created |
| 97 | +``` |
| 98 | +At this point, the output from `kubectl get networkqoses` will look like this: |
| 99 | + |
| 100 | +```bash |
| 101 | +$ kubectl api-resources -owide | head -1 ; \ |
| 102 | + kubectl api-resources -owide | grep NetworkQoS |
| 103 | +NAME SHORTNAMES APIVERSION NAMESPACED KIND VERBS CATEGORIES |
| 104 | +networkqoses k8s.ovn.org/v1alpha1 true NetworkQoS delete,deletecollection,get,list,patch,create,update,watch |
| 105 | +
|
| 106 | +$ kubectl get networkqoses qos-external -n default -owide |
| 107 | +NAME STATUS |
| 108 | +qos-external NetworkQoS Destinations applied |
| 109 | +``` |
| 110 | + |
| 111 | +## **4 Create Sample Pods and Verify the Configuration** |
| 112 | + |
| 113 | +### **4.1 Launch Test Pods** |
| 114 | + |
| 115 | +To test this, let's create a pod using a helper function that allows us to add labels to it. |
| 116 | + |
| 117 | +File: create_pod.source |
| 118 | + |
| 119 | +```bash |
| 120 | +create_pod() { |
| 121 | + local pod_name=${1:-pod0} |
| 122 | + local node_name=${2:-ovn-worker} |
| 123 | + local extra_labels=${3:-} |
| 124 | +
|
| 125 | + NAMESPACE=$(kubectl config view --minify --output 'jsonpath={..namespace}') |
| 126 | + NAMESPACE=${NAMESPACE:-default} |
| 127 | +
|
| 128 | + if ! kubectl get pod "$pod_name" -n "$NAMESPACE" &>/dev/null; then |
| 129 | + echo "Creating pod $pod_name in namespace $NAMESPACE..." |
| 130 | +
|
| 131 | + # Prepare labels block |
| 132 | + labels_block=" name: $pod_name" |
| 133 | + if [[ -n "$extra_labels" ]]; then |
| 134 | + # Convert JSON string to YAML-compatible lines |
| 135 | + while IFS="=" read -r k v; do |
| 136 | + labels_block+=" |
| 137 | + $k: $v" |
| 138 | + done < <(echo "$extra_labels" | jq -r 'to_entries|map("\(.key)=\(.value)")|.[]') |
| 139 | + fi |
| 140 | +
|
| 141 | + # Generate the manifest |
| 142 | + cat <<EOF | kubectl apply -n "$NAMESPACE" -f - |
| 143 | +apiVersion: v1 |
| 144 | +kind: Pod |
| 145 | +metadata: |
| 146 | + name: $pod_name |
| 147 | + labels: |
| 148 | +$labels_block |
| 149 | + annotations: |
| 150 | + k8s.v1.cni.cncf.io/networks: ovn-stream@eth1 |
| 151 | +spec: |
| 152 | + nodeSelector: |
| 153 | + kubernetes.io/hostname: $node_name |
| 154 | + containers: |
| 155 | + - name: $pod_name |
| 156 | + image: ghcr.io/nicolaka/netshoot:v0.13 |
| 157 | + command: ["/bin/ash", "-c", "trap : TERM INT; sleep infinity & wait"] |
| 158 | +EOF |
| 159 | + else |
| 160 | + echo "Pod $pod_name already exists." |
| 161 | + fi |
| 162 | +} |
| 163 | +``` |
| 164 | + |
| 165 | +```bash |
| 166 | +$ create_pod pod0 && \ |
| 167 | + create_pod pod1 ovn-worker '{"nqos-app":"bw-limited"}' && \ |
| 168 | + create_pod pod2 ovn-worker2 '{"foo":"bar","nqos-app":"bw-limited"}' && \ |
| 169 | + echo pods created |
| 170 | +
|
| 171 | +extract_pod_ip_from_annotation() { |
| 172 | + local pod_name="$1" |
| 173 | + local namespace="${2:-default}" |
| 174 | + local interface="${3:-eth1}" |
| 175 | +
|
| 176 | + kubectl get pod "$pod_name" -n "$namespace" -o json | |
| 177 | + jq -r '.metadata.annotations["k8s.v1.cni.cncf.io/network-status"]' | |
| 178 | + jq -r --arg iface "$interface" '.[] | select(.interface == $iface) | .ips[0]' |
| 179 | +} |
| 180 | +``` |
| 181 | + |
| 182 | +```bash |
| 183 | +NAMESPACE=$(kubectl config view --minify --output 'jsonpath={..namespace}') ; NAMESPACE=${NAMESPACE:-default} |
| 184 | +DST_IP_POD0=$(extract_pod_ip_from_annotation pod0 $NAMESPACE eth1) |
| 185 | +DST_IP_POD1=$(extract_pod_ip_from_annotation pod1 $NAMESPACE eth1) |
| 186 | +DST_IP_POD2=$(extract_pod_ip_from_annotation pod2 $NAMESPACE eth1) |
| 187 | +
|
| 188 | +# Let's see the NAD IP addresses of the pods created |
| 189 | +$ echo pod0 has ip $DST_IP_POD0 ; \ |
| 190 | + echo pod1 has ip $DST_IP_POD1 ; \ |
| 191 | + echo pod2 has ip $DST_IP_POD2 |
| 192 | +
|
| 193 | +pod0 has ip 10.245.4.4 |
| 194 | +pod1 has ip 10.245.4.3 |
| 195 | +pod2 has ip 10.245.2.3 |
| 196 | +``` |
| 197 | + |
| 198 | +### **4.2 Checking Bandwidth** |
| 199 | + |
| 200 | +`qos-external` limits **only** traffic on pods that carry `nqos-app=bw-limited`. That means: |
| 201 | + |
| 202 | +* **pod1 → pod0**: *unlimited* (no matching label) |
| 203 | +* **pod1 → pod2**: *rate-limited* to ≈ 20 Mbit/s |
| 204 | + |
| 205 | +Follow these steps to verify it with `iperf3`. |
| 206 | + |
| 207 | +```bash |
| 208 | +# 1) Start an iperf server inside pod0 and pod2 (runs forever in background) |
| 209 | +kubectl -n default exec pod0 -- iperf3 -s -p 5201 & |
| 210 | +kubectl -n default exec pod2 -- iperf3 -s -p 5201 & |
| 211 | +
|
| 212 | +# 2) From pod1 → pod0 (EXPECTED ≈ line rate) |
| 213 | +kubectl -n default exec pod1 -- iperf3 -c "$DST_IP_POD0" -p 5201 -R -t 10 |
| 214 | +
|
| 215 | +# 3) From pod1 → pod2 (EXPECTED ≈ 20 Mbit/s) |
| 216 | +kubectl -n default exec pod1 -- iperf3 -c "$DST_IP_POD2" -p 5201 -R -t 10 |
| 217 | +``` |
| 218 | + |
| 219 | +Sample output: |
| 220 | + |
| 221 | +``` |
| 222 | +# to pod0 (unlimited) |
| 223 | +[ ID] Interval Transfer Bitrate Retr |
| 224 | +[ 5] 0.00-10.00 sec 37.2 GBytes 31.9 Gbits/sec 607 sender |
| 225 | +[ 5] 0.00-10.00 sec 37.2 GBytes 31.9 Gbits/sec receiver |
| 226 | +
|
| 227 | +# to pod1 (rate-limited) |
| 228 | +[ ID] Interval Transfer Bitrate Retr |
| 229 | +[ 5] 0.00-10.00 sec 20.8 MBytes 17.4 Mbits/sec 4056 sender |
| 230 | +[ 5] 0.00-10.00 sec 20.8 MBytes 17.4 Mbits/sec receiver |
| 231 | +``` |
| 232 | + |
| 233 | +The sharp drop confirms that `NetworkQoS` is enforcing the **20 Mbit/s** rate limit only for pods matching the selector. |
| 234 | + |
| 235 | +### **4.3 Packet Capture** |
| 236 | + |
| 237 | +Generate ICMP traffic and observe DSCP markings in Geneve outer headers using `tcpdump -envvi eth0 geneve` inside the worker node's network namespace. Only flows involving label-matched pods (those with `nqos-app=bw-limited`) will show `tos 0x50` (DSCP 20). |
| 238 | + |
| 239 | +```bash |
| 240 | +# Run ping commands in the background, so we can look at packets they generate |
| 241 | +
|
| 242 | +# pod0 to pod2 |
| 243 | +nohup kubectl exec -i pod0 -- ping -c 3600 -q $DST_IP_POD2 >/dev/null 2>&1 & |
| 244 | +# pod1 to pod2 |
| 245 | +nohup kubectl exec -i pod1 -- ping -c 3600 -q $DST_IP_POD2 >/dev/null 2>&1 & |
| 246 | +
|
| 247 | +sudo dnf install -y --quiet tcpdump ; # Install tcpdump, if needed |
| 248 | +
|
| 249 | +IPNS=$(docker inspect --format '{{ '{{' }} .State.Pid }}' ovn-worker) |
| 250 | +sudo nsenter -t ${IPNS} -n tcpdump -envvi eth0 geneve |
| 251 | +``` |
| 252 | + |
| 253 | +``` |
| 254 | +tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes |
| 255 | + |
| 256 | +**Pod0 to Pod2**: Notice that since pod0 does not have the label to match against NetworkQoS, its TOS is 0. However, pod2's response is DSCP marked (tos 0x50), since pod2 matches the NetworkQoS criteria with the label `nqos-app: bw-limited`. |
| 257 | + |
| 258 | +12:46:30.755551 02:42:ac:12:00:06 > 02:42:ac:12:00:05, ethertype IPv4 (0x0800), length 156: (tos 0x0, ttl 64, id 26896, offset 0, flags [DF], proto UDP (17), length 142) |
| 259 | + 172.18.0.6.38210 > 172.18.0.5.geneve: [bad udp cksum 0x58bb -> 0xc87d!] Geneve, Flags [C], vni 0x12, proto TEB (0x6558), options [class Open Virtual Networking (OVN) (0x102) type 0x80(C) len 8 data 00090006] |
| 260 | + 0a:58:0a:f5:02:01 > 0a:58:0a:f5:02:03, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 61037, offset 0, flags [DF], proto ICMP (1), length 84) |
| 261 | + 10.245.4.4 > 10.245.2.3: ICMP echo request, id 14, seq 44, length 64 |
| 262 | + |
| 263 | +— |
| 264 | + |
| 265 | +12:46:30.755694 02:42:ac:12:00:05 > 02:42:ac:12:00:06, ethertype IPv4 (0x0800), length 156: (tos 0x50, ttl 64, id 46220, offset 0, flags [DF], proto UDP (17), length 142) |
| 266 | + 172.18.0.5.38210 > 172.18.0.6.geneve: [bad udp cksum 0x58bb -> 0xc47d!] Geneve, Flags [C], vni 0x12, proto TEB (0x6558), options [class Open Virtual Networking (OVN) (0x102) type 0x80(C) len 8 data 0004000a] |
| 267 | + 0a:58:0a:f5:04:01 > 0a:58:0a:f5:04:04, ethertype IPv4 (0x0800), length 98: (tos 0x50, ttl 63, id 45002, offset 0, flags [none], proto ICMP (1), length 84) |
| 268 | + 10.245.2.3 > 10.245.4.4: ICMP echo reply, id 14, seq 44, length 64 |
| 269 | + |
| 270 | +—--------- |
| 271 | + |
| 272 | +**Pod1 to Pod2**: Traffic is marked both ways (both pods have the matching label) |
| 273 | + |
| 274 | +12:46:30.497289 02:42:ac:12:00:06 > 02:42:ac:12:00:05, ethertype IPv4 (0x0800), length 156: (tos 0x50, ttl 64, id 26752, offset 0, flags [DF], proto UDP (17), length 142) |
| 275 | + 172.18.0.6.7856 > 172.18.0.5.geneve: [bad udp cksum 0x58bb -> 0x3f10!] Geneve, Flags [C], vni 0x12, proto TEB (0x6558), options [class Open Virtual Networking (OVN) (0x102) type 0x80(C) len 8 data 00090006] |
| 276 | + 0a:58:0a:f5:02:01 > 0a:58:0a:f5:02:03, ethertype IPv4 (0x0800), length 98: (tos 0x50, ttl 63, id 21760, offset 0, flags [DF], proto ICMP (1), length 84) |
| 277 | + 10.245.4.3 > 10.245.2.3: ICMP echo request, id 14, seq 56, length 64 |
| 278 | + |
| 279 | +— |
| 280 | + |
| 281 | +12:46:30.497381 02:42:ac:12:00:05 > 02:42:ac:12:00:06, ethertype IPv4 (0x0800), length 156: (tos 0x50, ttl 64, id 46019, offset 0, flags [DF], proto UDP (17), length 142) |
| 282 | + 172.18.0.5.7856 > 172.18.0.6.geneve: [bad udp cksum 0x58bb -> 0x3b11!] Geneve, Flags [C], vni 0x12, proto TEB (0x6558), options [class Open Virtual Networking (OVN) (0x102) type 0x80(C) len 8 data 0004000a] |
| 283 | + 0a:58:0a:f5:04:01 > 0a:58:0a:f5:04:03, ethertype IPv4 (0x0800), length 98: (tos 0x50, ttl 63, id 3850, offset 0, flags [none], proto ICMP (1), length 84) |
| 284 | + 10.245.2.3 > 10.245.4.3: ICMP echo reply, id 14, seq 56, length 64 |
| 285 | +``` |
| 286 | +
|
| 287 | +## **5 Explain the NetworkQoS Object** |
| 288 | +
|
| 289 | +Below is an *abbreviated* map of the CRD schema returned by `kubectl explain networkqos --recursive` (v1alpha1). Use this as a quick reference. For the definitive specification, always consult the `kubectl explain` output or the CRD YAML in the ovn-kubernetes repository. |
| 290 | +
|
| 291 | +### **5.1 Top‑level `spec` keys** |
| 292 | +
|
| 293 | +| Field | Type | Required | Purpose | |
| 294 | +| ----- | ----- | ----- | ----- | |
| 295 | +| **podSelector** | `LabelSelector` | No | Selects pods whose traffic will be evaluated by the QoS rules. If empty, all pods in the namespace are selected. | |
| 296 | +| **networkSelectors[]** | list `NetworkSelector` | No | Restricts the rule to traffic on specific networks. If absent, the rule matches any interface. *(See §5.2)* | |
| 297 | +| **priority** | `int` | **Yes** | Higher number → chosen first when multiple `NetworkQoS` objects match the same packet. | |
| 298 | +| **egress[]** | list `EgressRule` | **Yes** | One or more marking / policing rules. Evaluated in the order listed. *(See §5.3)* | |
| 299 | +
|
| 300 | +Note the square-bracket notation (`[]`) for **both** `egress` and `networkSelectors`—each is an array in the CRD. |
| 301 | +
|
| 302 | +--- |
| 303 | +
|
| 304 | +### **5.2 Inside a `networkSelectors[]` entry** |
| 305 | +
|
| 306 | +Each list element tells the controller **where** the pods' egress traffic must flow in order to apply the rule. Exactly **one** selector type must be set. |
| 307 | +
|
| 308 | +| Key | Required | Description | |
| 309 | +| :---- | :---- | :---- | |
| 310 | +| `networkSelectionType` | **Yes** | Enum that declares which selector below is populated. Common values: `NetworkAttachmentDefinitions`, `DefaultNetwork`, `SecondaryUserDefinedNetworks`, … | |
| 311 | +| `networkAttachmentDefinitionSelector` | conditional | When `networkSelectionType=NetworkAttachmentDefinitions`. Selects NADs by **namespaceSelector** (required) *and* **networkSelector** (required). Both are ordinary `LabelSelectors`. | |
| 312 | +| `secondaryUserDefinedNetworkSelector` | conditional | Used when `networkSelectionType=SecondaryUserDefinedNetworks`. Similar structure: required **namespaceSelector** & **networkSelector**. | |
| 313 | +| `clusterUserDefinedNetworkSelector`, `primaryUserDefinedNetworkSelector` | conditional | Additional selector styles, each with required sub‑selectors as per the CRD. | |
| 314 | +
|
| 315 | +**Typical usage** – `networkSelectionType: NetworkAttachmentDefinitions` + `networkAttachmentDefinitionSelector`. |
| 316 | +
|
| 317 | +--- |
| 318 | +
|
| 319 | +### **5.3 Inside an `egress[]` rule** |
| 320 | +
|
| 321 | +| Field | Type | Required | Description | |
| 322 | +| :---- | :---- | :---- | :---- | |
| 323 | +| `dscp` | `int` (0 – 63) | **Yes** | DSCP value to stamp on the **inner** IP header. This value determines the traffic priority. | |
| 324 | +| `bandwidth.rate` | `int` (kbps) | No | Sustained rate for the token-bucket policer (in kilobits per second). | |
| 325 | +| `bandwidth.burst` | `int` (kilobits) | No | Maximum burst size that can accrue (in kilobits). | |
| 326 | +| `classifier.to` / `classifier.from` | list `TrafficSelector` | No | CIDRs the packet destination (or source) must match. Each entry is an `ipBlock` supporting an `except` list. | |
| 327 | +| `classifier.ports[]` | list | No | List of `{protocol, port}` tuples the packet must match; protocol is `TCP`, `UDP`, or `SCTP`. | |
| 328 | +
|
| 329 | +If **all** specified classifier conditions match, the packet gets the DSCP mark and/or bandwidth policer defined above. This allows for fine-grained control over which traffic flows receive QoS treatment. |
0 commit comments