|
| 1 | +--- |
| 2 | +name: terway-troubleshooting |
| 3 | +description: Troubleshoot Terway CNI issues in Kubernetes using Kubernetes events and Terway logs. Use when diagnosing "cni plugin not initialized", Pod create/delete failures, or ENI/IPAM problems in Terway (centralized or non-centralized IPAM). |
| 4 | +--- |
| 5 | + |
| 6 | +# Terway Troubleshooting SOP |
| 7 | + |
| 8 | +## When to use this Skill |
| 9 | + |
| 10 | +Use this Skill whenever the user: |
| 11 | + |
| 12 | +- Reports **"cni plugin not initialized"** or similar CNI errors on nodes |
| 13 | +- Reports **Pod creation or deletion failures** in a cluster using Terway as the CNI |
| 14 | +- Suspects **ENI/IPAM/resource issues** related to Terway (centralized or non-centralized) |
| 15 | + |
| 16 | +Always assume the cluster is running Kubernetes and Terway is the CNI plugin. |
| 17 | + |
| 18 | +## High-level troubleshooting flow |
| 19 | + |
| 20 | +Follow this order unless the user has already done some steps: |
| 21 | + |
| 22 | +1. **Gather cluster-level configuration first** |
| 23 | + - Run the cluster configuration inspection script to understand the environment: |
| 24 | + |
| 25 | + ```bash |
| 26 | + ./scripts/inspect-terway-cluster.sh |
| 27 | + ``` |
| 28 | + |
| 29 | + - This provides Terway version, IPAM type (centralized vs non-centralized), service CIDR, kube-proxy mode, and key Terway feature flags. |
| 30 | + - Use this information to guide the rest of the troubleshooting flow. |
| 31 | + |
| 32 | +2. **Check Terway components health** |
| 33 | + - Verify Terway DaemonSet Pod is created and running on the node. |
| 34 | + - If using centralized IPAM (identified in step 1), also verify the Terway controlplane Pod. |
| 35 | + |
| 36 | +3. **Inspect the problematic Pod and Node configuration** |
| 37 | + - Once you've identified the problematic Pod and its Node: |
| 38 | + - Run `./scripts/inspect-terway-pod.sh <namespace> <pod-name>` to check Pod-level config (hostNetwork, pod-eni, annotation-based config source). |
| 39 | + - Run `./scripts/inspect-terway-node.sh <node-name>` to check Node-level config (ENI mode, dynamic config, LingJun status, ignore-by-terway, no-kube-proxy). |
| 40 | +
|
| 41 | +4. **Use Kubernetes Events as the primary signal** |
| 42 | + - For any problematic Pod, inspect its Events first. |
| 43 | + - Map Terway-specific event reasons to likely causes and next checks. |
| 44 | +
|
| 45 | +5. **Inspect Terway IPAM / ENI controllers** |
| 46 | + - Depending on centralized vs non-centralized IPAM (from step 1), check relevant CRDs and their Events. |
| 47 | +
|
| 48 | +6. **Only then, inspect logs** |
| 49 | + - Use Terway daemon and controlplane logs to deepen analysis when Events are missing or unclear. |
| 50 | +
|
| 51 | +Keep answers structured: first restate what has been checked, then propose next verification steps. |
| 52 | +
|
| 53 | +## Step 1 – Terway and CNI initialization |
| 54 | +
|
| 55 | +1. **If the user reports "cni plugin not initialized" or similar:** |
| 56 | + - Do **not** immediately blame Terway IPAM logic. |
| 57 | + - First ensure Terway Pods (daemon and, if present, controlplane) are **created, scheduled, and running** on the node. |
| 58 | + - If Terway Pod is missing: |
| 59 | + - Ask the user to check Mutating/Validating Webhooks, runtime, and CNI configuration (kubelet cni dirs, etc.). |
| 60 | + - If Terway Pod is CrashLooping: |
| 61 | + - Ask for the Pod description/logs and help debug that before going to Pod-level network issues. |
| 62 | +
|
| 63 | +2. **Only after Terway is confirmed running on the node**, proceed to Pod create/delete failures and Events. |
| 64 | +
|
| 65 | +## Step 2 – Always start from Kubernetes Events |
| 66 | +
|
| 67 | +For any Pod with network-related failures: |
| 68 | +
|
| 69 | +1. **Inspect Pod Events** |
| 70 | + - Instruct the user to run `kubectl describe pod <pod> -n <ns>` and paste relevant Events. |
| 71 | + - Focus on Terway-related reasons (case-sensitive): |
| 72 | + - `AllocIPFailed` (Warning, Pod) |
| 73 | + - `AllocIPSucceed` (Normal, Pod) |
| 74 | + - `VirtualModeChanged` (Warning, Pod) |
| 75 | + - `CniPodCreateError` (Warning, Pod) |
| 76 | + - `CniPodDeleteError` (Warning, Pod) |
| 77 | + - `CniCreateENIError` (Warning, Pod) |
| 78 | + - `CniPodENIDeleteErr` (Warning, Pod) |
| 79 | +
|
| 80 | +2. **Interpret common Pod event reasons** |
| 81 | + - **`AllocIPFailed` (Warning, Pod)** |
| 82 | + - Means CNI ADD reached Terway backend but IP allocation failed. |
| 83 | + - Likely causes: |
| 84 | + - ENI quota exhausted (`ErrEniPerInstanceLimitExceeded`). |
| 85 | + - VSwitch IP exhaustion (`InvalidVSwitchID.IPNotEnough`, `QuotaExceeded.PrivateIPAddress`). |
| 86 | + - OpenAPI permission or configuration errors. |
| 87 | + - Next checks: |
| 88 | + - Node-level Events on the Node and Node CR (if centralized IPAM). |
| 89 | + - Terway daemon logs around the same time. |
| 90 | + - **`AllocIPSucceed` (Normal, Pod)** |
| 91 | + - IP allocation succeeded; if the Pod still fails, the issue is likely **after** IP allocation (datapath setup, routes, iptables, etc.). |
| 92 | + - **`VirtualModeChanged` (Warning, Pod)** |
| 93 | + - IPvlan datapath is unavailable, Terway falls back to veth. |
| 94 | + - Usually not fatal but indicates kernel or capability problems on the node. |
| 95 | + - **`CniPodCreateError` (Warning, Pod)** |
| 96 | + - From the controlplane Pod controller. Means Pod create path failed (annotation parsing, PodENI/PodNetworking, vswitch selection, etc.). |
| 97 | + - Ask for the full event message; it usually contains the specific error string. |
| 98 | + - **`CniPodDeleteError` (Warning, Pod)** |
| 99 | + - Failure in Pod delete cleanup (PodENI/ENI status or detach). Investigate PodENI and Node CR status. |
| 100 | + - **`CniCreateENIError` / `CniPodENIDeleteErr` (Warning, Pod)** |
| 101 | + - Emitted by the PodENI controller when ENI creation/deletion for the Pod fails. Use PodENI CR Events for more details. |
| 102 | +
|
| 103 | +3. **If no Terway-specific Events are present** |
| 104 | + - Confirm that the Pod is scheduled to a node where Terway is running. |
| 105 | + - Then move to node-level and CRD-level Events. |
| 106 | +
|
| 107 | +## Step 3 – Node and Node CR Events |
| 108 | +
|
| 109 | +Distinguish between: |
| 110 | +
|
| 111 | +- **Kubernetes Node object** (`corev1.Node`). |
| 112 | +- **Terway Node CRD** (`network.alibabacloud.com/v1beta1 Node`) used in centralized IPAM. |
| 113 | +
|
| 114 | +1. **On the Kubernetes Node (`corev1.Node`)** |
| 115 | + - Important Terway-related event reasons: |
| 116 | + - `AllocIPFailed` (Warning, Node) |
| 117 | + - From local IPAM; indicates ENI/IP issues at node level. |
| 118 | + - `ConfigError` (Warning, Node) |
| 119 | + - From Terway node controllers when `eni-config` or node capabilities are invalid. |
| 120 | + - Use these to distinguish between misconfiguration vs. resource exhaustion. |
| 121 | +
|
| 122 | +2. **On the Terway Node CRD (centralized IPAM)** |
| 123 | + - When centralized IPAM is enabled, a `Node` CR under `network.alibabacloud.com` exists. |
| 124 | + - Terway emits events on this CR for ENI lifecycle and pool operations, using reasons defined in `types/k8s.go`, such as: |
| 125 | + - `CreateENISucceed` / `CreateENIFailed` |
| 126 | + - `AttachENISucceed` / `AttachENIFailed` |
| 127 | + - `DetachENISucceed` / `DetachENIFailed` |
| 128 | + - `DeleteENISucceed` / `DeleteENIFailed` |
| 129 | + - Use these events to answer questions like: |
| 130 | + - Is the IP pool being warmed correctly? |
| 131 | + - Are new ENIs failing to create because of OpenAPI errors or configuration? |
| 132 | +
|
| 133 | +3. **Link Node events to Pod failures** |
| 134 | + - If Pods report `AllocIPFailed` or `CniPodCreateError`, check whether the corresponding Node / Node CR shows ENI/IPAM failures. |
| 135 | + - Use that correlation to explain whether the problem is capacity, config, or bug. |
| 136 | +
|
| 137 | +## Step 4 – Centralized vs non-centralized IPAM behavior |
| 138 | +
|
| 139 | +When reasoning about Terway behavior, always clarify which IPAM mode is in use. |
| 140 | +
|
| 141 | +1. **Detect mode from context** |
| 142 | + - Centralized IPAM indicators: |
| 143 | + - Presence of Terway controlplane deployment. |
| 144 | + - CRDs like `podenis.network.alibabacloud.com`, `nodes.network.alibabacloud.com`, `podnetworkings.network.alibabacloud.com`. |
| 145 | + - Helm/config flag `centralizedIPAM: true` or controlplane config with `CentralizedIPAM` set. |
| 146 | + - Non-centralized/local IPAM indicators: |
| 147 | + - IPAM type in `eni-config` is `default`. |
| 148 | + - Node-local IPAM logic in the daemon is responsible for ENI/IP management. |
| 149 | +
|
| 150 | +2. **If centralized IPAM** |
| 151 | + - In addition to Pod and Node events, always consider: |
| 152 | + - **PodENI CR** (per-pod ENI and IP state): events like `CreateENIFailed`, `AttachENIFailed`, `UpdatePodENIFailed`. |
| 153 | + - **Node CR**: ENI pool and warmup behavior. |
| 154 | + - **PodNetworking CR**: Events `SyncPodNetworkingSucceed/Failed` when syncing vswitch lists. |
| 155 | + - For Pod failures: |
| 156 | + - Check Pod Events (Cni* reasons) → PodENI Events → Node CR Events → controlplane logs. |
| 157 | +
|
| 158 | +3. **If non-centralized IPAM** |
| 159 | + - Focus on: |
| 160 | + - Node Events (`AllocIPFailed`, `ConfigError`). |
| 161 | + - `eni-config` ConfigMap correctness (vswitches, security groups, ip_stack, trunk/erdma flags, etc.). |
| 162 | + - Terway daemon logs on the affected node. |
| 163 | +
|
| 164 | +## Step 5 – Using logs only when Events are insufficient |
| 165 | +
|
| 166 | +1. **When to move to logs** |
| 167 | + - Events point to a failure but not the exact cause (e.g., only `AllocIPFailed` without OpenAPI error details). |
| 168 | + - There are **no** Terway Events on the relevant Pod/Node/CR, but the behavior clearly involves Terway. |
| 169 | +
|
| 170 | +2. **Which logs to inspect** |
| 171 | + - **Terway daemon logs** on the affected node: |
| 172 | + - Look for: |
| 173 | + - The Pod name / namespace. |
| 174 | + - OpenAPI errors (quota, IP shortage, permission issues). |
| 175 | + - Internal errors in ENI/route/datapath setup. |
| 176 | + - **Terway controlplane logs** (centralized IPAM): |
| 177 | + - Look for: |
| 178 | + - Errors in Pod controller, PodENI controller, Node controller. |
| 179 | + - PodNetworking sync failures. |
| 180 | +
|
| 181 | +3. **How to combine logs with Events** |
| 182 | + - Use Event timestamps and reasons as an index into the logs. |
| 183 | + - Explain to the user: |
| 184 | + - Which event indicates the failure. |
| 185 | + - Which log line confirms the root cause. |
| 186 | +
|
| 187 | +## Utility scripts |
| 188 | +
|
| 189 | +### Cluster-level configuration |
| 190 | +
|
| 191 | +Before starting troubleshooting, gather cluster-wide Terway configuration: |
| 192 | +
|
| 193 | +```bash |
| 194 | +./scripts/inspect-terway-cluster.sh |
| 195 | +``` |
| 196 | +
|
| 197 | +This script inspects: |
| 198 | +
|
| 199 | +- **Terway version** from the `terway-eniip` DaemonSet image tag |
| 200 | +- **Service CIDR** and **IP stack** from `ack-cluster-profile` ConfigMap |
| 201 | +- **Kube-proxy mode** (iptables/ipvs) and **cluster CIDR** from `kube-proxy-worker` ConfigMap |
| 202 | +- **IPAM type** (`crd` for centralized, `default` for non-centralized) from `eni-config` ConfigMap |
| 203 | +- **Terway feature flags**: `enable_eni_trunking`, `enable_erdma`, `vswitch_selection_policy`, `max_pool_size`, `min_pool_size`, etc. |
| 204 | +
|
| 205 | +Use this information to determine whether centralized IPAM is enabled and which Terway features are active. This guides the rest of the troubleshooting flow. |
| 206 | +
|
| 207 | +### Node-level configuration |
| 208 | +
|
| 209 | +To inspect Terway-related node configuration for a problematic Pod, first identify the Pod's node (for example via `kubectl get pod -o wide`). Then, from the repository root, run: |
| 210 | + |
| 211 | +```bash |
| 212 | +./scripts/inspect-terway-node.sh <node-name> |
| 213 | +``` |
| 214 | + |
| 215 | +This prints ENI mode (shared vs exclusive), node-level dynamic config (`terway-config`), LingJun node flags, `k8s.aliyun.com/ignore-by-terway` and `k8s.aliyun.com/no-kube-proxy` labels, and the ENO API type from the `nodes.network.alibabacloud.com` CR. Use this information as input to the troubleshooting steps above when you have located the Pod's node. |
| 216 | +
|
| 217 | +### Pod-level configuration |
| 218 | +
|
| 219 | +To inspect Terway-related Pod configuration, run: |
| 220 | +
|
| 221 | +```bash |
| 222 | +./scripts/inspect-terway-pod.sh <namespace> <pod-name> |
| 223 | +``` |
| 224 | +
|
| 225 | +This checks: |
| 226 | +
|
| 227 | +- Whether the Pod uses `hostNetwork` (if true, Terway CNI does not process it). |
| 228 | +- Whether the Pod has `k8s.aliyun.com/pod-eni: "true"` annotation (indicating trunk/exclusive ENI mode). |
| 229 | +- Which annotation-based config source is used, following the webhook priority order: |
| 230 | + 1. `k8s.aliyun.com/pod-networks` (explicit pod-networks config) |
| 231 | + 2. `k8s.aliyun.com/pod-networks-request` (pod-networks-request config) |
| 232 | + 3. `k8s.aliyun.com/pod-networking` (matched PodNetworking resource) |
| 233 | + 4. Fallback to `eni-config` default on eth0 if none of the above are set. |
| 234 | +
|
| 235 | +Use this to determine if the Pod should be managed by Terway, whether it uses PodENI, and which configuration source drives its ENI/IP allocation. |
| 236 | +
|
| 237 | +## Response style guidelines |
| 238 | +
|
| 239 | +When this Skill is active: |
| 240 | +
|
| 241 | +- **Always start from Events** when diagnosing Pod or node-level issues; do not jump straight into logs unless Events are missing. |
| 242 | +- **Reference concrete Terway event reasons** (e.g., `AllocIPFailed`, `CniPodCreateError`, `CreateENIFailed`) and explain what they mean. |
| 243 | +- **Ask for specific artifacts** when needed: |
| 244 | + - `kubectl describe pod` output for the problematic Pod. |
| 245 | + - Node and Node CR describe output when centralized IPAM is used. |
| 246 | + - Excerpts from Terway daemon/controlplane logs around the relevant time. |
| 247 | +- Keep answers structured and concise, but be explicit about next steps (what to inspect next and why). |
| 248 | +- Clearly distinguish between **configuration issues**, **resource exhaustion/quota**, and **potential Terway bugs** based on Events and logs. |
0 commit comments