Skip to content

Commit 6cbdbde

Browse files
authored
Merge pull request #965 from l1b0k/feat/skill
feat: add inspection scripts for Terway cluster, node, and pod config…
2 parents e3d10ff + 87c54ba commit 6cbdbde

File tree

4 files changed

+708
-0
lines changed

4 files changed

+708
-0
lines changed
Lines changed: 248 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,248 @@
1+
---
2+
name: terway-troubleshooting
3+
description: Troubleshoot Terway CNI issues in Kubernetes using Kubernetes events and Terway logs. Use when diagnosing "cni plugin not initialized", Pod create/delete failures, or ENI/IPAM problems in Terway (centralized or non-centralized IPAM).
4+
---
5+
6+
# Terway Troubleshooting SOP
7+
8+
## When to use this Skill
9+
10+
Use this Skill whenever the user:
11+
12+
- Reports **"cni plugin not initialized"** or similar CNI errors on nodes
13+
- Reports **Pod creation or deletion failures** in a cluster using Terway as the CNI
14+
- Suspects **ENI/IPAM/resource issues** related to Terway (centralized or non-centralized)
15+
16+
Always assume the cluster is running Kubernetes and Terway is the CNI plugin.
17+
18+
## High-level troubleshooting flow
19+
20+
Follow this order unless the user has already done some steps:
21+
22+
1. **Gather cluster-level configuration first**
23+
- Run the cluster configuration inspection script to understand the environment:
24+
25+
```bash
26+
./scripts/inspect-terway-cluster.sh
27+
```
28+
29+
- This provides Terway version, IPAM type (centralized vs non-centralized), service CIDR, kube-proxy mode, and key Terway feature flags.
30+
- Use this information to guide the rest of the troubleshooting flow.
31+
32+
2. **Check Terway components health**
33+
- Verify Terway DaemonSet Pod is created and running on the node.
34+
- If using centralized IPAM (identified in step 1), also verify the Terway controlplane Pod.
35+
36+
3. **Inspect the problematic Pod and Node configuration**
37+
- Once you've identified the problematic Pod and its Node:
38+
- Run `./scripts/inspect-terway-pod.sh <namespace> <pod-name>` to check Pod-level config (hostNetwork, pod-eni, annotation-based config source).
39+
- Run `./scripts/inspect-terway-node.sh <node-name>` to check Node-level config (ENI mode, dynamic config, LingJun status, ignore-by-terway, no-kube-proxy).
40+
41+
4. **Use Kubernetes Events as the primary signal**
42+
- For any problematic Pod, inspect its Events first.
43+
- Map Terway-specific event reasons to likely causes and next checks.
44+
45+
5. **Inspect Terway IPAM / ENI controllers**
46+
- Depending on centralized vs non-centralized IPAM (from step 1), check relevant CRDs and their Events.
47+
48+
6. **Only then, inspect logs**
49+
- Use Terway daemon and controlplane logs to deepen analysis when Events are missing or unclear.
50+
51+
Keep answers structured: first restate what has been checked, then propose next verification steps.
52+
53+
## Step 1 – Terway and CNI initialization
54+
55+
1. **If the user reports "cni plugin not initialized" or similar:**
56+
- Do **not** immediately blame Terway IPAM logic.
57+
- First ensure Terway Pods (daemon and, if present, controlplane) are **created, scheduled, and running** on the node.
58+
- If Terway Pod is missing:
59+
- Ask the user to check Mutating/Validating Webhooks, runtime, and CNI configuration (kubelet cni dirs, etc.).
60+
- If Terway Pod is CrashLooping:
61+
- Ask for the Pod description/logs and help debug that before going to Pod-level network issues.
62+
63+
2. **Only after Terway is confirmed running on the node**, proceed to Pod create/delete failures and Events.
64+
65+
## Step 2 – Always start from Kubernetes Events
66+
67+
For any Pod with network-related failures:
68+
69+
1. **Inspect Pod Events**
70+
- Instruct the user to run `kubectl describe pod <pod> -n <ns>` and paste relevant Events.
71+
- Focus on Terway-related reasons (case-sensitive):
72+
- `AllocIPFailed` (Warning, Pod)
73+
- `AllocIPSucceed` (Normal, Pod)
74+
- `VirtualModeChanged` (Warning, Pod)
75+
- `CniPodCreateError` (Warning, Pod)
76+
- `CniPodDeleteError` (Warning, Pod)
77+
- `CniCreateENIError` (Warning, Pod)
78+
- `CniPodENIDeleteErr` (Warning, Pod)
79+
80+
2. **Interpret common Pod event reasons**
81+
- **`AllocIPFailed` (Warning, Pod)**
82+
- Means CNI ADD reached Terway backend but IP allocation failed.
83+
- Likely causes:
84+
- ENI quota exhausted (`ErrEniPerInstanceLimitExceeded`).
85+
- VSwitch IP exhaustion (`InvalidVSwitchID.IPNotEnough`, `QuotaExceeded.PrivateIPAddress`).
86+
- OpenAPI permission or configuration errors.
87+
- Next checks:
88+
- Node-level Events on the Node and Node CR (if centralized IPAM).
89+
- Terway daemon logs around the same time.
90+
- **`AllocIPSucceed` (Normal, Pod)**
91+
- IP allocation succeeded; if the Pod still fails, the issue is likely **after** IP allocation (datapath setup, routes, iptables, etc.).
92+
- **`VirtualModeChanged` (Warning, Pod)**
93+
- IPvlan datapath is unavailable, Terway falls back to veth.
94+
- Usually not fatal but indicates kernel or capability problems on the node.
95+
- **`CniPodCreateError` (Warning, Pod)**
96+
- From the controlplane Pod controller. Means Pod create path failed (annotation parsing, PodENI/PodNetworking, vswitch selection, etc.).
97+
- Ask for the full event message; it usually contains the specific error string.
98+
- **`CniPodDeleteError` (Warning, Pod)**
99+
- Failure in Pod delete cleanup (PodENI/ENI status or detach). Investigate PodENI and Node CR status.
100+
- **`CniCreateENIError` / `CniPodENIDeleteErr` (Warning, Pod)**
101+
- Emitted by the PodENI controller when ENI creation/deletion for the Pod fails. Use PodENI CR Events for more details.
102+
103+
3. **If no Terway-specific Events are present**
104+
- Confirm that the Pod is scheduled to a node where Terway is running.
105+
- Then move to node-level and CRD-level Events.
106+
107+
## Step 3 – Node and Node CR Events
108+
109+
Distinguish between:
110+
111+
- **Kubernetes Node object** (`corev1.Node`).
112+
- **Terway Node CRD** (`network.alibabacloud.com/v1beta1 Node`) used in centralized IPAM.
113+
114+
1. **On the Kubernetes Node (`corev1.Node`)**
115+
- Important Terway-related event reasons:
116+
- `AllocIPFailed` (Warning, Node)
117+
- From local IPAM; indicates ENI/IP issues at node level.
118+
- `ConfigError` (Warning, Node)
119+
- From Terway node controllers when `eni-config` or node capabilities are invalid.
120+
- Use these to distinguish between misconfiguration vs. resource exhaustion.
121+
122+
2. **On the Terway Node CRD (centralized IPAM)**
123+
- When centralized IPAM is enabled, a `Node` CR under `network.alibabacloud.com` exists.
124+
- Terway emits events on this CR for ENI lifecycle and pool operations, using reasons defined in `types/k8s.go`, such as:
125+
- `CreateENISucceed` / `CreateENIFailed`
126+
- `AttachENISucceed` / `AttachENIFailed`
127+
- `DetachENISucceed` / `DetachENIFailed`
128+
- `DeleteENISucceed` / `DeleteENIFailed`
129+
- Use these events to answer questions like:
130+
- Is the IP pool being warmed correctly?
131+
- Are new ENIs failing to create because of OpenAPI errors or configuration?
132+
133+
3. **Link Node events to Pod failures**
134+
- If Pods report `AllocIPFailed` or `CniPodCreateError`, check whether the corresponding Node / Node CR shows ENI/IPAM failures.
135+
- Use that correlation to explain whether the problem is capacity, config, or bug.
136+
137+
## Step 4 – Centralized vs non-centralized IPAM behavior
138+
139+
When reasoning about Terway behavior, always clarify which IPAM mode is in use.
140+
141+
1. **Detect mode from context**
142+
- Centralized IPAM indicators:
143+
- Presence of Terway controlplane deployment.
144+
- CRDs like `podenis.network.alibabacloud.com`, `nodes.network.alibabacloud.com`, `podnetworkings.network.alibabacloud.com`.
145+
- Helm/config flag `centralizedIPAM: true` or controlplane config with `CentralizedIPAM` set.
146+
- Non-centralized/local IPAM indicators:
147+
- IPAM type in `eni-config` is `default`.
148+
- Node-local IPAM logic in the daemon is responsible for ENI/IP management.
149+
150+
2. **If centralized IPAM**
151+
- In addition to Pod and Node events, always consider:
152+
- **PodENI CR** (per-pod ENI and IP state): events like `CreateENIFailed`, `AttachENIFailed`, `UpdatePodENIFailed`.
153+
- **Node CR**: ENI pool and warmup behavior.
154+
- **PodNetworking CR**: Events `SyncPodNetworkingSucceed/Failed` when syncing vswitch lists.
155+
- For Pod failures:
156+
- Check Pod Events (Cni* reasons) → PodENI Events → Node CR Events → controlplane logs.
157+
158+
3. **If non-centralized IPAM**
159+
- Focus on:
160+
- Node Events (`AllocIPFailed`, `ConfigError`).
161+
- `eni-config` ConfigMap correctness (vswitches, security groups, ip_stack, trunk/erdma flags, etc.).
162+
- Terway daemon logs on the affected node.
163+
164+
## Step 5 – Using logs only when Events are insufficient
165+
166+
1. **When to move to logs**
167+
- Events point to a failure but not the exact cause (e.g., only `AllocIPFailed` without OpenAPI error details).
168+
- There are **no** Terway Events on the relevant Pod/Node/CR, but the behavior clearly involves Terway.
169+
170+
2. **Which logs to inspect**
171+
- **Terway daemon logs** on the affected node:
172+
- Look for:
173+
- The Pod name / namespace.
174+
- OpenAPI errors (quota, IP shortage, permission issues).
175+
- Internal errors in ENI/route/datapath setup.
176+
- **Terway controlplane logs** (centralized IPAM):
177+
- Look for:
178+
- Errors in Pod controller, PodENI controller, Node controller.
179+
- PodNetworking sync failures.
180+
181+
3. **How to combine logs with Events**
182+
- Use Event timestamps and reasons as an index into the logs.
183+
- Explain to the user:
184+
- Which event indicates the failure.
185+
- Which log line confirms the root cause.
186+
187+
## Utility scripts
188+
189+
### Cluster-level configuration
190+
191+
Before starting troubleshooting, gather cluster-wide Terway configuration:
192+
193+
```bash
194+
./scripts/inspect-terway-cluster.sh
195+
```
196+
197+
This script inspects:
198+
199+
- **Terway version** from the `terway-eniip` DaemonSet image tag
200+
- **Service CIDR** and **IP stack** from `ack-cluster-profile` ConfigMap
201+
- **Kube-proxy mode** (iptables/ipvs) and **cluster CIDR** from `kube-proxy-worker` ConfigMap
202+
- **IPAM type** (`crd` for centralized, `default` for non-centralized) from `eni-config` ConfigMap
203+
- **Terway feature flags**: `enable_eni_trunking`, `enable_erdma`, `vswitch_selection_policy`, `max_pool_size`, `min_pool_size`, etc.
204+
205+
Use this information to determine whether centralized IPAM is enabled and which Terway features are active. This guides the rest of the troubleshooting flow.
206+
207+
### Node-level configuration
208+
209+
To inspect Terway-related node configuration for a problematic Pod, first identify the Pod's node (for example via `kubectl get pod -o wide`). Then, from the repository root, run:
210+
211+
```bash
212+
./scripts/inspect-terway-node.sh <node-name>
213+
```
214+
215+
This prints ENI mode (shared vs exclusive), node-level dynamic config (`terway-config`), LingJun node flags, `k8s.aliyun.com/ignore-by-terway` and `k8s.aliyun.com/no-kube-proxy` labels, and the ENO API type from the `nodes.network.alibabacloud.com` CR. Use this information as input to the troubleshooting steps above when you have located the Pod's node.
216+
217+
### Pod-level configuration
218+
219+
To inspect Terway-related Pod configuration, run:
220+
221+
```bash
222+
./scripts/inspect-terway-pod.sh <namespace> <pod-name>
223+
```
224+
225+
This checks:
226+
227+
- Whether the Pod uses `hostNetwork` (if true, Terway CNI does not process it).
228+
- Whether the Pod has `k8s.aliyun.com/pod-eni: "true"` annotation (indicating trunk/exclusive ENI mode).
229+
- Which annotation-based config source is used, following the webhook priority order:
230+
1. `k8s.aliyun.com/pod-networks` (explicit pod-networks config)
231+
2. `k8s.aliyun.com/pod-networks-request` (pod-networks-request config)
232+
3. `k8s.aliyun.com/pod-networking` (matched PodNetworking resource)
233+
4. Fallback to `eni-config` default on eth0 if none of the above are set.
234+
235+
Use this to determine if the Pod should be managed by Terway, whether it uses PodENI, and which configuration source drives its ENI/IP allocation.
236+
237+
## Response style guidelines
238+
239+
When this Skill is active:
240+
241+
- **Always start from Events** when diagnosing Pod or node-level issues; do not jump straight into logs unless Events are missing.
242+
- **Reference concrete Terway event reasons** (e.g., `AllocIPFailed`, `CniPodCreateError`, `CreateENIFailed`) and explain what they mean.
243+
- **Ask for specific artifacts** when needed:
244+
- `kubectl describe pod` output for the problematic Pod.
245+
- Node and Node CR describe output when centralized IPAM is used.
246+
- Excerpts from Terway daemon/controlplane logs around the relevant time.
247+
- Keep answers structured and concise, but be explicit about next steps (what to inspect next and why).
248+
- Clearly distinguish between **configuration issues**, **resource exhaustion/quota**, and **potential Terway bugs** based on Events and logs.

0 commit comments

Comments
 (0)