Skip to content

Commit 337a46e

Browse files
committed
docs: more bootstrap troubleshooting docs
1 parent 850c8f3 commit 337a46e

File tree

1 file changed

+130
-27
lines changed

1 file changed

+130
-27
lines changed

docs/troubleshooting.md

Lines changed: 130 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -58,41 +58,144 @@ kubectl logs -n karpenter deployment/karpenter | grep "Starting Controller"
5858

5959
### Node Registration Issues
6060

61-
!!! error "Node failed to join cluster"
62-
**Most Common Issue - Wrong API Server Endpoint:**
63-
```bash
64-
# Symptoms: kubelet timeouts, nodes never register
65-
# Error: "dial tcp 10.243.65.4:6443: i/o timeout"
61+
!!! error "Nodes not joining cluster after provisioning"
62+
This is often caused by a chain of issues. Work through this systematic checklist:
6663

67-
# 1. Check what endpoint kubelet is trying to reach
68-
ssh ubuntu@INSTANCE_IP "cat /var/lib/kubelet/bootstrap-kubeconfig | grep server"
64+
#### 1. Verify Instance Creation
6965

70-
# 2. Find correct internal API endpoint
71-
kubectl get endpointslice -n default -l kubernetes.io/service-name=kubernetes
66+
```bash
67+
# Check if instances are being created
68+
ibmcloud is instances --output json | jq '.[] | select(.name | contains("nodepool"))'
7269

73-
# 3. Update NodeClass with correct INTERNAL endpoint
74-
kubectl patch ibmnodeclass YOUR-NODECLASS --type='merge' \
75-
-p='{"spec":{"apiServerEndpoint":"https://INTERNAL-IP:6443"}}'
76-
```
70+
# Check NodeClaim status
71+
kubectl get nodeclaims -o wide
72+
kubectl describe nodeclaim NODECLAIM_NAME
73+
```
7774

78-
**Other Common Causes:**
75+
**Expected:** Instance status `running`, NodeClaim shows `Launched: True`
7976

80-
- VNI (Virtual Network Interface) not configured properly (v0.3.53+ required)
81-
- Bootstrap token expiration
82-
- Network connectivity problems
77+
#### 2. Check Network Connectivity (Most Common Issue)
8378

84-
**Debug steps:**
85-
```bash
86-
# Check bootstrap logs on instance
87-
ssh ubuntu@INSTANCE_IP "sudo journalctl -u cloud-final"
79+
**Step 2a: Verify Subnet Placement**
80+
```bash
81+
# Find which subnet your cluster nodes are in
82+
kubectl get nodes -o wide # Note the INTERNAL-IP range
8883

89-
# Check kubelet status and errors
90-
ssh ubuntu@INSTANCE_IP "sudo systemctl status kubelet"
91-
ssh ubuntu@INSTANCE_IP "sudo journalctl -u kubelet --no-pager -n 50"
84+
# Check if Karpenter nodes are in the same subnet
85+
ibmcloud is instance INSTANCE_ID --output json | jq '.primary_network_interface.subnet'
9286

93-
# Test API server connectivity from node
94-
ssh ubuntu@INSTANCE_IP "curl -k -m 10 https://API-SERVER-IP:6443/healthz"
95-
```
87+
# If different subnets, nodes may be network-isolated!
88+
```
89+
90+
**Step 2b: Verify API Server Endpoint Configuration**
91+
```bash
92+
# Find the INTERNAL API endpoint (not external!)
93+
kubectl get endpoints kubernetes -o yaml
94+
# OR
95+
kubectl get endpointslice -n default -l kubernetes.io/service-name=kubernetes
96+
97+
# Check what's configured in IBMNodeClass
98+
kubectl get ibmnodeclass YOUR-NODECLASS -o yaml | grep apiServerEndpoint
99+
100+
# Update if using external IP instead of internal
101+
kubectl patch ibmnodeclass YOUR-NODECLASS --type='merge' \
102+
-p='{"spec":{"apiServerEndpoint":"https://INTERNAL-IP:6443"}}'
103+
```
104+
105+
**Step 2c: Test Connectivity from Node**
106+
```bash
107+
# Attach floating IP for debugging
108+
109+
# Then SSH and test
110+
ssh -i ~/.ssh/eb root@FLOATING_IP
111+
112+
# Test network layers
113+
ping INTERNAL_API_IP # Test ICMP
114+
telnet INTERNAL_API_IP 6443 # Test TCP
115+
curl -k https://INTERNAL_API_IP:6443/healthz # Test HTTPS
116+
```
117+
118+
#### 3. Verify Security Groups
119+
120+
!!! danger "Security Group Requirements"
121+
Both worker and control plane security groups need proper rules for bidirectional communication.
122+
123+
**Required Security Group Rules:**
124+
125+
```bash
126+
# Check current security groups on instance
127+
ibmcloud is instance INSTANCE_ID --output json | \
128+
jq '.network_interfaces[0].security_groups'
129+
130+
# Worker Node Security Group needs:
131+
# Outbound rules
132+
- TCP 6443 to control plane subnet (Kubernetes API)
133+
- TCP 10250 to all nodes (Kubelet)
134+
- TCP/UDP 53 to 0.0.0.0/0 (DNS)
135+
- TCP 80,443 to 0.0.0.0/0 (Package downloads)
136+
137+
# Inbound rules
138+
- TCP 6443 from control plane (API server callbacks)
139+
- TCP 10250 from all nodes (Kubelet peer communication)
140+
141+
# Add missing rules example:
142+
ibmcloud is security-group-rule-add WORKER_SG_ID \
143+
outbound tcp --port-min 6443 --port-max 6443 \
144+
--remote CONTROL_PLANE_SUBNET_CIDR
145+
146+
ibmcloud is security-group-rule-add WORKER_SG_ID \
147+
inbound tcp --port-min 6443 --port-max 6443 \
148+
--remote CONTROL_PLANE_SUBNET_CIDR
149+
```
150+
151+
#### 4. Debug Bootstrap Process
152+
153+
**Check Cloud-Init Status:**
154+
```bash
155+
# SSH to node (after attaching floating IP)
156+
ssh -i ~/.ssh/eb root@FLOATING_IP
157+
158+
# Check cloud-init progress
159+
sudo cloud-init status --long
160+
161+
# View bootstrap logs
162+
sudo tail -100 /var/log/cloud-init.log
163+
sudo tail -100 /var/log/cloud-init-output.log
164+
sudo cat /var/log/karpenter-bootstrap.log
165+
166+
# Check if kubelet was installed
167+
sudo systemctl status kubelet
168+
sudo journalctl -u kubelet --no-pager -n 50
169+
```
170+
171+
**Common Bootstrap Issues:**
172+
- Package repository access blocked (check security groups for HTTP/HTTPS)
173+
- CNI conflicts (check for pre-existing CNI configurations)
174+
175+
#### 5. Verify IBMNodeClass Configuration
176+
177+
```bash
178+
# Check for common configuration issues
179+
kubectl get ibmnodeclass YOUR-NODECLASS -o yaml
180+
181+
# Key fields to verify:
182+
# - apiServerEndpoint: Must be INTERNAL cluster endpoint
183+
# - bootstrapMode: Should be "cloud-init" for VPC
184+
# - securityGroups: Must include proper security group IDs
185+
# - sshKeys: Must use SSH key IDs (r010-xxx format), not names
186+
```
187+
188+
#### 6. Check Resource Group Configuration
189+
190+
```bash
191+
# Verify instances are created in correct resource group
192+
ibmcloud is instances --output json | \
193+
jq '.[] | select(.name | contains("nodepool")) |
194+
{name: .name, resource_group: .resource_group.id}'
195+
196+
# Should match the resource group in IBMNodeClass
197+
kubectl get ibmnodeclass YOUR-NODECLASS -o yaml | grep resourceGroupID
198+
```
96199

97200
### Security Group Configuration
98201

0 commit comments

Comments
 (0)