@@ -58,41 +58,144 @@ kubectl logs -n karpenter deployment/karpenter | grep "Starting Controller"
5858
5959### Node Registration Issues
6060
61- !!! error "Node failed to join cluster"
62- ** Most Common Issue - Wrong API Server Endpoint:**
63- ```bash
64- # Symptoms: kubelet timeouts, nodes never register
65- # Error: "dial tcp 10.243.65.4:6443: i/o timeout"
61+ !!! error "Nodes not joining cluster after provisioning"
62+ This is often caused by a chain of issues. Work through this systematic checklist:
6663
67- # 1. Check what endpoint kubelet is trying to reach
68- ssh ubuntu@INSTANCE_IP "cat /var/lib/kubelet/bootstrap-kubeconfig | grep server"
64+ #### 1. Verify Instance Creation
6965
70- # 2. Find correct internal API endpoint
71- kubectl get endpointslice -n default -l kubernetes.io/service-name=kubernetes
66+ ``` bash
67+ # Check if instances are being created
68+ ibmcloud is instances --output json | jq ' .[] | select(.name | contains("nodepool"))'
7269
73- # 3. Update NodeClass with correct INTERNAL endpoint
74- kubectl patch ibmnodeclass YOUR-NODECLASS --type='merge' \
75- -p='{"spec":{"apiServerEndpoint":"https://INTERNAL-IP:6443"}}'
76- ```
70+ # Check NodeClaim status
71+ kubectl get nodeclaims -o wide
72+ kubectl describe nodeclaim NODECLAIM_NAME
73+ ```
7774
78- **Other Common Causes:**
75+ ** Expected: ** Instance status ` running ` , NodeClaim shows ` Launched: True `
7976
80- - VNI (Virtual Network Interface) not configured properly (v0.3.53+ required)
81- - Bootstrap token expiration
82- - Network connectivity problems
77+ #### 2. Check Network Connectivity (Most Common Issue)
8378
84- **Debug steps: **
85- ```bash
86- # Check bootstrap logs on instance
87- ssh ubuntu@INSTANCE_IP "sudo journalctl -u cloud-final"
79+ ** Step 2a: Verify Subnet Placement **
80+ ``` bash
81+ # Find which subnet your cluster nodes are in
82+ kubectl get nodes -o wide # Note the INTERNAL-IP range
8883
89- # Check kubelet status and errors
90- ssh ubuntu@INSTANCE_IP "sudo systemctl status kubelet"
91- ssh ubuntu@INSTANCE_IP "sudo journalctl -u kubelet --no-pager -n 50"
84+ # Check if Karpenter nodes are in the same subnet
85+ ibmcloud is instance INSTANCE_ID --output json | jq ' .primary_network_interface.subnet'
9286
93- # Test API server connectivity from node
94- ssh ubuntu@INSTANCE_IP "curl -k -m 10 https://API-SERVER-IP:6443/healthz"
95- ```
87+ # If different subnets, nodes may be network-isolated!
88+ ```
89+
90+ ** Step 2b: Verify API Server Endpoint Configuration**
91+ ``` bash
92+ # Find the INTERNAL API endpoint (not external!)
93+ kubectl get endpoints kubernetes -o yaml
94+ # OR
95+ kubectl get endpointslice -n default -l kubernetes.io/service-name=kubernetes
96+
97+ # Check what's configured in IBMNodeClass
98+ kubectl get ibmnodeclass YOUR-NODECLASS -o yaml | grep apiServerEndpoint
99+
100+ # Update if using external IP instead of internal
101+ kubectl patch ibmnodeclass YOUR-NODECLASS --type=' merge' \
102+ -p=' {"spec":{"apiServerEndpoint":"https://INTERNAL-IP:6443"}}'
103+ ```
104+
105+ ** Step 2c: Test Connectivity from Node**
106+ ``` bash
107+ # Attach floating IP for debugging
108+
109+ # Then SSH and test
110+ ssh -i ~ /.ssh/eb root@FLOATING_IP
111+
112+ # Test network layers
113+ ping INTERNAL_API_IP # Test ICMP
114+ telnet INTERNAL_API_IP 6443 # Test TCP
115+ curl -k https://INTERNAL_API_IP:6443/healthz # Test HTTPS
116+ ```
117+
118+ #### 3. Verify Security Groups
119+
120+ !!! danger "Security Group Requirements"
121+ Both worker and control plane security groups need proper rules for bidirectional communication.
122+
123+ ** Required Security Group Rules:**
124+
125+ ``` bash
126+ # Check current security groups on instance
127+ ibmcloud is instance INSTANCE_ID --output json | \
128+ jq ' .network_interfaces[0].security_groups'
129+
130+ # Worker Node Security Group needs:
131+ # Outbound rules
132+ - TCP 6443 to control plane subnet (Kubernetes API)
133+ - TCP 10250 to all nodes (Kubelet)
134+ - TCP/UDP 53 to 0.0.0.0/0 (DNS)
135+ - TCP 80,443 to 0.0.0.0/0 (Package downloads)
136+
137+ # Inbound rules
138+ - TCP 6443 from control plane (API server callbacks)
139+ - TCP 10250 from all nodes (Kubelet peer communication)
140+
141+ # Add missing rules example:
142+ ibmcloud is security-group-rule-add WORKER_SG_ID \
143+ outbound tcp --port-min 6443 --port-max 6443 \
144+ --remote CONTROL_PLANE_SUBNET_CIDR
145+
146+ ibmcloud is security-group-rule-add WORKER_SG_ID \
147+ inbound tcp --port-min 6443 --port-max 6443 \
148+ --remote CONTROL_PLANE_SUBNET_CIDR
149+ ```
150+
151+ #### 4. Debug Bootstrap Process
152+
153+ ** Check Cloud-Init Status:**
154+ ``` bash
155+ # SSH to node (after attaching floating IP)
156+ ssh -i ~ /.ssh/eb root@FLOATING_IP
157+
158+ # Check cloud-init progress
159+ sudo cloud-init status --long
160+
161+ # View bootstrap logs
162+ sudo tail -100 /var/log/cloud-init.log
163+ sudo tail -100 /var/log/cloud-init-output.log
164+ sudo cat /var/log/karpenter-bootstrap.log
165+
166+ # Check if kubelet was installed
167+ sudo systemctl status kubelet
168+ sudo journalctl -u kubelet --no-pager -n 50
169+ ```
170+
171+ ** Common Bootstrap Issues:**
172+ - Package repository access blocked (check security groups for HTTP/HTTPS)
173+ - CNI conflicts (check for pre-existing CNI configurations)
174+
175+ #### 5. Verify IBMNodeClass Configuration
176+
177+ ``` bash
178+ # Check for common configuration issues
179+ kubectl get ibmnodeclass YOUR-NODECLASS -o yaml
180+
181+ # Key fields to verify:
182+ # - apiServerEndpoint: Must be INTERNAL cluster endpoint
183+ # - bootstrapMode: Should be "cloud-init" for VPC
184+ # - securityGroups: Must include proper security group IDs
185+ # - sshKeys: Must use SSH key IDs (r010-xxx format), not names
186+ ```
187+
188+ #### 6. Check Resource Group Configuration
189+
190+ ``` bash
191+ # Verify instances are created in correct resource group
192+ ibmcloud is instances --output json | \
193+ jq ' .[] | select(.name | contains("nodepool")) |
194+ {name: .name, resource_group: .resource_group.id}'
195+
196+ # Should match the resource group in IBMNodeClass
197+ kubectl get ibmnodeclass YOUR-NODECLASS -o yaml | grep resourceGroupID
198+ ```
96199
97200### Security Group Configuration
98201
0 commit comments