Add AWS VPC CNI IP exhaustion (#102)

elskow · web-flow · commit 5ebcf6c78dd6 · 2025-07-29T15:47:00.000-05:00
* feat(rules): add nginx ingress SSL certificate crisis detection

Add new rule CRE-2025-0120 to detect critical SSL certificate failures in NGINX Ingress Controllers

* feat(rules): add AWS VPC CNI IP exhaustion crisis rule and tags

Add new rule for detecting and mitigating AWS VPC CNI IP address exhaustion scenarios.
Includes related tags for IP exhaustion, ENI allocation, pod scheduling, and cluster scaling issues.
diff --git a/rules/cre-2025-0122/aws-vpc-cni-ip-exhaustion-crisis.yaml b/rules/cre-2025-0122/aws-vpc-cni-ip-exhaustion-crisis.yaml
@@ -0,0 +1,93 @@
+rules:
+  - cre:
+      id: CRE-2025-0121
+      severity: 0
+      title: AWS VPC CNI IP Address Exhaustion Crisis
+      category: networking-problem
+      author: Prequel
+      description: |
+        Critical AWS VPC CNI IP address exhaustion detected. This pattern indicates cascading failures
+        where subnet IP exhaustion leads to ENI allocation failures, pod scheduling failures, and
+        complete service unavailability. The failure sequence shows IP allocation errors, ENI attachment
+        failures, and resulting pod startup failures that affect cluster scalability and workload deployment.
+      cause: |
+        - Subnet IP address pool exhaustion in VPC
+        - Maximum ENI limit reached per EC2 instance
+        - Secondary IP allocation failures on existing ENIs
+        - VPC CNI plugin configuration errors
+        - Insufficient subnet CIDR block size for cluster scale
+        - ENI warm pool depletion during traffic spikes
+        - AWS API rate limiting on EC2 ENI operations
+        - Security group or NACL blocking ENI operations
+        - IAM permissions missing for ENI management
+        - Cross-AZ networking constraints affecting IP allocation
+      impact: |
+        - CRITICAL: Complete inability to schedule new pods
+        - Existing pods fail to restart or scale
+        - Service degradation due to reduced pod capacity
+        - Cluster autoscaling failures and node provisioning issues
+        - Application deployment failures and rollback complications
+        - Load balancer health check failures due to unreachable pods
+        - Cascading failures across microservices architecture
+        - Data plane connectivity loss between pods
+        - Revenue loss from service unavailability
+        - Compliance violations for high-availability requirements
+      impactScore: 10
+      tags:
+        - aws
+        - vpc-cni
+        - kubernetes
+        - networking
+        - ip-exhaustion
+        - eni-allocation
+        - pod-scheduling
+        - cluster-scaling
+        - high-availability
+        - service-unavailability
+      mitigation: |
+        IMMEDIATE ACTIONS:
+        - Check available IPs in subnets: `aws ec2 describe-subnets --subnet-ids subnet-xxx`
+        - Verify ENI limits: `aws ec2 describe-network-interfaces --filters Name=attachment.instance-id,Values=i-xxx`
+        - Monitor VPC CNI logs: `kubectl logs -n kube-system -l app=aws-node`
+        - Check pod scheduling: `kubectl get pods --all-namespaces | grep Pending`
+        - Verify CNI configuration: `kubectl get configmap -n kube-system aws-node -o yaml`
+
+        RECOVERY STEPS:
+        1. Add additional subnets with larger CIDR blocks
+        2. Increase ENI warm pool size: `kubectl set env daemonset aws-node -n kube-system WARM_ENI_TARGET=2`
+        3. Enable prefix delegation: `kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true`
+        4. Scale down non-critical workloads to free IPs
+        5. Restart VPC CNI daemonset: `kubectl rollout restart daemonset/aws-node -n kube-system`
+        6. Monitor IP allocation recovery: `kubectl get pods -n kube-system -l app=aws-node`
+
+        PREVENTION:
+        - Implement IP address monitoring and alerting
+        - Configure subnet auto-scaling with larger CIDR blocks
+        - Set up VPC CNI metrics monitoring in CloudWatch
+        - Implement pod density limits per node
+        - Use prefix delegation for improved IP efficiency
+        - Regular capacity planning for cluster growth
+        - Implement network policy optimization
+        - Set up automated subnet provisioning
+      references:
+        - https://docs.aws.amazon.com/eks/latest/userguide/cni-increase-ip-addresses.html
+        - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md
+        - https://aws.amazon.com/blogs/containers/amazon-vpc-cni-increases-pods-per-node-limits/
+        - https://docs.aws.amazon.com/eks/latest/userguide/cni-custom-network.html
+        - https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/
+      applications:
+        - name: amazon-vpc-cni-k8s
+          version: ">= 1.7.0"
+        - name: kubernetes
+          version: ">= 1.18.0"
+      mitigationScore: 6
+    metadata:
+      gen: 1
+      id: 6E7meYDEvC5c6yub5dVgkW
+      kind: prequel
+    rule:
+      set:
+        event:
+          source: cre.log.aws-vpc-cni
+        match:
+          - regex: "failed to allocate a private IP address.*no available IP addresses|ENI allocation failed.*insufficient IP addresses|failed to assign private IP.*AddressLimitExceeded|pod.*failed.*no available IP|insufficient IP addresses in subnet|failed to create ENI.*AddressLimitExceeded|unable to provision ENI.*IP address limit|failed to allocate IP.*subnet has no available addresses|pod scheduling failed.*insufficient IP addresses|CNI failed to allocate IP.*no free addresses"
diff --git a/rules/cre-2025-0122/test.log b/rules/cre-2025-0122/test.log
@@ -0,0 +1,21 @@
+2025/07/02 08:29:03 [ERROR] aws-node-daemonset-xyz: ipamd.go:1234 failed to allocate ENI: AddressLimitExceeded: The maximum number of addresses has been reached.
+2025/07/02 08:29:03 [ERROR] aws-node-daemonset-xyz: ipamd.go:1235 no available IP addresses in subnet 
+2025/07/02 08:29:03 [WARN] aws-node-daemonset-xyz: ipamd.go:1236 insufficient IP addresses available for new pods
+2025/07/02 08:29:03 [ERROR] kubelet: event.go:294 FailedScheduling: 0/3 nodes are available: 3 Insufficient IP addresses in subnet 
+2025/07/02 08:29:03 [ERROR] kubelet: event.go:295 FailedScheduling: pod "test-app-deployment-abc123-xyz" failed to fit in any node
+2025/07/02 08:29:03 [ERROR] scheduler: scheduler.go:456 Failed to schedule pod test-app/test-pod-789: Insufficient IP
+2025/07/02 08:29:03 [ERROR] aws-node: cni.go:123 failed to assign an IP address to container: no available IP addresses in subnet 
+2025/07/02 08:29:03 [ERROR] aws-node: eni.go:234 failed to allocate ENI for pod test-pod-456: NetworkInterfaceLimitExceeded
+2025/07/02 08:29:03 [ERROR] aws-node: ipam.go:345 IPAM: failed to get IP address from datastore: no available IP addresses
+2025/07/02 08:29:03 [ERROR] aws-node: ec2.go:567 EC2 API error: AddressLimitExceeded - The maximum number of addresses has been reached
+2025/07/02 08:29:03 [ERROR] aws-node: ec2.go:568 EC2 API error: NetworkInterfaceLimitExceeded - The maximum number of network interfaces has been reached
+2025/07/02 08:29:03 [ERROR] aws-node: vpc.go:789 VPC CNI error: insufficient IP addresses in subnet  for pod allocation
+2025/07/02 08:29:03 [ERROR] cluster-autoscaler: scale_up.go:123 failed to scale up: nodes cannot accommodate new pods due to IP exhaustion in VPC 
+2025/07/02 08:29:03 [ERROR] karpenter: provisioner.go:234 failed to provision new node: insufficient IP addresses in subnet 
+2025/07/02 08:29:03 [ERROR] aws-load-balancer-controller: controller.go:345 failed to create target group: no available IP addresses
+2025/07/02 08:29:03 [ERROR] deployment-controller: deployment.go:456 Deployment "critical-app" failed: pods cannot be scheduled due to IP exhaustion
+2025/07/02 08:29:03 [ERROR] replicaset-controller: replicaset.go:567 ReplicaSet "web-app-rs" failed to create pods: Insufficient IP addresses
+2025/07/02 08:29:03 [ERROR] statefulset-controller: statefulset.go:678 StatefulSet "database" stuck: cannot allocate IP addresses for new pods
+2025/07/02 08:29:03 [ERROR] service-controller: service.go:789 Service "api-service" endpoints unavailable: pods failed to start due to IP exhaustion
+2025/07/02 08:29:03 [ERROR] ingress-controller: ingress.go:890 Ingress "web-ingress" backend unavailable: target pods cannot be scheduled
+2025/07/02 08:29:03 [ERROR] dns-controller: dns.go:901 DNS resolution failing: CoreDNS pods cannot be scheduled due to IP exhaustion
diff --git a/rules/tags/tags.yaml b/rules/tags/tags.yaml
@@ -833,3 +833,9 @@ tags:
   - name: certificate-verification
     displayName: Certificate Verification
     description: Issues with SSL/TLS certificate verification including trust chain validation, certificate authority verification, and hostname matching
+  - name: pod-scheduling
+    displayName: Pod Scheduling
+    description: Issues with Kubernetes pod scheduling due to resource constraints or networking problems
+  - name: cluster-scaling
+    displayName: Cluster Scaling
+    description: Problems related to Kubernetes cluster scaling operations and capacity management