Skip to content

CAPA Cluster Priority Expander Fails to Scale to Shared Tenancy When Dedicated Hosts Have Insufficient Capacity #5853

@raghs-aws

Description

@raghs-aws

/kind bug

What steps did you take and what happened:
Issue Summary: CAPA cluster with dual nodegroup setup (dedicated hosts + shared tenancy) fails to scale to shared tenancy instances when dedicated hosts reach capacity limits. The cluster autoscaler gets stuck and does not create new nodes on the low-priority shared tenancy nodegroup as expected.

Environment:

Platform: CAPA-based Kubernetes clusters with  MachinePool Template for dedicated hosts and Shared tenancy
ClusterAutoscaler Configuration: expander=priority,least-waste
Priority Setup: Dedicated Hosts (Priority 100) → Shared Tenancy (Priority 10)
AutoDiscover enabled on cluster Autoscalar

apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-priority-expander
namespace: kube-system
data:
priorities: |-
100:
- .HostTenancy
10:
- .Shared

Steps to Reproduce

Setup CAPA cluster with dual MachinePool configuration:

MachinePool 1: Dedicated hosts tenancy (nodegroup name contains "HostTenancy")
MachinePool 2: Shared tenancy (nodegroup name contains "Shared")

Configure ClusterAutoscaler with priority expander and ConfigMap as shown above

Trigger scaling event when dedicated hosts are at or near capacity limit

Observe behavior: Scaling attempts fail to create nodes on shared tenancy MachinePool

What did you expect to happen:

Expected Behavior:

ClusterAutoscaler attempts to scale dedicated hosts nodegroup (priority 100)
When dedicated hosts have insufficient capacity → automatic fallback to shared tenancy nodegroup (priority 10)
New nodes successfully created on shared tenancy instances
Workload scheduling continues seamlessly

Actual Behavior:

ClusterAutoscaler attempts to scale dedicated hosts nodegroup ✅
Dedicated hosts capacity insufficient ❌
FAILURE: Scaling gets stuck - no fallback to shared tenancy nodegroup
RESULT: No new nodes created, workload scheduling fails

Anything else you would like to add:

This solution works when used with autoscaling groups (non - Capa setup)
Same priority expander configuration
Same ConfigMap setup
Successfully falls back to shared tenancy ASG when dedicated hosts at capacity
Key Difference: Uses native ASGs instead of CAPA MachinePools

Environment:

  • Cluster-api-provider-aws version: 1.24.16,
  • Kubernetes version: (use kubectl version):1.24.16,
  • Cluster Autoscalar version: v1.31.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-priorityneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions