Skip to content

Commit a076dbf

Browse files
committed
feat: initial release of pulumi-aws-ec2-capacity-fallback
Pulumi dynamic provider that launches EC2 instances with automatic fallback across instance types and availability zones when AWS returns capacity errors (InsufficientInstanceCapacity, Unsupported, etc.). Provides two APIs: - ResilientInstance: typed dataclass args for standalone usage - ResilientInstanceRaw: dict props accepting Pulumi Outputs Includes pre-flight offerings check, AZ filtering, least-used subnet selection, and in-place tag/security group updates. Once launched, instance type is locked to prevent accidental replacement.
0 parents  commit a076dbf

File tree

10 files changed

+1191
-0
lines changed

10 files changed

+1191
-0
lines changed

.gitignore

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
__pycache__/
2+
*.py[cod]
3+
*$py.class
4+
*.egg-info/
5+
dist/
6+
build/
7+
venv/
8+
.venv/
9+
*.egg
10+
.pytest_cache/
11+
.coverage
12+
htmlcov/

README.md

Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# pulumi-aws-ec2-capacity-fallback
2+
3+
A Pulumi component that launches EC2 instances with automatic fallback across instance types and availability zones when AWS returns capacity errors.
4+
5+
## The problem
6+
7+
When launching GPU instances (g6, g5, p5, etc.), AWS frequently returns `InsufficientInstanceCapacity` because GPU capacity is limited and unevenly distributed across AZs. This causes `pulumi up` to fail, requiring manual intervention to try a different instance type or AZ.
8+
9+
## The solution
10+
11+
This component wraps EC2 instance creation with retry logic. You provide an ordered list of instance types, and the component:
12+
13+
1. Checks `describe_instance_type_offerings` to skip types not offered in the target AZs
14+
2. Attempts to launch each remaining type/AZ combination via `run_instances`
15+
3. On `InsufficientInstanceCapacity`, `Unsupported`, or `InstanceLimitExceeded`, automatically tries the next combination
16+
4. Once launched, the instance type is locked -- subsequent `pulumi up` runs will not replace the instance even if a different type is now preferred
17+
18+
## Features
19+
20+
- Automatic fallback across multiple instance types in priority order
21+
- Automatic fallback across multiple availability zones
22+
- Pre-flight offerings check to skip types not available in target AZs
23+
- AZ filtering (e.g. restrict to AZ A and B only)
24+
- Least-used subnet selection for balanced distribution
25+
- In-place tag and security group updates without instance replacement
26+
- Instance type locked after creation (use `pulumi up --replace <urn>` to force change)
27+
28+
## Installation
29+
30+
```bash
31+
uv add pulumi-aws-ec2-capacity-fallback
32+
```
33+
34+
Or with pip:
35+
36+
```bash
37+
pip install pulumi-aws-ec2-capacity-fallback
38+
```
39+
40+
## Usage
41+
42+
### Typed API (recommended)
43+
44+
```python
45+
from pulumi_ec2_capacity_fallback import (
46+
ResilientInstance,
47+
ResilientInstanceArgs,
48+
BlockDeviceConfig,
49+
)
50+
51+
instance = ResilientInstance(
52+
"my-gpu-node",
53+
args=ResilientInstanceArgs(
54+
instance_types=["g6.xlarge", "g6e.xlarge", "g5.xlarge"],
55+
subnet_ids=["subnet-abc123", "subnet-def456"],
56+
ami_id="ami-0123456789abcdef0",
57+
security_group_ids=["sg-abc123"],
58+
instance_profile_name="my-profile",
59+
key_name="my-key",
60+
root_block_device=BlockDeviceConfig(volume_size=100),
61+
tags={"Name": "my-gpu-node"},
62+
),
63+
region="us-east-1",
64+
)
65+
```
66+
67+
### Raw API (for Pulumi Output inputs)
68+
69+
When your inputs are Pulumi Outputs (e.g. subnet IDs from another resource), use `ResilientInstanceRaw`:
70+
71+
```python
72+
from pulumi_ec2_capacity_fallback import ResilientInstanceRaw
73+
74+
instance = ResilientInstanceRaw(
75+
"my-node",
76+
props={
77+
"instance_types": ["g6.xlarge", "g5.xlarge"],
78+
"subnet_ids": vpc_component.private_subnet_ids, # Output[List[str]]
79+
"ami_id": ami.id, # Output[str]
80+
"security_group_ids": [sg.id],
81+
"instance_profile_name": profile.name,
82+
"key_name": "my-key",
83+
"root_block_device": {
84+
"volume_size": 100,
85+
"volume_type": "gp3",
86+
"encrypted": True,
87+
"delete_on_termination": True,
88+
},
89+
"tags": {"Name": "my-node"},
90+
"region": "us-east-1",
91+
},
92+
)
93+
```
94+
95+
## Inputs
96+
97+
| Name | Type | Default | Description |
98+
|------|------|---------|-------------|
99+
| `instance_types` | `List[str]` | required | Ordered list of instance types to try (primary first) |
100+
| `subnet_ids` | `List[str]` | required | Subnet IDs to try across AZs |
101+
| `ami_id` | `str` | required | AMI ID to launch |
102+
| `security_group_ids` | `List[str]` | required | Security group IDs to attach |
103+
| `instance_profile_name` | `str` | `""` | IAM instance profile name |
104+
| `key_name` | `str` | `""` | SSH key pair name |
105+
| `root_block_device` | `BlockDeviceConfig` | 8GB gp3 | Root volume configuration |
106+
| `user_data` | `str` | `""` | User data script |
107+
| `tags` | `Dict[str, str]` | `{}` | Tags for the instance |
108+
| `az_suffixes` | `List[str]` | `["a", "b"]` | Restrict to AZs ending with these suffixes. Set to `None` for all AZs |
109+
| `prefer_least_used_subnet` | `bool` | `False` | Pick the subnet with the most available IPs |
110+
111+
## Outputs
112+
113+
| Name | Type | Description |
114+
|------|------|-------------|
115+
| `instance_id` | `str` | AWS EC2 instance ID |
116+
| `private_ip` | `str` | Private IP address |
117+
| `public_ip` | `str` | Public IP address (if applicable) |
118+
| `availability_zone` | `str` | AZ where the instance was launched |
119+
| `launched_instance_type` | `str` | The actual instance type that was launched |
120+
| `launched_subnet_id` | `str` | The actual subnet used |
121+
122+
## How it handles re-runs
123+
124+
The component is designed to be safe on subsequent `pulumi up` runs:
125+
126+
- **Instance type changes in config**: ignored. The existing instance keeps its launched type.
127+
- **Subnet/AZ changes in config**: ignored. The existing instance stays in its launched subnet.
128+
- **Tag changes**: applied in-place (no instance replacement).
129+
- **Security group changes**: applied in-place.
130+
- **AMI changes**: triggers instance replacement (delete + create with fallback).
131+
- **Volume size changes**: triggers instance replacement.
132+
- **Intentional type change**: use `pulumi up --replace <urn>` to force recreation.
133+
134+
## Retryable errors
135+
136+
The following AWS errors trigger fallback to the next type/AZ combination:
137+
138+
- `InsufficientInstanceCapacity` -- no capacity for this type in this AZ
139+
- `InstanceLimitExceeded` -- account limit reached for this type
140+
- `Unsupported` -- instance type not supported in this AZ
141+
- `InsufficientFreeAddressesInSubnet` -- subnet has no available IPs
142+
143+
Any other error (e.g. `UnauthorizedOperation`) is raised immediately.

examples/basic/Pulumi.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
name: ec2-capacity-fallback-example
2+
runtime:
3+
name: python
4+
options:
5+
virtualenv: venv
6+
description: Example of launching EC2 instances with capacity fallback
7+
config:
8+
aws:region: us-east-1

examples/basic/__main__.py

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
"""Basic example: launch a GPU instance with capacity fallback."""
2+
3+
import pulumi
4+
import pulumi_aws as aws
5+
6+
from pulumi_ec2_capacity_fallback import (
7+
ResilientInstance,
8+
ResilientInstanceArgs,
9+
BlockDeviceConfig,
10+
)
11+
12+
# Look up the latest Ubuntu 24.04 AMI
13+
ami = aws.ec2.get_ami(
14+
most_recent=True,
15+
owners=["099720109477"],
16+
filters=[{"name": "name", "values": ["ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-*"]}],
17+
)
18+
19+
# Create a GPU instance with fallback types
20+
gpu_instance = ResilientInstance(
21+
"my-gpu-node",
22+
args=ResilientInstanceArgs(
23+
# Types are tried in order. If g6.xlarge has no capacity,
24+
# g6e.xlarge is tried, then g5.xlarge, etc.
25+
instance_types=["g6.xlarge", "g6e.xlarge", "g5.xlarge"],
26+
subnet_ids=["subnet-abc123", "subnet-def456"],
27+
ami_id=ami.id,
28+
security_group_ids=["sg-abc123"],
29+
instance_profile_name="my-instance-profile",
30+
key_name="my-ssh-key",
31+
root_block_device=BlockDeviceConfig(
32+
volume_size=100,
33+
volume_type="gp3",
34+
encrypted=True,
35+
),
36+
tags={
37+
"Name": "my-gpu-node",
38+
"Environment": "dev",
39+
},
40+
# Only try AZs ending in 'a' or 'b' (default)
41+
az_suffixes=["a", "b"],
42+
),
43+
region="us-east-1",
44+
)
45+
46+
pulumi.export("instance_id", gpu_instance.instance_id)
47+
pulumi.export("private_ip", gpu_instance.private_ip)
48+
pulumi.export("launched_type", gpu_instance.launched_instance_type)
49+
pulumi.export("availability_zone", gpu_instance.availability_zone)
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
from .component import ResilientInstance, ResilientInstanceRaw
2+
from .types import BlockDeviceConfig, ResilientInstanceArgs
3+
4+
__all__ = [
5+
"ResilientInstance",
6+
"ResilientInstanceRaw",
7+
"ResilientInstanceArgs",
8+
"BlockDeviceConfig",
9+
]
Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
"""Resilient EC2 instance component resource.
2+
3+
Wraps the dynamic provider in a Pulumi ComponentResource for a clean
4+
user-facing API with typed outputs.
5+
"""
6+
7+
import pulumi
8+
from pulumi.dynamic import Resource
9+
from typing import Any, Dict, Optional
10+
11+
from .provider import ResilientInstanceProvider
12+
from .types import ResilientInstanceArgs
13+
14+
15+
class ResilientInstance(Resource):
16+
"""EC2 instance with automatic fallback across instance types and AZs.
17+
18+
On create, tries each instance type in priority order across available
19+
subnets/AZs. If a launch fails with InsufficientInstanceCapacity or
20+
similar capacity errors, automatically tries the next type/AZ combination.
21+
22+
Once created, the instance type is locked. Re-running ``pulumi up`` with
23+
different type preferences will not replace the instance. Use
24+
``pulumi up --replace <urn>`` to force recreation with current preferences.
25+
26+
Example::
27+
28+
instance = ResilientInstance(
29+
"my-gpu-node",
30+
args=ResilientInstanceArgs(
31+
instance_types=["g6.xlarge", "g6e.xlarge", "g5.xlarge"],
32+
subnet_ids=["subnet-abc123", "subnet-def456"],
33+
ami_id="ami-0123456789abcdef0",
34+
security_group_ids=["sg-abc123"],
35+
instance_profile_name="my-profile",
36+
tags={"Name": "my-gpu-node", "Environment": "dev"},
37+
),
38+
region="us-east-1",
39+
)
40+
41+
pulumi.export("instance_id", instance.instance_id)
42+
pulumi.export("launched_type", instance.launched_instance_type)
43+
"""
44+
45+
instance_id: pulumi.Output[str]
46+
private_ip: pulumi.Output[str]
47+
public_ip: pulumi.Output[str]
48+
availability_zone: pulumi.Output[str]
49+
launched_instance_type: pulumi.Output[str]
50+
launched_subnet_id: pulumi.Output[str]
51+
52+
def __init__(
53+
self,
54+
name: str,
55+
args: ResilientInstanceArgs,
56+
region: str,
57+
opts: Optional[pulumi.ResourceOptions] = None,
58+
):
59+
"""Create a resilient EC2 instance.
60+
61+
Args:
62+
name: Pulumi resource name.
63+
args: Instance configuration.
64+
region: AWS region to launch in.
65+
opts: Pulumi resource options.
66+
"""
67+
props = self._build_props(args, region)
68+
69+
super().__init__(
70+
ResilientInstanceProvider(),
71+
name,
72+
{
73+
"instance_id": None,
74+
"private_ip": None,
75+
"public_ip": None,
76+
"availability_zone": None,
77+
"launched_instance_type": None,
78+
"launched_subnet_id": None,
79+
**props,
80+
},
81+
opts,
82+
)
83+
84+
@staticmethod
85+
def _build_props(args: ResilientInstanceArgs, region: str) -> Dict[str, Any]:
86+
"""Convert typed args to the dict expected by the provider."""
87+
return {
88+
"instance_types": args.instance_types,
89+
"subnet_ids": args.subnet_ids,
90+
"ami_id": args.ami_id,
91+
"security_group_ids": args.security_group_ids,
92+
"instance_profile_name": args.instance_profile_name,
93+
"key_name": args.key_name,
94+
"root_block_device": {
95+
"volume_size": args.root_block_device.volume_size,
96+
"volume_type": args.root_block_device.volume_type,
97+
"encrypted": args.root_block_device.encrypted,
98+
"delete_on_termination": args.root_block_device.delete_on_termination,
99+
"iops": args.root_block_device.iops,
100+
"throughput": args.root_block_device.throughput,
101+
},
102+
"user_data": args.user_data,
103+
"tags": args.tags,
104+
"az_suffixes": args.az_suffixes,
105+
"prefer_least_used_subnet": args.prefer_least_used_subnet,
106+
"region": region,
107+
}
108+
109+
110+
class ResilientInstanceRaw(Resource):
111+
"""Lower-level API accepting a plain dict instead of ResilientInstanceArgs.
112+
113+
Useful when inputs contain Pulumi Outputs that need to be resolved
114+
by the framework (e.g. subnet IDs from another resource's output).
115+
116+
Example::
117+
118+
instance = ResilientInstanceRaw(
119+
"my-node",
120+
props={
121+
"instance_types": ["m7i.xlarge", "m6i.xlarge"],
122+
"subnet_ids": vpc.subnet_ids, # Output[List[str]]
123+
"ami_id": ami.id, # Output[str]
124+
"security_group_ids": [sg.id],
125+
"region": "us-east-1",
126+
"tags": {"Name": "my-node"},
127+
},
128+
)
129+
"""
130+
131+
instance_id: pulumi.Output[str]
132+
private_ip: pulumi.Output[str]
133+
public_ip: pulumi.Output[str]
134+
availability_zone: pulumi.Output[str]
135+
launched_instance_type: pulumi.Output[str]
136+
launched_subnet_id: pulumi.Output[str]
137+
138+
def __init__(
139+
self,
140+
name: str,
141+
props: Dict[str, Any],
142+
opts: Optional[pulumi.ResourceOptions] = None,
143+
):
144+
super().__init__(
145+
ResilientInstanceProvider(),
146+
name,
147+
{
148+
"instance_id": None,
149+
"private_ip": None,
150+
"public_ip": None,
151+
"availability_zone": None,
152+
"launched_instance_type": None,
153+
"launched_subnet_id": None,
154+
**props,
155+
},
156+
opts,
157+
)

0 commit comments

Comments
 (0)