GremlinLTD
diff --git a/‎.gitignore‎
Lines changed: 12 additions & 0 deletions b/‎.gitignore‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 143 additions & 0 deletions b/‎README.md‎
Lines changed: 143 additions & 0 deletions
diff --git a/‎examples/basic/Pulumi.yaml‎
Lines changed: 8 additions & 0 deletions b/‎examples/basic/Pulumi.yaml‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎examples/basic/__main__.py‎
Lines changed: 49 additions & 0 deletions b/‎examples/basic/__main__.py‎
Lines changed: 49 additions & 0 deletions
diff --git a/‎pulumi_ec2_capacity_fallback/__init__.py‎
Lines changed: 9 additions & 0 deletions b/‎pulumi_ec2_capacity_fallback/__init__.py‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎pulumi_ec2_capacity_fallback/component.py‎
Lines changed: 157 additions & 0 deletions b/‎pulumi_ec2_capacity_fallback/component.py‎
Lines changed: 157 additions & 0 deletions
@@ -0,0 +1,12 @@
+__pycache__/
+*.py[cod]
+*$py.class
+*.egg-info/
+dist/
+build/
+venv/
+.venv/
+*.egg
+.pytest_cache/
+.coverage
+htmlcov/
@@ -0,0 +1,143 @@
+# pulumi-aws-ec2-capacity-fallback
+
+A Pulumi component that launches EC2 instances with automatic fallback across instance types and availability zones when AWS returns capacity errors.
+
+## The problem
+
+When launching GPU instances (g6, g5, p5, etc.), AWS frequently returns `InsufficientInstanceCapacity` because GPU capacity is limited and unevenly distributed across AZs. This causes `pulumi up` to fail, requiring manual intervention to try a different instance type or AZ.
+
+## The solution
+
+This component wraps EC2 instance creation with retry logic. You provide an ordered list of instance types, and the component:
+
+1. Checks `describe_instance_type_offerings` to skip types not offered in the target AZs
+2. Attempts to launch each remaining type/AZ combination via `run_instances`
+3. On `InsufficientInstanceCapacity`, `Unsupported`, or `InstanceLimitExceeded`, automatically tries the next combination
+4. Once launched, the instance type is locked -- subsequent `pulumi up` runs will not replace the instance even if a different type is now preferred
+
+## Features
+
+- Automatic fallback across multiple instance types in priority order
+- Automatic fallback across multiple availability zones
+- Pre-flight offerings check to skip types not available in target AZs
+- AZ filtering (e.g. restrict to AZ A and B only)
+- Least-used subnet selection for balanced distribution
+- In-place tag and security group updates without instance replacement
+- Instance type locked after creation (use `pulumi up --replace <urn>` to force change)
+
+## Installation
+
+```bash
+uv add pulumi-aws-ec2-capacity-fallback
+```
+
+Or with pip:
+
+```bash
+pip install pulumi-aws-ec2-capacity-fallback
+```
+
+## Usage
+
+### Typed API (recommended)
+
+```python
+from pulumi_ec2_capacity_fallback import (
+    ResilientInstance,
+    ResilientInstanceArgs,
+    BlockDeviceConfig,
+)
+
+instance = ResilientInstance(
+    "my-gpu-node",
+    args=ResilientInstanceArgs(
+        instance_types=["g6.xlarge", "g6e.xlarge", "g5.xlarge"],
+        subnet_ids=["subnet-abc123", "subnet-def456"],
+        ami_id="ami-0123456789abcdef0",
+        security_group_ids=["sg-abc123"],
+        instance_profile_name="my-profile",
+        key_name="my-key",
+        root_block_device=BlockDeviceConfig(volume_size=100),
+        tags={"Name": "my-gpu-node"},
+    ),
+    region="us-east-1",
+)
+```
+
+### Raw API (for Pulumi Output inputs)
+
+When your inputs are Pulumi Outputs (e.g. subnet IDs from another resource), use `ResilientInstanceRaw`:
+
+```python
+from pulumi_ec2_capacity_fallback import ResilientInstanceRaw
+
+instance = ResilientInstanceRaw(
+    "my-node",
+    props={
+        "instance_types": ["g6.xlarge", "g5.xlarge"],
+        "subnet_ids": vpc_component.private_subnet_ids,  # Output[List[str]]
+        "ami_id": ami.id,                                 # Output[str]
+        "security_group_ids": [sg.id],
+        "instance_profile_name": profile.name,
+        "key_name": "my-key",
+        "root_block_device": {
+            "volume_size": 100,
+            "volume_type": "gp3",
+            "encrypted": True,
+            "delete_on_termination": True,
+        },
+        "tags": {"Name": "my-node"},
+        "region": "us-east-1",
+    },
+)
+```
+
+## Inputs
+
+| Name | Type | Default | Description |
+|------|------|---------|-------------|
+| `instance_types` | `List[str]` | required | Ordered list of instance types to try (primary first) |
+| `subnet_ids` | `List[str]` | required | Subnet IDs to try across AZs |
+| `ami_id` | `str` | required | AMI ID to launch |
+| `security_group_ids` | `List[str]` | required | Security group IDs to attach |
+| `instance_profile_name` | `str` | `""` | IAM instance profile name |
+| `key_name` | `str` | `""` | SSH key pair name |
+| `root_block_device` | `BlockDeviceConfig` | 8GB gp3 | Root volume configuration |
+| `user_data` | `str` | `""` | User data script |
+| `tags` | `Dict[str, str]` | `{}` | Tags for the instance |
+| `az_suffixes` | `List[str]` | `["a", "b"]` | Restrict to AZs ending with these suffixes. Set to `None` for all AZs |
+| `prefer_least_used_subnet` | `bool` | `False` | Pick the subnet with the most available IPs |
+
+## Outputs
+
+| Name | Type | Description |
+|------|------|-------------|
+| `instance_id` | `str` | AWS EC2 instance ID |
+| `private_ip` | `str` | Private IP address |
+| `public_ip` | `str` | Public IP address (if applicable) |
+| `availability_zone` | `str` | AZ where the instance was launched |
+| `launched_instance_type` | `str` | The actual instance type that was launched |
+| `launched_subnet_id` | `str` | The actual subnet used |
+
+## How it handles re-runs
+
+The component is designed to be safe on subsequent `pulumi up` runs:
+
+- **Instance type changes in config**: ignored. The existing instance keeps its launched type.
+- **Subnet/AZ changes in config**: ignored. The existing instance stays in its launched subnet.
+- **Tag changes**: applied in-place (no instance replacement).
+- **Security group changes**: applied in-place.
+- **AMI changes**: triggers instance replacement (delete + create with fallback).
+- **Volume size changes**: triggers instance replacement.
+- **Intentional type change**: use `pulumi up --replace <urn>` to force recreation.
+
+## Retryable errors
+
+The following AWS errors trigger fallback to the next type/AZ combination:
+
+- `InsufficientInstanceCapacity` -- no capacity for this type in this AZ
+- `InstanceLimitExceeded` -- account limit reached for this type
+- `Unsupported` -- instance type not supported in this AZ
+- `InsufficientFreeAddressesInSubnet` -- subnet has no available IPs
+
+Any other error (e.g. `UnauthorizedOperation`) is raised immediately.
@@ -0,0 +1,8 @@
+name: ec2-capacity-fallback-example
+runtime:
+  name: python
+  options:
+    virtualenv: venv
+description: Example of launching EC2 instances with capacity fallback
+config:
+  aws:region: us-east-1
@@ -0,0 +1,49 @@
+"""Basic example: launch a GPU instance with capacity fallback."""
+
+import pulumi
+import pulumi_aws as aws
+
+from pulumi_ec2_capacity_fallback import (
+    ResilientInstance,
+    ResilientInstanceArgs,
+    BlockDeviceConfig,
+)
+
+# Look up the latest Ubuntu 24.04 AMI
+ami = aws.ec2.get_ami(
+    most_recent=True,
+    owners=["099720109477"],
+    filters=[{"name": "name", "values": ["ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-*"]}],
+)
+
+# Create a GPU instance with fallback types
+gpu_instance = ResilientInstance(
+    "my-gpu-node",
+    args=ResilientInstanceArgs(
+        # Types are tried in order. If g6.xlarge has no capacity,
+        # g6e.xlarge is tried, then g5.xlarge, etc.
+        instance_types=["g6.xlarge", "g6e.xlarge", "g5.xlarge"],
+        subnet_ids=["subnet-abc123", "subnet-def456"],
+        ami_id=ami.id,
+        security_group_ids=["sg-abc123"],
+        instance_profile_name="my-instance-profile",
+        key_name="my-ssh-key",
+        root_block_device=BlockDeviceConfig(
+            volume_size=100,
+            volume_type="gp3",
+            encrypted=True,
+        ),
+        tags={
+            "Name": "my-gpu-node",
+            "Environment": "dev",
+        },
+        # Only try AZs ending in 'a' or 'b' (default)
+        az_suffixes=["a", "b"],
+    ),
+    region="us-east-1",
+)
+
+pulumi.export("instance_id", gpu_instance.instance_id)
+pulumi.export("private_ip", gpu_instance.private_ip)
+pulumi.export("launched_type", gpu_instance.launched_instance_type)
+pulumi.export("availability_zone", gpu_instance.availability_zone)
@@ -0,0 +1,9 @@
+from .component import ResilientInstance, ResilientInstanceRaw
+from .types import BlockDeviceConfig, ResilientInstanceArgs
+
+__all__ = [
+    "ResilientInstance",
+    "ResilientInstanceRaw",
+    "ResilientInstanceArgs",
+    "BlockDeviceConfig",
+]
@@ -0,0 +1,157 @@
+"""Resilient EC2 instance component resource.
+
+Wraps the dynamic provider in a Pulumi ComponentResource for a clean
+user-facing API with typed outputs.
+"""
+
+import pulumi
+from pulumi.dynamic import Resource
+from typing import Any, Dict, Optional
+
+from .provider import ResilientInstanceProvider
+from .types import ResilientInstanceArgs
+
+
+class ResilientInstance(Resource):
+    """EC2 instance with automatic fallback across instance types and AZs.
+
+    On create, tries each instance type in priority order across available
+    subnets/AZs. If a launch fails with InsufficientInstanceCapacity or
+    similar capacity errors, automatically tries the next type/AZ combination.
+
+    Once created, the instance type is locked. Re-running ``pulumi up`` with
+    different type preferences will not replace the instance. Use
+    ``pulumi up --replace <urn>`` to force recreation with current preferences.
+
+    Example::
+
+        instance = ResilientInstance(
+            "my-gpu-node",
+            args=ResilientInstanceArgs(
+                instance_types=["g6.xlarge", "g6e.xlarge", "g5.xlarge"],
+                subnet_ids=["subnet-abc123", "subnet-def456"],
+                ami_id="ami-0123456789abcdef0",
+                security_group_ids=["sg-abc123"],
+                instance_profile_name="my-profile",
+                tags={"Name": "my-gpu-node", "Environment": "dev"},
+            ),
+            region="us-east-1",
+        )
+
+        pulumi.export("instance_id", instance.instance_id)
+        pulumi.export("launched_type", instance.launched_instance_type)
+    """
+
+    instance_id: pulumi.Output[str]
+    private_ip: pulumi.Output[str]
+    public_ip: pulumi.Output[str]
+    availability_zone: pulumi.Output[str]
+    launched_instance_type: pulumi.Output[str]
+    launched_subnet_id: pulumi.Output[str]
+
+    def __init__(
+        self,
+        name: str,
+        args: ResilientInstanceArgs,
+        region: str,
+        opts: Optional[pulumi.ResourceOptions] = None,
+    ):
+        """Create a resilient EC2 instance.
+
+        Args:
+            name: Pulumi resource name.
+            args: Instance configuration.
+            region: AWS region to launch in.
+            opts: Pulumi resource options.
+        """
+        props = self._build_props(args, region)
+
+        super().__init__(
+            ResilientInstanceProvider(),
+            name,
+            {
+                "instance_id": None,
+                "private_ip": None,
+                "public_ip": None,
+                "availability_zone": None,
+                "launched_instance_type": None,
+                "launched_subnet_id": None,
+                **props,
+            },
+            opts,
+        )
+
+    @staticmethod
+    def _build_props(args: ResilientInstanceArgs, region: str) -> Dict[str, Any]:
+        """Convert typed args to the dict expected by the provider."""
+        return {
+            "instance_types": args.instance_types,
+            "subnet_ids": args.subnet_ids,
+            "ami_id": args.ami_id,
+            "security_group_ids": args.security_group_ids,
+            "instance_profile_name": args.instance_profile_name,
+            "key_name": args.key_name,
+            "root_block_device": {
+                "volume_size": args.root_block_device.volume_size,
+                "volume_type": args.root_block_device.volume_type,
+                "encrypted": args.root_block_device.encrypted,
+                "delete_on_termination": args.root_block_device.delete_on_termination,
+                "iops": args.root_block_device.iops,
+                "throughput": args.root_block_device.throughput,
+            },
+            "user_data": args.user_data,
+            "tags": args.tags,
+            "az_suffixes": args.az_suffixes,
+            "prefer_least_used_subnet": args.prefer_least_used_subnet,
+            "region": region,
+        }
+
+
+class ResilientInstanceRaw(Resource):
+    """Lower-level API accepting a plain dict instead of ResilientInstanceArgs.
+
+    Useful when inputs contain Pulumi Outputs that need to be resolved
+    by the framework (e.g. subnet IDs from another resource's output).
+
+    Example::
+
+        instance = ResilientInstanceRaw(
+            "my-node",
+            props={
+                "instance_types": ["m7i.xlarge", "m6i.xlarge"],
+                "subnet_ids": vpc.subnet_ids,  # Output[List[str]]
+                "ami_id": ami.id,               # Output[str]
+                "security_group_ids": [sg.id],
+                "region": "us-east-1",
+                "tags": {"Name": "my-node"},
+            },
+        )
+    """
+
+    instance_id: pulumi.Output[str]
+    private_ip: pulumi.Output[str]
+    public_ip: pulumi.Output[str]
+    availability_zone: pulumi.Output[str]
+    launched_instance_type: pulumi.Output[str]
+    launched_subnet_id: pulumi.Output[str]
+
+    def __init__(
+        self,
+        name: str,
+        props: Dict[str, Any],
+        opts: Optional[pulumi.ResourceOptions] = None,
+    ):
+        super().__init__(
+            ResilientInstanceProvider(),
+            name,
+            {
+                "instance_id": None,
+                "private_ip": None,
+                "public_ip": None,
+                "availability_zone": None,
+                "launched_instance_type": None,
+                "launched_subnet_id": None,
+                **props,
+            },
+            opts,
+        )