Skip to content

Commit 6279dfb

Browse files
committed
another ipv6 attempt, failed.
1 parent 0e20d7b commit 6279dfb

File tree

4 files changed

+224
-23
lines changed

4 files changed

+224
-23
lines changed

IPv6_MIGRATION_NOTES.md

Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
# IPv6 Migration Attempt and Module Upgrade Notes
2+
3+
## Executive Summary
4+
5+
**Attempted:** IPv6-only EC2 instances to save ~$7-14/month on public IPv4 charges
6+
**Result:** **REVERTED** - Not viable without AWS NAT64 support
7+
**Successful:** Module version upgrades (ASG, Security Group, AWS Provider)
8+
9+
## What We Tried (October 2025)
10+
11+
### 1. IPv6-Only Configuration Attempt
12+
13+
**Modified:** `terraform/asg.tf`
14+
- Disabled public IPv4: `associate_public_ip_address = false`
15+
- Enabled IPv6: `ipv6_address_count = 1`
16+
- Configured ECS agent and Docker for IPv6
17+
18+
**Infrastructure verified working:**
19+
- ✅ VPC has IPv6 CIDR: `2600:1f16:78e:d400::/56`
20+
- ✅ Subnets have IPv6 CIDRs with auto-assign enabled
21+
- ✅ Route table: `::/0` → Internet Gateway
22+
- ✅ DNS64 enabled on all subnets
23+
- ✅ AWS dual-stack endpoints available:
24+
- `ecs.us-east-2.api.aws``2600:1f70:6000:c0:...`
25+
- `ecr.us-east-2.api.aws``2600:1f70:6000:80:...`
26+
- `logs.us-east-2.api.aws``2600:1f70:6000:200:...`
27+
28+
### Why It Failed
29+
30+
**Root cause:** AWS provides DNS64 but **NOT NAT64**
31+
32+
**What this means:**
33+
- **DNS64** (✅ provided): Translates DNS queries from A records to AAAA records using `64:ff9b::/96` prefix
34+
- **NAT64** (❌ NOT provided): Would translate actual IPv6 packets to IPv4 for IPv4-only services
35+
- Result: Instances can resolve IPv4-only services to IPv6 addresses, but packets time out with no NAT64 gateway
36+
37+
**Services that broke:**
38+
- ❌ AWS SSM Agent (IPv4-only): `dial tcp [64:ff9b::392:b12]:443: i/o timeout`
39+
- ❌ ECS container health checks failed
40+
- ❌ Any IPv4-only external dependencies
41+
42+
**Services that worked:**
43+
- ✅ ECS control plane (has dual-stack endpoint)
44+
- ✅ ECR (has dual-stack endpoint)
45+
- ✅ CloudWatch Logs (has dual-stack endpoint)
46+
47+
### 2. Terraform Module Version Upgrades (SUCCESSFUL)
48+
49+
**Successfully Updated Modules:**
50+
51+
| Module | Old Version | New Version | Status |
52+
|--------|-------------|-------------|--------|
53+
| `terraform-aws-modules/autoscaling/aws` | ~> 6.5 | ~> 8.3 | ✅ Applied |
54+
| `terraform-aws-modules/security-group/aws` | ~> 4.0 | ~> 5.3 | ✅ Applied |
55+
| AWS Provider | >= 4.6 | >= 5.0 | ✅ Applied |
56+
| `terraform-aws-modules/ecs/aws` | ~> 4.0 | ~> 4.1 | ✅ Applied (kept at v4 to avoid cluster recreation) |
57+
58+
**Why we didn't go further:**
59+
- ECS v6.x: Breaking API changes (cluster recreation required)
60+
- ASG v9.x: Breaking changes in `mixed_instances_policy` structure
61+
62+
**Installed Versions:**
63+
- AWS Provider: v5.100.0
64+
- ECS Module: v4.1.3
65+
- Autoscaling Module: v8.3.1
66+
- Security Group Module: v5.3.1
67+
68+
## Current Configuration (Post-Revert)
69+
70+
**Final State:**
71+
- ✅ Instances have public IPv4 (reverted from IPv6-only)
72+
- ✅ Instances have IPv6 addresses
73+
- ✅ Dual-stack networking
74+
- ✅ Module upgrades applied
75+
- ❌ No cost savings (still paying for public IPv4)
76+
77+
**Configuration:**
78+
```hcl
79+
# terraform/asg.tf
80+
network_interfaces = [
81+
{
82+
associate_public_ip_address = true # Reverted to true
83+
ipv6_address_count = 1 # Still have IPv6
84+
# ...
85+
}
86+
]
87+
88+
# terraform/ecs.tf - user_data
89+
# Standard ECS config, no IPv6-specific settings
90+
```
91+
92+
## What Would Need to Change for IPv6-Only to Work
93+
94+
**Waiting for AWS to provide:**
95+
96+
1. **Native NAT64 Service**
97+
- Similar to NAT Gateway but for IPv6→IPv4 translation
98+
- Would allow IPv6-only instances to reach IPv4-only services
99+
- **This is the blocker - AWS doesn't offer this**
100+
101+
2. **Alternative: All services support dual-stack**
102+
- Every AWS service with IPv6 endpoints
103+
- Particularly: SSM, EC2 Messages, SSM Messages
104+
- Currently only ECS, ECR, CloudWatch Logs, S3 support dual-stack
105+
106+
**Self-managed workarounds we rejected:**
107+
108+
1. **Deploy NAT64 on EC2** (Jool/Tayga software)
109+
- Cost: ~$3-5/month + maintenance burden
110+
- Complexity: High (setup, monitoring, SPOF)
111+
- Not worth $7-14/month savings
112+
113+
2. **VPC Endpoints for IPv4-only services**
114+
- Cost: ~$7-10/month
115+
- Would eliminate savings
116+
- Previous testing showed higher cost than benefit
117+
118+
3. **Disable SSM entirely**
119+
- Lose remote management capability
120+
- Not acceptable for production
121+
122+
## Lessons Learned
123+
124+
### What We Discovered
125+
126+
1. **DNS64 ≠ NAT64**
127+
- DNS64 only translates DNS queries, not actual traffic
128+
- Need both DNS64 + NAT64 for IPv6-only to work
129+
- AWS provides DNS64 but not NAT64
130+
131+
2. **Docker IPv6 Configuration Issues**
132+
- Enabling Docker IPv6 (`"ipv6": true`) broke dual-stack networking
133+
- Caused container health check failures
134+
- Required instance refresh to fix
135+
136+
3. **AWS Service IPv6 Support is Inconsistent**
137+
- Some services have dual-stack: ECS, ECR, CloudWatch, S3
138+
- Some services are IPv4-only: SSM, EC2 Messages
139+
- Use `.api.aws` suffix for dual-stack endpoints when available
140+
141+
4. **Cost-Benefit Analysis**
142+
- Potential savings: ~$7-14/month (public IPv4 charges)
143+
- VPC endpoint costs: ~$7-10/month (negates savings)
144+
- Self-managed NAT64: High complexity for minimal savings
145+
- **Conclusion:** Not worth the effort at this scale
146+
147+
### Technical Details Documented
148+
149+
**VPC IPv6 Configuration:**
150+
- VPC CIDR: `2600:1f16:78e:d400::/56`
151+
- Subnets: `2600:1f16:78e:d400::/64`, `d401::/64`, `d402::/64`
152+
- DNS64 prefix: `64:ff9b::/96`
153+
- Route: `::/0``igw-e39ab08a`
154+
155+
**Error signatures to watch for:**
156+
```
157+
dial tcp [64:ff9b::xxx:xxx]:443: i/o timeout
158+
```
159+
This indicates DNS64 translation without NAT64 gateway.
160+
161+
## Future Retry Conditions
162+
163+
**Only attempt IPv6-only again when ONE of these is true:**
164+
165+
1.**AWS launches managed NAT64 service**
166+
- Monitor AWS announcements for VPC NAT64 Gateway
167+
- Similar to existing NAT Gateway but for IPv6→IPv4
168+
169+
2.**All required AWS services support dual-stack**
170+
- Specifically need: SSM, EC2 Messages, SSM Messages with IPv6
171+
- Check: https://docs.aws.amazon.com/general/latest/gr/aws-ipv6-support.html
172+
173+
3.**Public IPv4 costs exceed $20-30/month**
174+
- At current scale (2-4 instances), savings too small
175+
- If scale increases significantly, complexity might be worth it
176+
177+
4.**VPC Endpoint costs drop significantly**
178+
- If AWS reduces endpoint pricing below ~$3/month per endpoint
179+
- Would make endpoint solution viable
180+
181+
**How to check service IPv6 support:**
182+
```bash
183+
dig service-name.region.api.aws AAAA +short
184+
# If returns IPv6 address, service supports dual-stack
185+
```
186+
187+
## Rollback Summary
188+
189+
**What we reverted:**
190+
1. Changed `associate_public_ip_address` back to `true`
191+
2. Removed IPv6-specific ECS agent configuration
192+
3. Removed Docker IPv6 configuration
193+
4. Triggered instance refresh to replace broken instances
194+
195+
**What we kept:**
196+
- IPv6 addressing (instances have both IPv4 and IPv6)
197+
- Module version upgrades
198+
- Updated security group module
199+
200+
**Recovery time:** ~5 minutes for instance refresh to complete

terraform/.terraform.lock.hcl

Lines changed: 17 additions & 17 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

terraform/asg.tf

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ data "aws_ssm_parameter" "ecs_optimized_ami" {
88
# https://registry.terraform.io/modules/terraform-aws-modules/autoscaling/aws/latest
99
module "autoscaling" {
1010
source = "terraform-aws-modules/autoscaling/aws"
11-
version = "~> 6.5"
11+
version = "~> 8.0" # v9+ has breaking API changes in mixed_instances_policy
1212

1313
name = "${local.name}-spot"
1414
min_size = 2
@@ -77,7 +77,8 @@ module "autoscaling" {
7777
{
7878
delete_on_termination = true
7979
device_index = 0
80-
associate_public_ip_address = true
80+
associate_public_ip_address = true # set to False to use IPv6 only - still doesn't fully work with SSM and ECS as of Oct 2025
81+
ipv6_address_count = 1 # Assign one IPv6 address
8182
security_groups = [module.autoscaling_sg.security_group_id]
8283
}
8384
]
@@ -128,7 +129,7 @@ module "autoscaling" {
128129
# https://registry.terraform.io/modules/terraform-aws-modules/security-group/aws/latest
129130
module "autoscaling_sg" {
130131
source = "terraform-aws-modules/security-group/aws"
131-
version = "~> 4.0"
132+
version = "~> 5.3" # Latest version
132133

133134
name = local.name
134135
description = "Autoscaling group security group"
@@ -142,7 +143,7 @@ module "autoscaling_sg" {
142143
}
143144
]
144145

145-
# Inbound all high ports from the alb
146+
# Inbound all high ports from the alb (IPv4 and IPv6)
146147
ingress_with_source_security_group_id = [
147148
{
148149
source_security_group_id = aws_security_group.lb_security_group.id
@@ -152,7 +153,7 @@ module "autoscaling_sg" {
152153
}
153154
]
154155

155-
egress_rules = ["all-all"]
156+
egress_rules = ["all-all"] # Already includes IPv4 and IPv6 egress
156157

157158
tags = local.tags
158159
}

terraform/main.tf

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ terraform {
44
required_providers {
55
aws = {
66
source = "hashicorp/aws"
7-
version = ">= 4.6"
7+
version = ">= 5.0" # Updated from 4.6 to 5.0
88
}
99
}
1010
}

0 commit comments

Comments
 (0)