Skip to content

Commit 89e6172

Browse files
irvingpopIrving Popovetsky
andauthored
Backend API and Pybot maintenance (#184)
* bringing down the API Signed-off-by: Irving Popovetsky <irving@honeycomb.io> * Try to reduce spot availability issues Signed-off-by: Irving Popovetsky <irving@honeycomb.io> * Switch to arm due to amd64 spot unreliability Signed-off-by: Irving Popovetsky <irving@honeycomb.io> * Update back-end image to the latest one in ECR if needed Signed-off-by: Irving Popovetsky <irving@honeycomb.io> * API still up for now Signed-off-by: Irving Popovetsky <irving@honeycomb.io> * Fix an issue where the pg driver was confused about the current timezone Signed-off-by: Irving Popovetsky <irving@honeycomb.io> * switch to GHA because circleci is clearly wonky Signed-off-by: Irving Popovetsky <irving@honeycomb.io> * another ipv6 attempt, failed. * don't double-run GHA actions * fix comments failing fmt check Signed-off-by: Irving Popovetsky <irving@honeycomb.io> * make the action more efficient * update access IP Signed-off-by: Irving Popovetsky <irving@honeycomb.io> * re-enable staging backend and update prod backend docker image tag Signed-off-by: Irving Popovetsky <irving@honeycomb.io> * adjust node group to always have 1 node at market price, health check to use curl now Signed-off-by: Irving Popovetsky <irving@popovetsky.com> --------- Signed-off-by: Irving Popovetsky <irving@honeycomb.io> Signed-off-by: Irving Popovetsky <irving@popovetsky.com> Co-authored-by: Irving Popovetsky <irving@honeycomb.io>
1 parent 7219c84 commit 89e6172

File tree

20 files changed

+361
-217
lines changed

20 files changed

+361
-217
lines changed

.circleci/config.yml

Lines changed: 0 additions & 16 deletions
This file was deleted.

.github/workflows/terraform.yml

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
name: Terraform validation
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
pull_request:
8+
9+
jobs:
10+
terraform:
11+
name: Terraform format and validate
12+
runs-on: ubuntu-latest
13+
defaults:
14+
run:
15+
working-directory: terraform
16+
steps:
17+
- name: Checkout
18+
uses: actions/checkout@v4
19+
20+
- name: Setup Terraform
21+
uses: hashicorp/setup-terraform@v3
22+
23+
- name: Terraform Format
24+
run: terraform fmt -check -recursive
25+
26+
- name: Terraform Init
27+
run: terraform init -backend=false
28+
29+
- name: Terraform Validate
30+
run: terraform validate

IPv6_MIGRATION_NOTES.md

Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
# IPv6 Migration Attempt and Module Upgrade Notes
2+
3+
## Executive Summary
4+
5+
**Attempted:** IPv6-only EC2 instances to save ~$7-14/month on public IPv4 charges
6+
**Result:** **REVERTED** - Not viable without AWS NAT64 support
7+
**Successful:** Module version upgrades (ASG, Security Group, AWS Provider)
8+
9+
## What We Tried (October 2025)
10+
11+
### 1. IPv6-Only Configuration Attempt
12+
13+
**Modified:** `terraform/asg.tf`
14+
- Disabled public IPv4: `associate_public_ip_address = false`
15+
- Enabled IPv6: `ipv6_address_count = 1`
16+
- Configured ECS agent and Docker for IPv6
17+
18+
**Infrastructure verified working:**
19+
- ✅ VPC has IPv6 CIDR: `2600:1f16:78e:d400::/56`
20+
- ✅ Subnets have IPv6 CIDRs with auto-assign enabled
21+
- ✅ Route table: `::/0` → Internet Gateway
22+
- ✅ DNS64 enabled on all subnets
23+
- ✅ AWS dual-stack endpoints available:
24+
- `ecs.us-east-2.api.aws``2600:1f70:6000:c0:...`
25+
- `ecr.us-east-2.api.aws``2600:1f70:6000:80:...`
26+
- `logs.us-east-2.api.aws``2600:1f70:6000:200:...`
27+
28+
### Why It Failed
29+
30+
**Root cause:** NAT64 requires NAT Gateway, which negates cost savings
31+
32+
**What this means:**
33+
- **DNS64** (✅ provided): Translates DNS queries from A records to AAAA records using `64:ff9b::/96` prefix
34+
- **NAT64** (✅ available via NAT Gateway): AWS NAT Gateway supports NAT64 translation when routing `64:ff9b::/96` traffic through it
35+
- **The problem**: NAT Gateway costs ~$32+/month base, which exceeds the ~$18/month we'd save on public IPv4 addresses
36+
- Additionally, SSM still requires IPv4 connectivity regardless of NAT64
37+
38+
**Services that broke:**
39+
- ❌ AWS SSM Agent (IPv4-only): `dial tcp [64:ff9b::392:b12]:443: i/o timeout`
40+
- ❌ ECS container health checks failed
41+
- ❌ Any IPv4-only external dependencies
42+
43+
**Services that worked:**
44+
- ✅ ECS control plane (has dual-stack endpoint)
45+
- ✅ ECR (has dual-stack endpoint)
46+
- ✅ CloudWatch Logs (has dual-stack endpoint)
47+
48+
### 2. Terraform Module Version Upgrades (SUCCESSFUL)
49+
50+
**Successfully Updated Modules:**
51+
52+
| Module | Old Version | New Version | Status |
53+
|--------|-------------|-------------|--------|
54+
| `terraform-aws-modules/autoscaling/aws` | ~> 6.5 | ~> 8.3 | ✅ Applied |
55+
| `terraform-aws-modules/security-group/aws` | ~> 4.0 | ~> 5.3 | ✅ Applied |
56+
| AWS Provider | >= 4.6 | >= 5.0 | ✅ Applied |
57+
| `terraform-aws-modules/ecs/aws` | ~> 4.0 | ~> 4.1 | ✅ Applied (kept at v4 to avoid cluster recreation) |
58+
59+
**Why we didn't go further:**
60+
- ECS v6.x: Breaking API changes (cluster recreation required)
61+
- ASG v9.x: Breaking changes in `mixed_instances_policy` structure
62+
63+
**Installed Versions:**
64+
- AWS Provider: v5.100.0
65+
- ECS Module: v4.1.3
66+
- Autoscaling Module: v8.3.1
67+
- Security Group Module: v5.3.1
68+
69+
## Current Configuration (Post-Revert)
70+
71+
**Final State:**
72+
- ✅ Instances have public IPv4 (reverted from IPv6-only)
73+
- ✅ Instances have IPv6 addresses
74+
- ✅ Dual-stack networking
75+
- ✅ Module upgrades applied
76+
- ❌ No cost savings (still paying for public IPv4)
77+
78+
**Configuration:**
79+
```hcl
80+
# terraform/asg.tf
81+
network_interfaces = [
82+
{
83+
associate_public_ip_address = true # Reverted to true
84+
ipv6_address_count = 1 # Still have IPv6
85+
# ...
86+
}
87+
]
88+
89+
# terraform/ecs.tf - user_data
90+
# Standard ECS config, no IPv6-specific settings
91+
```
92+
93+
## What Would Need to Change for IPv6-Only to Work
94+
95+
**Waiting for AWS to provide:**
96+
97+
1. **SSM dual-stack endpoints** (main blocker)
98+
- SSM, EC2 Messages, and SSM Messages currently require IPv4
99+
- Without this, managed EC2 instances cannot go IPv6-only
100+
- NAT64 via NAT Gateway exists but costs ~$32+/month (negates savings)
101+
102+
2. **Alternative: All management services support dual-stack**
103+
- Particularly: SSM, EC2 Messages, SSM Messages
104+
- Currently ECS, ECR, CloudWatch Logs, S3, IAM support dual-stack
105+
106+
**Self-managed workarounds we rejected:**
107+
108+
1. **Deploy NAT64 on EC2** (Jool/Tayga software)
109+
- Cost: ~$3-5/month + maintenance burden
110+
- Complexity: High (setup, monitoring, SPOF)
111+
- Not worth $7-14/month savings
112+
113+
2. **VPC Endpoints for IPv4-only services**
114+
- Cost: ~$7-10/month
115+
- Would eliminate savings
116+
- Previous testing showed higher cost than benefit
117+
118+
3. **Disable SSM entirely**
119+
- Lose remote management capability
120+
- Not acceptable for production
121+
122+
## Lessons Learned
123+
124+
### What We Discovered
125+
126+
1. **DNS64 ≠ NAT64**
127+
- DNS64 only translates DNS queries, not actual traffic
128+
- Need both DNS64 + NAT64 for IPv6-only to work
129+
- AWS provides DNS64 but not NAT64
130+
131+
2. **Docker IPv6 Configuration Issues**
132+
- Enabling Docker IPv6 (`"ipv6": true`) broke dual-stack networking
133+
- Caused container health check failures
134+
- Required instance refresh to fix
135+
136+
3. **AWS Service IPv6 Support is Inconsistent**
137+
- Some services have dual-stack: ECS, ECR, CloudWatch, S3
138+
- Some services are IPv4-only: SSM, EC2 Messages
139+
- Use `.api.aws` suffix for dual-stack endpoints when available
140+
141+
4. **Cost-Benefit Analysis**
142+
- Potential savings: ~$7-14/month (public IPv4 charges)
143+
- VPC endpoint costs: ~$7-10/month (negates savings)
144+
- Self-managed NAT64: High complexity for minimal savings
145+
- **Conclusion:** Not worth the effort at this scale
146+
147+
### Technical Details Documented
148+
149+
**VPC IPv6 Configuration:**
150+
- VPC CIDR: `2600:1f16:78e:d400::/56`
151+
- Subnets: `2600:1f16:78e:d400::/64`, `d401::/64`, `d402::/64`
152+
- DNS64 prefix: `64:ff9b::/96`
153+
- Route: `::/0``igw-e39ab08a`
154+
155+
**Error signatures to watch for:**
156+
```
157+
dial tcp [64:ff9b::xxx:xxx]:443: i/o timeout
158+
```
159+
This indicates DNS64 translation without NAT64 gateway.
160+
161+
## Future Retry Conditions
162+
163+
**Only attempt IPv6-only again when ONE of these is true:**
164+
165+
1.**SSM gets dual-stack endpoints**
166+
- Specifically need: SSM, EC2 Messages, SSM Messages with IPv6
167+
- This is the primary blocker for managed EC2 instances
168+
- Check: https://docs.aws.amazon.com/vpc/latest/userguide/aws-ipv6-support.html
169+
170+
2.**NAT Gateway pricing drops significantly**
171+
- Currently ~$32+/month base cost negates IPv4 savings
172+
- Would need to be <$10/month to make economic sense
173+
174+
3.**Public IPv4 costs exceed $20-30/month**
175+
- At current scale (2-4 instances), savings too small
176+
- If scale increases significantly, complexity might be worth it
177+
178+
4.**VPC Endpoint costs drop significantly**
179+
- If AWS reduces endpoint pricing below ~$3/month per endpoint
180+
- Would make endpoint solution viable
181+
182+
**How to check service IPv6 support:**
183+
```bash
184+
dig service-name.region.api.aws AAAA +short
185+
# If returns IPv6 address, service supports dual-stack
186+
```
187+
188+
**Track AWS IPv6 progress:**
189+
- Official tracker: https://docs.aws.amazon.com/vpc/latest/userguide/aws-ipv6-support.html
190+
- AWS What's New (filter for IPv6): https://aws.amazon.com/new/
191+
192+
**Key services to watch for IPv6-only viability:**
193+
- SSM (Systems Manager) - currently IPv4-only, this is the main blocker
194+
- EC2 Messages - currently IPv4-only
195+
- SSM Messages - currently IPv4-only
196+
197+
**Last checked:** January 2026 - SSM still requires IPv4 connectivity
198+
199+
## Rollback Summary
200+
201+
**What we reverted:**
202+
1. Changed `associate_public_ip_address` back to `true`
203+
2. Removed IPv6-specific ECS agent configuration
204+
3. Removed Docker IPv6 configuration
205+
4. Triggered instance refresh to replace broken instances
206+
207+
**What we kept:**
208+
- IPv6 addressing (instances have both IPv4 and IPv6)
209+
- Module version upgrades
210+
- Updated security group module
211+
212+
**Recovery time:** ~5 minutes for instance refresh to complete

dns/README.md

Lines changed: 0 additions & 11 deletions
This file was deleted.

dns/operationcode.net/provider.tf

Lines changed: 0 additions & 8 deletions
This file was deleted.

dns/operationcode.net/records.tf

Lines changed: 0 additions & 13 deletions
This file was deleted.

dns/operationcode.net/terraform.tfvars

Lines changed: 0 additions & 7 deletions
This file was deleted.

dns/operationcode.net/variables.tf

Lines changed: 0 additions & 4 deletions
This file was deleted.

dns/operationcode.org/provider.tf

Lines changed: 0 additions & 8 deletions
This file was deleted.

dns/operationcode.org/records.tf

Lines changed: 0 additions & 78 deletions
This file was deleted.

0 commit comments

Comments
 (0)