|
| 1 | +# Dynamic Cloud Configuration |
| 2 | + |
| 3 | +kdevops supports dynamic configuration generation for cloud providers, automatically |
| 4 | +querying cloud APIs to provide up-to-date instance types, regions, and pricing |
| 5 | +information. |
| 6 | + |
| 7 | +## Overview |
| 8 | + |
| 9 | +Dynamic cloud configuration ensures your kdevops setup always has access to the |
| 10 | +latest cloud provider offerings without manual updates. This system: |
| 11 | + |
| 12 | +- Queries cloud provider APIs for current instance types and regions |
| 13 | +- Generates Kconfig files with accurate specifications |
| 14 | +- Caches data for performance (24-hour TTL) |
| 15 | +- Supports parallel processing for fast generation |
| 16 | +- Integrates with standard kdevops workflows |
| 17 | + |
| 18 | +## Supported Cloud Providers |
| 19 | + |
| 20 | +### AWS (Amazon Web Services) |
| 21 | + |
| 22 | +AWS dynamic configuration provides: |
| 23 | +- 146+ instance families (vs 6 in static configs) |
| 24 | +- 900+ instance types with current specs |
| 25 | +- 30+ regions with availability zones |
| 26 | +- GPU instance support (P5, G5, etc.) |
| 27 | +- Cost tracking integration |
| 28 | + |
| 29 | +### Lambda Labs |
| 30 | + |
| 31 | +Lambda Labs dynamic configuration provides: |
| 32 | +- GPU-focused instance types |
| 33 | +- Real-time availability checking |
| 34 | +- Automatic region discovery |
| 35 | +- Pricing information |
| 36 | + |
| 37 | +## Quick Start |
| 38 | + |
| 39 | +### Generate Cloud Configurations |
| 40 | + |
| 41 | +```bash |
| 42 | +# Generate all cloud provider configurations |
| 43 | +make cloud-config |
| 44 | + |
| 45 | +# Generate specific provider configurations |
| 46 | +make cloud-config-aws |
| 47 | +make cloud-config-lambdalabs |
| 48 | +``` |
| 49 | + |
| 50 | +### Update Cloud Data |
| 51 | + |
| 52 | +To refresh cached data and get the latest information: |
| 53 | + |
| 54 | +```bash |
| 55 | +# Update all providers |
| 56 | +make cloud-update |
| 57 | + |
| 58 | +# Update specific provider |
| 59 | +make cloud-update-aws |
| 60 | +``` |
| 61 | + |
| 62 | +### Check Cloud Costs |
| 63 | + |
| 64 | +Monitor your cloud spending: |
| 65 | + |
| 66 | +```bash |
| 67 | +# Show current month's costs |
| 68 | +make cloud-bill |
| 69 | + |
| 70 | +# AWS-specific billing |
| 71 | +make cloud-bill-aws |
| 72 | +``` |
| 73 | + |
| 74 | +## AWS Dynamic Configuration |
| 75 | + |
| 76 | +### How It Works |
| 77 | + |
| 78 | +1. **Data Collection**: Uses Chuck's AWS scripts to query EC2 APIs |
| 79 | + - `terraform/aws/scripts/ec2_instance_info.py`: Instance specifications |
| 80 | + - `terraform/aws/scripts/aws_regions_info.py`: Region information |
| 81 | + - `terraform/aws/scripts/aws_ami_info.py`: AMI details |
| 82 | + |
| 83 | +2. **Caching**: JSON data cached in `~/.cache/kdevops/aws/` |
| 84 | + - 24-hour TTL for cached data |
| 85 | + - Automatic refresh on cache expiry |
| 86 | + - Manual refresh with `make cloud-update-aws` |
| 87 | + |
| 88 | +3. **Generation**: Parallel processing creates Kconfig files |
| 89 | + - Main configs in `terraform/aws/kconfigs/*.generated` |
| 90 | + - Instance types in `terraform/aws/kconfigs/instance-types/*.generated` |
| 91 | + - ~21 seconds for fresh generation (vs 6 minutes unoptimized) |
| 92 | + - ~0.04 seconds when using cache |
| 93 | + |
| 94 | +### Configuration Structure |
| 95 | + |
| 96 | +``` |
| 97 | +terraform/aws/kconfigs/ |
| 98 | +├── Kconfig.compute.generated # Instance family selection |
| 99 | +├── Kconfig.location.generated # AWS regions |
| 100 | +├── Kconfig.gpu-amis.generated # GPU AMI configurations |
| 101 | +└── instance-types/ |
| 102 | + ├── Kconfig.m5.generated # M5 family sizes |
| 103 | + ├── Kconfig.p5.generated # P5 GPU instances |
| 104 | + └── ... (146+ families) |
| 105 | +``` |
| 106 | + |
| 107 | +### Using AWS GPU Instances |
| 108 | + |
| 109 | +kdevops includes pre-configured defconfigs for GPU workloads: |
| 110 | + |
| 111 | +```bash |
| 112 | +# High-end: 8x NVIDIA H100 80GB GPUs |
| 113 | +make defconfig-aws-gpu-p5-48xlarge |
| 114 | + |
| 115 | +# Cost-effective: 1x NVIDIA A10G 24GB GPU |
| 116 | +make defconfig-aws-gpu-g5-xlarge |
| 117 | + |
| 118 | +# Then provision |
| 119 | +make bringup |
| 120 | +``` |
| 121 | + |
| 122 | +### Cost Management |
| 123 | + |
| 124 | +Track AWS costs with integrated billing support: |
| 125 | + |
| 126 | +```bash |
| 127 | +# Check current month's spending |
| 128 | +make cloud-bill-aws |
| 129 | +``` |
| 130 | + |
| 131 | +Output shows: |
| 132 | +- Total monthly cost to date |
| 133 | +- Breakdown by AWS service |
| 134 | +- Daily average spending |
| 135 | +- Projected monthly cost (when mid-month) |
| 136 | + |
| 137 | +## Lambda Labs Dynamic Configuration |
| 138 | + |
| 139 | +Lambda Labs configuration focuses on GPU instances for ML/AI workloads: |
| 140 | + |
| 141 | +```bash |
| 142 | +# Generate Lambda Labs configs |
| 143 | +make cloud-config-lambdalabs |
| 144 | + |
| 145 | +# Use a Lambda Labs defconfig |
| 146 | +make defconfig-lambdalabs-gpu-8x-h100 |
| 147 | +``` |
| 148 | + |
| 149 | +## Technical Details |
| 150 | + |
| 151 | +### Performance Optimizations |
| 152 | + |
| 153 | +The dynamic configuration system uses several optimizations: |
| 154 | + |
| 155 | +1. **Parallel API Queries**: 10 concurrent workers fetch instance data |
| 156 | +2. **Parallel File Writing**: 20 concurrent workers write Kconfig files |
| 157 | +3. **JSON Caching**: 24-hour cache reduces API calls |
| 158 | +4. **Batch Processing**: Fetches all data in single API call where possible |
| 159 | + |
| 160 | +### Cache Management |
| 161 | + |
| 162 | +Cache location: `~/.cache/kdevops/<provider>/` |
| 163 | + |
| 164 | +Cache files: |
| 165 | +- `aws_families.json`: Instance family list |
| 166 | +- `aws_family_<name>.json`: Per-family instance data |
| 167 | +- `aws_regions.json`: Region information |
| 168 | +- `aws_all_instances.json`: Complete dataset |
| 169 | + |
| 170 | +Clear cache manually: |
| 171 | +```bash |
| 172 | +rm -rf ~/.cache/kdevops/aws/ |
| 173 | +make cloud-update-aws |
| 174 | +``` |
| 175 | + |
| 176 | +### Adding New Cloud Providers |
| 177 | + |
| 178 | +To add support for a new cloud provider: |
| 179 | + |
| 180 | +1. Create provider-specific scripts in `terraform/<provider>/scripts/` |
| 181 | +2. Add Kconfig directory structure in `terraform/<provider>/kconfigs/` |
| 182 | +3. Update `scripts/dynamic-cloud-kconfig.Makefile` with new targets |
| 183 | +4. Implement generation in `scripts/generate_cloud_configs.py` |
| 184 | + |
| 185 | +## Troubleshooting |
| 186 | + |
| 187 | +### AWS Credentials Not Configured |
| 188 | + |
| 189 | +If you see "AWS: Credentials not configured": |
| 190 | + |
| 191 | +```bash |
| 192 | +# Configure AWS CLI |
| 193 | +aws configure |
| 194 | + |
| 195 | +# Or set environment variables |
| 196 | +export AWS_ACCESS_KEY_ID=your_key |
| 197 | +export AWS_SECRET_ACCESS_KEY=your_secret |
| 198 | +export AWS_DEFAULT_REGION=us-east-1 |
| 199 | +``` |
| 200 | + |
| 201 | +### Kconfig Errors |
| 202 | + |
| 203 | +If menuconfig shows errors after generation: |
| 204 | + |
| 205 | +1. Clear cache and regenerate: |
| 206 | + ```bash |
| 207 | + make cloud-update-aws |
| 208 | + ``` |
| 209 | + |
| 210 | +2. Check for syntax issues: |
| 211 | + ```bash |
| 212 | + grep -n "error:" terraform/aws/kconfigs/*.generated |
| 213 | + ``` |
| 214 | + |
| 215 | +### Slow Generation |
| 216 | + |
| 217 | +If generation takes longer than 30 seconds: |
| 218 | + |
| 219 | +1. Check network connectivity to AWS |
| 220 | +2. Verify credentials are valid |
| 221 | +3. Try different AWS region: |
| 222 | + ```bash |
| 223 | + export AWS_DEFAULT_REGION=eu-west-1 |
| 224 | + make cloud-update-aws |
| 225 | + ``` |
| 226 | + |
| 227 | +## Development |
| 228 | + |
| 229 | +### Running Scripts Directly |
| 230 | + |
| 231 | +```bash |
| 232 | +# Generate AWS configs with Chuck's scripts |
| 233 | +python3 terraform/aws/scripts/generate_aws_kconfig.py |
| 234 | + |
| 235 | +# Clear cache and regenerate |
| 236 | +python3 terraform/aws/scripts/generate_aws_kconfig.py clear-cache |
| 237 | + |
| 238 | +# Query specific instance family |
| 239 | +python3 terraform/aws/scripts/ec2_instance_info.py m5 --format json |
| 240 | + |
| 241 | +# List all families |
| 242 | +python3 terraform/aws/scripts/ec2_instance_info.py --families --format json |
| 243 | +``` |
| 244 | + |
| 245 | +### Debugging |
| 246 | + |
| 247 | +Enable debug output: |
| 248 | +```bash |
| 249 | +# Debug AWS script |
| 250 | +python3 terraform/aws/scripts/ec2_instance_info.py --debug m5 |
| 251 | + |
| 252 | +# Verbose Makefile execution |
| 253 | +make V=1 cloud-config-aws |
| 254 | +``` |
| 255 | + |
| 256 | +## Best Practices |
| 257 | + |
| 258 | +1. **Regular Updates**: Run `make cloud-update` weekly for latest offerings |
| 259 | +2. **Cost Monitoring**: Check `make cloud-bill` before major deployments |
| 260 | +3. **Cache Management**: Let cache expire naturally unless testing changes |
| 261 | +4. **Region Selection**: Choose regions close to you for lower latency |
| 262 | +5. **Instance Right-Sizing**: Use dynamic configs to find optimal instance sizes |
| 263 | + |
| 264 | +## Future Enhancements |
| 265 | + |
| 266 | +Planned improvements: |
| 267 | +- Azure dynamic configuration support |
| 268 | +- GCE (Google Cloud) dynamic configuration |
| 269 | +- Real-time pricing integration |
| 270 | +- Spot instance availability checking |
| 271 | +- Instance recommendation based on workload |
| 272 | +- Cost optimization suggestions |
0 commit comments