Skip to content

Commit 3d7c3a9

Browse files
committed
docs: add documentation for dynamic cloud configuration
Add detailed documentation covering the dynamic cloud configuration system, including: - Overview of dynamic configuration benefits - AWS and Lambda Labs provider details - Quick start commands for all cloud operations - Technical implementation details - Performance optimizations (21s vs 6 minutes) - Cache management (24-hour TTL) - Cost tracking with make cloud-bill - GPU instance configuration examples - Troubleshooting guide - Development and debugging instructions The documentation explains how the system works, from Chuck's AWS scripts through the caching layer to Kconfig generation, providing users with a complete understanding of the dynamic configuration workflow. Generated-by: Claude AI Signed-off-by: Luis Chamberlain <[email protected]>
1 parent 55c9d1c commit 3d7c3a9

File tree

1 file changed

+272
-0
lines changed

1 file changed

+272
-0
lines changed

docs/cloud-dynamic-config.md

Lines changed: 272 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,272 @@
1+
# Dynamic Cloud Configuration
2+
3+
kdevops supports dynamic configuration generation for cloud providers, automatically
4+
querying cloud APIs to provide up-to-date instance types, regions, and pricing
5+
information.
6+
7+
## Overview
8+
9+
Dynamic cloud configuration ensures your kdevops setup always has access to the
10+
latest cloud provider offerings without manual updates. This system:
11+
12+
- Queries cloud provider APIs for current instance types and regions
13+
- Generates Kconfig files with accurate specifications
14+
- Caches data for performance (24-hour TTL)
15+
- Supports parallel processing for fast generation
16+
- Integrates with standard kdevops workflows
17+
18+
## Supported Cloud Providers
19+
20+
### AWS (Amazon Web Services)
21+
22+
AWS dynamic configuration provides:
23+
- 146+ instance families (vs 6 in static configs)
24+
- 900+ instance types with current specs
25+
- 30+ regions with availability zones
26+
- GPU instance support (P5, G5, etc.)
27+
- Cost tracking integration
28+
29+
### Lambda Labs
30+
31+
Lambda Labs dynamic configuration provides:
32+
- GPU-focused instance types
33+
- Real-time availability checking
34+
- Automatic region discovery
35+
- Pricing information
36+
37+
## Quick Start
38+
39+
### Generate Cloud Configurations
40+
41+
```bash
42+
# Generate all cloud provider configurations
43+
make cloud-config
44+
45+
# Generate specific provider configurations
46+
make cloud-config-aws
47+
make cloud-config-lambdalabs
48+
```
49+
50+
### Update Cloud Data
51+
52+
To refresh cached data and get the latest information:
53+
54+
```bash
55+
# Update all providers
56+
make cloud-update
57+
58+
# Update specific provider
59+
make cloud-update-aws
60+
```
61+
62+
### Check Cloud Costs
63+
64+
Monitor your cloud spending:
65+
66+
```bash
67+
# Show current month's costs
68+
make cloud-bill
69+
70+
# AWS-specific billing
71+
make cloud-bill-aws
72+
```
73+
74+
## AWS Dynamic Configuration
75+
76+
### How It Works
77+
78+
1. **Data Collection**: Uses Chuck's AWS scripts to query EC2 APIs
79+
- `terraform/aws/scripts/ec2_instance_info.py`: Instance specifications
80+
- `terraform/aws/scripts/aws_regions_info.py`: Region information
81+
- `terraform/aws/scripts/aws_ami_info.py`: AMI details
82+
83+
2. **Caching**: JSON data cached in `~/.cache/kdevops/aws/`
84+
- 24-hour TTL for cached data
85+
- Automatic refresh on cache expiry
86+
- Manual refresh with `make cloud-update-aws`
87+
88+
3. **Generation**: Parallel processing creates Kconfig files
89+
- Main configs in `terraform/aws/kconfigs/*.generated`
90+
- Instance types in `terraform/aws/kconfigs/instance-types/*.generated`
91+
- ~21 seconds for fresh generation (vs 6 minutes unoptimized)
92+
- ~0.04 seconds when using cache
93+
94+
### Configuration Structure
95+
96+
```
97+
terraform/aws/kconfigs/
98+
├── Kconfig.compute.generated # Instance family selection
99+
├── Kconfig.location.generated # AWS regions
100+
├── Kconfig.gpu-amis.generated # GPU AMI configurations
101+
└── instance-types/
102+
├── Kconfig.m5.generated # M5 family sizes
103+
├── Kconfig.p5.generated # P5 GPU instances
104+
└── ... (146+ families)
105+
```
106+
107+
### Using AWS GPU Instances
108+
109+
kdevops includes pre-configured defconfigs for GPU workloads:
110+
111+
```bash
112+
# High-end: 8x NVIDIA H100 80GB GPUs
113+
make defconfig-aws-gpu-p5-48xlarge
114+
115+
# Cost-effective: 1x NVIDIA A10G 24GB GPU
116+
make defconfig-aws-gpu-g5-xlarge
117+
118+
# Then provision
119+
make bringup
120+
```
121+
122+
### Cost Management
123+
124+
Track AWS costs with integrated billing support:
125+
126+
```bash
127+
# Check current month's spending
128+
make cloud-bill-aws
129+
```
130+
131+
Output shows:
132+
- Total monthly cost to date
133+
- Breakdown by AWS service
134+
- Daily average spending
135+
- Projected monthly cost (when mid-month)
136+
137+
## Lambda Labs Dynamic Configuration
138+
139+
Lambda Labs configuration focuses on GPU instances for ML/AI workloads:
140+
141+
```bash
142+
# Generate Lambda Labs configs
143+
make cloud-config-lambdalabs
144+
145+
# Use a Lambda Labs defconfig
146+
make defconfig-lambdalabs-gpu-8x-h100
147+
```
148+
149+
## Technical Details
150+
151+
### Performance Optimizations
152+
153+
The dynamic configuration system uses several optimizations:
154+
155+
1. **Parallel API Queries**: 10 concurrent workers fetch instance data
156+
2. **Parallel File Writing**: 20 concurrent workers write Kconfig files
157+
3. **JSON Caching**: 24-hour cache reduces API calls
158+
4. **Batch Processing**: Fetches all data in single API call where possible
159+
160+
### Cache Management
161+
162+
Cache location: `~/.cache/kdevops/<provider>/`
163+
164+
Cache files:
165+
- `aws_families.json`: Instance family list
166+
- `aws_family_<name>.json`: Per-family instance data
167+
- `aws_regions.json`: Region information
168+
- `aws_all_instances.json`: Complete dataset
169+
170+
Clear cache manually:
171+
```bash
172+
rm -rf ~/.cache/kdevops/aws/
173+
make cloud-update-aws
174+
```
175+
176+
### Adding New Cloud Providers
177+
178+
To add support for a new cloud provider:
179+
180+
1. Create provider-specific scripts in `terraform/<provider>/scripts/`
181+
2. Add Kconfig directory structure in `terraform/<provider>/kconfigs/`
182+
3. Update `scripts/dynamic-cloud-kconfig.Makefile` with new targets
183+
4. Implement generation in `scripts/generate_cloud_configs.py`
184+
185+
## Troubleshooting
186+
187+
### AWS Credentials Not Configured
188+
189+
If you see "AWS: Credentials not configured":
190+
191+
```bash
192+
# Configure AWS CLI
193+
aws configure
194+
195+
# Or set environment variables
196+
export AWS_ACCESS_KEY_ID=your_key
197+
export AWS_SECRET_ACCESS_KEY=your_secret
198+
export AWS_DEFAULT_REGION=us-east-1
199+
```
200+
201+
### Kconfig Errors
202+
203+
If menuconfig shows errors after generation:
204+
205+
1. Clear cache and regenerate:
206+
```bash
207+
make cloud-update-aws
208+
```
209+
210+
2. Check for syntax issues:
211+
```bash
212+
grep -n "error:" terraform/aws/kconfigs/*.generated
213+
```
214+
215+
### Slow Generation
216+
217+
If generation takes longer than 30 seconds:
218+
219+
1. Check network connectivity to AWS
220+
2. Verify credentials are valid
221+
3. Try different AWS region:
222+
```bash
223+
export AWS_DEFAULT_REGION=eu-west-1
224+
make cloud-update-aws
225+
```
226+
227+
## Development
228+
229+
### Running Scripts Directly
230+
231+
```bash
232+
# Generate AWS configs with Chuck's scripts
233+
python3 terraform/aws/scripts/generate_aws_kconfig.py
234+
235+
# Clear cache and regenerate
236+
python3 terraform/aws/scripts/generate_aws_kconfig.py clear-cache
237+
238+
# Query specific instance family
239+
python3 terraform/aws/scripts/ec2_instance_info.py m5 --format json
240+
241+
# List all families
242+
python3 terraform/aws/scripts/ec2_instance_info.py --families --format json
243+
```
244+
245+
### Debugging
246+
247+
Enable debug output:
248+
```bash
249+
# Debug AWS script
250+
python3 terraform/aws/scripts/ec2_instance_info.py --debug m5
251+
252+
# Verbose Makefile execution
253+
make V=1 cloud-config-aws
254+
```
255+
256+
## Best Practices
257+
258+
1. **Regular Updates**: Run `make cloud-update` weekly for latest offerings
259+
2. **Cost Monitoring**: Check `make cloud-bill` before major deployments
260+
3. **Cache Management**: Let cache expire naturally unless testing changes
261+
4. **Region Selection**: Choose regions close to you for lower latency
262+
5. **Instance Right-Sizing**: Use dynamic configs to find optimal instance sizes
263+
264+
## Future Enhancements
265+
266+
Planned improvements:
267+
- Azure dynamic configuration support
268+
- GCE (Google Cloud) dynamic configuration
269+
- Real-time pricing integration
270+
- Spot instance availability checking
271+
- Instance recommendation based on workload
272+
- Cost optimization suggestions

0 commit comments

Comments
 (0)