|
| 1 | +# Kepler Power Attribution Guide |
| 2 | + |
| 3 | +This guide explains how Kepler measures and attributes power consumption to processes, containers, VMs, and pods. |
| 4 | + |
| 5 | +## How Power Measurement Works |
| 6 | + |
| 7 | +Kepler's power attribution follows a simple but effective approach: measure total system energy consumption from hardware, then distribute it fairly to individual workloads based on their resource usage. |
| 8 | + |
| 9 | +### The Big Picture |
| 10 | + |
| 11 | +Think of your computer like an apartment building with a single electricity meter. The meter shows total power consumption (e.g., 40W), but you need to know how much each apartment (process) is using. Kepler solves this by: |
| 12 | + |
| 13 | +1. **Reading the main meter** - Hardware sensors (Intel RAPL) provide total energy consumption |
| 14 | +2. **Understanding system activity** - Monitor CPU usage to determine how "busy" the system is |
| 15 | +3. **Splitting costs fairly** - Divide energy between "active work" and "idle baseline" |
| 16 | +4. **Allocating to tenants** - Give each process power proportional to their CPU usage |
| 17 | + |
| 18 | +### Core Insight: Active vs Idle Power |
| 19 | + |
| 20 | +The key insight is that system power has two components: |
| 21 | + |
| 22 | +- **Active Power**: Energy consumed doing actual work (running processes) |
| 23 | +- **Idle Power**: Baseline energy for keeping the system running (even when idle) |
| 24 | + |
| 25 | +If your system uses 25% of CPU capacity, then 25% of total power goes to "active" and 75% stays as "idle." |
| 26 | + |
| 27 | +### Attribution Principle |
| 28 | + |
| 29 | +Once Kepler knows the active power available, it distributes it proportionally: |
| 30 | + |
| 31 | +```text |
| 32 | +Process Power = (Process CPU Time / Total CPU Time) × Active Power |
| 33 | +``` |
| 34 | + |
| 35 | +This ensures that processes consuming more CPU get more power attribution, while the total never exceeds what hardware actually measured. |
| 36 | + |
| 37 | +## Overview |
| 38 | + |
| 39 | +Kepler uses a hierarchical power attribution system that starts with hardware energy measurements and distributes power proportionally based on CPU utilization. The system ensures energy conservation while providing fair attribution across workloads. |
| 40 | + |
| 41 | + |
| 42 | + |
| 43 | +*Figure 1: Power attribution flow showing how 40W total power is split between active (10W) and idle (30W) components, then distributed to workloads based on CPU usage ratios.* |
| 44 | + |
| 45 | +### Real-World Example |
| 46 | + |
| 47 | +Using the diagram above: |
| 48 | + |
| 49 | +- **Hardware reports**: 40W total system power |
| 50 | +- **System analysis**: 25% CPU usage ratio |
| 51 | +- **Power split**: 40W × 25% = 10W active, 30W idle |
| 52 | +- **VM attribution**: VM uses 100% of active CPU → gets all 10W active power |
| 53 | +- **Container breakdown**: Within the VM, containers get proportional shares of the 10W |
| 54 | + |
| 55 | +## Architecture Components |
| 56 | + |
| 57 | +### 1. Hardware Energy Reading (`internal/device/`) |
| 58 | + |
| 59 | +The device layer provides the foundation for all power measurements: |
| 60 | + |
| 61 | +#### Energy Zones |
| 62 | + |
| 63 | +- **Package**: CPU package-level energy consumption |
| 64 | +- **Core**: Individual CPU core energy |
| 65 | +- **DRAM**: Memory subsystem energy |
| 66 | +- **Uncore**: Integrated graphics and other uncore components |
| 67 | +- **PSys**: Platform-level energy (most comprehensive when available) |
| 68 | + |
| 69 | +#### Key Interfaces |
| 70 | + |
| 71 | +- `EnergyZone`: Interface for reading energy from hardware zones |
| 72 | +- `CPUPowerMeter`: Main interface for accessing energy zones |
| 73 | +- `AggregatedZone`: Combines multiple zones of the same type |
| 74 | + |
| 75 | +#### Energy Types |
| 76 | + |
| 77 | +- **Energy**: Measured in microjoules (μJ) as cumulative counters |
| 78 | +- **Power**: Calculated as rate in microwatts (μW) using `Power = ΔEnergy / Δtime` |
| 79 | + |
| 80 | +#### Wraparound Handling |
| 81 | + |
| 82 | +Hardware energy counters have maximum values and wrap around to zero. |
| 83 | +Kepler handles this in `calculateEnergyDelta()`: |
| 84 | + |
| 85 | +```go |
| 86 | +func calculateEnergyDelta(current, previous, maxJoules Energy) Energy { |
| 87 | + if current >= previous { |
| 88 | + return current - previous |
| 89 | + } |
| 90 | + // Handle counter wraparound |
| 91 | + if maxJoules > 0 { |
| 92 | + return (maxJoules - previous) + current |
| 93 | + } |
| 94 | + return 0 // Unable to calculate delta |
| 95 | +} |
| 96 | +``` |
| 97 | + |
| 98 | +### 2. Node-Level Power Calculation (`internal/monitor/node.go`) |
| 99 | + |
| 100 | +The node calculation is the first step in power attribution, splitting total hardware energy into active and idle components. |
| 101 | + |
| 102 | +#### CPU Usage Calculation |
| 103 | + |
| 104 | +```go |
| 105 | +nodeCPUTimeDelta := pm.resources.Node().ProcessTotalCPUTimeDelta |
| 106 | +nodeCPUUsageRatio := pm.resources.Node().CPUUsageRatio |
| 107 | +``` |
| 108 | + |
| 109 | +#### Energy Split Algorithm |
| 110 | + |
| 111 | +For each energy zone, Kepler calculates: |
| 112 | + |
| 113 | +```go |
| 114 | +deltaEnergy := calculateEnergyDelta(absEnergy, prevZone.EnergyTotal, zone.MaxEnergy()) |
| 115 | +activeEnergy = Energy(float64(deltaEnergy) * nodeCPUUsageRatio) |
| 116 | +idleEnergy := deltaEnergy - activeEnergy |
| 117 | +``` |
| 118 | + |
| 119 | +**Key Principle**: Active energy represents the portion consumed by CPU-intensive work, while idle energy represents baseline system power consumption. |
| 120 | + |
| 121 | +#### Power Calculation |
| 122 | + |
| 123 | +```go |
| 124 | +powerF64 := float64(deltaEnergy) / float64(timeDiff) |
| 125 | +power = Power(powerF64) |
| 126 | +activePower = Power(powerF64 * nodeCPUUsageRatio) |
| 127 | +idlePower = power - activePower |
| 128 | +``` |
| 129 | + |
| 130 | +### 3. Process Power Attribution (`internal/monitor/process.go`) |
| 131 | + |
| 132 | +Individual processes receive power proportional to their CPU time usage relative to total system CPU time. |
| 133 | + |
| 134 | +#### Attribution Formula |
| 135 | + |
| 136 | +For each running process: |
| 137 | + |
| 138 | +```go |
| 139 | +cpuTimeRatio := proc.CPUTimeDelta / nodeCPUTimeDelta |
| 140 | +activeEnergy := Energy(cpuTimeRatio * float64(nodeZoneUsage.activeEnergy)) |
| 141 | +``` |
| 142 | + |
| 143 | +#### Power Assignment |
| 144 | + |
| 145 | +```go |
| 146 | +process.Zones[zone] = Usage{ |
| 147 | + Power: Power(cpuTimeRatio * nodeZoneUsage.ActivePower.MicroWatts()), |
| 148 | + EnergyTotal: absoluteEnergy, |
| 149 | +} |
| 150 | +``` |
| 151 | + |
| 152 | +#### Cumulative Energy Tracking |
| 153 | + |
| 154 | +Process energy accumulates over time: |
| 155 | + |
| 156 | +```go |
| 157 | +absoluteEnergy := activeEnergy |
| 158 | +if prev, exists := prev.Processes[pid]; exists { |
| 159 | + if prevUsage, hasZone := prev.Zones[zone]; hasZone { |
| 160 | + absoluteEnergy += prevUsage.EnergyTotal |
| 161 | + } |
| 162 | +} |
| 163 | +``` |
| 164 | + |
| 165 | +## Attribution Flow Example |
| 166 | + |
| 167 | +Using the diagram as reference, here's how 40W total power gets attributed: |
| 168 | + |
| 169 | +### Step 1: Hardware Measurement |
| 170 | + |
| 171 | +- RAPL sensors report total energy consumption for the measurement interval |
| 172 | +- Convert to power: `40W total power` |
| 173 | + |
| 174 | +### Step 2: Node CPU Usage Analysis |
| 175 | + |
| 176 | +- System reports 25% CPU usage ratio |
| 177 | +- Split power: `40W × 25% = 10W active`, `40W - 10W = 30W idle` |
| 178 | + |
| 179 | +### Step 3: Process Attribution |
| 180 | + |
| 181 | +- VM process uses 100% of active CPU time |
| 182 | +- VM gets: `10W × (100% CPU usage) = 10W` |
| 183 | +- Container processes within VM get proportional shares of the 10W |
| 184 | + |
| 185 | +### Step 4: Workload Aggregation |
| 186 | + |
| 187 | +- **Container power** = sum of constituent process power |
| 188 | +- **VM power** = sum of all processes in the VM |
| 189 | +- **Pod power** = sum of container power (in Kubernetes) |
| 190 | + |
| 191 | +## Key Principles |
| 192 | + |
| 193 | +### 1. Energy Conservation |
| 194 | + |
| 195 | +The total attributed power always equals the measured hardware power: |
| 196 | + |
| 197 | +```text |
| 198 | +Σ(Process Power) + Idle Power = Total Hardware Power |
| 199 | +``` |
| 200 | + |
| 201 | +### 2. Proportional Attribution |
| 202 | + |
| 203 | +Power distribution is strictly proportional to CPU time usage: |
| 204 | + |
| 205 | +```text |
| 206 | +Process Power = (Process CPU Time / Total CPU Time) × Active Power |
| 207 | +``` |
| 208 | + |
| 209 | +### 3. Hierarchical Aggregation |
| 210 | + |
| 211 | +Higher-level workloads inherit power from their constituent processes: |
| 212 | + |
| 213 | +- **Pods** = sum of container power |
| 214 | +- **Containers** = sum of process power |
| 215 | +- **VMs** = sum of process power |
| 216 | + |
| 217 | +### 4. Idle Power Handling |
| 218 | + |
| 219 | +Idle power represents baseline system consumption and is tracked separately but not attributed to individual workloads. |
| 220 | + |
| 221 | +## Implementation Details |
| 222 | + |
| 223 | +### Thread Safety |
| 224 | + |
| 225 | +- **Device Layer**: Not required to be thread-safe (single monitor goroutine) |
| 226 | +- **Monitor Layer**: All public methods except `Init()` must be thread-safe |
| 227 | +- **Singleflight Pattern**: Prevents redundant power calculations during concurrent requests |
| 228 | + |
| 229 | +### Data Freshness |
| 230 | + |
| 231 | +- Configurable staleness threshold ensures data isn't stale |
| 232 | +- Atomic snapshots provide consistent power readings across all workloads |
| 233 | + |
| 234 | +### Terminated Process Handling |
| 235 | + |
| 236 | +- Terminated processes are tracked in a separate collection |
| 237 | +- Power attribution continues until the next export cycle |
| 238 | +- Priority-based retention manages memory usage |
| 239 | + |
| 240 | +### Error Handling |
| 241 | + |
| 242 | +- Individual zone read failures don't stop attribution |
| 243 | +- Graceful degradation when hardware sensors are unavailable |
| 244 | +- Comprehensive logging for debugging attribution issues |
| 245 | + |
| 246 | +## Configuration |
| 247 | + |
| 248 | +### Key Settings |
| 249 | + |
| 250 | +- **Collection Interval**: How frequently to read hardware sensors |
| 251 | +- **Staleness Threshold**: Maximum age of cached power data |
| 252 | +- **Zone Filtering**: Which RAPL zones to use for attribution |
| 253 | +- **Fake Meter**: Development mode when hardware unavailable |
| 254 | + |
| 255 | +### Development Mode |
| 256 | + |
| 257 | +```bash |
| 258 | +# Use fake CPU meter for development |
| 259 | +sudo ./bin/kepler --dev.fake-cpu-meter.enabled --config hack/config.yaml |
| 260 | +``` |
| 261 | + |
| 262 | +## Monitoring and Debugging |
| 263 | + |
| 264 | +### Metrics Access |
| 265 | + |
| 266 | +- **Local**: `http://localhost:28282/metrics` |
| 267 | +- **Compose**: `http://localhost:28283/metrics` |
| 268 | +- **Grafana**: `http://localhost:23000` |
| 269 | + |
| 270 | +### Debug Options |
| 271 | + |
| 272 | +```bash |
| 273 | +# Enable debug logging |
| 274 | +--log.level=debug |
| 275 | + |
| 276 | +# Use stdout exporter for immediate inspection |
| 277 | +--exporter.stdout |
| 278 | + |
| 279 | +# Enable performance profiling |
| 280 | +--debug.pprof |
| 281 | +``` |
| 282 | + |
| 283 | +### Key Metrics |
| 284 | + |
| 285 | +- `kepler_node_cpu_watts{}`: Total node power consumption |
| 286 | +- `kepler_process_cpu_watts{}`: Individual process power |
| 287 | +- `kepler_container_cpu_watts{}`: Container-level aggregation |
| 288 | +- `kepler_vm_cpu_watts{}`: Virtual machine power attribution |
| 289 | + |
| 290 | +## Conclusion |
| 291 | + |
| 292 | +Kepler's power attribution system provides accurate, proportional distribution of hardware energy consumption to individual workloads. By using CPU utilization as the primary attribution factor and maintaining strict energy conservation, Kepler enables fine-grained energy accounting for modern containerized and virtualized environments. |
| 293 | + |
| 294 | +The implementation balances accuracy with performance, providing thread-safe concurrent access while minimizing the overhead of continuous power monitoring. |
0 commit comments