|
| 1 | +# EP-001: Redfish Power Monitoring Support |
| 2 | + |
| 3 | +- **Status**: Draft |
| 4 | +- **Author**: Sunil Thaha |
| 5 | +- **Created**: 2025-08-14 |
| 6 | + |
| 7 | +## Summary |
| 8 | + |
| 9 | +Add Redfish BMC power monitoring to Kepler for platform-level power consumption data, |
| 10 | +complementing existing RAPL CPU monitoring to provide comprehensive server power |
| 11 | +visibility. |
| 12 | + |
| 13 | +## Problem |
| 14 | + |
| 15 | +Kepler currently measures only CPU power via Intel RAPL, missing: |
| 16 | + |
| 17 | +- Platform power (PSU, cooling, storage, network) |
| 18 | +- Multi-vendor support (AMD, ARM systems) |
| 19 | +- BMC integration capabilities already present in data centers |
| 20 | + |
| 21 | +## Goals |
| 22 | + |
| 23 | +- Add Redfish BMC power monitoring capability |
| 24 | +- Support Kubernetes, bare metal, and standalone deployments |
| 25 | +- Integrate with existing Kepler architecture |
| 26 | +- Maintain security best practices |
| 27 | + |
| 28 | +## Non-Goals |
| 29 | + |
| 30 | +- Replace RAPL monitoring (complementary) |
| 31 | +- Support non-Redfish protocols (IPMI) initially |
| 32 | +- Implement power control features |
| 33 | +- Advanced resilience patterns in v1 |
| 34 | + |
| 35 | +## Solution |
| 36 | + |
| 37 | +Add platform service layer collecting BMC power data via Redfish, exposed through |
| 38 | +Prometheus collectors separately from CPU power attribution. |
| 39 | + |
| 40 | +```mermaid |
| 41 | +C4Container |
| 42 | + title Container Diagram - Kepler Power Monitoring |
| 43 | +
|
| 44 | + Person(user, "User", "Prometheus/Grafana") |
| 45 | +
|
| 46 | + Container_Boundary(kepler, "Kepler") { |
| 47 | + Component(rapl, "RAPL Reader", "Go", "CPU power") |
| 48 | + Component(redfish, "Redfish Client", "Go", "Platform power") |
| 49 | + Component(monitor, "Power Monitor", "Go", "Attribution") |
| 50 | + Component(platform, "Platform Collector", "Go", "Platform metrics") |
| 51 | + Component(exporter, "Prometheus Exporter", "Go", "Metrics endpoint") |
| 52 | + } |
| 53 | +
|
| 54 | + System_Ext(bmc, "BMC", "Redfish API") |
| 55 | + System_Ext(kernel, "Linux", "RAPL sysfs") |
| 56 | +
|
| 57 | + Rel(rapl, kernel, "Reads") |
| 58 | + Rel(redfish, bmc, "Queries") |
| 59 | + Rel(monitor, rapl, "Uses") |
| 60 | + Rel(platform, redfish, "Uses") |
| 61 | + Rel(exporter, monitor, "Collects") |
| 62 | + Rel(exporter, platform, "Collects") |
| 63 | + Rel(user, exporter, "Scrapes") |
| 64 | +``` |
| 65 | + |
| 66 | +## Node Identification |
| 67 | + |
| 68 | +Nodes identified via `--platform.redfish.node-id` flag or `platform.redfish.nodeID` config, |
| 69 | +matching identifiers in BMC configuration file. E.g. |
| 70 | + |
| 71 | +```bash |
| 72 | +kepler --platform.redfish.node-id=worker-1 |
| 73 | +``` |
| 74 | + |
| 75 | +1. **Configuration**: `platform.redfish.nodeID` as below |
| 76 | + |
| 77 | +```yaml |
| 78 | +# config.yaml |
| 79 | + |
| 80 | + platform: |
| 81 | + redfish: |
| 82 | + nodeID: worker-1 |
| 83 | +``` |
| 84 | +
|
| 85 | +```mermaid |
| 86 | +flowchart LR |
| 87 | + A[Node Start] --> B{Node ID?} |
| 88 | + B -->|Yes| C[Load BMC Config] |
| 89 | + B -->|No| D[RAPL Only] |
| 90 | + C --> E{BMC Found?} |
| 91 | + E -->|Yes| F[Connect & Monitor] |
| 92 | + E -->|No| D |
| 93 | +``` |
| 94 | + |
| 95 | +## Implementation |
| 96 | + |
| 97 | +### Package Structure |
| 98 | + |
| 99 | +```mermaid |
| 100 | +graph TD |
| 101 | + subgraph "internal/" |
| 102 | + A[platform/redfish/<br/>service.go<br/>config.go<br/>client.go] |
| 103 | + B[exporter/prometheus/<br/>collector/<br/>platform_collector.go] |
| 104 | + end |
| 105 | + A --> B |
| 106 | +``` |
| 107 | + |
| 108 | +### Service Interfaces |
| 109 | + |
| 110 | +Implements standard Kepler patterns: |
| 111 | + |
| 112 | +- `service.Initializer`: Configuration and connection setup |
| 113 | +- `service.Runner`: Periodic power collection with context |
| 114 | +- `service.Shutdowner`: Clean resource release |
| 115 | + |
| 116 | +### Configuration |
| 117 | + |
| 118 | +**Kepler Config Structure:** |
| 119 | + |
| 120 | +```go |
| 121 | +type Platform struct { |
| 122 | + Redfish Redfish `yaml:"redfish"` |
| 123 | +} |
| 124 | + |
| 125 | +type Redfish struct { |
| 126 | + Enabled *bool `yaml:"enabled"` |
| 127 | + NodeID string `yaml:"nodeID"` |
| 128 | + ConfigFile string `yaml:"configFile"` |
| 129 | +} |
| 130 | +``` |
| 131 | + |
| 132 | +**CLI Flags:** |
| 133 | + |
| 134 | +```bash |
| 135 | +--platform.redfish.enabled=true |
| 136 | +--platform.redfish.node-id=worker-1 |
| 137 | +--platform.redfish.config=/etc/kepler/redfish.yaml |
| 138 | +``` |
| 139 | + |
| 140 | +**Main Configuration (`hack/config.yaml`):** |
| 141 | + |
| 142 | +```yaml |
| 143 | +# ... existing config sections ... |
| 144 | + |
| 145 | +platform: |
| 146 | + redfish: |
| 147 | + enabled: true |
| 148 | + nodeID: "worker-1" # Node identifier for BMC mapping |
| 149 | + configFile: "/etc/kepler/redfish.yaml" |
| 150 | +``` |
| 151 | +
|
| 152 | +**BMC Configuration (`/etc/kepler/redfish.yaml`):** |
| 153 | + |
| 154 | +The configuration separates node-to-BMC mappings from BMC credentials for several reasons: |
| 155 | + |
| 156 | +- **Multi-tenant deployments**: Multiple VMs/nodes can share the same BMC (blade servers, virtualized environments) |
| 157 | +- **Credential reuse**: Same BMC credentials can be shared across multiple node mappings |
| 158 | +- **Operational flexibility**: Easy to reassign nodes to different BMCs without credential changes |
| 159 | + |
| 160 | +```yaml |
| 161 | +nodes: |
| 162 | + baremetal-worker-1: bmc-1 |
| 163 | + baremetal-worker-2: bmc-2 |
| 164 | + vm_worker-3: BMC_2_VM |
| 165 | + vm_worker-4: BMC_2_VM |
| 166 | +
|
| 167 | +bmcs: |
| 168 | + bmc-1: |
| 169 | + endpoint: "https://192.168.1.100" |
| 170 | + username: "admin" |
| 171 | + password: "secret" |
| 172 | + insecure: true # TLS verification |
| 173 | +
|
| 174 | + bmc-2: |
| 175 | + endpoint: "https://192.168.1.101" |
| 176 | + username: "admin" |
| 177 | + password: "secret456" |
| 178 | + insecure: false # Verify TLS certificates |
| 179 | +
|
| 180 | + BMC_2_VM: |
| 181 | + endpoint: "https://192.168.1.103" |
| 182 | + username: "admin" |
| 183 | + password: "secret456" |
| 184 | + insecure: false |
| 185 | +
|
| 186 | +``` |
| 187 | + |
| 188 | +## Metrics |
| 189 | + |
| 190 | +Platform-level metrics are introduced as a separate metric namespace to distinguish from |
| 191 | +node-level power attribution. While Kepler's existing metrics attribute power consumption |
| 192 | +to workloads running on a node, platform metrics represent the total power consumed by |
| 193 | +the underlying bare metal server (via BMC), regardless of whether Kepler runs on bare |
| 194 | +metal or within a VM. This separation enables: |
| 195 | + |
| 196 | +- Multiple VMs on the same bare metal to report the same platform power |
| 197 | +- Clear distinction between attributed workload power and total platform power |
| 198 | +- Aggregation by BMC ID to get actual bare metal consumption: `max by(bmc) (kepler_platform_watts)` |
| 199 | + |
| 200 | +**Important**: This implementation uses a **power-only (Watts) approach**. |
| 201 | +Energy counters (`kepler_platform_joules_total`) are not supported because: |
| 202 | + |
| 203 | +- Redfish does not provide native energy counters |
| 204 | +- BMC polling is intermittent (every 10 seconds) vs continuous monitoring |
| 205 | + |
| 206 | +```prometheus |
| 207 | +# Platform power metrics (bare metal power consumption) |
| 208 | +kepler_platform_watts{source="redfish",node_name="worker-1",bmc="bmc-1",chassis_id="System.Embedded.1"} 450.5 |
| 209 | +
|
| 210 | +# Existing node metrics unchanged (workload attribution) |
| 211 | +kepler_node_cpu_watts{zone="package",node_name="worker-1"} 125.2 |
| 212 | +``` |
| 213 | + |
| 214 | +## Error Handling |
| 215 | + |
| 216 | +- Connection failures: Log errors and continue to run (instead of terminating) |
| 217 | +- Authentication errors: Retry once, then disable for node |
| 218 | +- Timeouts: 30-second context timeout for BMC requests |
| 219 | +- Graceful degradation when BMCs unavailable |
| 220 | + |
| 221 | +## Security |
| 222 | + |
| 223 | +- Credentials in Kubernetes secrets or secure files (mode 0600) |
| 224 | +- No credential logging |
| 225 | +- Require explicit opt-in via configuration |
| 226 | + |
| 227 | +## Implementation Phases |
| 228 | + |
| 229 | +1. **Foundation**: Dependencies, service structure, config parsing |
| 230 | +2. **Core**: Gofish integration, power collection, service interface |
| 231 | +3. **Metrics**: Platform collector, Prometheus registration |
| 232 | +4. **Testing**: Unit, integration, multi-vendor validation |
| 233 | +5. **Release**: Documentation, migration guides |
| 234 | + |
| 235 | +## Testing Strategy |
| 236 | + |
| 237 | +- Unit tests with mocked Redfish responses |
| 238 | +- Integration tests with Redfish simulator |
| 239 | +- Performance impact validation (<2% overhead target compared to base kepler) |
| 240 | + |
| 241 | +## Migration |
| 242 | + |
| 243 | +- **Backward Compatible**: No breaking changes, opt-in feature |
| 244 | +- **Phased Rollout**: Test subset before full deployment |
| 245 | +- **Rollback**: Disable via config flag, continues with RAPL-only |
| 246 | + |
| 247 | +## Risks and Mitigations |
| 248 | + |
| 249 | +| Risk | Mitigation | |
| 250 | +|----------------------|-----------------------------------------| |
| 251 | +| BMC connectivity | Retry logic, graceful degradation | |
| 252 | +| Vendor compatibility | Multi-vendor testing | |
| 253 | +| Performance impact | <2% overhead validation | |
| 254 | +| Security | Secure credential handling, TLS default | |
| 255 | + |
| 256 | +## Future Enhancements |
| 257 | + |
| 258 | +- Circuit breaker patterns |
| 259 | +- Exponential backoff strategies |
| 260 | +- External secret integration |
| 261 | +- Chassis sub-component power zones |
| 262 | + |
| 263 | +## Open Questions |
| 264 | + |
| 265 | +1. Multi-chassis server handling? |
| 266 | +2. Sub-component power exposure (PSU, fans)? |
0 commit comments