Skip to content

Commit a65dd68

Browse files
authored
Merge pull request #2245 from sthaha/feat-redfish
doc: enhancement proposal for supporing Redfish
2 parents ca4d92d + 7696418 commit a65dd68

File tree

2 files changed

+267
-0
lines changed

2 files changed

+267
-0
lines changed
Lines changed: 266 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,266 @@
1+
# EP-001: Redfish Power Monitoring Support
2+
3+
- **Status**: Draft
4+
- **Author**: Sunil Thaha
5+
- **Created**: 2025-08-14
6+
7+
## Summary
8+
9+
Add Redfish BMC power monitoring to Kepler for platform-level power consumption data,
10+
complementing existing RAPL CPU monitoring to provide comprehensive server power
11+
visibility.
12+
13+
## Problem
14+
15+
Kepler currently measures only CPU power via Intel RAPL, missing:
16+
17+
- Platform power (PSU, cooling, storage, network)
18+
- Multi-vendor support (AMD, ARM systems)
19+
- BMC integration capabilities already present in data centers
20+
21+
## Goals
22+
23+
- Add Redfish BMC power monitoring capability
24+
- Support Kubernetes, bare metal, and standalone deployments
25+
- Integrate with existing Kepler architecture
26+
- Maintain security best practices
27+
28+
## Non-Goals
29+
30+
- Replace RAPL monitoring (complementary)
31+
- Support non-Redfish protocols (IPMI) initially
32+
- Implement power control features
33+
- Advanced resilience patterns in v1
34+
35+
## Solution
36+
37+
Add platform service layer collecting BMC power data via Redfish, exposed through
38+
Prometheus collectors separately from CPU power attribution.
39+
40+
```mermaid
41+
C4Container
42+
title Container Diagram - Kepler Power Monitoring
43+
44+
Person(user, "User", "Prometheus/Grafana")
45+
46+
Container_Boundary(kepler, "Kepler") {
47+
Component(rapl, "RAPL Reader", "Go", "CPU power")
48+
Component(redfish, "Redfish Client", "Go", "Platform power")
49+
Component(monitor, "Power Monitor", "Go", "Attribution")
50+
Component(platform, "Platform Collector", "Go", "Platform metrics")
51+
Component(exporter, "Prometheus Exporter", "Go", "Metrics endpoint")
52+
}
53+
54+
System_Ext(bmc, "BMC", "Redfish API")
55+
System_Ext(kernel, "Linux", "RAPL sysfs")
56+
57+
Rel(rapl, kernel, "Reads")
58+
Rel(redfish, bmc, "Queries")
59+
Rel(monitor, rapl, "Uses")
60+
Rel(platform, redfish, "Uses")
61+
Rel(exporter, monitor, "Collects")
62+
Rel(exporter, platform, "Collects")
63+
Rel(user, exporter, "Scrapes")
64+
```
65+
66+
## Node Identification
67+
68+
Nodes identified via `--platform.redfish.node-id` flag or `platform.redfish.nodeID` config,
69+
matching identifiers in BMC configuration file. E.g.
70+
71+
```bash
72+
kepler --platform.redfish.node-id=worker-1
73+
```
74+
75+
1. **Configuration**: `platform.redfish.nodeID` as below
76+
77+
```yaml
78+
# config.yaml
79+
80+
platform:
81+
redfish:
82+
nodeID: worker-1
83+
```
84+
85+
```mermaid
86+
flowchart LR
87+
A[Node Start] --> B{Node ID?}
88+
B -->|Yes| C[Load BMC Config]
89+
B -->|No| D[RAPL Only]
90+
C --> E{BMC Found?}
91+
E -->|Yes| F[Connect & Monitor]
92+
E -->|No| D
93+
```
94+
95+
## Implementation
96+
97+
### Package Structure
98+
99+
```mermaid
100+
graph TD
101+
subgraph "internal/"
102+
A[platform/redfish/<br/>service.go<br/>config.go<br/>client.go]
103+
B[exporter/prometheus/<br/>collector/<br/>platform_collector.go]
104+
end
105+
A --> B
106+
```
107+
108+
### Service Interfaces
109+
110+
Implements standard Kepler patterns:
111+
112+
- `service.Initializer`: Configuration and connection setup
113+
- `service.Runner`: Periodic power collection with context
114+
- `service.Shutdowner`: Clean resource release
115+
116+
### Configuration
117+
118+
**Kepler Config Structure:**
119+
120+
```go
121+
type Platform struct {
122+
Redfish Redfish `yaml:"redfish"`
123+
}
124+
125+
type Redfish struct {
126+
Enabled *bool `yaml:"enabled"`
127+
NodeID string `yaml:"nodeID"`
128+
ConfigFile string `yaml:"configFile"`
129+
}
130+
```
131+
132+
**CLI Flags:**
133+
134+
```bash
135+
--platform.redfish.enabled=true
136+
--platform.redfish.node-id=worker-1
137+
--platform.redfish.config=/etc/kepler/redfish.yaml
138+
```
139+
140+
**Main Configuration (`hack/config.yaml`):**
141+
142+
```yaml
143+
# ... existing config sections ...
144+
145+
platform:
146+
redfish:
147+
enabled: true
148+
nodeID: "worker-1" # Node identifier for BMC mapping
149+
configFile: "/etc/kepler/redfish.yaml"
150+
```
151+
152+
**BMC Configuration (`/etc/kepler/redfish.yaml`):**
153+
154+
The configuration separates node-to-BMC mappings from BMC credentials for several reasons:
155+
156+
- **Multi-tenant deployments**: Multiple VMs/nodes can share the same BMC (blade servers, virtualized environments)
157+
- **Credential reuse**: Same BMC credentials can be shared across multiple node mappings
158+
- **Operational flexibility**: Easy to reassign nodes to different BMCs without credential changes
159+
160+
```yaml
161+
nodes:
162+
baremetal-worker-1: bmc-1
163+
baremetal-worker-2: bmc-2
164+
vm_worker-3: BMC_2_VM
165+
vm_worker-4: BMC_2_VM
166+
167+
bmcs:
168+
bmc-1:
169+
endpoint: "https://192.168.1.100"
170+
username: "admin"
171+
password: "secret"
172+
insecure: true # TLS verification
173+
174+
bmc-2:
175+
endpoint: "https://192.168.1.101"
176+
username: "admin"
177+
password: "secret456"
178+
insecure: false # Verify TLS certificates
179+
180+
BMC_2_VM:
181+
endpoint: "https://192.168.1.103"
182+
username: "admin"
183+
password: "secret456"
184+
insecure: false
185+
186+
```
187+
188+
## Metrics
189+
190+
Platform-level metrics are introduced as a separate metric namespace to distinguish from
191+
node-level power attribution. While Kepler's existing metrics attribute power consumption
192+
to workloads running on a node, platform metrics represent the total power consumed by
193+
the underlying bare metal server (via BMC), regardless of whether Kepler runs on bare
194+
metal or within a VM. This separation enables:
195+
196+
- Multiple VMs on the same bare metal to report the same platform power
197+
- Clear distinction between attributed workload power and total platform power
198+
- Aggregation by BMC ID to get actual bare metal consumption: `max by(bmc) (kepler_platform_watts)`
199+
200+
**Important**: This implementation uses a **power-only (Watts) approach**.
201+
Energy counters (`kepler_platform_joules_total`) are not supported because:
202+
203+
- Redfish does not provide native energy counters
204+
- BMC polling is intermittent (every 10 seconds) vs continuous monitoring
205+
206+
```prometheus
207+
# Platform power metrics (bare metal power consumption)
208+
kepler_platform_watts{source="redfish",node_name="worker-1",bmc="bmc-1",chassis_id="System.Embedded.1"} 450.5
209+
210+
# Existing node metrics unchanged (workload attribution)
211+
kepler_node_cpu_watts{zone="package",node_name="worker-1"} 125.2
212+
```
213+
214+
## Error Handling
215+
216+
- Connection failures: Log errors and continue to run (instead of terminating)
217+
- Authentication errors: Retry once, then disable for node
218+
- Timeouts: 30-second context timeout for BMC requests
219+
- Graceful degradation when BMCs unavailable
220+
221+
## Security
222+
223+
- Credentials in Kubernetes secrets or secure files (mode 0600)
224+
- No credential logging
225+
- Require explicit opt-in via configuration
226+
227+
## Implementation Phases
228+
229+
1. **Foundation**: Dependencies, service structure, config parsing
230+
2. **Core**: Gofish integration, power collection, service interface
231+
3. **Metrics**: Platform collector, Prometheus registration
232+
4. **Testing**: Unit, integration, multi-vendor validation
233+
5. **Release**: Documentation, migration guides
234+
235+
## Testing Strategy
236+
237+
- Unit tests with mocked Redfish responses
238+
- Integration tests with Redfish simulator
239+
- Performance impact validation (<2% overhead target compared to base kepler)
240+
241+
## Migration
242+
243+
- **Backward Compatible**: No breaking changes, opt-in feature
244+
- **Phased Rollout**: Test subset before full deployment
245+
- **Rollback**: Disable via config flag, continues with RAPL-only
246+
247+
## Risks and Mitigations
248+
249+
| Risk | Mitigation |
250+
|----------------------|-----------------------------------------|
251+
| BMC connectivity | Retry logic, graceful degradation |
252+
| Vendor compatibility | Multi-vendor testing |
253+
| Performance impact | <2% overhead validation |
254+
| Security | Secure credential handling, TLS default |
255+
256+
## Future Enhancements
257+
258+
- Circuit breaker patterns
259+
- Exponential backoff strategies
260+
- External secret integration
261+
- Chassis sub-component power zones
262+
263+
## Open Questions
264+
265+
1. Multi-chassis server handling?
266+
2. Sub-component power exposure (PSU, fans)?

docs/developer/proposal/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ This directory contains Enhancement Proposals (EPs) for major features and chang
77
| ID | Title | Status | Author | Created |
88
|----------------------------------------------|----------------------------------|----------|-------------------------|------------|
99
| [EP-000](EP_TEMPLATE.md) | Enhancement Proposal Template | Accepted | Sunil Thaha | 2025-01-18 |
10+
| [EP-001](EP_001-redfish-support.md) | Redfish Power Monitoring Support | Draft | Sunil Thaha | 2025-08-14 |
1011
| [EP-002](EP-002-MSR-Fallback-Power-Meter.md) | MSR Fallback for CPU Power Meter | Draft | Kepler Development Team | 2025-08-12 |
1112

1213
## Proposal Status

0 commit comments

Comments
 (0)