Skip to content

Commit ae0016f

Browse files
author
Sunil Thaha
authored
Merge pull request #2311 from sthaha/feat-redfish-better-api
refactor(redfish): enhance power reading with PowerSubsystem API support
2 parents 231a01b + f890d52 commit ae0016f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+2144
-3029
lines changed

docs/developer/proposal/EP_001-redfish-support.md

Lines changed: 50 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
- **Maturity**: Experimental
55
- **Author**: Sunil Thaha
66
- **Created**: 2025-08-14
7-
- **Updated**: 2025-08-28
7+
- **Updated**: 2025-09-03
88

99
## Summary
1010

@@ -124,12 +124,16 @@ Implements standard Kepler patterns:
124124

125125
### Implementation Details
126126

127-
**Simplified On-Demand Architecture with Caching:**
127+
**Hybrid PowerSubsystem with Power API Fallback Architecture:**
128128

129-
- `Power()`: synchronous method returning detailed PowerControl readings from all chassis
129+
- `Power()`: synchronous method returning power readings from all chassis with automatic API selection
130+
- **Primary**: Modern Redfish PowerSubsystem API (PowerSupplies collection)
131+
- **Fallback**: Deprecated Power API (PowerControl array) for backward compatibility
132+
- Automatic fallback when PowerSubsystem is unavailable or returns errors
130133
- Simple staleness-based caching to reduce BMC API calls
131-
- Individual PowerControl entry exposure with detailed labeling (chassis_id, power_control_id, power_control_name)
134+
- Individual power source exposure with generic labeling (chassis_id, source_id, source_name, source_type)
132135
- BMC API calls only when cached data is stale or unavailable
136+
- Source type differentiation enables clear identification of API used
133137

134138
**Service Lifecycle:**
135139

@@ -223,17 +227,17 @@ The Redfish service implements a **on-demand collection with caching**:
223227
- Implements simple caching with staleness-based expiration to
224228
support multiple Prometheus scrapes in a short period (High Availability)
225229
- Returns cached data if available and fresh, otherwise collects fresh data
226-
- Returns all chassis with detailed PowerControl readings in a single call
227-
- Each PowerControl entry identified by `chassis_id`, `power_control_id`, and `power_control_name` for granular metric labeling
230+
- Returns all chassis with detailed power supply readings in a single call via PowerSubsystem
231+
- Each power supply entry identified by `chassis_id`, `source_id`, and `source_name` for granular metric labeling
228232

229-
### Multiple Chassis and PowerControl Support
233+
### Multiple Chassis and PowerSupply Support
230234

231-
- `Power()` method returns `*PowerReading` (single reading containing multiple chassis with detailed PowerControl data)
232-
- `PowerReading` struct contains `[]Chassis` slice, each with `ID` and `[]Reading` for individual PowerControl entries
233-
- Iterates through all available chassis on the BMC and their PowerControl arrays
234-
- Filters and returns only PowerControl entries with valid power readings
235-
- Each reading includes `ControlID`, `Name`, and `Power` for granular power domain monitoring
236-
- Exposes individual PowerControl entries as separate metrics (e.g., Server Power Control, CPU Sub-system Power, Memory Power)
235+
- `Power()` method returns `*PowerReading` (single reading containing multiple chassis with detailed power supply data via PowerSubsystem)
236+
- `PowerReading` struct contains `[]Chassis` slice, each with `ID` and `[]Reading` for individual power supply entries
237+
- Iterates through all available chassis on the BMC and their PowerSubsystem → PowerSupplies arrays
238+
- Filters and returns only power supply entries with valid power output readings
239+
- Each reading includes `SourceID`, `SourceName`, and `Power` for granular power supply monitoring
240+
- Exposes individual power supply entries as separate metrics (e.g., Power Supply 1, Power Supply 2, Power Supply Bay 1)
237241

238242
## Metrics
239243

@@ -243,17 +247,17 @@ to workloads running on a node, platform metrics represent individual power doma
243247
the underlying bare metal server (via BMC), regardless of whether Kepler runs on bare
244248
metal or within a VM.
245249

246-
**PowerControl Granularity**: Each PowerControl entry from the BMC's PowerControl array is
247-
exposed as an individual metric with detailed labels. This approach avoids making assumptions
248-
about power topology (whether PowerControl entries should be summed or represent independent
249-
power domains) and allows users to understand their specific hardware's power structure.
250+
**PowerSupply Granularity**: Each power supply from the BMC's PowerSubsystem → PowerSupplies collection is
251+
exposed as an individual metric with detailed labels. This approach provides direct visibility into
252+
individual power supply output and allows users to understand their hardware's power supply topology
253+
and redundancy configuration.
250254

251255
This separation enables:
252256

253257
- Multiple VMs on the same bare metal to report the same platform power
254-
- Clear distinction between attributed workload power and platform power domains
255-
- Granular monitoring of power subsystems (CPU, memory, storage, etc.)
256-
- Flexible aggregation based on understanding of specific hardware topology
258+
- Clear distinction between attributed workload power and platform power supplies
259+
- Granular monitoring of individual power supplies and their redundancy status
260+
- Direct visibility into power supply efficiency and utilization
257261

258262
**Important**: This implementation uses a **power-only (Watts) approach**.
259263
Energy counters (`kepler_platform_joules_total`) are not supported because:
@@ -262,10 +266,14 @@ Energy counters (`kepler_platform_joules_total`) are not supported because:
262266
- Collection frequency varies based on demand and configuration
263267

264268
```prometheus
265-
# Platform power metrics (bare metal power consumption) - individual PowerControl entries exposed
266-
kepler_platform_watts{source="redfish",node_name="worker-1",bmc_id="bmc-1",chassis_id="System.Embedded.1",power_control_id="PC1",power_control_name="Server Power Control"} 450.5
267-
kepler_platform_watts{source="redfish",node_name="worker-1",bmc_id="bmc-1",chassis_id="System.Embedded.1",power_control_id="PC2",power_control_name="CPU Sub-system Power"} 85.2
268-
kepler_platform_watts{source="redfish",node_name="worker-1",bmc_id="bmc-1",chassis_id="Enclosure.Internal.0-1",power_control_id="PC1",power_control_name="Enclosure Power Control"} 125.3
269+
# Platform power metrics (bare metal power consumption) - hybrid API approach with source_type differentiation
270+
271+
# Modern PowerSubsystem API (PowerSupplies)
272+
kepler_platform_watts{source="redfish",node_name="worker-1",bmc_id="bmc-1",chassis_id="System.Embedded.1",source_id="PS1",source_name="Power Supply 1",source_type="PowerSupply"} 245.0
273+
kepler_platform_watts{source="redfish",node_name="worker-1",bmc_id="bmc-1",chassis_id="System.Embedded.1",source_id="PS2",source_name="Power Supply 2",source_type="PowerSupply"} 0.0
274+
275+
# Fallback Power API (PowerControl) when PowerSubsystem unavailable
276+
kepler_platform_watts{source="redfish",node_name="worker-1",bmc_id="bmc-1",chassis_id="System.Embedded.1",source_id="0",source_name="System Power Control",source_type="PowerControl"} 189.5
269277
270278
# Existing node metrics unchanged (workload attribution)
271279
kepler_node_cpu_watts{zone="package",node_name="worker-1"} 125.2
@@ -289,12 +297,14 @@ kepler_node_cpu_watts{zone="package",node_name="worker-1"} 125.2
289297

290298
**✅ Implemented and Available (Experimental):**
291299

292-
1. **Core**: Full Gofish integration with simplified on-demand power collection and service interfaces
293-
2. **Metrics**: Platform collector integrated with Prometheus exporter
294-
3. **Configuration**: CLI flags and YAML configuration with automatic node ID resolution
295-
4. **Testing**: Unit tests with mock server covering multiple vendor scenarios
296-
5. **Caching**: Staleness-based caching to reduce BMC API calls
297-
6. **Multiple Chassis and PowerControl**: Support for collecting detailed power data from all chassis and individual PowerControl entries
300+
1. **Core**: Full Gofish integration with hybrid PowerSubsystem/Power API collection and service interfaces
301+
2. **API Hybrid Approach**: PowerSubsystem API (modern) with automatic fallback to Power API (deprecated) for backward compatibility
302+
3. **Metrics**: Platform collector with generic source_id/source_name/source_type labels for power data
303+
4. **Configuration**: CLI flags and YAML configuration with automatic node ID resolution
304+
5. **Testing**: Unit tests with mock server including PowerSubsystem, Power API, and fallback scenarios
305+
6. **Caching**: Staleness-based caching to reduce BMC API calls
306+
7. **Multiple Chassis and Sources**: Support for collecting detailed power data from all chassis via both APIs with source differentiation
307+
8. **Fallback Logic**: Automatic detection and fallback when PowerSubsystem is unavailable
298308

299309
**Current State:**
300310

@@ -314,13 +324,16 @@ kepler_node_cpu_watts{zone="package",node_name="worker-1"} 125.2
314324

315325
**Implemented Testing:**
316326

317-
- **Unit tests**: Full test coverage with mocked Redfish responses
318-
- **Mock server**: HTTP server simulating BMC Redfish API endpoints for different vendors
319-
- **Multi-vendor scenarios**: Dell, HPE, Lenovo, and Generic response variations
320-
- **Error conditions**: Connection failures, authentication errors, timeouts, missing chassis
327+
- **Unit tests**: Full test coverage with mocked PowerSubsystem and Power API responses
328+
- **Mock server**: HTTP server simulating BMC Redfish PowerSubsystem and Power API endpoints for different vendors
329+
- **PowerSupply fixtures**: Dell, HPE, Lenovo PowerSupply collection response variations
330+
- **PowerControl fixtures**: PowerControl array response variations for fallback testing
331+
- **Fallback scenarios**: Comprehensive testing of PowerSubsystem → Power API fallback logic
332+
- **Error conditions**: Connection failures, authentication errors, timeouts, missing chassis/power supplies/power subsystems
333+
- **Source type validation**: Testing proper source_type assignment for PowerSupply vs PowerControl
321334
- **Concurrency testing**: Race detection and thread safety validation
322335
- **Caching behavior**: Staleness-based caching and cache expiry testing
323-
- **Service lifecycle**: Complete Init, ChassisPower, and Shutdown testing
336+
- **Service lifecycle**: Complete Init, Power (hybrid approach), and Shutdown testing
324337

325338
**Testing Infrastructure:**
326339

@@ -375,7 +388,8 @@ kepler_node_cpu_watts{zone="package",node_name="worker-1"} 125.2
375388
- Power-only metrics (no energy counters due to intermittent BMC polling)
376389
- Basic staleness-based caching (more advanced cache management could be added)
377390
- BMC calls during Prometheus scrape when cache is stale (mitigated by built-in caching)
378-
- Tested with mock servers (Dell, HPE, Lenovo, Generic scenarios)
391+
- Optimal for modern Redfish implementations with PowerSubsystem support (gracefully falls back to deprecated Power API)
392+
- Tested with mock servers simulating both PowerSubsystem and Power API scenarios
379393

380394
## Future Enhancements
381395

docs/user/metrics.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -263,14 +263,15 @@ These experimental metrics provide platform-level power information from BMC sou
263263
#### kepler_platform_watts
264264

265265
- **Type**: GAUGE
266-
- **Description**: Current platform power consumption in watts from BMC PowerControl entries
266+
- **Description**: Current platform power in watts from BMC (PowerSubsystem or deprecated Power API)
267267
- **Labels**:
268268
- `source`
269269
- `node_name`
270270
- `bmc_id`
271271
- `chassis_id`
272-
- `power_control_id`
273-
- `power_control_name`
272+
- `source_id`
273+
- `source_name`
274+
- `source_type`
274275

275276
---
276277

go.mod

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ require (
1414
github.com/prometheus/client_model v0.6.1
1515
github.com/prometheus/exporter-toolkit v0.14.0
1616
github.com/prometheus/procfs v0.15.1
17-
github.com/stmcginnis/gofish v0.15.0
17+
github.com/stmcginnis/gofish v0.20.0
1818
github.com/stretchr/testify v1.10.0
1919
go.uber.org/zap v1.26.0
2020
golang.org/x/sync v0.12.0

go.sum

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -135,8 +135,8 @@ github.com/rogpeppe/go-internal v1.12.0 h1:exVL4IDcn6na9z1rAb56Vxr+CgyK3nn3O+epU
135135
github.com/rogpeppe/go-internal v1.12.0/go.mod h1:E+RYuTGaKKdloAfM02xzb0FW3Paa99yedzYV+kq4uf4=
136136
github.com/spf13/pflag v1.0.5 h1:iy+VFUOCP1a+8yFto/drg2CJ5u0yRoB7fZw3DKv/JXA=
137137
github.com/spf13/pflag v1.0.5/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg=
138-
github.com/stmcginnis/gofish v0.15.0 h1:8TG41+lvJk/0Nf8CIIYErxbMlQUy80W0JFRZP3Ld82A=
139-
github.com/stmcginnis/gofish v0.15.0/go.mod h1:BLDSFTp8pDlf/xDbLZa+F7f7eW0E/CHCboggsu8CznI=
138+
github.com/stmcginnis/gofish v0.20.0 h1:hH2V2Qe898F2wWT1loApnkDUrXXiLKqbSlMaH3Y1n08=
139+
github.com/stmcginnis/gofish v0.20.0/go.mod h1:PzF5i8ecRG9A2ol8XT64npKUunyraJ+7t0kYMpQAtqU=
140140
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
141141
github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw=
142142
github.com/stretchr/objx v0.5.0/go.mod h1:Yh+to48EsGEfYuaHDzXPcE3xhTkx73EhmCGUpEOglKo=

hack/gen-metric-docs/main.go

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -66,24 +66,27 @@ func (m *MockRedfishService) Power() (*redfish.PowerReading, error) {
6666
ID: "System.Embedded.1",
6767
Readings: []redfish.Reading{
6868
{
69-
ControlID: "PC1",
70-
Name: "System Power Control",
71-
Power: 245.0 * device.Watt, // Dell 245W scenario
69+
SourceID: "PC1",
70+
SourceName: "System Power Control",
71+
SourceType: redfish.PowerControlSource,
72+
Power: 245.0 * device.Watt, // Dell 245W scenario
7273
},
7374
},
7475
},
7576
{
7677
ID: "Enclosure.Internal.0-1",
7778
Readings: []redfish.Reading{
7879
{
79-
ControlID: "PC1",
80-
Name: "Enclosure Power Control",
81-
Power: 189.5 * device.Watt, // HPE 189W scenario
80+
SourceID: "PC1",
81+
SourceName: "Enclosure Power Control",
82+
SourceType: redfish.PowerControlSource,
83+
Power: 189.5 * device.Watt, // HPE 189W scenario
8284
},
8385
{
84-
ControlID: "PC2",
85-
Name: "CPU Sub-system Power",
86-
Power: 167.8 * device.Watt, // Lenovo 167W scenario
86+
SourceID: "PC2",
87+
SourceName: "CPU Sub-system Power",
88+
SourceType: redfish.PowerControlSource,
89+
Power: 167.8 * device.Watt, // Lenovo 167W scenario
8790
},
8891
},
8992
},

0 commit comments

Comments
 (0)