Skip to content

Commit c4eb8bb

Browse files
author
Sunil Thaha
committed
docs: update redfish proposal to match implementation
Signed-off-by: Sunil Thaha <[email protected]>
1 parent 0b9fc01 commit c4eb8bb

File tree

1 file changed

+50
-36
lines changed

1 file changed

+50
-36
lines changed

docs/developer/proposal/EP_001-redfish-support.md

Lines changed: 50 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
- **Maturity**: Experimental
55
- **Author**: Sunil Thaha
66
- **Created**: 2025-08-14
7-
- **Updated**: 2025-08-28
7+
- **Updated**: 2025-09-03
88

99
## Summary
1010

@@ -124,12 +124,16 @@ Implements standard Kepler patterns:
124124

125125
### Implementation Details
126126

127-
**Simplified On-Demand Architecture with Caching:**
127+
**Hybrid PowerSubsystem with Power API Fallback Architecture:**
128128

129-
- `Power()`: synchronous method returning detailed PowerControl readings from all chassis
129+
- `Power()`: synchronous method returning power readings from all chassis with automatic API selection
130+
- **Primary**: Modern Redfish PowerSubsystem API (PowerSupplies collection)
131+
- **Fallback**: Deprecated Power API (PowerControl array) for backward compatibility
132+
- Automatic fallback when PowerSubsystem is unavailable or returns errors
130133
- Simple staleness-based caching to reduce BMC API calls
131-
- Individual PowerControl entry exposure with detailed labeling (chassis_id, power_control_id, power_control_name)
134+
- Individual power source exposure with generic labeling (chassis_id, source_id, source_name, source_type)
132135
- BMC API calls only when cached data is stale or unavailable
136+
- Source type differentiation enables clear identification of API used
133137

134138
**Service Lifecycle:**
135139

@@ -223,17 +227,17 @@ The Redfish service implements a **on-demand collection with caching**:
223227
- Implements simple caching with staleness-based expiration to
224228
support multiple Prometheus scrapes in a short period (High Availability)
225229
- Returns cached data if available and fresh, otherwise collects fresh data
226-
- Returns all chassis with detailed PowerControl readings in a single call
227-
- Each PowerControl entry identified by `chassis_id`, `power_control_id`, and `power_control_name` for granular metric labeling
230+
- Returns all chassis with detailed power supply readings in a single call via PowerSubsystem
231+
- Each power supply entry identified by `chassis_id`, `source_id`, and `source_name` for granular metric labeling
228232

229-
### Multiple Chassis and PowerControl Support
233+
### Multiple Chassis and PowerSupply Support
230234

231-
- `Power()` method returns `*PowerReading` (single reading containing multiple chassis with detailed PowerControl data)
232-
- `PowerReading` struct contains `[]Chassis` slice, each with `ID` and `[]Reading` for individual PowerControl entries
233-
- Iterates through all available chassis on the BMC and their PowerControl arrays
234-
- Filters and returns only PowerControl entries with valid power readings
235-
- Each reading includes `ControlID`, `Name`, and `Power` for granular power domain monitoring
236-
- Exposes individual PowerControl entries as separate metrics (e.g., Server Power Control, CPU Sub-system Power, Memory Power)
235+
- `Power()` method returns `*PowerReading` (single reading containing multiple chassis with detailed power supply data via PowerSubsystem)
236+
- `PowerReading` struct contains `[]Chassis` slice, each with `ID` and `[]Reading` for individual power supply entries
237+
- Iterates through all available chassis on the BMC and their PowerSubsystem → PowerSupplies arrays
238+
- Filters and returns only power supply entries with valid power output readings
239+
- Each reading includes `SourceID`, `SourceName`, and `Power` for granular power supply monitoring
240+
- Exposes individual power supply entries as separate metrics (e.g., Power Supply 1, Power Supply 2, Power Supply Bay 1)
237241

238242
## Metrics
239243

@@ -243,17 +247,17 @@ to workloads running on a node, platform metrics represent individual power doma
243247
the underlying bare metal server (via BMC), regardless of whether Kepler runs on bare
244248
metal or within a VM.
245249

246-
**PowerControl Granularity**: Each PowerControl entry from the BMC's PowerControl array is
247-
exposed as an individual metric with detailed labels. This approach avoids making assumptions
248-
about power topology (whether PowerControl entries should be summed or represent independent
249-
power domains) and allows users to understand their specific hardware's power structure.
250+
**PowerSupply Granularity**: Each power supply from the BMC's PowerSubsystem → PowerSupplies collection is
251+
exposed as an individual metric with detailed labels. This approach provides direct visibility into
252+
individual power supply output and allows users to understand their hardware's power supply topology
253+
and redundancy configuration.
250254

251255
This separation enables:
252256

253257
- Multiple VMs on the same bare metal to report the same platform power
254-
- Clear distinction between attributed workload power and platform power domains
255-
- Granular monitoring of power subsystems (CPU, memory, storage, etc.)
256-
- Flexible aggregation based on understanding of specific hardware topology
258+
- Clear distinction between attributed workload power and platform power supplies
259+
- Granular monitoring of individual power supplies and their redundancy status
260+
- Direct visibility into power supply efficiency and utilization
257261

258262
**Important**: This implementation uses a **power-only (Watts) approach**.
259263
Energy counters (`kepler_platform_joules_total`) are not supported because:
@@ -262,10 +266,14 @@ Energy counters (`kepler_platform_joules_total`) are not supported because:
262266
- Collection frequency varies based on demand and configuration
263267

264268
```prometheus
265-
# Platform power metrics (bare metal power consumption) - individual PowerControl entries exposed
266-
kepler_platform_watts{source="redfish",node_name="worker-1",bmc_id="bmc-1",chassis_id="System.Embedded.1",power_control_id="PC1",power_control_name="Server Power Control"} 450.5
267-
kepler_platform_watts{source="redfish",node_name="worker-1",bmc_id="bmc-1",chassis_id="System.Embedded.1",power_control_id="PC2",power_control_name="CPU Sub-system Power"} 85.2
268-
kepler_platform_watts{source="redfish",node_name="worker-1",bmc_id="bmc-1",chassis_id="Enclosure.Internal.0-1",power_control_id="PC1",power_control_name="Enclosure Power Control"} 125.3
269+
# Platform power metrics (bare metal power consumption) - hybrid API approach with source_type differentiation
270+
271+
# Modern PowerSubsystem API (PowerSupplies)
272+
kepler_platform_watts{source="redfish",node_name="worker-1",bmc_id="bmc-1",chassis_id="System.Embedded.1",source_id="PS1",source_name="Power Supply 1",source_type="PowerSupply"} 245.0
273+
kepler_platform_watts{source="redfish",node_name="worker-1",bmc_id="bmc-1",chassis_id="System.Embedded.1",source_id="PS2",source_name="Power Supply 2",source_type="PowerSupply"} 0.0
274+
275+
# Fallback Power API (PowerControl) when PowerSubsystem unavailable
276+
kepler_platform_watts{source="redfish",node_name="worker-1",bmc_id="bmc-1",chassis_id="System.Embedded.1",source_id="0",source_name="System Power Control",source_type="PowerControl"} 189.5
269277
270278
# Existing node metrics unchanged (workload attribution)
271279
kepler_node_cpu_watts{zone="package",node_name="worker-1"} 125.2
@@ -289,12 +297,14 @@ kepler_node_cpu_watts{zone="package",node_name="worker-1"} 125.2
289297

290298
**✅ Implemented and Available (Experimental):**
291299

292-
1. **Core**: Full Gofish integration with simplified on-demand power collection and service interfaces
293-
2. **Metrics**: Platform collector integrated with Prometheus exporter
294-
3. **Configuration**: CLI flags and YAML configuration with automatic node ID resolution
295-
4. **Testing**: Unit tests with mock server covering multiple vendor scenarios
296-
5. **Caching**: Staleness-based caching to reduce BMC API calls
297-
6. **Multiple Chassis and PowerControl**: Support for collecting detailed power data from all chassis and individual PowerControl entries
300+
1. **Core**: Full Gofish integration with hybrid PowerSubsystem/Power API collection and service interfaces
301+
2. **API Hybrid Approach**: PowerSubsystem API (modern) with automatic fallback to Power API (deprecated) for backward compatibility
302+
3. **Metrics**: Platform collector with generic source_id/source_name/source_type labels for power data
303+
4. **Configuration**: CLI flags and YAML configuration with automatic node ID resolution
304+
5. **Testing**: Unit tests with mock server including PowerSubsystem, Power API, and fallback scenarios
305+
6. **Caching**: Staleness-based caching to reduce BMC API calls
306+
7. **Multiple Chassis and Sources**: Support for collecting detailed power data from all chassis via both APIs with source differentiation
307+
8. **Fallback Logic**: Automatic detection and fallback when PowerSubsystem is unavailable
298308

299309
**Current State:**
300310

@@ -314,13 +324,16 @@ kepler_node_cpu_watts{zone="package",node_name="worker-1"} 125.2
314324

315325
**Implemented Testing:**
316326

317-
- **Unit tests**: Full test coverage with mocked Redfish responses
318-
- **Mock server**: HTTP server simulating BMC Redfish API endpoints for different vendors
319-
- **Multi-vendor scenarios**: Dell, HPE, Lenovo, and Generic response variations
320-
- **Error conditions**: Connection failures, authentication errors, timeouts, missing chassis
327+
- **Unit tests**: Full test coverage with mocked PowerSubsystem and Power API responses
328+
- **Mock server**: HTTP server simulating BMC Redfish PowerSubsystem and Power API endpoints for different vendors
329+
- **PowerSupply fixtures**: Dell, HPE, Lenovo PowerSupply collection response variations
330+
- **PowerControl fixtures**: PowerControl array response variations for fallback testing
331+
- **Fallback scenarios**: Comprehensive testing of PowerSubsystem → Power API fallback logic
332+
- **Error conditions**: Connection failures, authentication errors, timeouts, missing chassis/power supplies/power subsystems
333+
- **Source type validation**: Testing proper source_type assignment for PowerSupply vs PowerControl
321334
- **Concurrency testing**: Race detection and thread safety validation
322335
- **Caching behavior**: Staleness-based caching and cache expiry testing
323-
- **Service lifecycle**: Complete Init, ChassisPower, and Shutdown testing
336+
- **Service lifecycle**: Complete Init, Power (hybrid approach), and Shutdown testing
324337

325338
**Testing Infrastructure:**
326339

@@ -375,7 +388,8 @@ kepler_node_cpu_watts{zone="package",node_name="worker-1"} 125.2
375388
- Power-only metrics (no energy counters due to intermittent BMC polling)
376389
- Basic staleness-based caching (more advanced cache management could be added)
377390
- BMC calls during Prometheus scrape when cache is stale (mitigated by built-in caching)
378-
- Tested with mock servers (Dell, HPE, Lenovo, Generic scenarios)
391+
- Optimal for modern Redfish implementations with PowerSubsystem support (gracefully falls back to deprecated Power API)
392+
- Tested with mock servers simulating both PowerSubsystem and Power API scenarios
379393

380394
## Future Enhancements
381395

0 commit comments

Comments
 (0)