Skip to content

Commit 9eaf6a3

Browse files
author
Sunil Thaha
committed
docs(kepler-operator): restructure OpenShift docs into focused guides
Break down OpenShift installation document into smaller guides. - Overview page with prerequisites and navigation - Quick start guide for OperatorHub installation - CLI installation guide with automation scripts - Configuration guide with comprehensive options - Monitoring & troubleshooting guide - Migration guide for Kepler CRD to PowerMonitor Updated navigation structure in mkdocs.yml and moved images to appropriate directory structure. Signed-off-by: Sunil Thaha <[email protected]>
1 parent 13627ee commit 9eaf6a3

18 files changed

+1773
-449
lines changed
Lines changed: 340 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,340 @@
1+
# PowerMonitor Configuration Guide
2+
3+
This guide covers comprehensive configuration options for the PowerMonitor Custom
4+
Resource Definition (CRD). Use these settings to customize your Kepler deployment
5+
for different environments, security requirements, and monitoring needs.
6+
7+
## Basic Configuration
8+
9+
### Metric Levels
10+
11+
Control which types of metrics Kepler exports. Choose based on your monitoring
12+
requirements and performance considerations:
13+
14+
```yaml
15+
spec:
16+
kepler:
17+
config:
18+
metricLevels:
19+
- node # Node-level power consumption (recommended)
20+
- pod # Pod-level power consumption (recommended)
21+
- vm # Virtual machine power consumption
22+
- process # Process-level power consumption (high overhead)
23+
- container # Container-level power consumption (high overhead)
24+
```
25+
26+
**Recommendations:**
27+
28+
- **Production**: Use `node` and `pod` for balanced monitoring
29+
- **Development**: Add `container` for detailed analysis
30+
- **Minimal overhead**: Use `node` only
31+
32+
### Timing Configuration
33+
34+
Configure how frequently Kepler samples and reports metrics:
35+
36+
```yaml
37+
spec:
38+
kepler:
39+
config:
40+
sampleRate: 5s # How often to sample metrics
41+
staleness: 500ms # How long before values are considered stale
42+
```
43+
44+
**Performance tuning:**
45+
46+
- **Lower CPU usage**: Increase `sampleRate` to `10s` or `15s`
47+
- **Higher precision**: Decrease `sampleRate` to `3s` (minimum recommended)
48+
- **Network optimization**: Adjust `staleness` based on network latency
49+
50+
### Logging Configuration
51+
52+
Set logging levels for troubleshooting and monitoring:
53+
54+
```yaml
55+
spec:
56+
kepler:
57+
config:
58+
logLevel: info # Options: debug, info, warn, error
59+
```
60+
61+
**Log levels:**
62+
63+
- **`debug`**: Verbose logging for troubleshooting (not for production)
64+
- **`info`**: Standard operational logging (recommended)
65+
- **`warn`**: Only warnings and errors
66+
- **`error`**: Only error messages
67+
68+
## Resource Management
69+
70+
### Terminated Workload Tracking
71+
72+
Control how Kepler handles metrics for terminated pods and containers:
73+
74+
```yaml
75+
spec:
76+
kepler:
77+
config:
78+
maxTerminated: 500 # Track top 500 terminated workloads by power consumption
79+
```
80+
81+
**Options:**
82+
83+
- **`500`** (default): Track top 500 terminated workloads
84+
- **`0`**: Disable terminated workload tracking (saves memory)
85+
- **`-1`**: Track unlimited terminated workloads (use with caution)
86+
87+
### Additional ConfigMaps
88+
89+
Use external ConfigMaps for advanced configuration:
90+
91+
```yaml
92+
spec:
93+
kepler:
94+
config:
95+
additionalConfigMaps:
96+
- name: custom-model-config
97+
- name: advanced-power-settings
98+
```
99+
100+
This allows you to:
101+
102+
- Override power models
103+
- Customize CPU frequency mappings
104+
- Configure platform-specific settings
105+
106+
## Security Configuration
107+
108+
### RBAC Mode
109+
110+
Configure security and service account permissions:
111+
112+
```yaml
113+
spec:
114+
kepler:
115+
deployment:
116+
security:
117+
mode: rbac # Options: none, rbac
118+
allowedSANames:
119+
- "monitoring-sa"
120+
- "custom-service-account"
121+
```
122+
123+
**Security modes:**
124+
125+
- **`none`**: No additional security restrictions (default for development)
126+
- **`rbac`**: Enable Role-Based Access Control (recommended for production)
127+
128+
### Service Account Configuration
129+
130+
When using RBAC mode, specify allowed service accounts:
131+
132+
```yaml
133+
spec:
134+
kepler:
135+
deployment:
136+
security:
137+
mode: rbac
138+
allowedSANames:
139+
- "prometheus-service-account"
140+
- "grafana-service-account"
141+
```
142+
143+
## Node Selection and Scheduling
144+
145+
### Node Selectors
146+
147+
Control which nodes run Kepler pods:
148+
149+
```yaml
150+
spec:
151+
kepler:
152+
deployment:
153+
nodeSelector:
154+
kubernetes.io/os: linux
155+
node-type: worker
156+
monitoring: enabled
157+
```
158+
159+
Common selectors:
160+
161+
- **OS selection**: `kubernetes.io/os: linux`
162+
- **Node roles**: `node-role.kubernetes.io/worker: ""`
163+
- **Custom labels**: `monitoring: enabled`
164+
165+
### Tolerations
166+
167+
Allow Kepler to run on nodes with specific taints:
168+
169+
```yaml
170+
spec:
171+
kepler:
172+
deployment:
173+
tolerations:
174+
- key: node-role.kubernetes.io/master
175+
operator: Exists
176+
effect: NoSchedule
177+
- key: dedicated
178+
operator: Equal
179+
value: monitoring
180+
effect: NoSchedule
181+
```
182+
183+
**Common tolerations:**
184+
185+
- **Master nodes**: Allow monitoring on control plane nodes
186+
- **Dedicated nodes**: Run on nodes dedicated to monitoring
187+
- **Special hardware**: Nodes with specific hardware requirements
188+
189+
## Complete Configuration Examples
190+
191+
### Development Environment
192+
193+
```yaml
194+
apiVersion: kepler.system.sustainable.computing.io/v1alpha1
195+
kind: PowerMonitor
196+
metadata:
197+
name: power-monitor-dev
198+
spec:
199+
kepler:
200+
config:
201+
logLevel: debug
202+
metricLevels:
203+
- node
204+
- pod
205+
- container
206+
sampleRate: 3s
207+
staleness: 200ms
208+
maxTerminated: 100
209+
deployment:
210+
security:
211+
mode: none
212+
```
213+
214+
### Production Environment
215+
216+
```yaml
217+
apiVersion: kepler.system.sustainable.computing.io/v1alpha1
218+
kind: PowerMonitor
219+
metadata:
220+
name: power-monitor-prod
221+
spec:
222+
kepler:
223+
config:
224+
logLevel: warn
225+
metricLevels:
226+
- node
227+
- pod
228+
sampleRate: 10s
229+
staleness: 1s
230+
maxTerminated: 1000
231+
deployment:
232+
security:
233+
mode: rbac
234+
allowedSANames:
235+
- "prometheus-operator-prometheus"
236+
nodeSelector:
237+
kubernetes.io/os: linux
238+
node-role.kubernetes.io/worker: ""
239+
tolerations:
240+
- key: dedicated
241+
operator: Equal
242+
value: monitoring
243+
effect: NoSchedule
244+
```
245+
246+
### High-Performance Monitoring
247+
248+
```yaml
249+
apiVersion: kepler.system.sustainable.computing.io/v1alpha1
250+
kind: PowerMonitor
251+
metadata:
252+
name: power-monitor-hpc
253+
spec:
254+
kepler:
255+
config:
256+
logLevel: info
257+
metricLevels:
258+
- node
259+
- pod
260+
- vm
261+
- process
262+
sampleRate: 1s
263+
staleness: 100ms
264+
maxTerminated: -1 # Track all terminated workloads
265+
additionalConfigMaps:
266+
- name: hpc-power-models
267+
deployment:
268+
security:
269+
mode: rbac
270+
nodeSelector:
271+
node-type: compute-intensive
272+
tolerations:
273+
- key: high-performance
274+
operator: Exists
275+
effect: NoSchedule
276+
```
277+
278+
## Configuration Validation
279+
280+
### Check Configuration
281+
282+
Verify your PowerMonitor configuration:
283+
284+
```bash
285+
# Check PowerMonitor status
286+
oc get powermonitor power-monitor -o yaml
287+
288+
# Verify conditions
289+
oc describe powermonitor power-monitor
290+
291+
# Check generated resources
292+
oc get all -n power-monitor
293+
```
294+
295+
### Common Configuration Issues
296+
297+
**Issue**: Pods not scheduling on intended nodes
298+
299+
```bash
300+
# Check node labels
301+
oc get nodes --show-labels
302+
303+
# Update nodeSelector
304+
oc patch powermonitor power-monitor --type='merge' -p='
305+
{
306+
"spec": {
307+
"kepler": {
308+
"deployment": {
309+
"nodeSelector": {
310+
"kubernetes.io/os": "linux"
311+
}
312+
}
313+
}
314+
}
315+
}'
316+
```
317+
318+
**Issue**: High resource usage
319+
320+
```bash
321+
# Reduce metric levels and increase sample rate
322+
oc patch powermonitor power-monitor --type='merge' -p='
323+
{
324+
"spec": {
325+
"kepler": {
326+
"config": {
327+
"metricLevels": ["node", "pod"],
328+
"sampleRate": "10s"
329+
}
330+
}
331+
}
332+
}'
333+
```
334+
335+
## Advanced Topics
336+
337+
For more advanced configuration scenarios, see:
338+
339+
- **[Monitoring & Troubleshooting](monitoring-troubleshooting.md)** - Advanced monitoring setup
340+
- **[Migration Guide](../migration/index.md)** - Migrating from legacy configurations

0 commit comments

Comments
 (0)