Skip to content

Commit e351e9c

Browse files
committed
Updates based on first PRR review
Signed-off-by: James Sturtevant <[email protected]>
1 parent 5e96687 commit e351e9c

File tree

1 file changed

+48
-17
lines changed
  • keps/sig-windows/4885-windows-cpu-and-memory-affinity

1 file changed

+48
-17
lines changed

keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md

Lines changed: 48 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -68,10 +68,16 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
6868

6969
This kep outlines how to add support for the CPU, Memory and Topology Managers in kubelet for Windows.
7070
The Managers are already available and support in kubelet on Linux and there have been requests to sig-windows
71-
to add support on Windows to help with workloads that require co-located workloads. The goal of the kep is to
71+
to add support on Windows to help with workloads that require co-located workloads. The goal of the KEP is to
7272
add Windows support without significant changes to the Managers logic while providing the same feature sets available
7373
on Linux today.
7474

75+
The existing KEPS are:
76+
77+
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3570-cpumanager
78+
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager
79+
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/693-topology-manager
80+
7581
## Motivation
7682

7783
Currently enabling low latency workloads co-hosted on the same nodes in Windows Server create noisy neighbor behaviors
@@ -165,7 +171,7 @@ One difference between the Windows API and Linux is the concept of [Processor gr
165171
On Windows systems with more than 64 cores the CPU's will be split into groups,
166172
each processor is identified by its group number and its group-relative processor number.
167173

168-
In Cri we will add the following structure to the `WindowsContainerResources` in CRI:
174+
In CRI we will add the following structure to the `WindowsContainerResources` in CRI:
169175

170176
```protobuf
171177
message WindowsCpuGroupAffinity {
@@ -268,7 +274,7 @@ Integration tests do not run on Windows. Functionality will be covered by unit a
268274

269275
##### e2e tests
270276

271-
- e2e_node will need to be enabled for windows to add coverage
277+
- e2e_node will need to be enabled for Windows to add coverage. We plan to enable just e2e tests that relate to memory/cpu/topology manager, not the full suite.
272278

273279
### Graduation Criteria
274280

@@ -305,10 +311,19 @@ N/A
305311

306312
### Version Skew Strategy
307313

308-
N/A
314+
This feature is kubelet specific, so version skew strategy is N/A.
309315

310316
## Production Readiness Review Questionnaire
311317

318+
This KEP discusses the changes required to enable for the various managers for Windows.
319+
This means many of the PRR questions for these features have already been covered and implemented
320+
as part of those KEPs. We try to give details relevant to Windows but do not plan to change any of the
321+
details of the features enablement in the KEP unless it is required because of a difference in Windows.
322+
323+
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager#production-readiness-review-questionnaire
324+
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/693-topology-manager#production-readiness-review-questionnaire
325+
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3570-cpumanager#production-readiness-review-questionnaire
326+
312327
### Feature Enablement and Rollback
313328

314329
<!--
@@ -329,19 +344,31 @@ well as the [existing list] of feature gates.
329344

330345
- [x] Feature gate (also fill in values in `kep.yaml`)
331346
- Feature gate name: WindowsCPUAndMemoryAffinity
332-
- Components depending on the feature gate:
333-
- [ ] Other
334-
- Describe the mechanism:
347+
- Components depending on the feature gate: Kubelet
335348
- Will enabling / disabling the feature require downtime of the control
336349
plane?
337350
No
338351
- Will enabling / disabling the feature require downtime or reprovisioning
339352
of a node?
340-
Yes it uses a feature gate. Memory and CPU managers have a state file that requires cleanup.
353+
This is behavior is is the same as the features is implemented today in existing KEPs:
354+
355+
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3570-cpumanager#troubleshooting
356+
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager#feature-enablement-and-rollback
357+
358+
Yes it uses a feature gate. Memory and CPU managers have a state file that requires cleanup. After changing the CPU manager policy from none to static or the the other way around, before to start the kubelet again, you must remove the CPU manager state file(/var/lib/kubelet/cpu_manager_state), otherwise the kubelet start will fail. Startup failures for this reason will be logged in the kubelet log.
359+
360+
Details for the steps to reset a state file are in https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#changing-the-cpu-manager-policy. Memory manager has the same steps for resetting.
341361

342362
###### Does enabling the feature change any default behavior?
343363

344-
No, Additional settings are required to enable the features. The default policies for CPU/Memory manager will be `None`, meaning that they will not interact with running of pods.
364+
No, Additional settings are required to enable the features. The default policies for CPU/Memory manager will be `None`, meaning that they will not interact with running of pods. The Cluster administrator will need to set specific CPU/Memory/Topology manager policies
365+
to enable any features described here.
366+
367+
See feature details in:
368+
369+
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3570-cpumanager#feature-enablement-and-rollback
370+
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager#feature-enablement-and-rollback
371+
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/693-topology-manager#feature-enablement-and-rollback
345372

346373
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
347374

@@ -356,12 +383,18 @@ feature.
356383
NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
357384
-->
358385

359-
Yes. Restarting of the pods will be required to remove the CPU/Memory affinity.
386+
Yes. A rolling restart (delete or delete and redeploy) of the pods will be required to remove the CPU/Memory affinity
387+
from running pods. Restarting kubelet after changing the feature will not affect any running pods but new pods created will be
388+
affected by the changes.
360389

361390
###### What happens if we reenable the feature if it was previously rolled back?
362391

392+
The Memory Manager and CPU managers utilize a state file to track assignments. If State file is not valid, it must be removed and kubelet restarted. E.g., State file might become invalid when kube/system reserved have changed (increased), which may lead to a situation when some containers cannot be started.
393+
363394
###### Are there any tests for feature enablement/disablement?
364395

396+
Yes, there is a number of Unit Tests designated for State file validation.
397+
365398
<!--
366399
The e2e framework does not currently support enabling or disabling feature
367400
gates. However, unit tests in each component dealing with managing data, created
@@ -466,13 +499,9 @@ The memory/cpu manager will be under the pod resources API. And there are propos
466499

467500
###### How can someone using this feature know that it is working for their instance?
468501

469-
- [x] Events
470-
- Event Reason:
471-
- [ ] API .status
472-
- Condition name:
473-
- Other field:
474-
- [ ] Other (treat as last resort)
475-
- Details:
502+
- [X] Other (treat as last resort)
503+
- Details: check the kubelet metric `cpu_manager_pinning_requests_total`
504+
- check the kubelet metric `memory_manager_pinning_requests_total`
476505

477506
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
478507

@@ -592,3 +621,5 @@ Use this section if you need things from the project/SIG. Examples include a
592621
new subproject, repos requested, or GitHub details. Listing these here allows a
593622
SIG to get the process for these resources started right away.
594623
-->
624+
625+
n/a Windows will use existing testing infrastructure

0 commit comments

Comments
 (0)