Merge pull request #5117 from swatisehgal/distribute-cpus-across-numa-to-beta

k8s-ci-robot · web-flow · commit 06d6d0d9e581 · 2025-02-07T09:07:55.000-08:00
KEP-2902: Promote CPUManager policy option to distribute CPUs across NUMA nodes to Beta
diff --git a/keps/prod-readiness/sig-node/2902.yaml b/keps/prod-readiness/sig-node/2902.yaml
@@ -1,3 +1,5 @@
 kep-number: 2902
 alpha:
   approver: "@johnbelamaric"
+beta:
+  approver: "@johnbelamaric"
diff --git a/keps/sig-node/2902-cpumanager-distribute-cpus-policy-option/README.md b/keps/sig-node/2902-cpumanager-distribute-cpus-policy-option/README.md
@@ -9,7 +9,12 @@
 - [Proposal](#proposal)
   - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
+  - [Compatibility with <code>full-pcpus-only</code> policy options](#compatibility-with-full-pcpus-only-policy-options)
   - [Test Plan](#test-plan)
+      - [Prerequisite testing updates](#prerequisite-testing-updates)
+      - [Unit tests](#unit-tests)
+      - [Integration tests](#integration-tests)
+      - [e2e tests](#e2e-tests)
   - [Graduation Criteria](#graduation-criteria)
     - [Alpha](#alpha)
     - [Beta](#beta)
@@ -18,8 +23,11 @@
   - [Version Skew Strategy](#version-skew-strategy)
 - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
   - [Feature Enablement and Rollback](#feature-enablement-and-rollback)
+  - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
   - [Monitoring Requirements](#monitoring-requirements)
+  - [Dependencies](#dependencies)
   - [Scalability](#scalability)
+  - [Troubleshooting](#troubleshooting)
 - [Implementation History](#implementation-history)
 <!-- /toc -->
 
@@ -75,7 +83,7 @@ When enabled, this will trigger the `CPUManager` to evenly distribute CPUs acros
 ### Risks and Mitigations
 
 The risks of adding this new feature are quite low.
-It is isolated to a specific policy option within the `CPUManager`, and is protected both by the option itself, as well as the `CPUManagerPolicyAlphaOptions` feature gate (which is disabled by default).
+It is isolated to a specific policy option within the `CPUManager`, and is protected both by the option itself, as well as the `CPUManagerPolicyBetaOptions` feature gate (which is disabled by default).
 
 | Risk                                             | Impact | Mitigation |
 | -------------------------------------------------| -------| ---------- |
@@ -112,10 +120,39 @@ If none of the above conditions can be met, resort back to a best-effort fit of
 
 NOTE: The striping operation after all CPUs have been evenly distributed will be performed such that the overall disribution of CPUs across those NUMA nodes remains as balanced as possible.
 
+### Compatibility with `full-pcpus-only` policy options
+
+| Compatibility | alpha | beta | GA |
+| --- | --- | --- | --- |
+| full-pcpus-only | x | x | x |
+
 ### Test Plan
 
 We will extend both the unit test suite and the E2E test suite to cover the new policy option described in this KEP.
 
+[x] I/we understand the owners of the involved components may require updates to
+existing tests to make this code solid enough prior to committing the changes necessary
+to implement this enhancement.
+
+##### Prerequisite testing updates
+
+##### Unit tests
+
+- `k8s.io/kubernetes/pkg/kubelet/cm/cpumanager`: `20250205` - 85.5% of statements
+
+##### Integration tests
+
+Not Applicable as Kubelet features don't have integration tests. We use a mix of `e2e_node` and `e2e` tests.
+
+##### e2e tests
+
+Currently no e2e tests are present for this particular policy option. E2E tests will be added as part of Beta graduation.
+
+The plan is to add e2e tests to cover the basic flows for cases below:
+1. `distribute-cpus-across-numa` option is enabled:   The test will ensure that the allocated CPUs are distributed across NUMA nodes according to the policy.
+1. `distribute-cpus-across-numa` option is disabled:  The test will verify that the allocated CPUs are packed according to the default behavior.
+1. Test how this option interacts with `full-pcpus-only` policy option (and test for it enabled and disabled).
+
 ### Graduation Criteria
 
 #### Alpha
@@ -149,7 +186,9 @@ No changes needed
 ###### How can this feature be enabled / disabled in a live cluster?
 
 - [X] Feature gate (also fill in values in `kep.yaml`)
+  - Feature gate name: `CPUManagerPolicyOptions`
   - Feature gate name: `CPUManagerPolicyAlphaOptions`
+  - Feature gate name: `CPUManagerPolicyBetaOptions`
   - Components depending on the feature gate: `kubelet`
 - [X] Change the kubelet configuration to set a `CPUManager` policy of `static` and a `CPUManager` policy option of `distribute-cpus-across-numa`
   - Will enabling / disabling the feature require downtime of the control
@@ -161,14 +200,14 @@ No changes needed
 ###### Does enabling the feature change any default behavior?
 
 No. In order to trigger any of the new logic, three things have to be true:
-1. The `CPUManagerPolicyAlphaOptions` feature gate must be enabled
+1. The `CPUManagerPolicyBetaOptions` feature gate must be enabled
 1. The `static` `CPUManager` policy must be selected
 1. The new `distribute-cpus-across-numa` policy option must be selected
 
 ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
 
 Yes, the feature can be disabled by either:
-1. Disabling the `CPUManagerPolicyAlphaOptions` feature gate
+1. Disabling the `CPUManagerPolicyBetaOptions` feature gate
 1. Switching the `CPUManager` policy to `none`
 1. Removing `distribute-cpus-across-numa` from the list of `CPUManager` policy options
 
@@ -182,12 +221,34 @@ No changes. Existing container will not see their allocation changed. New contai
 
 - A specific e2e test will demonstrate that the default behaviour is preserved when the feature gate is disabled, or when the feature is not used (2 separate tests)
 
+### Rollout, Upgrade and Rollback Planning
+
+###### How can a rollout or rollback fail? Can it impact already running workloads?
+
+- A rollout or rollback can fail if the feature gate and the policy option are not configured properly and kubelet fails to start.
+
+###### What specific metrics should inform a rollback?
+
+As part of graduation of this feature, we plan to add metric `cpu_manager_numa_allocation_spread` to see how the CPUs are distributed across NUMA nodes.
+This can be used to see the CPU distribution across NUMA and will provide an indication of a rollback.
+
+###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
+
+Not Applicable. This policy option only affects pods that meet certain conditions and are scheduled after the upgrade. Running pods will be unaffected
+by any change.
+
+###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
+
+No
+
 ### Monitoring Requirements
 
 ###### How can an operator determine if the feature is in use by workloads?
 
 Inspect the kubelet configuration of a node -- check for the presence of the feature gate and usage of the new policy option.
 
+In addition to that, we can check the metric `cpu_manager_numa_allocation_spread` to determine how allocated CPUs are spread across NUMA node.
+
 ###### How can someone using this feature know that it is working for their instance?
 
 In order to verify this feature is working, one should:
@@ -201,6 +262,31 @@ To verify the list of CPUs allocated to the container, one can either:
 - `exec` into uthe container and run `taskset -cp 1` (assuming this command is available in the container).
 - Call the `GetCPUS()` method of the `CPUProvider` interface in the `kubelet`'s [podresources API](https://pkg.go.dev/k8s.io/kubernetes/pkg/kubelet/apis/podresources#CPUsProvider).
 
+Also, we can check `cpu_manager_numa_allocation_spread` metric. We plan to add metric to track how CPUs are distributed across NUMA zones 
+with labels/buckets representing NUMA nodes (numa_node=0, numa_node=1, ..., numa_node=N).
+
+With packed allocation (default, option off), the distribution should mostly be in numa_node=1, with a small tail to numa_node=2 (and possibly higher)
+in cases of severe fragmentation. Users can compare this spread metric with the `container_aligned_compute_resources_count` metric to determine
+if they are getting aligned packed allocation or just packed allocation due to implementation details.
+
+For example, if a node has 2 NUMA nodes and a pod requests 8 CPUs (with no other pods requesting exclusive CPUs on the node), the metric would look like this:
+
+cpu_manager_numa_allocation_spread{numa_node="0"} = 8
+cpu_manager_numa_allocation_spread{numa_node="1"} = 0
+
+
+When the option is enabled, we would expect a more even distribution of CPUs across NUMA nodes, with no sharp peaks as seen with packed allocation.
+Users can also check the `container_aligned_compute_resources_count` metric to assess resource alignment and system behavior.
+
+In this case, the metric would show:
+cpu_manager_numa_allocation_spread{numa_node="0"} = 4
+cpu_manager_numa_allocation_spread{numa_node="1"} = 4
+
+
+Note: This example is simplified to clearly highlight the difference between the two cases. Existing pods may slightly skew the counts, but the general
+trend of peaks and troughs will still provide a good indication of CPU distribution across NUMA nodes, allowing users to determine if the policy option
+is enabled or not.
+
 ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
 
 There are no specific SLOs for this feature.
@@ -212,13 +298,20 @@ None
 
 ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
 
-None
+Yes, as part of graduation of this feature to Beta, we plan to add `cpu_manager_numa_allocation_spread` metric
+to provide data on how the CPUs are distributed across NUMA nodes.
 
 ###### Does this feature depend on any specific services running in the cluster?
 
 This feature is `linux` specific, and requires a version of CRI that includes the `LinuxContainerResources.CpusetCpus` field.
 This has been available since `v1alpha2`.
 
+### Dependencies
+
+###### Does this feature depend on any specific services running in the cluster?
+
+No
+
 ### Scalability
 
 ###### Will enabling / using this feature result in any new API calls?
@@ -249,9 +342,26 @@ This delay should be minimal.
 
 No, the algorithm will run on a single `goroutine` with minimal memory requirements.
 
+### Troubleshooting
+
+###### How does this feature react if the API server and/or etcd is unavailable?
+
+No impact. The behavior of the feature does not change when API Server and/or etcd is unavailable since the feature is node local.
+
+###### What are other known failure modes?
+
+Because of existing distribution of CPU resource across, a distributed allocation might not be possible. E.g. If all Available CPUs are present
+on the same NUMA node.
+
+In that case we resort back to a best-effort fit of packing CPUs into NUMA nodes wherever they can fit.
+
+###### What steps should be taken if SLOs are not being met to determine the problem?
+
 ## Implementation History
 
 - 2021-08-26: Initial KEP created
 - 2021-08-30: Updates to fill out more sections, answer PRR questions
 - 2021-09-08: Change feature gate from `CPUManagerPolicyOptions` to `CPUManagerPolicyExperimentalOptions`
 - 2021-10-11: Change feature gate from `CPUManagerPolicyExperimentalOptions` to `CPUManagerPolicyAlphaOptions`
+- 2025-01-30: KEP update for Beta graduation of the policy option 
+- 2025-02-05: KEP update to the latest template
diff --git a/keps/sig-node/2902-cpumanager-distribute-cpus-policy-option/kep.yaml b/keps/sig-node/2902-cpumanager-distribute-cpus-policy-option/kep.yaml
@@ -2,39 +2,49 @@ title: CPUManager Policy Option to Distribute CPUs Across NUMA Nodes Instead of
 kep-number: 2902
 authors:
   - "@klueska"
+  - "@swatisehgal" # For Beta graduation
 owning-sig: sig-node
 participating-sigs: []
 status: implementable
 creation-date: "2021-08-26"
+last-updated: "2025-01-31"
 reviewers:
-  - "@fromani"
+  - "@ffromani"
 approvers:
   - "@sig-node-tech-leads"
 see-also:
   - "keps/sig-node/2625-cpumanager-policies-thread-placement"
+  - "keps/sig-node/3545-improved-multi-numa-alignment/"
+  - "keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/"
+  - "keps/sig-node/4540-strict-cpu-reservation"
+  - "keps/sig-node/4622-topologymanager-max-allowable-numa-nodes/"
+  - "keps/sig-node/4800-cpumanager-split-uncorecache/"
 replaces: []
 
 # The target maturity stage in the current dev cycle for this KEP.
-stage: alpha
+stage: beta
 
 # The most recent milestone for which work toward delivery of this KEP has been
 # done. This can be the current (upcoming) milestone, if it is being actively
 # worked on.
-latest-milestone: "v1.23"
+latest-milestone: "v1.33"
 
 # The milestone at which this feature was, or is targeted to be, at each stage.
 milestone:
   alpha: "v1.23"
-  beta: "v1.24"
-  stable: "v1.25"
+  beta: "v1.33"
+  stable: "v1.35"
 
 # The following PRR answers are required at alpha release
 # List the feature gate name and the components for which it must be enabled
 feature-gates:
+  - name: "CPUManagerPolicyOptions"
   - name: "CPUManagerPolicyAlphaOptions"
+  - name: "CPUManagerPolicyBetaOptions"
     components:
       - kubelet
 disable-supported: true
 
 # The following PRR answers are required at beta release
-metrics: []
+metrics:
+ - cpu_manager_numa_allocation_spread