Skip to content

Commit a69f81d

Browse files
authored
Merge pull request #48938 from hacktivist123/merged-main-dev-1.32
Merge main branch into dev-1.32
2 parents 3f720f4 + 9a29b37 commit a69f81d

File tree

84 files changed

+1570
-318
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

84 files changed

+1570
-318
lines changed

Makefile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ non-production-build: module-check ## Build the non-production site, which adds
5656
GOMAXPROCS=1 hugo --cleanDestinationDir --enableGitInfo --environment nonprod
5757

5858
serve: module-check ## Boot the development server.
59-
hugo server --buildFuture --environment development
59+
hugo server --buildDrafts --buildFuture --environment development
6060

6161
docker-image:
6262
@echo -e "$(CCRED)**** The use of docker-image is deprecated. Use container-image instead. ****$(CCEND)"
@@ -107,7 +107,7 @@ container-build: module-check
107107
container-serve: module-check ## Boot the development server using container.
108108
$(CONTAINER_RUN) --cap-drop=ALL --cap-add=AUDIT_WRITE --read-only \
109109
--mount type=tmpfs,destination=/tmp,tmpfs-mode=01777 -p 1313:1313 $(CONTAINER_IMAGE) \
110-
hugo server --buildFuture --environment development --bind 0.0.0.0 --destination /tmp/public --cleanDestinationDir --noBuildLock
110+
hugo server --buildDrafts --buildFuture --environment development --bind 0.0.0.0 --destination /tmp/public --cleanDestinationDir --noBuildLock
111111

112112
test-examples:
113113
scripts/test_examples.sh install

OWNERS_ALIASES

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,8 +133,11 @@ aliases:
133133
- Okabe-Junya
134134
sig-docs-ja-reviews: # PR reviews for Japanese content
135135
- atoato88
136+
- b1gb4by
136137
- bells17
138+
- inductor
137139
- kakts
140+
- nasa9084
138141
- Okabe-Junya
139142
- t-inu
140143
sig-docs-ko-owners: # Admins for Korean content
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
---
2+
layout: blog
3+
title: "Kubernetes v1.32: Memory Manager Goes GA"
4+
date: 2024-11-11
5+
slug: memory-manager-goes-ga
6+
author: >
7+
[Talor Itzhak](https://github.com/Tal-or) (Red Hat)
8+
draft: true
9+
---
10+
11+
With Kubernetes 1.32, the memory manager has officially graduated to General Availability (GA),
12+
marking a significant milestone in the journey toward efficient and predictable memory allocation for containerized applications.
13+
Since Kubernetes v1.22, where it graduated to beta, the memory manager has proved itself reliable, stable and a good complementary feature for the
14+
[CPU Manager](/docs/tasks/administer-cluster/cpu-management-policies/).
15+
16+
As part of kubelet's workload admission process,
17+
the memory manager provides topology hints
18+
to optimize memory allocation and alignment.
19+
This enables users to allocate exclusive
20+
memory for Pods in the [Guaranteed](/docs/concepts/workloads/pods/pod-qos/#guaranteed) QoS class.
21+
More details about the process can be found in the memory manager goes to beta [blog](/blog/2021/08/11/kubernetes-1-22-feature-memory-manager-moves-to-beta/).
22+
23+
Most of the changes introduced since the Beta are bug fixes, internal refactoring and
24+
observability improvements, such as metrics and better logging.
25+
26+
## Observability improvements
27+
28+
As part of the effort
29+
to increase the observability of memory manager, new metrics have been added
30+
to provide some statistics on memory allocation patterns.
31+
32+
33+
* **memory_manager_pinning_requests_total** -
34+
tracks the number of times the pod spec required the memory manager to pin memory pages.
35+
36+
* **memory_manager_pinning_errors_total** -
37+
tracks the number of times the pod spec required the memory manager
38+
to pin memory pages, but the allocation failed.
39+
40+
41+
## Improving memory manager reliability and consistency
42+
43+
The kubelet does not guarantee pod ordering
44+
when admitting pods after a restart or reboot.
45+
46+
In certain edge cases, this behavior could cause
47+
the memory manager to reject some pods,
48+
and in more extreme cases, it may cause kubelet to fail upon restart.
49+
50+
Previously, the beta implementation lacked certain checks and logic to prevent
51+
these issues.
52+
53+
To stabilize the memory manager for general availability (GA) readiness,
54+
small but critical refinements have been
55+
made to the algorithm, improving its robustness and handling of edge cases.
56+
57+
## Future development
58+
59+
There is more to come for the future of Topology Manager in general,
60+
and memory manager in particular.
61+
Notably, ongoing efforts are underway
62+
to extend [memory manager support to Windows](https://github.com/kubernetes/kubernetes/pull/128560),
63+
enabling CPU and memory affinity on a Windows operating system.
64+
65+
## Getting involved
66+
67+
This feature is driven by the [SIG Node](https://github.com/Kubernetes/community/blob/master/sig-node/README.md) community.
68+
Please join us to connect with the community
69+
and share your ideas and feedback around the above feature and
70+
beyond.
71+
We look forward to hearing from you!
72+
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
---
2+
layout: blog
3+
title: 'Enhancing Kubernetes API Server Efficiency with API Streaming'
4+
date: 2024-12-11
5+
draft: true
6+
slug: kube-apiserver-api-streaming
7+
author: >
8+
Stefan Schimanski (Upbound),
9+
Wojciech Tyczynski (Google),
10+
Lukasz Szaszkiewicz (Red Hat)
11+
---
12+
13+
Managing Kubernetes clusters efficiently is critical, especially as their size is growing.
14+
A significant challenge with large clusters is the memory overhead caused by **list** requests.
15+
16+
In the existing implementation, the kube-apiserver processes **list** requests by assembling the entire response in-memory before transmitting any data to the client.
17+
But what if the response body is substantial, say hundreds of megabytes? Additionally, imagine a scenario where multiple **list** requests flood in simultaneously, perhaps after a brief network outage.
18+
While [API Priority and Fairness](/docs/concepts/cluster-administration/flow-control) has proven to reasonably protect kube-apiserver from CPU overload, its impact is visibly smaller for memory protection.
19+
This can be explained by the differing nature of resource consumption by a single API request - the CPU usage at any given time is capped by a constant, whereas memory, being uncompressible, can grow proportionally with the number of processed objects and is unbounded.
20+
This situation poses a genuine risk, potentially overwhelming and crashing any kube-apiserver within seconds due to out-of-memory (OOM) conditions. To better visualize the issue, let's consider the below graph.
21+
22+
23+
{{< figure src="kube-apiserver-memory_usage.png" alt="Monitoring graph showing kube-apiserver memory usage" >}}
24+
25+
The graph shows the memory usage of a kube-apiserver during a synthetic test.
26+
(see the [synthetic test](#the-synthetic-test) section for more details).
27+
The results clearly show that increasing the number of informers significantly boosts the server's memory consumption.
28+
Notably, at approximately 16:40, the server crashed when serving only 16 informers.
29+
30+
## Why does kube-apiserver allocate so much memory for list requests?
31+
32+
Our investigation revealed that this substantial memory allocation occurs because the server before sending the first byte to the client must:
33+
* fetch data from the database,
34+
* deserialize the data from its stored format,
35+
* and finally construct the final response by converting and serializing the data into a client requested format
36+
37+
This sequence results in significant temporary memory consumption.
38+
The actual usage depends on many factors like the page size, applied filters (e.g. label selectors), query parameters, and sizes of individual objects.
39+
40+
Unfortunately, neither [API Priority and Fairness](/docs/concepts/cluster-administration/flow-control) nor Golang's garbage collection or Golang memory limits can prevent the system from exhausting memory under these conditions.
41+
The memory is allocated suddenly and rapidly, and just a few requests can quickly deplete the available memory, leading to resource exhaustion.
42+
43+
Depending on how the API server is run on the node, it might either be killed through OOM by the kernel when exceeding the configured memory limits during these uncontrolled spikes, or if limits are not configured it might have even worse impact on the control plane node.
44+
And worst, after the first API server failure, the same requests will likely hit another control plane node in an HA setup with probably the same impact.
45+
Potentially a situation that is hard to diagnose and hard to recover from.
46+
47+
## Streaming list requests
48+
49+
Today, we're excited to announce a major improvement.
50+
With the graduation of the _watch list_ feature to beta in Kubernetes 1.32, client-go users can opt-in (after explicitly enabling `WatchListClient` feature gate)
51+
to streaming lists by switching from **list** to (a special kind of) **watch** requests.
52+
53+
**Watch** requests are served from the _watch cache_, an in-memory cache designed to improve scalability of read operations.
54+
By streaming each item individually instead of returning the entire collection, the new method maintains constant memory overhead.
55+
The API server is bound by the maximum allowed size of an object in etcd plus a few additional allocations.
56+
This approach drastically reduces the temporary memory usage compared to traditional **list** requests, ensuring a more efficient and stable system,
57+
especially in clusters with a large number of objects of a given type or large average object sizes where despite paging memory consumption used to be high.
58+
59+
Building on the insight gained from the synthetic test (see the [synthetic test](#the-synthetic-test), we developed an automated performance test to systematically evaluate the impact of the _watch list_ feature.
60+
This test replicates the same scenario, generating a large number of Secrets with a large payload, and scaling the number of informers to simulate heavy **list** request patterns.
61+
The automated test is executed periodically to monitor memory usage of the server with the feature enabled and disabled.
62+
63+
The results showed significant improvements with the _watch list_ feature enabled.
64+
With the feature turned on, the kube-apiserver’s memory consumption stabilized at approximately **2 GB**.
65+
By contrast, with the feature disabled, memory usage increased to approximately **20GB**, a **10x** increase!
66+
These results confirm the effectiveness of the new streaming API, which reduces the temporary memory footprint.
67+
68+
## Enabling API Streaming for your component
69+
70+
Upgrade to Kubernetes 1.32. Make sure your cluster uses etcd in version 3.4.31+ or 3.5.13+.
71+
Change your client software to use watch lists. If your client code is written in Golang, you'll want to enable `WatchListClient` for client-go.
72+
For details on enabling that feature, read [Introducing Feature Gates to Client-Go: Enhancing Flexibility and Control](/blog/2024/08/12/feature-gates-in-client-go).
73+
74+
## What's next?
75+
In Kubernetes 1.32, the feature is enabled in kube-controller-manager by default despite its beta state.
76+
This will eventually be expanded to other core components like kube-scheduler or kubelet; once the feature becomes generally available, if not earlier.
77+
Other 3rd-party components are encouraged to opt-in to the feature during the beta phase, especially when they are at risk of accessing a large number of resources or kinds with potentially large object sizes.
78+
79+
For the time being, [API Priority and Fairness](/docs/concepts/cluster-administration/flow-control) assigns a reasonable small cost to **list** requests.
80+
This is necessary to allow enough parallelism for the average case where **list** requests are cheap enough.
81+
But it does not match the spiky exceptional situation of many and large objects.
82+
Once the majority of the Kubernetes ecosystem has switched to _watch list_, the **list** cost estimation can be changed to larger values without risking degraded performance in the average case,
83+
and with that increasing the protection against this kind of requests that can still hit the API server in the future.
84+
85+
86+
## The synthetic test
87+
88+
In order to reproduce the issue, we conducted a manual test to understand the impact of **list** requests on kube-apiserver memory usage.
89+
In the test, we created 400 Secrets, each containing 1 MB of data, and used informers to retrieve all Secrets.
90+
91+
The results were alarming, only 16 informers were needed to cause the test server to run out of memory and crash, demonstrating how quickly memory consumption can grow under such conditions.
92+
93+
Special shout out to [@deads2k](https://github.com/deads2k) for his help in shaping this feature.
74.6 KB
Loading

0 commit comments

Comments
 (0)