Skip to content

Commit ff78594

Browse files
authored
Merge pull request kubernetes#3575 from harche/evented_pleg_update
Update Evented PLEG enhancement to include PodStatus
2 parents 77b5840 + f23e1a8 commit ff78594

10 files changed

+139
-36
lines changed

keps/sig-node/3386-kubelet-evented-pleg/README.md

Lines changed: 95 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,9 @@
1313
- [Risks and Mitigations](#risks-and-mitigations)
1414
- [Design Details](#design-details)
1515
- [Feature Gate](#feature-gate)
16+
- [Timestamp of the Pod Status](#timestamp-of-the-pod-status)
1617
- [Runtime Service Changes](#runtime-service-changes)
17-
- [Events Filter](#events-filter)
18-
- [Kubelet Changes](#kubelet-changes)
18+
- [Pod Status Update in the Cache](#pod-status-update-in-the-cache)
1919
- [Test Plan](#test-plan)
2020
- [Prerequisite testing updates](#prerequisite-testing-updates)
2121
- [Unit tests](#unit-tests)
@@ -134,15 +134,71 @@ This might be a good place to talk about core concepts and how they relate.
134134
- Users can disable this feature to make kubelet use existing relisting based PLEG.
135135
- Another risk is the CRI implementation could have a buggy event emitting system, and miss pod lifecycle events.
136136
- A mitigation is a `kube_pod_missed_events` metric, which the Kubelet could report when a lifecycle event is registered that wasn't triggered by an event, but rather by changes of state between lists.
137+
- While using the Evented implementation, the periodic relisting functionality would still be used with an increased interval which should work as a fallback mechanism for missed events in case of any disruptions.
138+
- Evented PLEG will need to update global cache timestamp periodically in order to make sure pod workers don't get stuck at [GetNewerThan](https://github.com/kubernetes/kubernetes/blob/4a894be926adfe51fd8654dcceef4ece89a4259f/pkg/kubelet/pod_workers.go#L924) in case Evented PLEG misses the event for any unforeseen reason.
137139

138140
## Design Details
139141

142+
Kubelet generates [PodLifecycleEvent](https://github.com/kubernetes/kubernetes/blob/release-1.24/pkg/kubelet/pleg/pleg.go#L41) using [relisting](https://github.com/kubernetes/kubernetes/blob/050f930f8968874855eb215f0c0f0877bcdaa0e8/pkg/kubelet/pleg/generic.go#L150). These `PodLifecycleEvents` get [used](https://github.com/kubernetes/kubernetes/blob/050f930f8968874855eb215f0c0f0877bcdaa0e8/pkg/kubelet/kubelet.go#L2060) in kubelet's sync loop to infer the state of the container. e.g. to determine if the [container has died](https://github.com/kubernetes/kubernetes/blob/050f930f8968874855eb215f0c0f0877bcdaa0e8/pkg/kubelet/kubelet.go#L2118).
143+
144+
The idea behind this enhancment is, kubelet will receive the [CRI events](#Runtime-Service-Changes) mentioned above from the CRI runtime and generate the corresponding `PodLifecycleEvent`. This will reduce kubelet's dependency on relisting to generate `PodLifecycleEvent` and that event will be immediately available within sync loop instead of waiting for relisting to finish. Kubelet will still do relisting but with a reduced frequency.
140145
### Feature Gate
141-
This feature can only be enabled using the feature gate `EventedPLEG`.
146+
This feature can only be used when `EventedPLEG` feature gate is enabled.
147+
148+
### Timestamp of the Pod Status
149+
![Existing Generic PLEG](./existing-generic-pleg.png)
150+
151+
Kubelet cache saves the [pod status with the timestamp](https://github.com/kubernetes/kubernetes/blob/c012d901d8bee86ef3e3c9472a1a4a0368a34775/pkg/kubelet/pleg/generic.go#L426). The value of this timestamp is calculated [within the kubelet process](https://github.com/kubernetes/kubernetes/blob/c012d901d8bee86ef3e3c9472a1a4a0368a34775/pkg/kubelet/pleg/generic.go#L399). This works fine when there is only Generic PLEG at work as it will calculate the timestamp first and then fetch the `PodStatus` to save it in the cache.
152+
153+
As of today, the `PodStatus` is saved in the cache without any validation of the existing status against the current timestamp. This works well when there is only `Generic PLEG` setting the `PodStatus` in the cache.
154+
155+
If we have multiple entities, such as `Evented PLEG`, while trying to set the `PodStatus` in the cache we may run into the racy timestamps given each of them were to calculate the timestamps in their respective execution flow. While `Generic PLEG` calculates this timestamp and gets the `PodStatus`, we can only calculate the corresponding timestamp in `Evented PLEG` after the event has been received by the Kubelet. Any disruptions in getting the events, such as errors in the grpc connection, might skew our calculation of the time in the kubelet for the `Evented PLEG`.
156+
157+
In order to address the issues above, we propose that existing `Generic PLEG` as well as `Evented PLEG` should rely on the CRI Runtime for the timestamp of the `PodStatus`. This way the `PodStatus` would also be a bit more closer to the actual time when the statuses of the `Sandboxes` and `Containers` where provided by the CRI Runtime. It will enable us to correctly compare the timestamps before saving them in the cache, to avoid the erroneous behaviour. This should also prevent any old buffered `PodStatus` (consolidated during any disruptions or failures) from overriding the newer entry in the cache.
158+
159+
![Modified Generic PLEG](./modified-generic-pleg.png "Existing Generic PLEG")
160+
161+
![Evented PLEG](./evented-pleg.png)
162+
142163

143164
### Runtime Service Changes
144165

145-
A new RPC will be introduced in the [CRI Runtime Service](https://github.com/kubernetes/kubernetes/blob/6efd6582df2011f1ec8c146ef711b3348ae07d60/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L34),
166+
Instead of getting the `Sandbox` and `Container` statuses independently and using the timestamp calculated from the kubelet process, `Generic PLEG` can fetch the `PodStatus` directly from the CRI Runtime using the modified [PodSandboxStatus](https://github.com/kubernetes/kubernetes/blob/4a894be926adfe51fd8654dcceef4ece89a4259f/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L58) rpc of the RuntimeService.
167+
168+
The modified `PodSandboxStatusRequest` will have a field `includeContainer` to indicate if `PodSandboxStatusResponse` should have `ContainerStatuses` and the corresponding timestamp.
169+
170+
```protobuf=
171+
message PodSandboxStatusRequest {
172+
// ID of the PodSandbox for which to retrieve status.
173+
string pod_sandbox_id = 1;
174+
// Verbose indicates whether to return extra information about the pod sandbox.
175+
bool verbose = 2;
176+
// IncludeContainers indicates whether to include ContainerStatuses and timestamp in the PodSandboxStatusResponse
177+
bool includeContainers = 3;
178+
}
179+
```
180+
181+
```protobuf=
182+
message PodSandboxStatusResponse {
183+
// Status of the PodSandbox.
184+
PodSandboxStatus status = 1;
185+
186+
// Info is extra information of the PodSandbox. The key could be arbitrary string, and
187+
// value should be in json format. The information could include anything useful for
188+
// debug, e.g. network namespace for linux container based container runtime.
189+
// It should only be returned non-empty when Verbose is true.
190+
map<string, string> info = 2;
191+
192+
// ContainerStatus needs to be included if includeContainers is set true PodSandboxStatusRequest
193+
repeated ContainerStatus containerStatues = 3;
194+
195+
// Timestamp needs to be included if includeContainers is set true in PodSandboxStatusRequest
196+
int64 timestamp = 4;
197+
198+
}
199+
```
200+
201+
Another RPC will be introduced in the [CRI Runtime Service](https://github.com/kubernetes/kubernetes/blob/6efd6582df2011f1ec8c146ef711b3348ae07d60/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L34),
146202

147203
```protobuf=
148204
// GetContainerEvents gets container events from the CRI runtime
@@ -160,10 +216,18 @@ message ContainerEventResponse {
160216
// Creation timestamp of this event
161217
int64 created_at = 3;
162218
163-
// ID of the sandbox container
164-
string sandbox_id = 4;
219+
// Metadata of the pod sandbox
220+
PodSandboxMetadata pod_sandbox_metadata = 4;
221+
222+
// Sandbox status of the pod
223+
PodSandboxStatus pod_sandbox_status = 5;
224+
225+
// Container statuses of the pod
226+
repeated ContainerStatus containers_statuses = 6;
165227
}
228+
166229
```
230+
Creation timestamp of the event will be used when saving the `PodStatus` in the kubelet cache.
167231

168232
```protobuf=
169233
enum ContainerEventType {
@@ -180,35 +244,33 @@ enum ContainerEventType {
180244
CONTAINER_DELETED_EVENT = 3;
181245
}
182246
```
183-
### Events Filter
184-
Events can be filtered to retrieve only subset of events,
247+
### Pod Status Update in the Cache
185248

186-
```protobuf=
187-
message GetEventsRequest {
188-
// Optional to filter a list of events.
189-
GetEventsFilter filter = 1;
190-
}
191-
```
249+
While using `Evented PLEG`, the existing `Generic PLEG` is set to relist with the increased period. But in case `Evented PLEG` faces temporary disruptions in the grpc connection with the runtime, there is a chance that when the normalcy is restored the incoming buffered events (which are outdated now) might end up overwriting the latest pod status in the cache updated by the `Generic PLEG`. Having a cache setter that only updates if the pod status in the cache is older than the current pod status helps in mitigating this issue.
192250

193-
```protobuf=
194-
// GetEventsFilter is used to filter a list of events.
195-
// All those fields are combined with 'AND'
196-
message GetEventsFilter {
197-
// ID of the container, sandbox.
198-
string id = 1;
199-
// LabelSelector to select matches.
200-
// Only api.MatchLabels is supported for now and the requirements
201-
// are ANDed. MatchExpressions is not supported yet.
202-
map<string, string> label_selector = 2;
203-
}
204-
```
251+
At present kubelet updates the cache using the [Set function](https://github.com/kubernetes/kubernetes/blob/7f129f1c9af62cc3cd4f6b754dacdf5932f39d5c/pkg/kubelet/container/cache.go#L101).
205252

206-
### Kubelet Changes
253+
Pod status should be updated in the cache only if the new status update has timestamp newer than the timestamp of the already present in the cache.
207254

208-
Kubelet generates [PodLifecycleEvent](https://github.com/kubernetes/kubernetes/blob/release-1.24/pkg/kubelet/pleg/pleg.go#L41) using [relisting](https://github.com/kubernetes/kubernetes/blob/050f930f8968874855eb215f0c0f0877bcdaa0e8/pkg/kubelet/pleg/generic.go#L150). These `PodLifecycleEvents` get [used](https://github.com/kubernetes/kubernetes/blob/050f930f8968874855eb215f0c0f0877bcdaa0e8/pkg/kubelet/kubelet.go#L2060) in kubelet's sync loop to infer the state of the container. e.g. to determine if the [container has died](https://github.com/kubernetes/kubernetes/blob/050f930f8968874855eb215f0c0f0877bcdaa0e8/pkg/kubelet/kubelet.go#L2118).
255+
![Modified Cache Setter](./modified-cache-setter.png)
256+
257+
```go
258+
func (c *cache) Set(id types.UID, status *PodStatus, err error, timestamp time.Time) (updated bool) {
259+
c.lock.Lock()
260+
defer c.lock.Unlock()
261+
// Set the value in the cache only if it's not present already
262+
// or the timestamp in the cache is older than the current update timestamp
263+
if val, ok := c.pods[id]; !ok || val.modified.Before(timestamp) {
264+
c.pods[id] = &data{status: status, err: err, modified: timestamp}
265+
c.notify(id, timestamp)
266+
return true
267+
}
268+
return false
269+
}
270+
```
209271

272+
This has no impact on the existing `Generic PLEG` when used without `Evented PLEG` because its the only entity that sets the cache and it does so every second (if needed) for a given pod.
210273

211-
The idea behind this enhancment is, kubelet will receive the [CRI events](###Runtime-Service-Changes) mentioned above from the CRI runtime and generate the corresponding `PodLifecycleEvent`. This will reduce kubelet's dependency on relisting to generate `PodLifecycleEvent` and that event will be immediately available within sync loop instead of waiting for relisting to finish. Kubelet will still do relisting but with a reduced frequency.
212274

213275
### Test Plan
214276

@@ -472,12 +534,11 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
472534
-->
473535
- CRI Runtime
474536
- CRI runtimes that are capable of emitting CRI events must be installed and running.
475-
- Impact of its outage on the feature: Kubelet will detect the outage and fall back on the current default relisting period to make sure the pod statuses are updated in time.
537+
- Impact of its outage on the feature: Kubelet will detect the outage and fall back on the `Generic PLEG` with the default relisting period to make sure the pod statuses are updated correctly.
476538
- Impact of its degraded performance or high-error rates on the feature:
477-
- Any instability with the CRI runtime events stream that results in an error can be detected by the kubelet. Such an error will result in the kubelet falling back to the current default relisting period to make sure the pod statuses are updated in time.
478-
- If the instability is only of the form degraded performance but does not result in an error then the kubelet will not fall back to the current default relisting period and will continue to use the CRI runtime events stream. This will result in the kubelet updating the pod statuses with either the CRI runtime events or the increased relisting period, whichever is less.
479-
- Without the stable stream CRI events this feature will suffer, and kubelet will fall back to relisting with the current default relisting period.
480-
- Kubelet should emit a metric `kube_pod_missed_events` when it detects pods changing state between relist periods not caught by an event.
539+
- Any instability with the CRI runtime events stream that results in an error can be detected by the kubelet. Such an error will result in the kubelet falling back to the `Generic PLEG` with default relisting period to make sure the pod statuses are updated in time.
540+
- If the instability is only of the form degraded performance but does not result in an error then the kubelet will not be able to fall back to the `Generic PLEG` with default relisting period and will continue to use the CRI runtime events stream. With the changes proposed in the section [Pod Status update in the Cache](#pod-status-update-in-the-cache) should help in handling this scenario.
541+
- Kubelet should emit a metric `kube_pod_missed_events` when it detects pods changing state between relist periods not caught by an event.
481542
### Scalability
482543
###### Will enabling / using this feature result in any new API calls?
483544

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
sequenceDiagram
2+
Title: Evented PLEG
3+
Runtime-->>+Evented PLEG: Send CRI Event
4+
Evented PLEG-->>Evented PLEG: Generate PodStatus from received Event
5+
Evented PLEG-->>Evented PLEG: Set pod cache update time to the Event creation time
6+
Evented PLEG-->>Evented PLEG: Update the PodStatus in the cache
7+
Evented PLEG-->>PLEG Channel: Send PodLifeCycleEvent
70.2 KB
Loading
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
sequenceDiagram
2+
Title: Existing Generic PLEG
3+
Generic PLEG-->>Generic PLEG: Global cache update timestamp = time.Now()
4+
Generic PLEG->>+Runtime: GetPods()
5+
Runtime-->>-Generic PLEG: List of Pods
6+
Generic PLEG-->>Generic PLEG: Generate PodLifeCycleEvents for updated Pods
7+
loop EventsbyPodID
8+
Generic PLEG-->>Generic PLEG: Pod cache update timestamp = time.now()
9+
Generic PLEG->>+Runtime: Get Sandbox Statuses
10+
Runtime-->>-Generic PLEG:
11+
Generic PLEG->>+Runtime: Get Container Statuses
12+
Runtime-->>-Generic PLEG:
13+
Generic PLEG-->>Generic PLEG: Update the PodStatus in the cache
14+
Generic PLEG-->>PLEG Channel: Send PodLifeCycleEvent
15+
end
86.5 KB
Loading

keps/sig-node/3386-kubelet-evented-pleg/kep.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,11 @@ stage: alpha
1818
# The most recent milestone for which work toward delivery of this KEP has been
1919
# done. This can be the current (upcoming) milestone, if it is being actively
2020
# worked on.
21-
latest-milestone: "v1.25"
21+
latest-milestone: "v1.26"
2222

2323
# The milestone at which this feature was, or is targeted to be, at each stage.
2424
milestone:
25-
alpha: "v1.25"
25+
alpha: "v1.26"
2626

2727
# The following PRR answers are required at alpha release
2828
# List the feature gate name and the components for which it must be enabled
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
sequenceDiagram
2+
Title: Modified Cache Setter
3+
Evented PLEG->>Cache: Set PodStatus
4+
Generic PLEG->>Cache: Set PodStatus
5+
Note over Cache: Update the PodStatus only if it is newer than Cache
61.2 KB
Loading
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
sequenceDiagram
2+
Title: Modified Generic PLEG
3+
Generic PLEG-->>Generic PLEG: Global cache update timestamp = time.Now()
4+
Generic PLEG->>+Runtime: GetPods()
5+
Runtime-->>-Generic PLEG: List of Pods
6+
Generic PLEG-->>Generic PLEG: Generate PodLifeCycleEvents for updated Pods
7+
loop EventsbyPodID
8+
Generic PLEG-->>+Runtime: PodSandboxStatus(includeContainers=true)
9+
Runtime->>Runtime: Get Sandbox Status
10+
Runtime->>Runtime: Get Container Statuses
11+
Runtime->>Runtime: timestamp = time.now()
12+
Runtime-->>-Generic PLEG:
13+
Generic PLEG-->>Generic PLEG: Update the PodStatus in the cache
14+
Generic PLEG-->>PLEG Channel: Send PodLifeCycleEvent
15+
end
88.3 KB
Loading

0 commit comments

Comments
 (0)