You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -134,15 +134,71 @@ This might be a good place to talk about core concepts and how they relate.
134
134
- Users can disable this feature to make kubelet use existing relisting based PLEG.
135
135
- Another risk is the CRI implementation could have a buggy event emitting system, and miss pod lifecycle events.
136
136
- A mitigation is a `kube_pod_missed_events` metric, which the Kubelet could report when a lifecycle event is registered that wasn't triggered by an event, but rather by changes of state between lists.
137
+
- While using the Evented implementation, the periodic relisting functionality would still be used with an increased interval which should work as a fallback mechanism for missed events in case of any disruptions.
138
+
- Evented PLEG will need to update global cache timestamp periodically in order to make sure pod workers don't get stuck at [GetNewerThan](https://github.com/kubernetes/kubernetes/blob/4a894be926adfe51fd8654dcceef4ece89a4259f/pkg/kubelet/pod_workers.go#L924) in case Evented PLEG misses the event for any unforeseen reason.
137
139
138
140
## Design Details
139
141
142
+
Kubelet generates [PodLifecycleEvent](https://github.com/kubernetes/kubernetes/blob/release-1.24/pkg/kubelet/pleg/pleg.go#L41) using [relisting](https://github.com/kubernetes/kubernetes/blob/050f930f8968874855eb215f0c0f0877bcdaa0e8/pkg/kubelet/pleg/generic.go#L150). These `PodLifecycleEvents` get [used](https://github.com/kubernetes/kubernetes/blob/050f930f8968874855eb215f0c0f0877bcdaa0e8/pkg/kubelet/kubelet.go#L2060) in kubelet's sync loop to infer the state of the container. e.g. to determine if the [container has died](https://github.com/kubernetes/kubernetes/blob/050f930f8968874855eb215f0c0f0877bcdaa0e8/pkg/kubelet/kubelet.go#L2118).
143
+
144
+
The idea behind this enhancment is, kubelet will receive the [CRI events](#Runtime-Service-Changes) mentioned above from the CRI runtime and generate the corresponding `PodLifecycleEvent`. This will reduce kubelet's dependency on relisting to generate `PodLifecycleEvent` and that event will be immediately available within sync loop instead of waiting for relisting to finish. Kubelet will still do relisting but with a reduced frequency.
140
145
### Feature Gate
141
-
This feature can only be enabled using the feature gate `EventedPLEG`.
146
+
This feature can only be used when `EventedPLEG` feature gate is enabled.
Kubelet cache saves the [pod status with the timestamp](https://github.com/kubernetes/kubernetes/blob/c012d901d8bee86ef3e3c9472a1a4a0368a34775/pkg/kubelet/pleg/generic.go#L426). The value of this timestamp is calculated [within the kubelet process](https://github.com/kubernetes/kubernetes/blob/c012d901d8bee86ef3e3c9472a1a4a0368a34775/pkg/kubelet/pleg/generic.go#L399). This works fine when there is only Generic PLEG at work as it will calculate the timestamp first and then fetch the `PodStatus` to save it in the cache.
152
+
153
+
As of today, the `PodStatus` is saved in the cache without any validation of the existing status against the current timestamp. This works well when there is only `Generic PLEG` setting the `PodStatus` in the cache.
154
+
155
+
If we have multiple entities, such as `Evented PLEG`, while trying to set the `PodStatus` in the cache we may run into the racy timestamps given each of them were to calculate the timestamps in their respective execution flow. While `Generic PLEG` calculates this timestamp and gets the `PodStatus`, we can only calculate the corresponding timestamp in `Evented PLEG` after the event has been received by the Kubelet. Any disruptions in getting the events, such as errors in the grpc connection, might skew our calculation of the time in the kubelet for the `Evented PLEG`.
156
+
157
+
In order to address the issues above, we propose that existing `Generic PLEG` as well as `Evented PLEG` should rely on the CRI Runtime for the timestamp of the `PodStatus`. This way the `PodStatus` would also be a bit more closer to the actual time when the statuses of the `Sandboxes` and `Containers` where provided by the CRI Runtime. It will enable us to correctly compare the timestamps before saving them in the cache, to avoid the erroneous behaviour. This should also prevent any old buffered `PodStatus` (consolidated during any disruptions or failures) from overriding the newer entry in the cache.
A new RPC will be introduced in the [CRI Runtime Service](https://github.com/kubernetes/kubernetes/blob/6efd6582df2011f1ec8c146ef711b3348ae07d60/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L34),
166
+
Instead of getting the `Sandbox` and `Container` statuses independently and using the timestamp calculated from the kubelet process, `Generic PLEG` can fetch the `PodStatus` directly from the CRI Runtime using the modified [PodSandboxStatus](https://github.com/kubernetes/kubernetes/blob/4a894be926adfe51fd8654dcceef4ece89a4259f/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L58) rpc of the RuntimeService.
167
+
168
+
The modified `PodSandboxStatusRequest` will have a field `includeContainer` to indicate if `PodSandboxStatusResponse` should have `ContainerStatuses` and the corresponding timestamp.
169
+
170
+
```protobuf=
171
+
message PodSandboxStatusRequest {
172
+
// ID of the PodSandbox for which to retrieve status.
173
+
string pod_sandbox_id = 1;
174
+
// Verbose indicates whether to return extra information about the pod sandbox.
175
+
bool verbose = 2;
176
+
// IncludeContainers indicates whether to include ContainerStatuses and timestamp in the PodSandboxStatusResponse
177
+
bool includeContainers = 3;
178
+
}
179
+
```
180
+
181
+
```protobuf=
182
+
message PodSandboxStatusResponse {
183
+
// Status of the PodSandbox.
184
+
PodSandboxStatus status = 1;
185
+
186
+
// Info is extra information of the PodSandbox. The key could be arbitrary string, and
187
+
// value should be in json format. The information could include anything useful for
188
+
// debug, e.g. network namespace for linux container based container runtime.
189
+
// It should only be returned non-empty when Verbose is true.
190
+
map<string, string> info = 2;
191
+
192
+
// ContainerStatus needs to be included if includeContainers is set true PodSandboxStatusRequest
193
+
repeated ContainerStatus containerStatues = 3;
194
+
195
+
// Timestamp needs to be included if includeContainers is set true in PodSandboxStatusRequest
196
+
int64 timestamp = 4;
197
+
198
+
}
199
+
```
200
+
201
+
Another RPC will be introduced in the [CRI Runtime Service](https://github.com/kubernetes/kubernetes/blob/6efd6582df2011f1ec8c146ef711b3348ae07d60/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto#L34),
146
202
147
203
```protobuf=
148
204
// GetContainerEvents gets container events from the CRI runtime
Creation timestamp of the event will be used when saving the `PodStatus` in the kubelet cache.
167
231
168
232
```protobuf=
169
233
enum ContainerEventType {
@@ -180,35 +244,33 @@ enum ContainerEventType {
180
244
CONTAINER_DELETED_EVENT = 3;
181
245
}
182
246
```
183
-
### Events Filter
184
-
Events can be filtered to retrieve only subset of events,
247
+
### Pod Status Update in the Cache
185
248
186
-
```protobuf=
187
-
message GetEventsRequest {
188
-
// Optional to filter a list of events.
189
-
GetEventsFilter filter = 1;
190
-
}
191
-
```
249
+
While using `Evented PLEG`, the existing `Generic PLEG` is set to relist with the increased period. But in case `Evented PLEG` faces temporary disruptions in the grpc connection with the runtime, there is a chance that when the normalcy is restored the incoming buffered events (which are outdated now) might end up overwriting the latest pod status in the cache updated by the `Generic PLEG`. Having a cache setter that only updates if the pod status in the cache is older than the current pod status helps in mitigating this issue.
192
250
193
-
```protobuf=
194
-
// GetEventsFilter is used to filter a list of events.
195
-
// All those fields are combined with 'AND'
196
-
message GetEventsFilter {
197
-
// ID of the container, sandbox.
198
-
string id = 1;
199
-
// LabelSelector to select matches.
200
-
// Only api.MatchLabels is supported for now and the requirements
201
-
// are ANDed. MatchExpressions is not supported yet.
202
-
map<string, string> label_selector = 2;
203
-
}
204
-
```
251
+
At present kubelet updates the cache using the [Set function](https://github.com/kubernetes/kubernetes/blob/7f129f1c9af62cc3cd4f6b754dacdf5932f39d5c/pkg/kubelet/container/cache.go#L101).
205
252
206
-
### Kubelet Changes
253
+
Pod status should be updated in the cache only if the new status update has timestamp newer than the timestamp of the already present in the cache.
207
254
208
-
Kubelet generates [PodLifecycleEvent](https://github.com/kubernetes/kubernetes/blob/release-1.24/pkg/kubelet/pleg/pleg.go#L41) using [relisting](https://github.com/kubernetes/kubernetes/blob/050f930f8968874855eb215f0c0f0877bcdaa0e8/pkg/kubelet/pleg/generic.go#L150). These `PodLifecycleEvents` get [used](https://github.com/kubernetes/kubernetes/blob/050f930f8968874855eb215f0c0f0877bcdaa0e8/pkg/kubelet/kubelet.go#L2060) in kubelet's sync loop to infer the state of the container. e.g. to determine if the [container has died](https://github.com/kubernetes/kubernetes/blob/050f930f8968874855eb215f0c0f0877bcdaa0e8/pkg/kubelet/kubelet.go#L2118).
This has no impact on the existing `Generic PLEG` when used without `Evented PLEG` because its the only entity that sets the cache and it does so every second (if needed) for a given pod.
210
273
211
-
The idea behind this enhancment is, kubelet will receive the [CRI events](###Runtime-Service-Changes) mentioned above from the CRI runtime and generate the corresponding `PodLifecycleEvent`. This will reduce kubelet's dependency on relisting to generate `PodLifecycleEvent` and that event will be immediately available within sync loop instead of waiting for relisting to finish. Kubelet will still do relisting but with a reduced frequency.
212
274
213
275
### Test Plan
214
276
@@ -472,12 +534,11 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
472
534
-->
473
535
- CRI Runtime
474
536
- CRI runtimes that are capable of emitting CRI events must be installed and running.
475
-
- Impact of its outage on the feature: Kubelet will detect the outage and fall back on the current default relisting period to make sure the pod statuses are updated in time.
537
+
- Impact of its outage on the feature: Kubelet will detect the outage and fall back on the `Generic PLEG` with the default relisting period to make sure the pod statuses are updated correctly.
476
538
- Impact of its degraded performance or high-error rates on the feature:
477
-
- Any instability with the CRI runtime events stream that results in an error can be detected by the kubelet. Such an error will result in the kubelet falling back to the current default relisting period to make sure the pod statuses are updated in time.
478
-
- If the instability is only of the form degraded performance but does not result in an error then the kubelet will not fall back to the current default relisting period and will continue to use the CRI runtime events stream. This will result in the kubelet updating the pod statuses with either the CRI runtime events or the increased relisting period, whichever is less.
479
-
- Without the stable stream CRI events this feature will suffer, and kubelet will fall back to relisting with the current default relisting period.
480
-
- Kubelet should emit a metric `kube_pod_missed_events` when it detects pods changing state between relist periods not caught by an event.
539
+
- Any instability with the CRI runtime events stream that results in an error can be detected by the kubelet. Such an error will result in the kubelet falling back to the `Generic PLEG` with default relisting period to make sure the pod statuses are updated in time.
540
+
- If the instability is only of the form degraded performance but does not result in an error then the kubelet will not be able to fall back to the `Generic PLEG` with default relisting period and will continue to use the CRI runtime events stream. With the changes proposed in the section [Pod Status update in the Cache](#pod-status-update-in-the-cache) should help in handling this scenario.
541
+
- Kubelet should emit a metric `kube_pod_missed_events` when it detects pods changing state between relist periods not caught by an event.
481
542
### Scalability
482
543
###### Will enabling / using this feature result in any new API calls?
0 commit comments