Skip to content

Commit 4885227

Browse files
committed
Add KEP for enabling CRIContainerLogRotation by default
Signed-off-by: Urvashi Mohnani <[email protected]>
1 parent 40cb8db commit 4885227

File tree

3 files changed

+350
-0
lines changed

3 files changed

+350
-0
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 2411
2+
stable:
3+
approver: "@deads2k"
Lines changed: 303 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,303 @@
1+
# KEP-2411: CRI Container Log Rotation
2+
3+
<!-- toc -->
4+
- [Release Signoff Checklist](#release-signoff-checklist)
5+
- [Summary](#summary)
6+
- [Motivation](#motivation)
7+
- [Goals](#goals)
8+
- [Non-Goals](#non-goals)
9+
- [Proposal](#proposal)
10+
- [User Stories (Optional)](#user-stories-optional)
11+
- [Story 1](#story-1)
12+
- [Story 2](#story-2)
13+
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
14+
- [Risks and Mitigations](#risks-and-mitigations)
15+
- [Design Details](#design-details)
16+
- [Test Plan](#test-plan)
17+
- [Graduation Criteria](#graduation-criteria)
18+
- [Alpha -&gt; Beta Graduation](#alpha---beta-graduation)
19+
- [Beta -&gt; GA Graduation](#beta---ga-graduation)
20+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
21+
- [Version Skew Strategy](#version-skew-strategy)
22+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
23+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
24+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
25+
- [Monitoring Requirements](#monitoring-requirements)
26+
- [Dependencies](#dependencies)
27+
- [Scalability](#scalability)
28+
- [Troubleshooting](#troubleshooting)
29+
- [Implementation History](#implementation-history)
30+
<!-- /toc -->
31+
32+
## Release Signoff Checklist
33+
34+
Items marked with (R) are required *prior to targeting to a milestone / release*.
35+
36+
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
37+
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
38+
- [x] (R) Design details are appropriately documented
39+
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
40+
- [x] (R) Graduation criteria is in place
41+
- [ ] (R) Production readiness review completed
42+
- [ ] (R) Production readiness review approved
43+
- [x] "Implementation History" section is up-to-date for milestone
44+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
45+
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
46+
47+
[kubernetes.io]: https://kubernetes.io/
48+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
49+
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
50+
[kubernetes/website]: https://git.k8s.io/website
51+
52+
## Summary
53+
54+
The CRIContainerLogRotation feature gate was implemented in v1.10 and has been in Beta stage since v1.11. We would like to identify any gaps in the implementation of this feature so that we can promote it to stable as it has already been in production use for quite some time now. With this feature gate, the kubelet is in charge of managing the container log directory structure as well as rotating the logs when a certain (user configurable) limit is reached.
55+
56+
## Motivation
57+
58+
Container runtimes that communicate with the kubelet via the Container Runtime Interface (CRI) needed a container log management system. The kubelet was already in charge of determining the container log file path and passing that down to the container runtime so that it can write the container logs there. Thus making the kubelet in charge of rotating the container logs allows the kubelet to manage and access the logs directly without having to call the container runtime. An added advantage of this is that logging agents can ingest files directly without any further integrations with the container runtime.
59+
60+
### Goals
61+
62+
- The kubelet assigns the log path for a container and runtime writes the container output to that path
63+
- The kubelet periodically checks the disk space occupied by the container logs and rotates them if necessary
64+
- After the logs are rotated, the kubelet sends a signal to the container runtime to re-open the log file
65+
- The kubelet exposes a consistent log directory structure with metadata so that any logging agent can integrate with it
66+
67+
### Non-Goals
68+
69+
- Shipping the logs directly to a remote storage
70+
- Supporting container runtimes that run on a different virtual/physical machine from the kubelet
71+
- Allow kubelet to manage the lifecycle of the logs to pave the way for better disk management in the future. This implies that the lifecycle of containers and their logs need to be decoupled.
72+
73+
## Proposal
74+
75+
Graduate the CRIContainerLogRotation feature gate from Beta to Stable. The kubelet already decides the container log directory structure and passes that down to the container runtime. The container runtime then writes the container's logs to this location. This makes the kubelet the best candidate to manage the rotation of the container logs as it already know the log directory structure and has access to it. The CRILogRotation feature gate implementation adds a container log manager package, which manages and rotates the container logs. It also adds 2 flags to the kubelet that allows the user to configure the maximum size of each log file and the maximum number of log files to retain. These flags are **--container-log-max-size** and **--container-log-max-files**.
76+
77+
### User Stories (Optional)
78+
79+
#### Story 1
80+
81+
As a kubernetes user, I want to use a CRI container runtime and want the container logs to be managed and rotated by the kubelet, so I don't have to worry about logs filling up my disk space and can access older longs when needed. I also want to be able to configure the size of my log file and how many log files to retain.
82+
83+
#### Story 2
84+
85+
As a kubernetes user, I want to integrate a logging agent that aggregates the logs to a remote storage, so I can easily access my logs in a centralized location without needing to access each node on my cluster to view the logs.
86+
87+
### Notes/Constraints/Caveats (Optional)
88+
89+
### Risks and Mitigations
90+
91+
- Loss of some logs during log rotation. There is an open issue on this with a suggested fix https://github.com/kubernetes/kubernetes/issues/64760. Have added this to the graduation criteria as well.
92+
93+
## Design Details
94+
95+
This implementation adds container log manager package, which the kubelet uses to manage and rotate the logs. The container log manager will only start up when the container runtime being used is one that communicates with the kubelet via the CRI i.e CRI-O, containerd, etc.
96+
97+
The rotated logs are compressed with gzip. The latest rotated log is not compressed as a logging agent, such as fluentd, might still be reading it right after rotation and/or the container runtime might still be writing to it shortly after getting the path to the new log file. The kubelet periodically checks the amount of disk space being used by the container logs and rotates them if the max value has been reached. After the logs are rotated, the kubelet sends a signal to the container runtime to re-open the log file.
98+
99+
The user can configure the maximum size of a log file and the maximum number of log files to retain with the **--container-log-max-size** and **--container-log-max-files** flags. The default values are **10Mi** for the max file size and **5** for the max number of log files. These parameters will only be applied if a CRI container runtime is being used, it will be ignored for dockershim.
100+
101+
The kubelet exposes a consistent log directory structure with embedded metadata so that logging agents can integrate with it is. Since the kubelet is in charge of setting the log directory structure and can directly access and manage the log files, the logging agents can directly work with the kubelet without having to make any further integrations with the container runtime in use.
102+
103+
### Test Plan
104+
105+
- There are currently unit tests and node E2E integration tests for container log rotation
106+
- Get feedback on performance and stability of the CRI log format on other products aside from OpenShift and GKE
107+
108+
### Graduation Criteria
109+
110+
#### Alpha -> Beta Graduation
111+
112+
- Unit and node E2E tests are consistently passing
113+
- Logging agents can easily integrate with kubernetes and push rotated logs to a remote storage
114+
- Successful log rotations by the kubelet
115+
116+
#### Beta -> GA Graduation
117+
118+
- Successfully run in production
119+
- Solicit feedback in SIG Node community that there are no issues with individual distributions production usage (OpenShift and GKE both report no major issue)
120+
- Fix https://github.com/kubernetes/kubernetes/issues/64760, which is an issue on loss of some logs during log rotation
121+
122+
### Upgrade / Downgrade Strategy
123+
124+
On Upgrade: feature will be available to use as it already is, but will be promoted to GA.
125+
126+
On downgrade: feature will be available to use when the feature gate is set, but will be moved back to Beta.
127+
128+
### Version Skew Strategy
129+
130+
Since this feature was promoted to Beta in v1.11, it will still be available with a n-2 kubelet. No coordination with the control plane is required. Changes to any other components on the node are not needed.
131+
132+
## Production Readiness Review Questionnaire
133+
134+
### Feature Enablement and Rollback
135+
136+
_This section must be completed when targeting alpha to a release._
137+
138+
- **How can this feature be enabled / disabled in a live cluster?**
139+
- [x] Feature gate (also fill in values in `kep.yaml`)
140+
- Feature gate name: CRIContainerLogRotation
141+
- Components depending on the feature gate: Kubelet
142+
- [ ] Other
143+
- Describe the mechanism:
144+
- Will enabling / disabling the feature require downtime of the control
145+
plane?
146+
- Will enabling / disabling the feature require downtime or reprovisioning
147+
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
148+
149+
- **Does enabling the feature change any default behavior?**
150+
151+
With the dockershim, the docker daemon was in charge of managing and rotating the logs. With a CRI container runtime, the kubelet is in charge of managing and rotating logs. There is no real change to default behavior apart form the fact that the log rotation will depend on which container runtime is being used.
152+
153+
- **Can the feature be disabled once it has been enabled (i.e. can we roll back
154+
the enablement)?**
155+
156+
Yes, but if disabled the container logs will not be rotated when using a CRI container runtime.
157+
158+
- **What happens if we reenable the feature if it was previously rolled back?**
159+
160+
No impact, container log rotation will work for CRI container runtimes.
161+
162+
- **Are there any tests for feature enablement/disablement?**
163+
164+
There are already unit and node e2e tests in place for this feature.
165+
166+
### Rollout, Upgrade and Rollback Planning
167+
168+
_This section must be completed when targeting beta graduation to a release._
169+
170+
- **How can a rollout fail? Can it impact already running workloads?**
171+
172+
When the container log manager doesn't start up and rotate the logs as expected. Restarts shouldn't affect ths as it is the container runtime that will be writing the logs. The kubelet is in charge of checking log size and rotating when needed.
173+
174+
- **What specific metrics should inform a rollback?**
175+
176+
- There is major loss of logs on nodes that use a CRI runtime.
177+
- Logging agents are unable to integrate with k8s and aggregate logs to a remote storage.
178+
179+
- **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
180+
181+
Any manual testing was done when the feature was initially implemented in https://github.com/kubernetes/kubernetes/pull/59898.
182+
183+
- **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
184+
fields of API types, flags, etc.?**
185+
186+
No
187+
188+
### Monitoring Requirements
189+
190+
_This section must be completed when targeting beta graduation to a release._
191+
192+
- **How can an operator determine if the feature is in use by workloads?**
193+
194+
When a CRI container runtime is used, the logs are being rotated and stored in the log directory structure with a gzip format.
195+
196+
- **What are the SLIs (Service Level Indicators) an operator can use to determine
197+
the health of the service?**
198+
- [ ] Metrics
199+
- Metric name:
200+
- [Optional] Aggregation method:
201+
- Components exposing the metric:
202+
- [x] Other (treat as last resort)
203+
- Details: Error messages logged in the kubelet journal when there is a failure to rotate the logs or delete old log files.
204+
205+
- **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
206+
207+
100% of logs are rotated according the configured max size and files without any loss.
208+
209+
- **Are there any missing metrics that would be useful to have to improve observability
210+
of this feature?**
211+
212+
N/A
213+
214+
### Dependencies
215+
216+
_This section must be completed when targeting beta graduation to a release._
217+
218+
- **Does this feature depend on any specific services running in the cluster?**
219+
220+
- [Kubelet]
221+
- Usage description: Responsible for managing the log directory structure and rotating the logs.
222+
- Impact of its outage on the feature: Logs will not be rotated. Container runtime may continue to write logs to the file even after the max size has been reached. This could bring the cluster down, if the logs continue to grow uncontrolled without pod eviction enabled.
223+
- Impact of its degraded performance or high-error rates on the feature: Loss of logs during rotation. Logging agents may have issues aggregating the logs.
224+
225+
### Scalability
226+
227+
_For alpha, this section is encouraged: reviewers should consider these questions
228+
and attempt to answer them._
229+
230+
_For beta, this section is required: reviewers must answer these questions._
231+
232+
_For GA, this section is required: approvers should be able to confirm the
233+
previous answers based on experience in the field._
234+
235+
- **Will enabling / using this feature result in any new API calls?**
236+
Describe them, providing:
237+
- Re-open container log file after logs are rotated
238+
- Not much, only the container ID
239+
- Kubelet
240+
- None
241+
- This will be triggered after the max file size for logs has been reached and the kubelet has rotated the logs
242+
243+
244+
- **Will enabling / using this feature result in introducing new API types?**
245+
246+
No
247+
248+
- **Will enabling / using this feature result in any new calls to the cloud
249+
provider?**
250+
251+
No
252+
253+
- **Will enabling / using this feature result in increasing size or count of
254+
the existing API objects?**
255+
256+
No
257+
258+
- **Will enabling / using this feature result in increasing time taken by any
259+
operations covered by [existing SLIs/SLOs]?**
260+
261+
No
262+
263+
- **Will enabling / using this feature result in non-negligible increase of
264+
resource usage (CPU, RAM, disk, IO, ...) in any components?**
265+
266+
No
267+
268+
### Troubleshooting
269+
270+
The Troubleshooting section currently serves the `Playbook` role. We may consider
271+
splitting it into a dedicated `Playbook` document (potentially with some monitoring
272+
details). For now, we leave it here.
273+
274+
_This section must be completed when targeting beta graduation to a release._
275+
276+
- **How does this feature react if the API server and/or etcd is unavailable?**
277+
278+
Container logs are written to a path on disk that the kubelet directly manages, so there should be no impact if the etcd and/or API server is unavailable
279+
280+
- **What are other known failure modes?**
281+
For each of them, fill in the following information by copying the below template:
282+
- [Failure mode brief description]
283+
- Detection: How can it be detected via metrics? Stated another way:
284+
how can an operator troubleshoot without logging into a master or worker node?
285+
- Mitigations: What can be done to stop the bleeding, especially for already
286+
running user workloads?
287+
- Diagnostics: What are the useful log messages and their required logging
288+
levels that could help debug the issue?
289+
Not required until feature graduated to beta.
290+
- Testing: Are there any tests for failure mode? If not, describe why.
291+
292+
- **What steps should be taken if SLOs are not being met to determine the problem?**
293+
294+
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
295+
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
296+
297+
## Implementation History
298+
299+
Original Issue: https://github.com/kubernetes/kubernetes/issues/58823
300+
First PR with implementation: https://github.com/kubernetes/kubernetes/pull/59898
301+
Original design doc with solutions considered: https://docs.google.com/document/d/1oQe8dFiLln7cGyrRdholMsgogliOtpAzq6-K3068Ncg/edit#
302+
Follow up PR: https://github.com/kubernetes/kubernetes/pull/58899
303+
Graduation to Beta: https://github.com/kubernetes/kubernetes/pull/64046
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
title: CRI Container Log Rotation
2+
kep-number: 2411
3+
authors:
4+
- "@umohnani8"
5+
owning-sig: sig-node
6+
participating-sigs:
7+
- sig-instrumentation
8+
- sig-storage
9+
status: implemented
10+
creation-date: 2018-01-25
11+
reviewers:
12+
- "@mrunalp"
13+
- "@SergeyKanzhelev"
14+
approvers:
15+
- "@mrunalp"
16+
- "@derekwaynecarr"
17+
prr-approvers:
18+
- "@deads2k"
19+
20+
# The target maturity stage in the current dev cycle for this KEP.
21+
stage: stable
22+
23+
# The most recent milestone for which work toward delivery of this KEP has been
24+
# done. This can be the current (upcoming) milestone, if it is being actively
25+
# worked on.
26+
latest-milestone: "v1.21"
27+
28+
# The milestone at which this feature was, or is targeted to be, at each stage.
29+
milestone:
30+
alpha: "v1.10"
31+
beta: "v1.11"
32+
stable: "v1.21"
33+
34+
# The following PRR answers are required at alpha release
35+
# List the feature gate name and the components for which it must be enabled
36+
feature-gates:
37+
- name: CRIContainerLogRotation
38+
components:
39+
- kubelet
40+
disable-supported: true
41+
42+
# The following PRR answers are required at beta release
43+
metrics:
44+
- N/A

0 commit comments

Comments
 (0)