Skip to content

Commit 5e445c5

Browse files
authored
Merge pull request kubernetes#2271 from aravindhp/KEP-2558
KEP 2258: Node service log viewer
2 parents 8b8aa9c + dd5073b commit 5e445c5

File tree

3 files changed

+424
-0
lines changed

3 files changed

+424
-0
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 2258
2+
alpha:
3+
approver: "@johnbelamaric"
Lines changed: 377 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,377 @@
1+
# KEP-2258: Node service log viewer
2+
3+
<!-- toc -->
4+
- [Release Signoff Checklist](#release-signoff-checklist)
5+
- [Summary](#summary)
6+
- [Motivation](#motivation)
7+
- [Goals](#goals)
8+
- [Non-Goals](#non-goals)
9+
- [Proposal](#proposal)
10+
- [Implement client for logs endpoint viewer (OS agnostic)](#implement-client-for-logs-endpoint-viewer-os-agnostic)
11+
- [Linux distros with systemd / journald](#linux-distros-with-systemd--journald)
12+
- [Linux distributions without systemd / journald](#linux-distributions-without-systemd--journald)
13+
- [Windows](#windows)
14+
- [User Stories](#user-stories)
15+
- [Risks and Mitigations](#risks-and-mitigations)
16+
- [Large log files and events](#large-log-files-and-events)
17+
- [Wider access to all node level service logs](#wider-access-to-all-node-level-service-logs)
18+
- [Design Details](#design-details)
19+
- [kubelet](#kubelet)
20+
- [kubectl](#kubectl)
21+
- [Test Plan](#test-plan)
22+
- [Graduation Criteria](#graduation-criteria)
23+
- [Alpha -&gt; Beta Graduation](#alpha---beta-graduation)
24+
- [Beta -&gt; GA Graduation](#beta---ga-graduation)
25+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
26+
- [Version Skew Strategy](#version-skew-strategy)
27+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
28+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
29+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
30+
- [Monitoring Requirements](#monitoring-requirements)
31+
- [Dependencies](#dependencies)
32+
- [Scalability](#scalability)
33+
- [Troubleshooting](#troubleshooting)
34+
- [Implementation History](#implementation-history)
35+
- [Drawbacks](#drawbacks)
36+
- [Alternatives](#alternatives)
37+
<!-- /toc -->
38+
39+
## Release Signoff Checklist
40+
41+
Items marked with (R) are required *prior to targeting to a milestone / release*.
42+
43+
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
44+
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
45+
- [x] (R) Design details are appropriately documented
46+
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
47+
- [x] (R) Graduation criteria is in place
48+
- [x] (R) Production readiness review completed
49+
- [ ] (R) Production readiness review approved
50+
- [x] "Implementation History" section is up-to-date for milestone
51+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
52+
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
53+
54+
[kubernetes.io]: https://kubernetes.io/
55+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
56+
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
57+
[kubernetes/website]: https://git.k8s.io/website
58+
59+
## Summary
60+
61+
A Kubernetes cluster administrator has to log in to the relavant control-plane
62+
or worker nodes to view the logs of the API server, kubelet etc. Or they would
63+
have to implement a client side reader. A simpler and more elegant method would
64+
be to allow them to use the kubectl CLI to also view these logs similar to
65+
using it for other interactions with the cluster. Given the sensitive nature of
66+
the information in node logs, this feature will only be available to cluster
67+
administrators.
68+
69+
## Motivation
70+
71+
Troubleshooting issues with control-plane and worker nodes typically requires
72+
a cluster administrator to SSH into the nodes for debugging. While certain
73+
issues will require being on the node, issues with the kube-proxy or kubelet,
74+
to name a couple, could be solved by perusing their logs. However this
75+
too requires the administrator to SSH access into the nodes. Having a way for
76+
them to view the logs using kubectl will significantly simplify their
77+
troubleshooting.
78+
79+
80+
### Goals
81+
Provide a cluster administrator with a streaming view of logs using kubectl
82+
without them having to implement a client side reader or logging into the node.
83+
This would work for:
84+
- Services on Linux worker and control plane nodes:
85+
- That have systemd / journald support.
86+
- That have services that log to `/var/log/`
87+
- Windows worker nodes (all supported variants) that log to `C:\var\log`,
88+
System and Application logs, Windows Event Logs and Event Tracing (ETW).
89+
90+
### Non-Goals
91+
- Providing support for non-systemd Linux distributions.
92+
- Reporting logs for nodes that have config or connection issues with the
93+
cluster.
94+
- Getting logs from services that do not use /var/log/.
95+
96+
## Proposal
97+
98+
### Implement client for logs endpoint viewer (OS agnostic)
99+
- Extend `kubectl logs` to work with node objects.
100+
- Implement a client for the `/var/log/` kubelet endpoint viewer.
101+
102+
### Linux distros with systemd / journald
103+
Supplement the the `/var/log/` endpoint viewer on the kubelet with a thin shim
104+
over the `journal` directory that shells out to journalctl. Then extend
105+
`kubectl logs` to also work with node objects.
106+
107+
### Linux distributions without systemd / journald
108+
Running the new "kubectl logs nodes" command against services on nodes that do
109+
not use systemd / journald should return "OS not supported". However getting
110+
logs from `/var/log/` should work on all systems.
111+
112+
### Windows
113+
Reuse the kubelet API for querying the Linux journal for invoking the
114+
`Get-WinEvent` cmdlet in a PowerShell.
115+
116+
### User Stories
117+
118+
Consider a scenario where pods / containers are refusing to come up on certain
119+
nodes. As mentioned in the motivation section, troubleshooting this scenario
120+
involves the cluster administrator to SSH into nodes to scan the logs. Allowing
121+
them to use `kubectl logs` to do the same as they would to debug issues with a
122+
pod / container would greatly simply their debug workflow. This also opens up
123+
opportunities for tooling and simplifying automated log gathering. The feature
124+
can also be used to debug issues with Kubernetes services especially in Windows
125+
nodes that run as native Windows services and not as DaemonSets or Deployments.
126+
127+
Here are some example of how a cluser administrator would use this feature:
128+
```
129+
# Show kubelet and crio journal logs from all masters
130+
kubectl logs nodes --role master -s kubelet -s crio
131+
132+
# Show kubelet log file (/var/log/kubelet/kubelet.log) from all Windows worker nodes
133+
kubectl logs nodes --label kubernetes.io/os=windows -s kubelet
134+
135+
# Display docker runtime WinEvent log entries from a specific Windows worker node
136+
kubectl logs nodes <node-name> --service docker
137+
```
138+
139+
### Risks and Mitigations
140+
141+
#### Large log files and events
142+
If the log that is attempted to be viewed is very large (GBs) there is
143+
potential for the node performance to be degraded. To mitigate this we can
144+
document that node logs should always be rotated in clusters that enable this
145+
feature. We should also take into account nodes that don't take advantage of
146+
journald's rate limiting options. We can then take real world feedback around
147+
this for better mitigation when graduating the feature from alpha to beta.
148+
149+
#### Wider access to all node level service logs
150+
The cluster administrator can now view all logs in /var/log/, systemd/journald
151+
services and Windows services. Given that the cluster administrator can log
152+
into the nodes and view the same information this should not be an issue.
153+
However there is potential for scenarios where the cluster administrator does
154+
not have access to the infrastructure. This again would benefit from real world
155+
usage feedback.
156+
157+
## Design Details
158+
159+
#### kubelet
160+
161+
The kubelet already has a `/var/log/` [endpoint viewer](https://github.com/kubernetes/kubernetes/blob/b184272e278571d1e6650605dd4c39be897eaaa2/pkg/kubelet/kubelet.go#L1403)
162+
that is lacking a client. Given its existence we can supplement that with a
163+
wafer thin shim over the /journal directory that shells out to journalctl. This
164+
allows us to extend the endpoint for getting logs from the system journal on
165+
Linux systems that support systemd. To enable filtering of logs, we can reuse
166+
the existing filters supported by journalctl. The `kubectl logs` will have
167+
command line options for specifying these filters when interacting with node
168+
objects.
169+
170+
On the Windows side viewing of logs from services that use `C:\var\log` will
171+
be supported by the existing endpoint. For Windows services that log to the
172+
the System and Application logs, Windows Event Logs and Event Tracing (ETW),
173+
we can leverage the [Get-WinEvent cmdlet](https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.diagnostics/get-winevent?view=powershell-7.1)
174+
that supports getting logs from all these sources. The cmdlet has filtering
175+
options that can be leveraged to filter the logs in the same manner we do
176+
with the journal logs.
177+
178+
Please note that filtering will not be available for logs in `/var/log/` or
179+
`C:\var\log\`.
180+
181+
The feature now enables the cluster administrator to interrogate all services.
182+
This could be prevented by having a whitelist of allowed services. But this
183+
comes with severe disadvantages as there could be nodes (especially with
184+
Windows) that have other services to support networking and monitoring.
185+
These services are variable and will depend on how the nodes have been
186+
configured. Here are some examples:
187+
- [hybrid-overlay-node](https://github.com/ovn-org/ovn-kubernetes/tree/master/go-controller/hybrid-overlay)
188+
- [windows-exporter](https://github.com/prometheus-community/windows_exporter).
189+
190+
191+
The `/var/log/` endpoint is enabled using the `enableSystemLogHandler` kubelet
192+
configuration options. To gain access to this new feature this option needs to
193+
be enabled. In addition when introducing this feature it will be hidden behind a
194+
`NodeLogs` feature gate in the kubelet that needs to be explicitly enabled. So
195+
you need to enable both options to get access to this new feature and disabling
196+
`enableSystemLogHandler` will disable the new feature irrespective of the
197+
`NodeLogs` feature gate.
198+
199+
A reference implementation of this feature without the feature gate is
200+
available [here](https://github.com/kubernetes/kubernetes/pull/96120).
201+
202+
#### kubectl
203+
204+
`kubectl` has an existing `logs` command that is used to view the logs for a
205+
container in a pod or a specified resource. The sub-command looks at resource
206+
types, so can be extended to work with node objects to view the logs of services
207+
on the nodes. Given that the `logs` command depends on RBAC policies for access
208+
to appropriate resource type and associated endpoints, it will allow us to
209+
restrict node logs access to only cluster administrators as long as the cluster
210+
is setup in that manner. Access to the `node/logs` sub-resource needs to be
211+
explicitly granted as a user with access to `nodes` will not automatically have
212+
access to `node/logs`.
213+
214+
The `logs` sub-command for node objects will follow a heuristics approach when
215+
asked to query for logs from a Windows or Linux service. If asked to get the
216+
logs from a service `foobar`, it will first assume `foobar` logs to the Linux
217+
journal / Windows eventing mechanisms (Application, System, and ETW). If unable
218+
to get logs from these, it will attempt to get logs from `/var/log/foobar.log`,
219+
`/var/log/foobar/foobar.log`, `/var/log/foobar*INFO` or
220+
`/var/log/foobar/foobar*INFO` in that order.
221+
Here are some examples and explanation of the options that will be added.
222+
```
223+
Examples:
224+
# Show kubelet logs from all masters
225+
kubectl logs nodes --role master -s kubelet
226+
227+
# Show docker logs from Windows nodes
228+
kubectl logs nodes -l kubernetes.io/os=windows -s docker
229+
230+
Options:
231+
--case-sensitive=true: Filters are case sensitive by default. Pass --case-sensitive=false to do a case insensitive filter.
232+
-g, --grep='': Filter log entries by the provided regex pattern. Only applies to node journal logs.
233+
-o, --output='': Display journal logs in an alternate format (short, cat, json, short-unix). Only applies to node journal logs.
234+
--raw=false: Perform no transformation of the returned data.
235+
--role='': Set a label selector by node role.
236+
-l, --selector='': Selector (label query) to filter on.
237+
--since='': Return logs after a specific ISO timestamp or relative date. Only applies to node journal or Get-WinEvent logs.
238+
--tail=0: Return up to this many lines (not more than 100k) from the end of the log. Only applies to node journal or Get-WinEvent logs.
239+
--sort=timestamp: Interleave logs by sorting the output. Defaults on when viewing node journal logs.
240+
-s, --service=[]: Return log entries from the specified service(s).
241+
--until='': Return logs before a specific ISO timestamp or relative date.
242+
```
243+
244+
The `--sort=timestamp` feature will introduce log unification across node
245+
objects by timestamps which can be extended to pod logs. This will allow users
246+
to see logs across nodes from the same time. Similarly for pods, it will allow
247+
seeing logs across containers aligned by time.
248+
249+
Given that the feature will be introduced behind a feature gate, by default
250+
`kubectl logs nodes` will return a feature not enabled message. When the
251+
feature is enabled in alpha phase, `kubectl logs nodes` will display a
252+
warning message that the feature is in alpha. When the `--service` option
253+
is used against Linux nodes that do not support systemd/journald and the service
254+
does not log to `/var/log`, an OS not supported message will be returned.
255+
256+
### Test Plan
257+
Add unit tests to kubelet and kubectl that exercise the new arguments that
258+
have been added. A reference implementation of the tests can be seen
259+
[here](https://github.com/kubernetes/kubernetes/pull/96120/commits/c606a38ec38ccfe486033495a1dc433279ce71f8#diff-1d703a87c6d6156adf2d0785ec0174bb365855d4883f5758c05fda1fee8f7f1bR1)
260+
261+
### Graduation Criteria
262+
263+
The plan is to introduce the feature as alpha in the v1.22 time frame behind the
264+
`NodeLogs` feature gate.
265+
266+
#### Alpha -> Beta Graduation
267+
268+
The plan is to graduate the feature to beta in the v1.23 time frame. At that
269+
point we would have collected feedback from cluster administrators and
270+
developers who have enabled the feature. Based on this feedback and issues
271+
opened we should consider adding a kubelet side throttle for the viewing the
272+
logs. In addition we will garner feedback on the heuristic approach and based on
273+
that we will decide if we need introduce options to explicitly differentiate
274+
between file vs journal / WinEvent logs.
275+
276+
#### Beta -> GA Graduation
277+
278+
The plan is to graduate the feature to GA in the v1.24 time frame at which point
279+
any major issues should have been surfaced and addressed during the alpha and
280+
beta phases.
281+
282+
### Upgrade / Downgrade Strategy
283+
284+
### Version Skew Strategy
285+
286+
If a kubectl version that has the new `logs nodes` option is used against a node
287+
that is using a kubelet that does not have the extended `/var/log` endpoint
288+
viewer, the result should be "feature not supported".
289+
290+
## Production Readiness Review Questionnaire
291+
292+
### Feature Enablement and Rollback
293+
294+
* **How can this feature be enabled / disabled in a live cluster?**
295+
- [x] Feature gate
296+
- Feature gate name: NodeLogs
297+
- Components depending on the feature gate: kubelet
298+
299+
* **Does enabling the feature change any default behavior?** No
300+
301+
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
302+
the enablement)?** Yes. It can be disabled by disabling the `NodeLogs` feature
303+
gate in the kubelet.
304+
305+
* **What happens if we reenable the feature if it was previously rolled back?**
306+
There will be no adverse effects of enabling the feature gate after it was
307+
disabled.
308+
309+
* **Are there any tests for feature enablement/disablement?** No
310+
311+
### Rollout, Upgrade and Rollback Planning
312+
313+
_This section must be completed when targeting beta graduation to a release._
314+
315+
### Monitoring Requirements
316+
317+
_This section must be completed when targeting beta graduation to a release._
318+
319+
### Dependencies
320+
321+
_This section must be completed when targeting beta graduation to a release._
322+
323+
* **Does this feature depend on any specific services running in the cluster?**
324+
- kubelet
325+
- Usage description:
326+
- Impact of its outage on the feature: If kubelet is not running on the
327+
node this feature will not work.
328+
- Impact of its degraded performance or high-error rates on the feature:
329+
If the kubelet is degraded this feature will also be degraded i.e. the
330+
node logs will not be returned.
331+
332+
### Scalability
333+
334+
* **Will enabling / using this feature result in any new API calls?**
335+
No
336+
337+
* **Will enabling / using this feature result in introducing new API types?**
338+
Yes. We will need to add a `NodeLogOptions` counterpart to
339+
[PodLogOptions](https://github.com/kubernetes/kubernetes/blob/548ad1b8d35d51e6d33ea21dcc75d60a789b00e6/pkg/apis/core/types.go#L4409)
340+
341+
* **Will enabling / using this feature result in any new calls to the cloud
342+
provider?**
343+
No
344+
345+
* **Will enabling / using this feature result in increasing size or count of
346+
the existing API objects?**
347+
No
348+
349+
* **Will enabling / using this feature result in increasing time taken by any
350+
operations covered by [existing SLIs/SLOs]?**
351+
No
352+
353+
* **Will enabling / using this feature result in non-negligible increase of
354+
resource usage (CPU, RAM, disk, IO, ...) in any components?**
355+
In the case of large logs, there is potential for an increase in RAM and CPU
356+
usage on the node when an attempt is made to stream them. Feedback from the
357+
field during alpha will provide more clarity as we graduate from alpha to
358+
beta.
359+
360+
### Troubleshooting
361+
362+
## Implementation History
363+
364+
- Created on Jan 14, 2021
365+
- Updated on May 5th, 2021
366+
367+
## Drawbacks
368+
369+
## Alternatives
370+
371+
Alternatively we could use a client side reader on the nodes to redirect the
372+
logs. The Windows side would require privileged container support. However this
373+
would not help scenarios where containers are not launching successfully on the
374+
nodes.
375+
376+
For the kubectl changes an alternative to extending `kubect logs` would be to
377+
introduce a plugin or add a new sub-command under `kubectl alpha`.

0 commit comments

Comments
 (0)