|
| 1 | +# KEP-2258: Node service log viewer |
| 2 | + |
| 3 | +<!-- toc --> |
| 4 | +- [Release Signoff Checklist](#release-signoff-checklist) |
| 5 | +- [Summary](#summary) |
| 6 | +- [Motivation](#motivation) |
| 7 | + - [Goals](#goals) |
| 8 | + - [Non-Goals](#non-goals) |
| 9 | +- [Proposal](#proposal) |
| 10 | + - [Implement client for logs endpoint viewer (OS agnostic)](#implement-client-for-logs-endpoint-viewer-os-agnostic) |
| 11 | + - [Linux distros with systemd / journald](#linux-distros-with-systemd--journald) |
| 12 | + - [Linux distributions without systemd / journald](#linux-distributions-without-systemd--journald) |
| 13 | + - [Windows](#windows) |
| 14 | + - [User Stories](#user-stories) |
| 15 | + - [Risks and Mitigations](#risks-and-mitigations) |
| 16 | + - [Large log files and events](#large-log-files-and-events) |
| 17 | + - [Wider access to all node level service logs](#wider-access-to-all-node-level-service-logs) |
| 18 | +- [Design Details](#design-details) |
| 19 | + - [kubelet](#kubelet) |
| 20 | + - [kubectl](#kubectl) |
| 21 | + - [Test Plan](#test-plan) |
| 22 | + - [Graduation Criteria](#graduation-criteria) |
| 23 | + - [Alpha -> Beta Graduation](#alpha---beta-graduation) |
| 24 | + - [Beta -> GA Graduation](#beta---ga-graduation) |
| 25 | + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) |
| 26 | + - [Version Skew Strategy](#version-skew-strategy) |
| 27 | +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) |
| 28 | + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) |
| 29 | + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) |
| 30 | + - [Monitoring Requirements](#monitoring-requirements) |
| 31 | + - [Dependencies](#dependencies) |
| 32 | + - [Scalability](#scalability) |
| 33 | + - [Troubleshooting](#troubleshooting) |
| 34 | +- [Implementation History](#implementation-history) |
| 35 | +- [Drawbacks](#drawbacks) |
| 36 | +- [Alternatives](#alternatives) |
| 37 | +<!-- /toc --> |
| 38 | + |
| 39 | +## Release Signoff Checklist |
| 40 | + |
| 41 | +Items marked with (R) are required *prior to targeting to a milestone / release*. |
| 42 | + |
| 43 | +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) |
| 44 | +- [ ] (R) KEP approvers have approved the KEP status as `implementable` |
| 45 | +- [x] (R) Design details are appropriately documented |
| 46 | +- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input |
| 47 | +- [x] (R) Graduation criteria is in place |
| 48 | +- [x] (R) Production readiness review completed |
| 49 | +- [ ] (R) Production readiness review approved |
| 50 | +- [x] "Implementation History" section is up-to-date for milestone |
| 51 | +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] |
| 52 | +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
| 53 | + |
| 54 | +[kubernetes.io]: https://kubernetes.io/ |
| 55 | +[kubernetes/enhancements]: https://git.k8s.io/enhancements |
| 56 | +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes |
| 57 | +[kubernetes/website]: https://git.k8s.io/website |
| 58 | + |
| 59 | +## Summary |
| 60 | + |
| 61 | +A Kubernetes cluster administrator has to log in to the relavant control-plane |
| 62 | +or worker nodes to view the logs of the API server, kubelet etc. Or they would |
| 63 | +have to implement a client side reader. A simpler and more elegant method would |
| 64 | +be to allow them to use the kubectl CLI to also view these logs similar to |
| 65 | +using it for other interactions with the cluster. Given the sensitive nature of |
| 66 | +the information in node logs, this feature will only be available to cluster |
| 67 | +administrators. |
| 68 | + |
| 69 | +## Motivation |
| 70 | + |
| 71 | +Troubleshooting issues with control-plane and worker nodes typically requires |
| 72 | +a cluster administrator to SSH into the nodes for debugging. While certain |
| 73 | +issues will require being on the node, issues with the kube-proxy or kubelet, |
| 74 | +to name a couple, could be solved by perusing their logs. However this |
| 75 | +too requires the administrator to SSH access into the nodes. Having a way for |
| 76 | +them to view the logs using kubectl will significantly simplify their |
| 77 | +troubleshooting. |
| 78 | + |
| 79 | + |
| 80 | +### Goals |
| 81 | +Provide a cluster administrator with a streaming view of logs using kubectl |
| 82 | +without them having to implement a client side reader or logging into the node. |
| 83 | +This would work for: |
| 84 | +- Services on Linux worker and control plane nodes: |
| 85 | + - That have systemd / journald support. |
| 86 | + - That have services that log to `/var/log/` |
| 87 | +- Windows worker nodes (all supported variants) that log to `C:\var\log`, |
| 88 | + System and Application logs, Windows Event Logs and Event Tracing (ETW). |
| 89 | + |
| 90 | +### Non-Goals |
| 91 | +- Providing support for non-systemd Linux distributions. |
| 92 | +- Reporting logs for nodes that have config or connection issues with the |
| 93 | + cluster. |
| 94 | +- Getting logs from services that do not use /var/log/. |
| 95 | + |
| 96 | +## Proposal |
| 97 | + |
| 98 | +### Implement client for logs endpoint viewer (OS agnostic) |
| 99 | +- Extend `kubectl logs` to work with node objects. |
| 100 | +- Implement a client for the `/var/log/` kubelet endpoint viewer. |
| 101 | + |
| 102 | +### Linux distros with systemd / journald |
| 103 | +Supplement the the `/var/log/` endpoint viewer on the kubelet with a thin shim |
| 104 | +over the `journal` directory that shells out to journalctl. Then extend |
| 105 | +`kubectl logs` to also work with node objects. |
| 106 | + |
| 107 | +### Linux distributions without systemd / journald |
| 108 | +Running the new "kubectl logs nodes" command against services on nodes that do |
| 109 | +not use systemd / journald should return "OS not supported". However getting |
| 110 | +logs from `/var/log/` should work on all systems. |
| 111 | + |
| 112 | +### Windows |
| 113 | +Reuse the kubelet API for querying the Linux journal for invoking the |
| 114 | +`Get-WinEvent` cmdlet in a PowerShell. |
| 115 | + |
| 116 | +### User Stories |
| 117 | + |
| 118 | +Consider a scenario where pods / containers are refusing to come up on certain |
| 119 | +nodes. As mentioned in the motivation section, troubleshooting this scenario |
| 120 | +involves the cluster administrator to SSH into nodes to scan the logs. Allowing |
| 121 | +them to use `kubectl logs` to do the same as they would to debug issues with a |
| 122 | +pod / container would greatly simply their debug workflow. This also opens up |
| 123 | +opportunities for tooling and simplifying automated log gathering. The feature |
| 124 | +can also be used to debug issues with Kubernetes services especially in Windows |
| 125 | +nodes that run as native Windows services and not as DaemonSets or Deployments. |
| 126 | + |
| 127 | +Here are some example of how a cluser administrator would use this feature: |
| 128 | +``` |
| 129 | +# Show kubelet and crio journal logs from all masters |
| 130 | +kubectl logs nodes --role master -s kubelet -s crio |
| 131 | +
|
| 132 | +# Show kubelet log file (/var/log/kubelet/kubelet.log) from all Windows worker nodes |
| 133 | +kubectl logs nodes --label kubernetes.io/os=windows -s kubelet |
| 134 | +
|
| 135 | +# Display docker runtime WinEvent log entries from a specific Windows worker node |
| 136 | +kubectl logs nodes <node-name> --service docker |
| 137 | +``` |
| 138 | + |
| 139 | +### Risks and Mitigations |
| 140 | + |
| 141 | +#### Large log files and events |
| 142 | +If the log that is attempted to be viewed is very large (GBs) there is |
| 143 | +potential for the node performance to be degraded. To mitigate this we can |
| 144 | +document that node logs should always be rotated in clusters that enable this |
| 145 | +feature. We should also take into account nodes that don't take advantage of |
| 146 | +journald's rate limiting options. We can then take real world feedback around |
| 147 | +this for better mitigation when graduating the feature from alpha to beta. |
| 148 | + |
| 149 | +#### Wider access to all node level service logs |
| 150 | +The cluster administrator can now view all logs in /var/log/, systemd/journald |
| 151 | +services and Windows services. Given that the cluster administrator can log |
| 152 | +into the nodes and view the same information this should not be an issue. |
| 153 | +However there is potential for scenarios where the cluster administrator does |
| 154 | +not have access to the infrastructure. This again would benefit from real world |
| 155 | +usage feedback. |
| 156 | + |
| 157 | +## Design Details |
| 158 | + |
| 159 | +#### kubelet |
| 160 | + |
| 161 | +The kubelet already has a `/var/log/` [endpoint viewer](https://github.com/kubernetes/kubernetes/blob/b184272e278571d1e6650605dd4c39be897eaaa2/pkg/kubelet/kubelet.go#L1403) |
| 162 | +that is lacking a client. Given its existence we can supplement that with a |
| 163 | +wafer thin shim over the /journal directory that shells out to journalctl. This |
| 164 | +allows us to extend the endpoint for getting logs from the system journal on |
| 165 | +Linux systems that support systemd. To enable filtering of logs, we can reuse |
| 166 | +the existing filters supported by journalctl. The `kubectl logs` will have |
| 167 | +command line options for specifying these filters when interacting with node |
| 168 | +objects. |
| 169 | + |
| 170 | +On the Windows side viewing of logs from services that use `C:\var\log` will |
| 171 | +be supported by the existing endpoint. For Windows services that log to the |
| 172 | +the System and Application logs, Windows Event Logs and Event Tracing (ETW), |
| 173 | +we can leverage the [Get-WinEvent cmdlet](https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.diagnostics/get-winevent?view=powershell-7.1) |
| 174 | +that supports getting logs from all these sources. The cmdlet has filtering |
| 175 | +options that can be leveraged to filter the logs in the same manner we do |
| 176 | +with the journal logs. |
| 177 | + |
| 178 | +Please note that filtering will not be available for logs in `/var/log/` or |
| 179 | +`C:\var\log\`. |
| 180 | + |
| 181 | +The feature now enables the cluster administrator to interrogate all services. |
| 182 | +This could be prevented by having a whitelist of allowed services. But this |
| 183 | +comes with severe disadvantages as there could be nodes (especially with |
| 184 | +Windows) that have other services to support networking and monitoring. |
| 185 | +These services are variable and will depend on how the nodes have been |
| 186 | +configured. Here are some examples: |
| 187 | +- [hybrid-overlay-node](https://github.com/ovn-org/ovn-kubernetes/tree/master/go-controller/hybrid-overlay) |
| 188 | +- [windows-exporter](https://github.com/prometheus-community/windows_exporter). |
| 189 | + |
| 190 | + |
| 191 | +The `/var/log/` endpoint is enabled using the `enableSystemLogHandler` kubelet |
| 192 | +configuration options. To gain access to this new feature this option needs to |
| 193 | +be enabled. In addition when introducing this feature it will be hidden behind a |
| 194 | +`NodeLogs` feature gate in the kubelet that needs to be explicitly enabled. So |
| 195 | +you need to enable both options to get access to this new feature and disabling |
| 196 | +`enableSystemLogHandler` will disable the new feature irrespective of the |
| 197 | +`NodeLogs` feature gate. |
| 198 | + |
| 199 | +A reference implementation of this feature without the feature gate is |
| 200 | +available [here](https://github.com/kubernetes/kubernetes/pull/96120). |
| 201 | + |
| 202 | +#### kubectl |
| 203 | + |
| 204 | +`kubectl` has an existing `logs` command that is used to view the logs for a |
| 205 | +container in a pod or a specified resource. The sub-command looks at resource |
| 206 | +types, so can be extended to work with node objects to view the logs of services |
| 207 | +on the nodes. Given that the `logs` command depends on RBAC policies for access |
| 208 | +to appropriate resource type and associated endpoints, it will allow us to |
| 209 | +restrict node logs access to only cluster administrators as long as the cluster |
| 210 | +is setup in that manner. Access to the `node/logs` sub-resource needs to be |
| 211 | +explicitly granted as a user with access to `nodes` will not automatically have |
| 212 | +access to `node/logs`. |
| 213 | + |
| 214 | +The `logs` sub-command for node objects will follow a heuristics approach when |
| 215 | +asked to query for logs from a Windows or Linux service. If asked to get the |
| 216 | +logs from a service `foobar`, it will first assume `foobar` logs to the Linux |
| 217 | +journal / Windows eventing mechanisms (Application, System, and ETW). If unable |
| 218 | +to get logs from these, it will attempt to get logs from `/var/log/foobar.log`, |
| 219 | +`/var/log/foobar/foobar.log`, `/var/log/foobar*INFO` or |
| 220 | +`/var/log/foobar/foobar*INFO` in that order. |
| 221 | +Here are some examples and explanation of the options that will be added. |
| 222 | +``` |
| 223 | +Examples: |
| 224 | + # Show kubelet logs from all masters |
| 225 | + kubectl logs nodes --role master -s kubelet |
| 226 | +
|
| 227 | + # Show docker logs from Windows nodes |
| 228 | + kubectl logs nodes -l kubernetes.io/os=windows -s docker |
| 229 | +
|
| 230 | +Options: |
| 231 | + --case-sensitive=true: Filters are case sensitive by default. Pass --case-sensitive=false to do a case insensitive filter. |
| 232 | + -g, --grep='': Filter log entries by the provided regex pattern. Only applies to node journal logs. |
| 233 | + -o, --output='': Display journal logs in an alternate format (short, cat, json, short-unix). Only applies to node journal logs. |
| 234 | + --raw=false: Perform no transformation of the returned data. |
| 235 | + --role='': Set a label selector by node role. |
| 236 | + -l, --selector='': Selector (label query) to filter on. |
| 237 | + --since='': Return logs after a specific ISO timestamp or relative date. Only applies to node journal or Get-WinEvent logs. |
| 238 | + --tail=0: Return up to this many lines (not more than 100k) from the end of the log. Only applies to node journal or Get-WinEvent logs. |
| 239 | + --sort=timestamp: Interleave logs by sorting the output. Defaults on when viewing node journal logs. |
| 240 | + -s, --service=[]: Return log entries from the specified service(s). |
| 241 | + --until='': Return logs before a specific ISO timestamp or relative date. |
| 242 | +``` |
| 243 | + |
| 244 | +The `--sort=timestamp` feature will introduce log unification across node |
| 245 | +objects by timestamps which can be extended to pod logs. This will allow users |
| 246 | +to see logs across nodes from the same time. Similarly for pods, it will allow |
| 247 | +seeing logs across containers aligned by time. |
| 248 | + |
| 249 | +Given that the feature will be introduced behind a feature gate, by default |
| 250 | +`kubectl logs nodes` will return a feature not enabled message. When the |
| 251 | +feature is enabled in alpha phase, `kubectl logs nodes` will display a |
| 252 | +warning message that the feature is in alpha. When the `--service` option |
| 253 | +is used against Linux nodes that do not support systemd/journald and the service |
| 254 | +does not log to `/var/log`, an OS not supported message will be returned. |
| 255 | + |
| 256 | +### Test Plan |
| 257 | +Add unit tests to kubelet and kubectl that exercise the new arguments that |
| 258 | +have been added. A reference implementation of the tests can be seen |
| 259 | +[here](https://github.com/kubernetes/kubernetes/pull/96120/commits/c606a38ec38ccfe486033495a1dc433279ce71f8#diff-1d703a87c6d6156adf2d0785ec0174bb365855d4883f5758c05fda1fee8f7f1bR1) |
| 260 | + |
| 261 | +### Graduation Criteria |
| 262 | + |
| 263 | +The plan is to introduce the feature as alpha in the v1.22 time frame behind the |
| 264 | +`NodeLogs` feature gate. |
| 265 | + |
| 266 | +#### Alpha -> Beta Graduation |
| 267 | + |
| 268 | +The plan is to graduate the feature to beta in the v1.23 time frame. At that |
| 269 | +point we would have collected feedback from cluster administrators and |
| 270 | +developers who have enabled the feature. Based on this feedback and issues |
| 271 | +opened we should consider adding a kubelet side throttle for the viewing the |
| 272 | +logs. In addition we will garner feedback on the heuristic approach and based on |
| 273 | +that we will decide if we need introduce options to explicitly differentiate |
| 274 | +between file vs journal / WinEvent logs. |
| 275 | + |
| 276 | +#### Beta -> GA Graduation |
| 277 | + |
| 278 | +The plan is to graduate the feature to GA in the v1.24 time frame at which point |
| 279 | +any major issues should have been surfaced and addressed during the alpha and |
| 280 | +beta phases. |
| 281 | + |
| 282 | +### Upgrade / Downgrade Strategy |
| 283 | + |
| 284 | +### Version Skew Strategy |
| 285 | + |
| 286 | +If a kubectl version that has the new `logs nodes` option is used against a node |
| 287 | +that is using a kubelet that does not have the extended `/var/log` endpoint |
| 288 | +viewer, the result should be "feature not supported". |
| 289 | + |
| 290 | +## Production Readiness Review Questionnaire |
| 291 | + |
| 292 | +### Feature Enablement and Rollback |
| 293 | + |
| 294 | +* **How can this feature be enabled / disabled in a live cluster?** |
| 295 | + - [x] Feature gate |
| 296 | + - Feature gate name: NodeLogs |
| 297 | + - Components depending on the feature gate: kubelet |
| 298 | + |
| 299 | +* **Does enabling the feature change any default behavior?** No |
| 300 | + |
| 301 | +* **Can the feature be disabled once it has been enabled (i.e. can we roll back |
| 302 | + the enablement)?** Yes. It can be disabled by disabling the `NodeLogs` feature |
| 303 | + gate in the kubelet. |
| 304 | + |
| 305 | +* **What happens if we reenable the feature if it was previously rolled back?** |
| 306 | + There will be no adverse effects of enabling the feature gate after it was |
| 307 | + disabled. |
| 308 | + |
| 309 | +* **Are there any tests for feature enablement/disablement?** No |
| 310 | + |
| 311 | +### Rollout, Upgrade and Rollback Planning |
| 312 | + |
| 313 | +_This section must be completed when targeting beta graduation to a release._ |
| 314 | + |
| 315 | +### Monitoring Requirements |
| 316 | + |
| 317 | +_This section must be completed when targeting beta graduation to a release._ |
| 318 | + |
| 319 | +### Dependencies |
| 320 | + |
| 321 | +_This section must be completed when targeting beta graduation to a release._ |
| 322 | + |
| 323 | +* **Does this feature depend on any specific services running in the cluster?** |
| 324 | + - kubelet |
| 325 | + - Usage description: |
| 326 | + - Impact of its outage on the feature: If kubelet is not running on the |
| 327 | + node this feature will not work. |
| 328 | + - Impact of its degraded performance or high-error rates on the feature: |
| 329 | + If the kubelet is degraded this feature will also be degraded i.e. the |
| 330 | + node logs will not be returned. |
| 331 | + |
| 332 | +### Scalability |
| 333 | + |
| 334 | +* **Will enabling / using this feature result in any new API calls?** |
| 335 | + No |
| 336 | + |
| 337 | +* **Will enabling / using this feature result in introducing new API types?** |
| 338 | + Yes. We will need to add a `NodeLogOptions` counterpart to |
| 339 | + [PodLogOptions](https://github.com/kubernetes/kubernetes/blob/548ad1b8d35d51e6d33ea21dcc75d60a789b00e6/pkg/apis/core/types.go#L4409) |
| 340 | + |
| 341 | +* **Will enabling / using this feature result in any new calls to the cloud |
| 342 | +provider?** |
| 343 | + No |
| 344 | + |
| 345 | +* **Will enabling / using this feature result in increasing size or count of |
| 346 | +the existing API objects?** |
| 347 | + No |
| 348 | + |
| 349 | +* **Will enabling / using this feature result in increasing time taken by any |
| 350 | +operations covered by [existing SLIs/SLOs]?** |
| 351 | + No |
| 352 | + |
| 353 | +* **Will enabling / using this feature result in non-negligible increase of |
| 354 | +resource usage (CPU, RAM, disk, IO, ...) in any components?** |
| 355 | + In the case of large logs, there is potential for an increase in RAM and CPU |
| 356 | + usage on the node when an attempt is made to stream them. Feedback from the |
| 357 | + field during alpha will provide more clarity as we graduate from alpha to |
| 358 | + beta. |
| 359 | + |
| 360 | +### Troubleshooting |
| 361 | + |
| 362 | +## Implementation History |
| 363 | + |
| 364 | +- Created on Jan 14, 2021 |
| 365 | +- Updated on May 5th, 2021 |
| 366 | + |
| 367 | +## Drawbacks |
| 368 | + |
| 369 | +## Alternatives |
| 370 | + |
| 371 | +Alternatively we could use a client side reader on the nodes to redirect the |
| 372 | +logs. The Windows side would require privileged container support. However this |
| 373 | +would not help scenarios where containers are not launching successfully on the |
| 374 | +nodes. |
| 375 | + |
| 376 | +For the kubectl changes an alternative to extending `kubect logs` would be to |
| 377 | +introduce a plugin or add a new sub-command under `kubectl alpha`. |
0 commit comments