-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Propose log tracking KEP #1961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propose log tracking KEP #1961
Changes from 34 commits
1632b59
a407e86
d2ec492
0c24512
5d32367
13bb682
b95257e
bcb70de
805b0bd
a9af677
94d9f41
be47492
a0380c3
c1f80f3
d18d9fd
b57ffd3
2da5b28
9aa6fac
da71563
bb1a496
7ce91a5
c5b2f7f
dea01f9
7dc0657
1600a21
e6d7e43
c012204
429df62
36f16f1
41cef3e
4be4d1d
9692e64
badbe6a
6fafbf0
5ccb15e
24ce1d1
693bee1
cc282e7
739eb78
d1ea659
34d8f47
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,372 @@ | ||||||||
# Log tracking for K8s component log | ||||||||
|
||||||||
<!-- toc --> | ||||||||
- [Release Signoff Checklist](#release-signoff-checklist) | ||||||||
- [Summary](#summary) | ||||||||
- [New three unique logging meta-data](#new-three-unique-logging-meta-data) | ||||||||
- [Note](#note) | ||||||||
- [Motivation](#motivation) | ||||||||
- [Goals](#goals) | ||||||||
- [Non-Goals](#non-goals) | ||||||||
- [Proposal](#proposal) | ||||||||
- [User Stories (Optional)](#user-stories-optional) | ||||||||
- [Story 1](#story-1) | ||||||||
- [Story 2](#story-2) | ||||||||
- [Summary of Cases](#summary-of-cases) | ||||||||
- [Logging metadata](#logging-metadata) | ||||||||
- [Design Details](#design-details) | ||||||||
- [Prerequisite](#prerequisite) | ||||||||
- [Design of ID propagation in apiserver](#design-of-id-propagation-in-apiserver) | ||||||||
- [Design of ID propagation (controller)](#design-of-id-propagation-controller) | ||||||||
- [Design of ID propagation in <a href="https://github.com/kubernetes/client-go">client-go</a>](#design-of-id-propagation-in-client-go) | ||||||||
- [Design of Mutating webhook(Out of tree)](#design-of-mutating-webhookout-of-tree) | ||||||||
- [Risks and Mitigations](#risks-and-mitigations) | ||||||||
- [Test Plan](#test-plan) | ||||||||
- [Graduation Criteria](#graduation-criteria) | ||||||||
- [Alpha](#alpha) | ||||||||
- [Beta](#beta) | ||||||||
- [GA](#ga) | ||||||||
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) | ||||||||
- [Feature Enablement and Rollback](#feature-enablement-and-rollback) | ||||||||
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) | ||||||||
- [Monitoring Requirements](#monitoring-requirements) | ||||||||
- [Dependencies](#dependencies) | ||||||||
- [Scalability](#scalability) | ||||||||
- [Troubleshooting](#troubleshooting) | ||||||||
- [Implementation History](#implementation-history) | ||||||||
<!-- /toc --> | ||||||||
|
||||||||
## Release Signoff Checklist | ||||||||
|
||||||||
Items marked with (R) are required *prior to targeting to a milestone / release*. | ||||||||
|
||||||||
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) | ||||||||
- [ ] (R) KEP approvers have approved the KEP status as `implementable` | ||||||||
- [ ] (R) Design details are appropriately documented | ||||||||
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input | ||||||||
- [ ] (R) Graduation criteria is in place | ||||||||
- [ ] (R) Production readiness review completed | ||||||||
- [ ] Production readiness review approved | ||||||||
- [ ] "Implementation History" section is up-to-date for milestone | ||||||||
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] | ||||||||
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes | ||||||||
|
||||||||
## Summary | ||||||||
|
||||||||
This KEP proposes a method for adding new three unique logging meta-data into K8s component logs. | ||||||||
It makes us more easy to identify specific logs related to an user request (such as `kubectl apply`) and object (such as Pod, Deployment). | ||||||||
It is expected to reduce investigation cost greatly when trouble shoothing. | ||||||||
|
||||||||
|
||||||||
### New three unique logging meta-data | ||||||||
|
||||||||
|
||||||||
We use three meta-data. These meta-data have different features and are used for troubleshooting from different perspectives. | ||||||||
|
||||||||
| meta-data name | feature | | ||||||||
| ------ | ------ | | ||||||||
| traceid | spans an user request. unique for user's request | | ||||||||
| spanid | spans a controller action. unique for controller action | | ||||||||
| initialtraceid | spans the entire object lifecycle. unique for related objects | | ||||||||
|
||||||||
|
||||||||
### Note | ||||||||
|
||||||||
|
||||||||
This KEP is **how** a component could add meta-data to logs. To actually add meta-data to K8s component logs, the following procedure is necessary in addition. | ||||||||
- Open issues for each component, and discuss them with the SIGs that own that component. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We cannot push work on other sigs and expect that they will do it on base that it's important for sig-instrumentation. We should scope work to set of components that will allow it to be beneficial and finishable by our sig-instrumentation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For our logging implementation, I think we can just open issues for each component, and discuss them with the SIGs that own that component. Do you think this plan is "push work on other sigs"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's generally easier to target SIGs individually, since KEPs can easily balloon (causing unnecessary bikeshedding). Insofar as practically applying @serathius' suggestion, it may make sense to separate 'core functionality', i.e. library code for tracing, from the component integration story. |
||||||||
- After get agreement, utilize this KEP's feature to change the source code that outputs log to add meta-data into these logs. | ||||||||
Please note that this KEP alone does not change the log format(does not add meta-data to logs). | ||||||||
|
||||||||
## Motivation | ||||||||
|
||||||||
Tracking logs among each Kubernetes component related to specific an user operation and objects is very tough work. | ||||||||
It is necessary to match logs by basically using timestamps and object's name as hints. | ||||||||
If multiple users throw many API requests at the same time, it is very difficult to track logs across each Kubernetes component log. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I understand that this is currently unsolved problem, but could you provide some info why this problem is worth solving? Why tracking user operation is important? Examples: Debugging, auditing, reproducing problems? |
||||||||
|
||||||||
### Goals | ||||||||
|
||||||||
- Implement method which propagates new logging meta-data among each K8s component | ||||||||
|
||||||||
- Design and implement so as not to interfere with [Tracing KEP](https://github.com/kubernetes/enhancements/pull/1458) | ||||||||
- e.g. implement of initialtraceid, adding traceid to object annotation executed in mutating webhook, etc. | ||||||||
|
||||||||
### Non-Goals | ||||||||
|
||||||||
- Add new logging metadata into actual K8s component logs | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If goal of this KEP is introducing trace information into logs then we skip this process. If the main benefit of KEP is introducing this information then KEP should include this work. Possibly in reduced scope, but enough to provide a benefit. For example structured logging KEP goal was not to rewrite all logs, but to provide interface and use it in enough places to provide benefit (25 log line changes was enough to improve 99.9% of log output). Maybe same approach would work here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As you said, we are now considering the separation to tracing and logging. If we separate, I think "Add new logging metadata into actual K8s component logs" is non-goal. In the future, we want to realize this work. Thank you for your suggestion about the method how you did in structured logging. Maybe it helps our work. |
||||||||
- This task will be done by opening issues after completing this KEP | ||||||||
- To centrally manage the logs of each Kubernetes component with Request-ID (This can be realized with existing OSS such as Kibana, so no need to implement into Kubernetes components). | ||||||||
|
||||||||
## Proposal | ||||||||
|
||||||||
### User Stories (Optional) | ||||||||
|
||||||||
#### Story 1 | ||||||||
|
||||||||
Suspicious user operation(e.g. unknown pod operations) or cluster processing(e.g. unexpected pod migration to another node) is detected. | ||||||||
Users want to get their mind around the whole picture and root cause. | ||||||||
As part of the investigation, it may be necessary to scrutinize the relevant logs of each component in order to figure out the series of cluster processing. | ||||||||
It takes long time to scrutinize the relevant logs without this log tracking feature, because component logs are independent of each other, and it is difficult to find related logs and link them. | ||||||||
|
||||||||
This is similar to the [Auditing](https://kubernetes.io/docs/tasks/debug-application-cluster/audit/), except for the following points. | ||||||||
|
||||||||
- Audit only collects information about http request sending and receiving in kube-apiserver, so it can't track internal work of each component. | ||||||||
- Audit logs can't be associated to logs related to user operation (kubectl operation), because auditID is different for each http request. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this was mentioned before, but one other difference is that audit logging is designed to be tamper-proof. The design we have been discussing (object annotations) would make it easy for an attacker to overwrite any metadata associated with objects they are messing with. A better story that wouldn't imply security guarantees might be an accidental deployment that causes problems. "AHH I did something and I need to know what it broke". |
||||||||
|
||||||||
#### Story 2 | ||||||||
|
||||||||
Failed to attach PV to pod | ||||||||
Prerequisite: It has been confirmed that the PV has been created successfully. | ||||||||
In this case, the volume generation on the storage side is OK, and there is a possibility that the mount process to the container in the pod is NG. | ||||||||
In order to identify the cause, it is necessary to look for the problem area while checking the component (kubelet) log as well as the system side syslog and mount related settings. | ||||||||
|
||||||||
This log tracking feature is useful to identify the logs related to specific user operation and cluster processing, and can reduce investigation cost in such cases. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This story described need to identify all logs related to a PV in multiple kubernetes components. Is object tracing needed in such case? I think having a annotation/label referencing PV name be enough to easily find Kubelet logs referring to PV. Wouldn't it? Adding trace information would allow us to separate requests from each other, but I don't think it's needed in this case. |
||||||||
### Summary of Cases | ||||||||
|
||||||||
- Given a component log(such as error log), find the API request that caused this (error) log. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think you can get back API request this way, you will only get request id. Possibly it would be useful to be able to correlate component logs to audit logs which store original request content. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm sorry, The current sentence seems to have been misleading. |
||||||||
- Given an API Request(such as suspicious API request), find the resulting component logs. | ||||||||
|
||||||||
### Logging metadata | ||||||||
|
||||||||
We use three logging meta-data, and propagate them each K8s component by using OpenTelemetry. | ||||||||
OpenTelemetry has SpanContext and [Baggage](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/baggage/api.md) which is used for propagation of K8s component. | ||||||||
|
||||||||
| meta-data name | feature | | ||||||||
| ------ | ------ | | ||||||||
| traceid | We use SpanContext.TraceID as traceid<br>traceid spans an user request.<br>traceid is unique for user's request | | ||||||||
| spanid | We use SpanContext.SpanID as spanid<br>spanid spans a controller action.<br>spanid is unique for controller action | | ||||||||
| initialtraceid | We implement new id(InitialTraceID) to SpanContext<br>We use [Baggage](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/baggage/api.md) to propagate initialtraceid<br/>We use UID of root object as initialtraceid<br>initialtraceid spans the entire object lifecycle. <br>initialtraceid is unique for related objects | | ||||||||
|
||||||||
All of three id's inception is from object creation and it dies with object deletion | ||||||||
|
||||||||
## Design Details | ||||||||
|
||||||||
### Prerequisite | ||||||||
|
||||||||
We need to consider three cases: | ||||||||
- Case1: Requests from kubectl that creating an object | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I read the section below, but I can't tell why it matters if a request comes from kubectl or not... What behavior changes for requests from kubectl vs requests from something else? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I update the "Design of ID propagation(controller)". It works differently depending on whether we use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mention of kubectl sounds like a detail that can be skipped. If mechanism is meant to work with any kubernetes client, we should skip mention of |
||||||||
- Case2: Requests from kubectl other than creating (e.g. updating, deleting) an object | ||||||||
- Case3: Requests from controllers | ||||||||
|
||||||||
The design below is based on the above three cases | ||||||||
|
||||||||
The following picture show our design | ||||||||
|
||||||||
|
||||||||
|
||||||||
we don't have any modifications in kubectl in this design. | ||||||||
dashpole marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||
|
||||||||
### Design of ID propagation in apiserver | ||||||||
|
||||||||
|
||||||||
**1. Do othttp's original [Extract](https://pkg.go.dev/go.opentelemetry.io/otel/api/propagators#TraceContext.Extract)() to get [SpanContext](https://pkg.go.dev/go.opentelemetry.io/otel/api/trace#SpanContext)** | ||||||||
|
||||||||
- For request from kubectl, SpanContext is null, do [Start](https://github.com/open-telemetry/opentelemetry-go/blob/3a9f5fe15f50a35ad8da5c5396a9ed3bbb82360c/sdk/trace/tracer.go#L38)() to start new SpanContext (new traceid and spanid) | ||||||||
- For request from controller we can get a valid SpanContext(including traceid and spanid), do [Start](https://github.com/open-telemetry/opentelemetry-go/blob/3a9f5fe15f50a35ad8da5c5396a9ed3bbb82360c/sdk/trace/tracer.go#L38)() to update the SpanContext (new spanid) | ||||||||
|
||||||||
**2. Chain SpanContext and initialtraceid to "r.ctx"** | ||||||||
|
||||||||
- we use r.ctx to propagate those IDs to next handler | ||||||||
|
||||||||
**3. Make up a new outgoing [request](#design-of-id-progagation-in-client-go)** | ||||||||
|
||||||||
### Design of ID propagation (controller) | ||||||||
|
APIs | file name |
---|---|
createPods() | pkg/controller/controller_utils.go |
CreatePodsWithControllerRef() | pkg/controller/controller_utils.go pkg/controller/replication/conversion.go pkg/controller/daemon/daemon_controller.go pkg/controller/replication/conversion.go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New context propagation library functions:
a. func WithObject(ctx context.Context, meta metav1.ObjectMeta) context.Context
Would attaching a logr instance which then contains that object as an additional key/value item also work? It's a more flexible.
The downside is that code then has to be converted to structured logging (but that's the goal anyway, right?) and that the object cannot be retrieved anymore separately (which may or may not be relevant).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment. I want to know more details of your suggestion. Maybe I could not understand it...
Do you mean that we should add logr instance to the library function?(Like, func WithObject(ctx context.Context, meta metav1.ObjectMeta, l logr.Logger) context.Context
.)
The final goal is output the tracking information using structured logging format. However, I assume that it is out of scope in this KEP. Of course, this KEP and implementation should be with structured logging in mind.
I'm sorry if this question is off the mark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is to to have something like https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/log/log.go in klog (perhaps as klog/kcontext/kcontext.go?), except that the FromContext
function in klog falls back to the global klog instance wrapped by klogr, so a caller can always be sure that it gets a logger and thus doesn't need to check for nil.
Then gradually packages can be converted from logging via klog to logging via a logr instance that they get with that FromContext
function -> structured logging, with additional information attached to each log message by the caller. As an added bonus, the caller can decide to raise the log level for the call.
The final goal is output the tracking information using structured logging format. However, I assume that it is out of scope in this KEP.
That depends. If you are proposing that each log site should do something like "retrieve objects, log with klog.InfoS", then I think we should consider whether that is the best solution for the problem. We don't want to propose one solution now (this KEP) and then another solution later (a KEP for passing a logger via the context).
As @dashpole said, it will be easier to discuss when you show how the proposed enhancement will be used by code in Kubernetes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to your comments by far, I feel that you mainly concern that how we modify the logs with the feature proposed in this KEP right?
I just want to clarify that In our initial vision "how we change logs" is out of this KEP's scope. With David's earlier suggestion we decided to split it down, to just focus on "bring k8s a new feature which could generate and propagate unique IDs across different components and objects" first, and then propose to change logs after that, because it requires more consideration and discussion with sigs.
But I realized that what you commented is worth considering, we will show some examples to answer your points, thanks.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: helper functions to modify context should also take a context as a parameter. E.g. ctx = Foo(ctx, obj)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment. I will fix it.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do this by using the WithPropagators()
option when using othttp in client-go.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs some introduction here. Why do we need an admission controller at all? Reasons I can think of include:
- Ease of use. Using client-go with a context.Context is easier than adding an annotation. The webhook takes care of writing the annotation.
- Object to object context propagation. Without the mutating admission controller, we can only associate actions from a single object. With the mutating admission controller, the logging metadata would be added for objects modified by controllers of the initial object (e.g. metadata added to a deployment annotation would appear in pod logs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should include that the metadata is coming from an annotation. That is probably the most important part of this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, we will update it later