Skip to content
Closed
Changes from 2 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
1632b59
Propose log tracking KEP
hase1128 Sep 1, 2020
a407e86
Update 20200901-log-tracking.md
hase1128 Sep 8, 2020
d2ec492
Create README.md
KobayashiD27 Sep 18, 2020
0c24512
Create kep.yaml
KobayashiD27 Sep 18, 2020
5d32367
Rename keps/1668-log-tracking/README.md to keps/sig-instrumentation/1…
KobayashiD27 Sep 18, 2020
13bb682
Rename keps/1668-log-tracking/kep.yaml to keps/sig-instrumentation/16…
KobayashiD27 Sep 18, 2020
b95257e
Delete 20200901-log-tracking.md
KobayashiD27 Sep 18, 2020
bcb70de
Update kep.yaml
KobayashiD27 Sep 18, 2020
805b0bd
Update kep.yaml
KobayashiD27 Sep 18, 2020
a9af677
update toc
zhijianli88 Sep 25, 2020
94d9f41
Update README.md
KobayashiD27 Sep 25, 2020
be47492
Merge pull request #1 from zhijianli88/log-tracking
zhijianli88 Sep 25, 2020
a0380c3
Update README.md
KobayashiD27 Sep 25, 2020
c1f80f3
update design details
zhijianli88 Sep 27, 2020
d18d9fd
add overview
zhijianli88 Sep 29, 2020
b57ffd3
minor updates
zhijianli88 Sep 29, 2020
2da5b28
Update KEP
Sep 29, 2020
9aa6fac
Merge branch 'zhijianli88/log-tracking' into 'fenggw'
Sep 29, 2020
da71563
Merge branch 'fenggw' into 'zhijianli88/log-tracking'
Sep 29, 2020
bb1a496
Update README.md
Sep 29, 2020
7ce91a5
Merge branch 'fenggw-log-tracking' into 'zhijianli88/log-tracking'
Sep 29, 2020
c5b2f7f
Update kep.yaml
Sep 29, 2020
dea01f9
Merge branch 'fenggw-log-tracking' into 'zhijianli88/log-tracking'
Sep 29, 2020
7dc0657
Fix Graduation Criteria and trivial format issue
Sep 30, 2020
1600a21
Fix default behavior change
Sep 30, 2020
e6d7e43
Merge pull request #2 from fenggw-fnst/wip-log-tracking
hase1128 Sep 30, 2020
c012204
update toc
zhijianli88 Sep 30, 2020
429df62
fix feature gate validation
zhijianli88 Sep 30, 2020
36f16f1
Merge pull request #3 from zhijianli88/zhijianli88/log-tracking
hase1128 Sep 30, 2020
41cef3e
Fix kep.yaml
Sep 30, 2020
4be4d1d
Merge pull request #5 from fenggw-fnst/work
hase1128 Sep 30, 2020
9692e64
Log kep update (#7)
KobayashiD27 Oct 2, 2020
badbe6a
Log kep update (#9)
KobayashiD27 Oct 5, 2020
6fafbf0
Zhijianli88/log tracking (#10)
zhijianli88 Oct 14, 2020
5ccb15e
Update KEP (#11)
zouy414 Oct 21, 2020
24ce1d1
reduce our work by refer to KEP647 (#12)
zhijianli88 Oct 30, 2020
693bee1
update implementation history (#13)
zhijianli88 Nov 11, 2020
cc282e7
remove feature gate in kube-apiserver
zhijianli88 Nov 12, 2020
739eb78
Address comments (#14)
zhijianli88 Dec 23, 2020
d1ea659
update log example
zhijianli88 Jan 8, 2021
34d8f47
Merge pull request #15 from zhijianli88/hase1128/log-tracking
Jan 8, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
280 changes: 280 additions & 0 deletions keps/sig-instrumentation/20200901-log-tracking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,280 @@
# KEP-1961: Log tracking for K8s component log

<!-- toc -->
- [Release Signoff Checklist](#release-signoff-checklist)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [User Stories (Optional)](#user-stories-optional)
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Logging metadata](#logging-metadata)
- [Prerequisite](#prerequisite)
- [Design of ID propagation (incoming request to webhook)](#design-of-id-propagation-incoming-request-to-webhook)
- [Design of Mutating webhook](#design-of-mutating-webhook)
- [Design of ID propagation (controller)](#design-of-id-propagation-controller)
- [Test Plan](#test-plan)
- [Graduation Criteria](#graduation-criteria)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
- [Monitoring Requirements](#monitoring-requirements)
- [Dependencies](#dependencies)
- [Scalability](#scalability)
- [Troubleshooting](#troubleshooting)
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
<!-- /toc -->

## Release Signoff Checklist

Items marked with (R) are required *prior to targeting to a milestone / release*.

- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
- [ ] (R) Design details are appropriately documented
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- [ ] (R) Graduation criteria is in place
- [ ] (R) Production readiness review completed
- [ ] Production readiness review approved
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

## Summary

This KEP proposes a method for adding new three unique logging meta-data into K8s component logs.
It makes us more easy to identify specific logs related to an user request (such as `kubectl apply`) and object (such as Pod, Deployment).
It is expected to reduce investigation cost greatly when trouble shoothing.

### New three unique logging meta-data

We use three meta-data. These meta-data have different features and are used for troubleshooting from different perspectives.

| meta-data name | feature |
| ------ | ------ |
| trace-id | spans an user request. unique for user's request |
| span-id | spans a controller action. unique for controller action |
| initial-trace-id | spans the entire object lifecycle. unique for related objects |

### Note

This KEP is **how** a component could add meta-data to logs. To actually add meta-data to K8s component logs, the following procedure is necessary in addition.
- Open issues for each component, and discuss them with the SIGs that own that component.
- After get agreement, utilize this KEP's feature to change the source code that outputs log to add meta-data into these logs.
Please note that this KEP alone does not change the log format(does not add meta-data to logs).

## Motivation

Tracking logs among each Kubernetes component related to specific an user operation and objects is very tough work.
It is necessary to match logs by basically using timestamps and object's name as hints.
If multiple users throw many API requests at the same time, it is very difficult to track logs across each Kubernetes component log.

### Goals

- Implement method which propagates new logging meta-data among each K8s component
- Design and implement so as not to interfere with [Tracing KEP](https://github.com/kubernetes/enhancements/pull/1458)
- e.g. implement of initial-trace-id, adding trace-id to object annotation executed in mutating webhook, etc.

### Non-Goals

- Add new logging metadata into actual K8s component logs
- This task will be done by opening issues after completing this KEP
- To centrally manage the logs of each Kubernetes component with Request-ID (This can be realized with existing OSS such as Kibana, so no need to implement into Kubernetes components).

## Proposal

<!--
This is where we get down to the specifics of what the proposal actually is.
This should have enough detail that reviewers can understand exactly what
you're proposing, but should not include things like API designs or
implementation. The "Design Details" section below is for the real
nitty-gritty.
-->

### User Stories (Optional)

- Given a component log(such as error log), find the API request that caused this (error) log.
- Given an API Request(such as suspicious API request), find the resulting component logs.

#### Story 1

Suspicious user operation(e.g. unknown pod operations) or cluster processing(e.g. unexpected pod migration to another node) is detected.
Users want to get their mind around the whole picture and root cause.
As part of the investigation, it may be necessary to scrutinize the relevant logs of each component in order to figure out the series of cluster processing.
It takes long time to scrutinize the relevant logs without this log tracking feature, because component logs are independent of each other, and it is difficult to find related logs and link them.

This is similar to the [Auditing](https://kubernetes.io/docs/tasks/debug-application-cluster/audit/), except for the following points.

- Audit only collects information about http request sending and receiving in kube-apiserver, so it can't track internal work of each component.
- Audit logs can't be associated to logs related to user operation (kubectl operation), because auditID is different for each http request.

#### Story 2

Failed to attach PV to pod
Prerequisite: It has been confirmed that the PV has been created successfully.
In this case, the volume generation on the storage side is OK, and there is a possibility that the mount process to the container in the pod is NG.
In order to identify the cause, it is necessary to look for the problem area while checking the component (kubelet) log as well as the system side syslog and mount related settings.

This log tracking feature is useful to identify the logs related to specific user operation and cluster processing, and can reduce investigation cost in such cases.

### Notes/Constraints/Caveats (Optional)

TBD

### Risks and Mitigations

TBD

## Design Details

### Logging metadata

We use three logging meta-data, and propagate them each K8s component by using OpenTelemetry.
OpenTelemetry has SpanContext which is used for propagation of K8s component.

| meta-data name | feature |
| ------ | ------ |
| trace-id | We use SpanContext.TraceID as trace-id<br>trace-id spans an user request.<br>trace-id is unique for user's request |
| span-id | We use SpanContext.SpanID as span-id<br>span-id spans a controller action.<br>span-id is unique for controller action |
| initial-trace-id | We implement new id(InitialTraceID) to SpanContext<br>We use SpanContext.InitialTraceID as initial-trace-id<br>initial-trace-id spans the entire object lifecycle. <br> initial-trace-id is unique for related objects |

All of three id's inception is from object creation and it dies with object deletion

### Prerequisite
We need to consider three cases:
- Case1: Requests from kubectl that creating an object
- Case2: Requests from kubectl other than creating (e.g. updating, deleting) an object
- Case3: Requests from controllers

The design below is based on the above three cases

### Design of ID propagation (incoming request to webhook)

**1. Incoming request to apiserver from kubectl or controller**
- For request from kubectl, request's header does not have trace-id, span-id or initial-trace-id
- For request from controller, request's header has trace-id, span-id and initial-trace-id

**2. Preprocessing handler (othttp handler)**
2.1 Do othttp's original Extract(), and get SpanContext
- For request from kubectl, result is null (no trace-id, span-id, initial-trace-id)
- For request from controller we can get trace-id, span-id and initial-trace-id
2.2 Create/Update SpanContext
- For request from kubectl
- Since we don't get any SpanContext, do StartSpan() to start new trace (new trace-id and span-id)
- the new SpanContext will be saved in the request's context "r.ctx"
- For request from controller
- Since we get SpanContext, do StartSpanWithRemoteParent() to update the SpanContext (new span-id)
- the updated SpanContext will be saved in the request's context "r.ctx"

**3. Creation handler**
3.1 do our new Extract() to get initial-trace-id from request header to a golang ctx
- For request from kubectl we can't get initial-trace-id
- For request from controller we can get initial-trace-id
3.2 get SpanContext from r.ctx to golang ctx

Notice that in this creation handler, the request will be consumed, so we need golang ctx to carry our information for propagation in apiserver.

**4. Make new request for sending to webhook**
4.1 call othttp's original Inject() to inject the trace-id and span-id from golang ctx to header
4.2 call our new Inject() to inject the initial-trace-id from golang ctx to header
- For request from kubectl we don't have initial-trace-id, so do nothing
- For request from controller we can do this

the order above(4.1 and 4.2) does not matter

### Design of Mutating webhook
check the request's header
- if there is initial-trace-id, add trace-id, span-id and initial-trace-id to annotation (This is the case for requests from controller.)
- if there is no initial-trace-id, check the request's operation
- if operation is create, copy the trace-id as initial-trace-id, and add trace-id, span-id and initial-trace-id to annotation (This is the case for requests from kubectl create.)
- if operation is not create, add trace-id, span-id to annotation (This is the case for requests from kubectl other than create.)

### Design of ID propagation (controller)
When controllers create/update/delete an object A based on another B, we propagate context from B to A. E.g.:
```
ctx = traceutil.WithObject(ctx, objB)
err = r.KubeClient.CoreV1().Create(ctx, objA...)
```
We do propagation across objects without adding traces to that components.

### Test Plan

TBD

### Graduation Criteria

TBD

#### Alpha -> Beta Graduation

TBD

#### Beta -> GA Graduation

TBD

#### Removing a Deprecated Flag

TBD

### Upgrade / Downgrade Strategy

TBD

### Version Skew Strategy

TBD

## Production Readiness Review Questionnaire

TBD

### Feature Enablement and Rollback

TBD

### Rollout, Upgrade and Rollback Planning

TBD

### Monitoring Requirements

TBD

### Dependencies

TBD

### Scalability

TBD

### Troubleshooting

TBD

## Implementation History

TBD

## Drawbacks

TBD

## Alternatives

TBD

## Infrastructure Needed (Optional)

TBD