Skip to content

Commit 1515af5

Browse files
committed
Add unresolved comments, and annotated unresolved with target stage
Signed-off-by: Laura Lorenz <[email protected]>
1 parent 9eacadb commit 1515af5

File tree

1 file changed

+21
-17
lines changed
  • keps/sig-node/4603-tune-crashloopbackoff

1 file changed

+21
-17
lines changed

keps/sig-node/4603-tune-crashloopbackoff/README.md

Lines changed: 21 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -528,6 +528,10 @@ of ~5 QPS when deploying 110 mass crashing pods for our tests, even with
528528
instantly crashing pods and intantaneously restarting CrashLoopBackOff behavior,
529529
`/pods` API requests quickly normalized to ~2 QPS. In the same tests, runtime
530530
CPU usage increased by x10 and the API server CPU usage increased by 2x.
531+
<<[UNRESOLVED non blocking]If you were testing this on a small cluster without a lot of
532+
additional load, the 2x increase in apiserver cpu usage is probably not a
533+
particularly useful metric. Might be worth mentioning the raw numbers here
534+
instead.>> <<[/UNRESOLVED]>>
531535

532536
For both of these changes, by passing these changes through the existing
533537
SIG-scalability tests, while pursuing manual and more detailed periodic
@@ -587,8 +591,8 @@ excess restarts every 5 minutes after that; each crashing pod would be
587591
contributing an excess of ~1550 pod state transition API requests, and fully
588592
saturated node with a full 110 crashing pods would be adding 170,500 new pod
589593
transition API requests every five minutes, which is an an excess of ~568
590-
requests/10s. <<[!UNRESOLVED kubernetes default for the kubelet client rate
591-
limit and how this changes by machine size]>> <<[UNRESOLVED]>>
594+
requests/10s. <<[!UNRESOLVED non blocking: kubernetes default for the kubelet
595+
client rate limit and how this changes by machine size]>> <<[UNRESOLVED]>>
592596

593597

594598
## Design Details
@@ -866,7 +870,7 @@ behaviors common to all pod restarts"](code-diagram-for-restarts.png "Kubelet
866870
and Container Runtime restart code paths")
867871

868872
```
869-
<<[UNRESOLVED answer these question from original PR]>>
873+
<<[UNRESOLVED non blocking answer these question from original PR or make new bugs]>>
870874
>Does this [old container cleanup using containerd] include cleaning up the image filesystem? There might be room for some optimization here, if we can reuse the RO layers.
871875
to answer question: looks like it is per runtime. need to check about leasees. also part of the value of this is to restart the sandbox.
872876
```
@@ -1089,11 +1093,9 @@ extending the production code to implement this enhancement.
10891093
-->
10901094

10911095

1092-
- <<[UNRESOLVED whats up with this]>>
1093-
`kubelet/kuberuntime/kuberuntime_manager_test`: **could not find a successful
1096+
- `kubelet/kuberuntime/kuberuntime_manager_test`: **could not find a successful
10941097
coverage run on
10951098
[prow](https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-coverage-unit/1800947623675301888)**
1096-
<<[/UNRESOLVED]>>
10971099

10981100
##### Integration tests
10991101

@@ -1165,6 +1167,8 @@ feature gates set as per the [Conflict Resolution](#conflict-resolution) policy
11651167
- Test proving `KubeletConfiguration` objects will silently drop unrecognized
11661168
fields in the `config.validation_test` package
11671169
([ref](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/validation/validation_test.go)).
1170+
- <<[UNRESOLVED non blocking]>>Is this also the expected behavior when the feature gate
1171+
is disabled?<<[/UNRESOLVED]>>
11681172
- Test coverage of proper requeue behavior; see
11691173
https://github.com/kubernetes/kubernetes/issues/123602
11701174
- Actually fix https://github.com/kubernetes/kubernetes/issues/123602 if this
@@ -1559,7 +1563,7 @@ rollout. Similarly, consider large clusters and how enablement/disablement
15591563
will rollout across nodes.
15601564
-->
15611565

1562-
<<[UNRESOLVED]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>>
1566+
<<[UNRESOLVED beta]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>>
15631567

15641568
###### What specific metrics should inform a rollback?
15651569

@@ -1598,15 +1602,15 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
15981602
are missing a bunch of machinery and tooling and can't do that now.
15991603
-->
16001604

1601-
<<[UNRESOLVED]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>>
1605+
<<[UNRESOLVED beta]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>>
16021606

16031607
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
16041608

16051609
<!--
16061610
Even if applying deprecation policies, they may still surprise some users.
16071611
-->
16081612

1609-
<<[UNRESOLVED]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>>
1613+
<<[UNRESOLVED beta]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>>
16101614

16111615
### Monitoring Requirements
16121616

@@ -1625,7 +1629,7 @@ checking if there are objects with field X set) may be a last resort. Avoid
16251629
logs or events for this purpose.
16261630
-->
16271631

1628-
<<[UNRESOLVED]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>>
1632+
<<[UNRESOLVED beta]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>>
16291633

16301634
###### How can someone using this feature know that it is working for their instance?
16311635

@@ -1638,7 +1642,7 @@ and operation of this feature.
16381642
Recall that end users cannot usually observe component logs or access metrics.
16391643
-->
16401644

1641-
<<[UNRESOLVED]>> Fill out when targeting beta to a release.
1645+
<<[UNRESOLVED beta]>> Fill out when targeting beta to a release.
16421646
- [ ] Events
16431647
- Event Reason:
16441648
- [ ] API .status
@@ -1666,15 +1670,15 @@ These goals will help you determine what you need to measure (SLIs) in the next
16661670
question.
16671671
-->
16681672

1669-
<<[UNRESOLVED]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>>
1673+
<<[UNRESOLVED beta]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>>
16701674

16711675
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
16721676

16731677
<!--
16741678
Pick one more of these and delete the rest.
16751679
-->
16761680

1677-
<<[UNRESOLVED]>> Fill out when targeting beta to a release.
1681+
<<[UNRESOLVED beta]>> Fill out when targeting beta to a release.
16781682

16791683
- [ ] Metrics
16801684
- Metric name:
@@ -1692,7 +1696,7 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co
16921696
implementation difficulties, etc.).
16931697
-->
16941698

1695-
<<[UNRESOLVED]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>>
1699+
<<[UNRESOLVED beta]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>>
16961700

16971701
### Dependencies
16981702

@@ -1717,7 +1721,7 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
17171721
- Impact of its degraded performance or high-error rates on the feature:
17181722
-->
17191723

1720-
<<[UNRESOLVED]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>>
1724+
<<[UNRESOLVED beta]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>>
17211725

17221726
### Scalability
17231727

@@ -1855,7 +1859,7 @@ details). For now, we leave it here.
18551859

18561860
###### How does this feature react if the API server and/or etcd is unavailable?
18571861

1858-
<<[UNRESOLVED]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>>
1862+
<<[UNRESOLVED beta]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>>
18591863

18601864
###### What are other known failure modes?
18611865

@@ -1874,7 +1878,7 @@ For each of them, fill in the following information by copying the below templat
18741878

18751879
###### What steps should be taken if SLOs are not being met to determine the problem?
18761880

1877-
<<[UNRESOLVED]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>>
1881+
<<[UNRESOLVED beta]>> Fill out when targeting beta to a release. <<[/UNRESOLVED]>>
18781882

18791883
## Implementation History
18801884

0 commit comments

Comments
 (0)