@@ -81,7 +81,11 @@ SIG Architecture for cross-cutting KEPs).
81
81
- [ Version Skew Strategy] ( #version-skew-strategy )
82
82
- [ Production Readiness Review Questionnaire] ( #production-readiness-review-questionnaire )
83
83
- [ Feature Enablement and Rollback] ( #feature-enablement-and-rollback )
84
+ - [ Rollout, Upgrade and Rollback Planning] ( #rollout-upgrade-and-rollback-planning )
85
+ - [ Monitoring Requirements] ( #monitoring-requirements )
86
+ - [ Dependencies] ( #dependencies )
84
87
- [ Scalability] ( #scalability )
88
+ - [ Troubleshooting] ( #troubleshooting )
85
89
- [ Implementation History] ( #implementation-history )
86
90
- [ Drawbacks] ( #drawbacks )
87
91
- [ Alternatives] ( #alternatives )
@@ -280,6 +284,7 @@ Unit, integration, and end-to-end tests will be added to test that:
280
284
### Graduation Criteria
281
285
282
286
#### Alpha -> Beta Graduation
287
+ * Metrics with observability in to the Job controller available
283
288
* Implemented feedback from alpha testers
284
289
285
290
#### Beta -> GA Graduation
@@ -383,80 +388,91 @@ field.
383
388
Jobs that have the flag set will be suspended, and new jobs or updates to existing
384
389
ones to the field will be persisted.
385
390
386
- * ** Are there any tests for feature enablement/disablement?** No.
387
-
388
- <!-- Uncomment when targeting beta graduation
391
+ * ** Are there any tests for feature enablement/disablement?** Yes. Integration
392
+ tests have exhaustive testing switching between different feature enablement
393
+ states whilst using the feature at the same time. Unit tests and end-to-end
394
+ tests test feature enablement too.
389
395
390
396
### Rollout, Upgrade and Rollback Planning
391
397
392
398
_ This section must be completed when targeting beta graduation to a release._
393
399
394
- * **How can a rollout fail? Can it impact already running workloads?**
395
- Try to be as paranoid as possible - e.g., what if some components will restart
396
- mid-rollout?
397
-
398
- * **What specific metrics should inform a rollback?**
399
-
400
- * **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
400
+ * ** How can a rollout fail? Can it impact already running workloads?** Impact
401
+ to existing Jobs that previously didn't use this feature in alpha is
402
+ impossible. In workloads using the feature in an older version, suspended
403
+ Jobs may inadvertently be resumed (or Jobs may be inadvertently suspended) if
404
+ there are storage-related issues arising from components crashing
405
+ mid-rollout.
406
+
407
+ * ** What specific metrics should inform a rollback?** ` job_sync_duration_seconds `
408
+ and ` job_sync_total ` should be observed. Unexpected spikes in the metric with
409
+ labels ` result=error ` and ` action=pods_deleted ` is potentially an indicator
410
+ that:
411
+ 1 . Job suspension is producing errors in the Job controller,
412
+ 1 . Jobs are getting suspended when they shouldn't be, or
413
+ 1 . Job sync latency is high when Job are suspended.
414
+ While the above list isn't exhaustive, they're signals in favour of rollbacks.
415
+
416
+ * ** Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path
417
+ tested?** <!-- I'll answer this after implementation.
401
418
Describe manual testing that was done and the outcomes.
402
419
Longer term, we may want to require automated upgrade/rollback tests, but we
403
- are missing a bunch of machinery and tooling and can't do that now.
420
+ are missing a bunch of machinery and tooling and can't do that now. -->
404
421
405
- * **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
406
- fields of API types, flags, etc.?**
407
- Even if applying deprecation policies, they may still surprise some users.
422
+ * ** Is the rollout accompanied by any deprecations and/or removals of features,
423
+ APIs, fields of API types, flags, etc.?** No.
408
424
409
425
### Monitoring Requirements
410
426
411
427
_ This section must be completed when targeting beta graduation to a release._
412
428
413
429
* ** How can an operator determine if the feature is in use by workloads?**
414
- Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
415
- checking if there are objects with field X set) may be a last resort. Avoid
416
- logs or events for this purpose.
417
-
418
- * **What are the SLIs (Service Level Indicators) an operator can use to determine
419
- the health of the service?**
420
- - [ ] Metrics
421
- - Metric name:
422
- - [Optional] Aggregation method:
430
+ The ` .spec.suspend ` field is set to true by Jobs. The status conditions of a
431
+ Job can also be used to determine whether a Job is using the feature (look
432
+ for a condition of type "Suspended").
433
+
434
+ * ** What are the SLIs (Service Level Indicators) an operator can use to
435
+ determine the health of the service?**
436
+ - [x] Metrics
437
+ - Metric name: The metrics ` job_sync_duration_seconds ` and ` job_sync_total `
438
+ get a new label named ` action ` to allow operators to filter Job sync
439
+ latency and error rate, respectively, by the action performed. There are
440
+ four mutually-exclusive values possible for this label:
441
+ - ` reconciling ` when the Job's pod creation/deletion expectations are
442
+ unsatisfied and the controller is waiting for issued Pod
443
+ creation/deletions to complete.
444
+ - ` tracking ` when the Job's pod creation/deletion expectations are
445
+ satisfied and the number of active Pods matches expectations (i.e. no
446
+ pod creation/deletions issued in this sync). This is expected to be
447
+ the action in most of the syncs.
448
+ - ` pods_created ` when the controller creates Pods. This can happen
449
+ when the number of active Pods is less than the wanted Job
450
+ parallelism.
451
+ - ` pods_deleted ` when the controller deletes Pods. This can happen if a
452
+ Job is suspended or if the number of active Pods is more than
453
+ parallelism.
454
+ Each sample of the two metrics will have exactly one of the above values
455
+ for the ` action ` label.
423
456
- Components exposing the metric:
424
- - [ ] Other (treat as last resort)
425
- - Details:
426
-
427
- * **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
428
- At a high level, this usually will be in the form of "high percentile of SLI
429
- per day <= X". It's impossible to provide comprehensive guidance, but at the very
430
- high level (needs more precise definitions) those may be things like:
431
- - per-day percentage of API calls finishing with 5XX errors <= 1%
432
- - 99% percentile over day of absolute value from (job creation time minus expected
433
- job creation time) for cron job <= 10%
434
- - 99,9% of /health requests per day finish with 200 code
435
-
436
- * **Are there any missing metrics that would be useful to have to improve observability
437
- of this feature?**
438
- Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
439
- implementation difficulties, etc.).
457
+ - kube-controller-manager
458
+
459
+ * ** What are the reasonable SLOs (Service Level Objectives) for the above
460
+ SLIs?**
461
+ - per-day percentage of ` job_sync_total ` with labels ` result=error ` and
462
+ ` action=pods_deleted ` <= 1%
463
+ - 99% percentile over day for ` job_sync_duration_seconds ` with label
464
+ ` action=pods_deleted ` is <= 15s, assuming a client-side QPS limit of 50
465
+ calls per second
466
+
467
+ * ** Are there any missing metrics that would be useful to have to improve
468
+ observability of this feature?** No.
440
469
441
470
### Dependencies
442
471
443
472
_ This section must be completed when targeting beta graduation to a release._
444
473
445
474
* ** Does this feature depend on any specific services running in the cluster?**
446
- Think about both cluster-level services (e.g. metrics-server) as well
447
- as node-level agents (e.g. specific version of CRI). Focus on external or
448
- optional services that are needed. For example, if this feature depends on
449
- a cloud provider API, or upon an external software-defined storage or network
450
- control plane.
451
-
452
- For each of these, fill in the following—thinking about running existing user workloads
453
- and creating new ones, as well as about cluster-level services (e.g. DNS):
454
- - [Dependency name]
455
- - Usage description:
456
- - Impact of its outage on the feature:
457
- - Impact of its degraded performance or high-error rates on the feature:
458
-
459
- -->
475
+ Feature is restricted to kube-apiserver and kube-controller-manager.
460
476
461
477
### Scalability
462
478
@@ -480,8 +496,6 @@ _This section must be completed when targeting beta graduation to a release._
480
496
* ** Will enabling / using this feature result in non-negligible increase of
481
497
resource usage (CPU, RAM, disk, IO, ...) in any components?** No.
482
498
483
- <!-- Uncomment when targeting beta graduation
484
-
485
499
### Troubleshooting
486
500
487
501
The Troubleshooting section currently serves the ` Playbook ` role. We may consider
@@ -491,29 +505,33 @@ details). For now, we leave it here.
491
505
_ This section must be completed when targeting beta graduation to a release._
492
506
493
507
* ** How does this feature react if the API server and/or etcd is unavailable?**
494
-
495
- * **What are other known failure modes?**
496
- For each of them, fill in the following information by copying the below template:
497
- - [Failure mode brief description]
498
- - Detection: How can it be detected via metrics? Stated another way:
499
- how can an operator troubleshoot without logging into a master or worker node?
500
- - Mitigations: What can be done to stop the bleeding, especially for already
501
- running user workloads?
502
- - Diagnostics: What are the useful log messages and their required logging
503
- levels that could help debug the issue?
504
- Not required until feature graduated to beta.
505
- - Testing: Are there any tests for failure mode? If not, describe why.
506
-
507
- * **What steps should be taken if SLOs are not being met to determine the problem?**
508
-
509
- -->
508
+ Updates to suspend or resume a Job will not work. The controller will not be
509
+ able to create or delete Pods. Events, logs, and status conditions for Jobs
510
+ will not be updated to reflect their suspended status.
511
+
512
+ * ** What are other known failure modes?** None. The API server, etcd, and the
513
+ controller manager are the only possible points of failure.
514
+
515
+ * ** What steps should be taken if SLOs are not being met to determine the
516
+ problem?**
517
+ - Verify that kube-apiserver and etcd are healthy. If not, the Job controller
518
+ cannot operate, so you must fix those problems first.
519
+ - Verify that ` job_sync_total ` is unexpectedly high for ` result=error ` and
520
+ ` action=pods_deleted ` in comparison to other actions.
521
+ - Verify that ` job_sync_duration_seconds ` is noticeably larger for
522
+ ` action=pods_deleted ` in comparison to the other actions.
523
+ - If control plane components are starved for CPU, which could be a potential
524
+ reason behind Job sync latency spikes, consider increasing the control
525
+ plane's resources.
510
526
511
527
[ supported limits ] : https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
512
528
[ existing SLIs/SLOs ] : https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
513
529
514
530
## Implementation History
515
531
516
532
2021-02-01: Initial KEP merged, alpha targeted for 1.21
533
+ 2021-03-08: Implementation merged in 1.21 with feature gate disabled by default
534
+ 2021-04-22: KEP updated for beta graduation in 1.22
517
535
518
536
## Drawbacks
519
537
0 commit comments