Skip to content

Commit 6ae9f53

Browse files
authored
Merge pull request kubernetes#2646 from adtac/suspend-beta
suspended jobs: graduate to beta
2 parents 54e4141 + 6aae0b5 commit 6ae9f53

File tree

3 files changed

+96
-74
lines changed

3 files changed

+96
-74
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 2232
22
alpha:
33
approver: "@wojtek-t"
4+
beta:
5+
approver: "@wojtek-t"

keps/sig-apps/2232-suspend-jobs/README.md

Lines changed: 89 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,11 @@ SIG Architecture for cross-cutting KEPs).
8181
- [Version Skew Strategy](#version-skew-strategy)
8282
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
8383
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
84+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
85+
- [Monitoring Requirements](#monitoring-requirements)
86+
- [Dependencies](#dependencies)
8487
- [Scalability](#scalability)
88+
- [Troubleshooting](#troubleshooting)
8589
- [Implementation History](#implementation-history)
8690
- [Drawbacks](#drawbacks)
8791
- [Alternatives](#alternatives)
@@ -280,6 +284,7 @@ Unit, integration, and end-to-end tests will be added to test that:
280284
### Graduation Criteria
281285

282286
#### Alpha -> Beta Graduation
287+
* Metrics with observability in to the Job controller available
283288
* Implemented feedback from alpha testers
284289

285290
#### Beta -> GA Graduation
@@ -383,80 +388,91 @@ field.
383388
Jobs that have the flag set will be suspended, and new jobs or updates to existing
384389
ones to the field will be persisted.
385390

386-
* **Are there any tests for feature enablement/disablement?** No.
387-
388-
<!-- Uncomment when targeting beta graduation
391+
* **Are there any tests for feature enablement/disablement?** Yes. Integration
392+
tests have exhaustive testing switching between different feature enablement
393+
states whilst using the feature at the same time. Unit tests and end-to-end
394+
tests test feature enablement too.
389395

390396
### Rollout, Upgrade and Rollback Planning
391397

392398
_This section must be completed when targeting beta graduation to a release._
393399

394-
* **How can a rollout fail? Can it impact already running workloads?**
395-
Try to be as paranoid as possible - e.g., what if some components will restart
396-
mid-rollout?
397-
398-
* **What specific metrics should inform a rollback?**
399-
400-
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
400+
* **How can a rollout fail? Can it impact already running workloads?** Impact
401+
to existing Jobs that previously didn't use this feature in alpha is
402+
impossible. In workloads using the feature in an older version, suspended
403+
Jobs may inadvertently be resumed (or Jobs may be inadvertently suspended) if
404+
there are storage-related issues arising from components crashing
405+
mid-rollout.
406+
407+
* **What specific metrics should inform a rollback?** `job_sync_duration_seconds`
408+
and `job_sync_total` should be observed. Unexpected spikes in the metric with
409+
labels `result=error` and `action=pods_deleted` is potentially an indicator
410+
that:
411+
1. Job suspension is producing errors in the Job controller,
412+
1. Jobs are getting suspended when they shouldn't be, or
413+
1. Job sync latency is high when Job are suspended.
414+
While the above list isn't exhaustive, they're signals in favour of rollbacks.
415+
416+
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path
417+
tested?** <!-- I'll answer this after implementation.
401418
Describe manual testing that was done and the outcomes.
402419
Longer term, we may want to require automated upgrade/rollback tests, but we
403-
are missing a bunch of machinery and tooling and can't do that now.
420+
are missing a bunch of machinery and tooling and can't do that now. -->
404421

405-
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
406-
fields of API types, flags, etc.?**
407-
Even if applying deprecation policies, they may still surprise some users.
422+
* **Is the rollout accompanied by any deprecations and/or removals of features,
423+
APIs, fields of API types, flags, etc.?** No.
408424

409425
### Monitoring Requirements
410426

411427
_This section must be completed when targeting beta graduation to a release._
412428

413429
* **How can an operator determine if the feature is in use by workloads?**
414-
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
415-
checking if there are objects with field X set) may be a last resort. Avoid
416-
logs or events for this purpose.
417-
418-
* **What are the SLIs (Service Level Indicators) an operator can use to determine
419-
the health of the service?**
420-
- [ ] Metrics
421-
- Metric name:
422-
- [Optional] Aggregation method:
430+
The `.spec.suspend` field is set to true by Jobs. The status conditions of a
431+
Job can also be used to determine whether a Job is using the feature (look
432+
for a condition of type "Suspended").
433+
434+
* **What are the SLIs (Service Level Indicators) an operator can use to
435+
determine the health of the service?**
436+
- [x] Metrics
437+
- Metric name: The metrics `job_sync_duration_seconds` and `job_sync_total`
438+
get a new label named `action` to allow operators to filter Job sync
439+
latency and error rate, respectively, by the action performed. There are
440+
four mutually-exclusive values possible for this label:
441+
- `reconciling` when the Job's pod creation/deletion expectations are
442+
unsatisfied and the controller is waiting for issued Pod
443+
creation/deletions to complete.
444+
- `tracking` when the Job's pod creation/deletion expectations are
445+
satisfied and the number of active Pods matches expectations (i.e. no
446+
pod creation/deletions issued in this sync). This is expected to be
447+
the action in most of the syncs.
448+
- `pods_created` when the controller creates Pods. This can happen
449+
when the number of active Pods is less than the wanted Job
450+
parallelism.
451+
- `pods_deleted` when the controller deletes Pods. This can happen if a
452+
Job is suspended or if the number of active Pods is more than
453+
parallelism.
454+
Each sample of the two metrics will have exactly one of the above values
455+
for the `action` label.
423456
- Components exposing the metric:
424-
- [ ] Other (treat as last resort)
425-
- Details:
426-
427-
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
428-
At a high level, this usually will be in the form of "high percentile of SLI
429-
per day <= X". It's impossible to provide comprehensive guidance, but at the very
430-
high level (needs more precise definitions) those may be things like:
431-
- per-day percentage of API calls finishing with 5XX errors <= 1%
432-
- 99% percentile over day of absolute value from (job creation time minus expected
433-
job creation time) for cron job <= 10%
434-
- 99,9% of /health requests per day finish with 200 code
435-
436-
* **Are there any missing metrics that would be useful to have to improve observability
437-
of this feature?**
438-
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
439-
implementation difficulties, etc.).
457+
- kube-controller-manager
458+
459+
* **What are the reasonable SLOs (Service Level Objectives) for the above
460+
SLIs?**
461+
- per-day percentage of `job_sync_total` with labels `result=error` and
462+
`action=pods_deleted` <= 1%
463+
- 99% percentile over day for `job_sync_duration_seconds` with label
464+
`action=pods_deleted` is <= 15s, assuming a client-side QPS limit of 50
465+
calls per second
466+
467+
* **Are there any missing metrics that would be useful to have to improve
468+
observability of this feature?** No.
440469

441470
### Dependencies
442471

443472
_This section must be completed when targeting beta graduation to a release._
444473

445474
* **Does this feature depend on any specific services running in the cluster?**
446-
Think about both cluster-level services (e.g. metrics-server) as well
447-
as node-level agents (e.g. specific version of CRI). Focus on external or
448-
optional services that are needed. For example, if this feature depends on
449-
a cloud provider API, or upon an external software-defined storage or network
450-
control plane.
451-
452-
For each of these, fill in the following—thinking about running existing user workloads
453-
and creating new ones, as well as about cluster-level services (e.g. DNS):
454-
- [Dependency name]
455-
- Usage description:
456-
- Impact of its outage on the feature:
457-
- Impact of its degraded performance or high-error rates on the feature:
458-
459-
-->
475+
Feature is restricted to kube-apiserver and kube-controller-manager.
460476

461477
### Scalability
462478

@@ -480,8 +496,6 @@ _This section must be completed when targeting beta graduation to a release._
480496
* **Will enabling / using this feature result in non-negligible increase of
481497
resource usage (CPU, RAM, disk, IO, ...) in any components?** No.
482498

483-
<!-- Uncomment when targeting beta graduation
484-
485499
### Troubleshooting
486500

487501
The Troubleshooting section currently serves the `Playbook` role. We may consider
@@ -491,29 +505,33 @@ details). For now, we leave it here.
491505
_This section must be completed when targeting beta graduation to a release._
492506

493507
* **How does this feature react if the API server and/or etcd is unavailable?**
494-
495-
* **What are other known failure modes?**
496-
For each of them, fill in the following information by copying the below template:
497-
- [Failure mode brief description]
498-
- Detection: How can it be detected via metrics? Stated another way:
499-
how can an operator troubleshoot without logging into a master or worker node?
500-
- Mitigations: What can be done to stop the bleeding, especially for already
501-
running user workloads?
502-
- Diagnostics: What are the useful log messages and their required logging
503-
levels that could help debug the issue?
504-
Not required until feature graduated to beta.
505-
- Testing: Are there any tests for failure mode? If not, describe why.
506-
507-
* **What steps should be taken if SLOs are not being met to determine the problem?**
508-
509-
-->
508+
Updates to suspend or resume a Job will not work. The controller will not be
509+
able to create or delete Pods. Events, logs, and status conditions for Jobs
510+
will not be updated to reflect their suspended status.
511+
512+
* **What are other known failure modes?** None. The API server, etcd, and the
513+
controller manager are the only possible points of failure.
514+
515+
* **What steps should be taken if SLOs are not being met to determine the
516+
problem?**
517+
- Verify that kube-apiserver and etcd are healthy. If not, the Job controller
518+
cannot operate, so you must fix those problems first.
519+
- Verify that `job_sync_total` is unexpectedly high for `result=error` and
520+
`action=pods_deleted` in comparison to other actions.
521+
- Verify that `job_sync_duration_seconds` is noticeably larger for
522+
`action=pods_deleted` in comparison to the other actions.
523+
- If control plane components are starved for CPU, which could be a potential
524+
reason behind Job sync latency spikes, consider increasing the control
525+
plane's resources.
510526

511527
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
512528
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
513529

514530
## Implementation History
515531

516532
2021-02-01: Initial KEP merged, alpha targeted for 1.21
533+
2021-03-08: Implementation merged in 1.21 with feature gate disabled by default
534+
2021-04-22: KEP updated for beta graduation in 1.22
517535

518536
## Drawbacks
519537

keps/sig-apps/2232-suspend-jobs/kep.yaml

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,12 @@ prr-approvers:
1717
- "@wojtek-t"
1818

1919
# The target maturity stage in the current dev cycle for this KEP.
20-
stage: alpha
20+
stage: beta
2121

2222
# The most recent milestone for which work toward delivery of this KEP has been
2323
# done. This can be the current (upcoming) milestone, if it is being actively
2424
# worked on.
25-
latest-milestone: "v1.21"
25+
latest-milestone: "v1.22"
2626

2727
# The milestone at which this feature was, or is targeted to be, at each stage.
2828
milestone:
@@ -40,4 +40,6 @@ feature-gates:
4040
disable-supported: true
4141

4242
# The following PRR answers are required at beta release
43-
metrics: []
43+
metrics:
44+
- job_sync_duration_seconds
45+
- job_sync_total

0 commit comments

Comments
 (0)