@@ -492,6 +492,23 @@ Recall that end users cannot usually observe component logs or access metrics.
492
492
- [ ] Other (treat as last resort)
493
493
- Details:
494
494
495
+ ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
496
+
497
+ <!--
498
+ This is your opportunity to define what "normal" quality of service looks like
499
+ for a feature.
500
+
501
+ It's impossible to provide comprehensive guidance, but at the very
502
+ high level (needs more precise definitions) those may be things like:
503
+ - per-day percentage of API calls finishing with 5XX errors <= 1%
504
+ - 99% percentile over day of absolute value from (job creation time minus expected
505
+ job creation time) for cron job <= 10%
506
+ - 99.9% of /health requests per day finish with 200 code
507
+
508
+ These goals will help you determine what you need to measure (SLIs) in the next
509
+ question.
510
+ -->
511
+
495
512
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
496
513
497
514
<!--
@@ -505,18 +522,6 @@ Pick one more of these and delete the rest.
505
522
- [ ] Other (treat as last resort)
506
523
- Details:
507
524
508
- ###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
509
-
510
- <!--
511
- At a high level, this usually will be in the form of "high percentile of SLI
512
- per day <= X". It's impossible to provide comprehensive guidance, but at the very
513
- high level (needs more precise definitions) those may be things like:
514
- - per-day percentage of API calls finishing with 5XX errors <= 1%
515
- - 99% percentile over day of absolute value from (job creation time minus expected
516
- job creation time) for cron job <= 10%
517
- - 99,9% of /health requests per day finish with 200 code
518
- -->
519
-
520
525
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
521
526
522
527
<!--
0 commit comments