Skip to content

Commit 6100c06

Browse files
authored
Merge pull request kubernetes#2934 from bswartz/volume-populators-beta2
KEP-1495: Update Volume Populators to beta
2 parents b531ed6 + 4dd99c9 commit 6100c06

File tree

3 files changed

+73
-32
lines changed

3 files changed

+73
-32
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 1495
22
alpha:
33
approver: "@deads2k"
4+
beta:
5+
approver: "@deads2k"

keps/sig-storage/1495-volume-populators/README.md

Lines changed: 68 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -402,7 +402,7 @@ _This section must be completed when targeting alpha to a release._
402402

403403
* **Can the feature be disabled once it has been enabled (i.e. can we roll back
404404
the enablement)?**
405-
Yes, dropping or ignoring the new field returns the system to it's previous
405+
Yes, dropping or ignoring the new field returns the system to its previous
406406
state. Worst case, some PVCs which were trying to use the new field might need
407407
to be deleted because they will never have anything happen to them after the
408408
feature is disabled.
@@ -447,37 +447,55 @@ _This section must be completed when targeting beta graduation to a release._
447447
same mechanisms as the volume snapshot controller for installation and
448448
health monitoring.
449449

450+
Additionally, the volume-data-source-validator controller will supply metrics
451+
on the number of volumes it validates and the outcomes of those validations,
452+
called `volume_data_source_validator_operation_count`.
453+
454+
Individual populators can generate metrics, and the supplied populator library
455+
will supply population duration metrics called
456+
`volume_populator_operation_seconds` and number of operations and results
457+
including errors called `volume_populator_operation_count`.
458+
450459
* **What are the SLIs (Service Level Indicators) an operator can use to determine
451460
the health of the service?**
452-
- [ ] Metrics
453-
- Metric name:
461+
- [X] Metrics
462+
- Metric name: The `volume_data_source_validator_operation_count` metric will
463+
tally operations and include how many were valid/invalid. A significant
464+
number of invalid operations would be cause for concern. The
465+
`volume_populator_operation_seconds` metric will expose how long individual
466+
population operations are taking and problems can be detected by deviations
467+
from expected values. The `volume_populator_operation_count` metric
468+
will include error counts, and any errors would be cause for concern.
454469
- [Optional] Aggregation method:
455-
- Components exposing the metric:
470+
- Components exposing the metric: volume-data-source-validator controller
471+
and each populator that uses the lib-volume-populator library.
456472
- [ ] Other (treat as last resort)
457473
- Details:
458474

459475
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
460-
At a high level, this usually will be in the form of "high percentile of SLI
461-
per day <= X". It's impossible to provide comprehensive guidance, but at the very
462-
high level (needs more precise definitions) those may be things like:
463-
- per-day percentage of API calls finishing with 5XX errors <= 1%
464-
- 99% percentile over day of absolute value from (job creation time minus expected
465-
job creation time) for cron job <= 10%
466-
- 99,9% of /health requests per day finish with 200 code
476+
477+
For `volume_data_source_validator_operation_count` counts of PVCs invalid data
478+
sources should be very low, ideally zero. A value other than zero indicates that
479+
a user tried to use a volume populator that didn't exist, which suggests a
480+
mistake by either the deployer or the user. Small numbers might be ignorable
481+
if users are likely to be experimenting or playing around.
482+
For `volume_populator_operation_seconds` the reasonable times will depend on
483+
what the populator is doing. Some data source might be reasonably populated in
484+
under 1 second, while others might frequently require a minute or more (if large
485+
amounts of data copying are involved).
486+
For `volume_populator_operation_count` any errors for a populator would suggest
487+
that specific populator had problems worth investigating.
467488

468489
* **Are there any missing metrics that would be useful to have to improve observability
469490
of this feature?**
470-
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
471-
implementation difficulties, etc.).
472-
* Counter for number of PVCs with no data sources
473-
* Counter for number of PVCs with valid data sources
474-
* Counter for number of PVCs with invalid data sources
491+
492+
No
475493

476494
### Dependencies
477495

478496
* **Does this feature depend on any specific services running in the cluster?**
479497
This feature depends on the VolumePopulator CRD being installed, and the
480-
associated data-source-validator controller.
498+
associated volume-data-source-validator controller.
481499

482500
### Scalability
483501

@@ -519,21 +537,41 @@ resource usage (CPU, RAM, disk, IO, ...) in any components?**
519537
n/a
520538

521539
* **What are other known failure modes?**
522-
For each of them, fill in the following information by copying the below template:
523-
- [Failure mode brief description]
524-
- Detection: How can it be detected via metrics? Stated another way:
525-
how can an operator troubleshoot without logging into a master or worker node?
526-
- Mitigations: What can be done to stop the bleeding, especially for already
527-
running user workloads?
528-
- Diagnostics: What are the useful log messages and their required logging
529-
levels that could help debug the issue?
530-
Not required until feature graduated to beta.
531-
- Testing: Are there any tests for failure mode? If not, describe why.
540+
- No feedback on invalid data sources
541+
- Detection: volume-data-source-validator controller not installed, or
542+
VolumePopulator CRD not installed, or complete absence of
543+
`volume_data_source_validator_operation_count` metrics.
544+
- Mitigations: Install the controller and CRD
545+
- Diagnostics: None
546+
- Testing: No, lack of feedback is expected result of not installing
547+
the controller.
548+
- PVCs using invalid data source
549+
- Detection: Non-zero `volume_data_source_validator_operation_count` errors
550+
- Mitigations: Install appropriate populator if possible
551+
- Diagnostics: Examine data source group/kind values for affected PVCs
552+
to determine what populator is missing.
553+
- Testing: Yes.
554+
- PVCs won't bind
555+
- Detection: Non-zero `volume_populator_operation_count` errors
556+
- Mitigations: Depends on the specific populator. If it was bad enough
557+
the populator could be uninstalled, preventing future PVCs with the
558+
matching data source group/kind from binding at all.
559+
- Diagnostics: Investigate the logs for the specific populator.
560+
- Testing: This will vary per-implementation of populators.
532561

533562
* **What steps should be taken if SLOs are not being met to determine the problem?**
534563

535-
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
536-
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
564+
First ensure that all the required components are installed. Most problems are
565+
likely to result from a specific populator not being installed when users
566+
expect it (although it is up to deployers to decide what to include and
567+
communicate reasonable expectations to users) or from the
568+
volume-data-source-validator and the VolumePopulator CRD not being installed.
569+
570+
Assuming all of the necessary components are installed, the next important
571+
step is to identify which populator is affected, by looking at the data
572+
sources of the PVCs that are having problems. Once the specific populator
573+
is determined, a populator-specific investigation will be needed, starting
574+
from looking at the logs for that populator.
537575

538576
## Implementation History
539577

@@ -551,6 +589,7 @@ resource usage (CPU, RAM, disk, IO, ...) in any components?**
551589
- Webhook replaced with controller in December 2020
552590
- KEP updated Feb 2021 for v1.21
553591
- Redesign with new `DataSourceRef` field in May 2021 for v1.22, still alpha
592+
- Added metrics, troubleshooting, move to beta in Sep 2021 for v1.23.
554593

555594
## Alternatives
556595

keps/sig-storage/1495-volume-populators/kep.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ participating-sigs:
77
- sig-api-machinery
88
status: implementable
99
creation-date: 2019-12-03
10-
last-updated: 2021-05-13
10+
last-updated: 2021-09-02
1111
reviewers:
1212
- "@thockin"
1313
- "@saad-ali"
@@ -20,8 +20,8 @@ prr-approvers:
2020
- "@deads2k"
2121
replaces:
2222
- "/keps/sig-storage/20200120-generic-data-populators.md"
23-
stage: alpha
24-
latest-milestone: "v1.22"
23+
stage: beta
24+
latest-milestone: "v1.23"
2525
milestone:
2626
alpha: "v1.18"
2727
beta: "v1.23"

0 commit comments

Comments
 (0)