@@ -402,7 +402,7 @@ _This section must be completed when targeting alpha to a release._
402
402
403
403
* ** Can the feature be disabled once it has been enabled (i.e. can we roll back
404
404
the enablement)?**
405
- Yes, dropping or ignoring the new field returns the system to it's previous
405
+ Yes, dropping or ignoring the new field returns the system to its previous
406
406
state. Worst case, some PVCs which were trying to use the new field might need
407
407
to be deleted because they will never have anything happen to them after the
408
408
feature is disabled.
@@ -447,37 +447,55 @@ _This section must be completed when targeting beta graduation to a release._
447
447
same mechanisms as the volume snapshot controller for installation and
448
448
health monitoring.
449
449
450
+ Additionally, the volume-data-source-validator controller will supply metrics
451
+ on the number of volumes it validates and the outcomes of those validations,
452
+ called ` volume_data_source_validator_operation_count ` .
453
+
454
+ Individual populators can generate metrics, and the supplied populator library
455
+ will supply population duration metrics called
456
+ ` volume_populator_operation_seconds ` and number of operations and results
457
+ including errors called ` volume_populator_operation_count ` .
458
+
450
459
* ** What are the SLIs (Service Level Indicators) an operator can use to determine
451
460
the health of the service?**
452
- - [ ] Metrics
453
- - Metric name:
461
+ - [X] Metrics
462
+ - Metric name: The ` volume_data_source_validator_operation_count ` metric will
463
+ tally operations and include how many were valid/invalid. A significant
464
+ number of invalid operations would be cause for concern. The
465
+ ` volume_populator_operation_seconds ` metric will expose how long individual
466
+ population operations are taking and problems can be detected by deviations
467
+ from expected values. The ` volume_populator_operation_count ` metric
468
+ will include error counts, and any errors would be cause for concern.
454
469
- [ Optional] Aggregation method:
455
- - Components exposing the metric:
470
+ - Components exposing the metric: volume-data-source-validator controller
471
+ and each populator that uses the lib-volume-populator library.
456
472
- [ ] Other (treat as last resort)
457
473
- Details:
458
474
459
475
* ** What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
460
- At a high level, this usually will be in the form of "high percentile of SLI
461
- per day <= X". It's impossible to provide comprehensive guidance, but at the very
462
- high level (needs more precise definitions) those may be things like:
463
- - per-day percentage of API calls finishing with 5XX errors <= 1%
464
- - 99% percentile over day of absolute value from (job creation time minus expected
465
- job creation time) for cron job <= 10%
466
- - 99,9% of /health requests per day finish with 200 code
476
+
477
+ For ` volume_data_source_validator_operation_count ` counts of PVCs invalid data
478
+ sources should be very low, ideally zero. A value other than zero indicates that
479
+ a user tried to use a volume populator that didn't exist, which suggests a
480
+ mistake by either the deployer or the user. Small numbers might be ignorable
481
+ if users are likely to be experimenting or playing around.
482
+ For ` volume_populator_operation_seconds ` the reasonable times will depend on
483
+ what the populator is doing. Some data source might be reasonably populated in
484
+ under 1 second, while others might frequently require a minute or more (if large
485
+ amounts of data copying are involved).
486
+ For ` volume_populator_operation_count ` any errors for a populator would suggest
487
+ that specific populator had problems worth investigating.
467
488
468
489
* ** Are there any missing metrics that would be useful to have to improve observability
469
490
of this feature?**
470
- Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
471
- implementation difficulties, etc.).
472
- * Counter for number of PVCs with no data sources
473
- * Counter for number of PVCs with valid data sources
474
- * Counter for number of PVCs with invalid data sources
491
+
492
+ No
475
493
476
494
### Dependencies
477
495
478
496
* ** Does this feature depend on any specific services running in the cluster?**
479
497
This feature depends on the VolumePopulator CRD being installed, and the
480
- associated data-source-validator controller.
498
+ associated volume- data-source-validator controller.
481
499
482
500
### Scalability
483
501
@@ -519,21 +537,41 @@ resource usage (CPU, RAM, disk, IO, ...) in any components?**
519
537
n/a
520
538
521
539
* ** What are other known failure modes?**
522
- For each of them, fill in the following information by copying the below template:
523
- - [ Failure mode brief description]
524
- - Detection: How can it be detected via metrics? Stated another way:
525
- how can an operator troubleshoot without logging into a master or worker node?
526
- - Mitigations: What can be done to stop the bleeding, especially for already
527
- running user workloads?
528
- - Diagnostics: What are the useful log messages and their required logging
529
- levels that could help debug the issue?
530
- Not required until feature graduated to beta.
531
- - Testing: Are there any tests for failure mode? If not, describe why.
540
+ - No feedback on invalid data sources
541
+ - Detection: volume-data-source-validator controller not installed, or
542
+ VolumePopulator CRD not installed, or complete absence of
543
+ ` volume_data_source_validator_operation_count ` metrics.
544
+ - Mitigations: Install the controller and CRD
545
+ - Diagnostics: None
546
+ - Testing: No, lack of feedback is expected result of not installing
547
+ the controller.
548
+ - PVCs using invalid data source
549
+ - Detection: Non-zero ` volume_data_source_validator_operation_count ` errors
550
+ - Mitigations: Install appropriate populator if possible
551
+ - Diagnostics: Examine data source group/kind values for affected PVCs
552
+ to determine what populator is missing.
553
+ - Testing: Yes.
554
+ - PVCs won't bind
555
+ - Detection: Non-zero ` volume_populator_operation_count ` errors
556
+ - Mitigations: Depends on the specific populator. If it was bad enough
557
+ the populator could be uninstalled, preventing future PVCs with the
558
+ matching data source group/kind from binding at all.
559
+ - Diagnostics: Investigate the logs for the specific populator.
560
+ - Testing: This will vary per-implementation of populators.
532
561
533
562
* ** What steps should be taken if SLOs are not being met to determine the problem?**
534
563
535
- [ supported limits ] : https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
536
- [ existing SLIs/SLOs ] : https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
564
+ First ensure that all the required components are installed. Most problems are
565
+ likely to result from a specific populator not being installed when users
566
+ expect it (although it is up to deployers to decide what to include and
567
+ communicate reasonable expectations to users) or from the
568
+ volume-data-source-validator and the VolumePopulator CRD not being installed.
569
+
570
+ Assuming all of the necessary components are installed, the next important
571
+ step is to identify which populator is affected, by looking at the data
572
+ sources of the PVCs that are having problems. Once the specific populator
573
+ is determined, a populator-specific investigation will be needed, starting
574
+ from looking at the logs for that populator.
537
575
538
576
## Implementation History
539
577
@@ -551,6 +589,7 @@ resource usage (CPU, RAM, disk, IO, ...) in any components?**
551
589
- Webhook replaced with controller in December 2020
552
590
- KEP updated Feb 2021 for v1.21
553
591
- Redesign with new ` DataSourceRef ` field in May 2021 for v1.22, still alpha
592
+ - Added metrics, troubleshooting, move to beta in Sep 2021 for v1.23.
554
593
555
594
## Alternatives
556
595
0 commit comments