@@ -447,31 +447,49 @@ _This section must be completed when targeting beta graduation to a release._
447
447
same mechanisms as the volume snapshot controller for installation and
448
448
health monitoring.
449
449
450
+ Additionally, the volume-data-source-validator controller will supply metrics
451
+ on the number of volumes it validates and the outcomes of those validations,
452
+ called ` volume_data_source_validator_operation_count ` .
453
+
454
+ Individual populators can generate metrics, and the supplied populator library
455
+ will supply population duration metrics called
456
+ ` volume_populator_operation_seconds ` and number of operations and results
457
+ including errors called ` volume_populator_operation_count ` .
458
+
450
459
* ** What are the SLIs (Service Level Indicators) an operator can use to determine
451
460
the health of the service?**
452
- - [ ] Metrics
453
- - Metric name:
461
+ - [X] Metrics
462
+ - Metric name: The ` volume_data_source_validator_operation_count ` metric will
463
+ tally operations and include how many were valid/invalid. A significant
464
+ number of invalid operations would be cause for concern. The
465
+ ` volume_populator_operation_seconds ` metric will expose how long individual
466
+ population operations are taking and problems can be detected by deviations
467
+ from expected values. The ` volume_populator_operation_count ` metric
468
+ will include error counts, and any errors would be cause for concern.
454
469
- [ Optional] Aggregation method:
455
- - Components exposing the metric:
470
+ - Components exposing the metric: volume-data-source-validator controller
471
+ and each populator that uses the lib-volume-populator library.
456
472
- [ ] Other (treat as last resort)
457
473
- Details:
458
474
459
475
* ** What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
460
- At a high level, this usually will be in the form of "high percentile of SLI
461
- per day <= X". It's impossible to provide comprehensive guidance, but at the very
462
- high level (needs more precise definitions) those may be things like:
463
- - per-day percentage of API calls finishing with 5XX errors <= 1%
464
- - 99% percentile over day of absolute value from (job creation time minus expected
465
- job creation time) for cron job <= 10%
466
- - 99,9% of /health requests per day finish with 200 code
476
+
477
+ For ` volume_data_source_validator_operation_count ` counts of PVCs invalid data
478
+ sources should be very low, ideally zero. A value other than zero indicates that
479
+ a user tried to use a volume populator that didn't exist, which suggests a
480
+ mistake by either the deployer or the user. Small numbers might be ignorable
481
+ if users are likely to be experimenting or playing around.
482
+ For ` volume_populator_operation_seconds ` the reasonable times will depend on
483
+ what the populator is doing. Some data source might be reasonably populated in
484
+ under 1 second, while others might frequently require a minute or more (if large
485
+ amounts of data copying are involved).
486
+ For ` volume_populator_operation_count ` any errors for a populator would suggest
487
+ that specific populator had problems worth investigating.
467
488
468
489
* ** Are there any missing metrics that would be useful to have to improve observability
469
490
of this feature?**
470
- Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
471
- implementation difficulties, etc.).
472
- * Counter for number of PVCs with no data sources
473
- * Counter for number of PVCs with valid data sources
474
- * Counter for number of PVCs with invalid data sources
491
+
492
+ No
475
493
476
494
### Dependencies
477
495
@@ -519,21 +537,41 @@ resource usage (CPU, RAM, disk, IO, ...) in any components?**
519
537
n/a
520
538
521
539
* ** What are other known failure modes?**
522
- For each of them, fill in the following information by copying the below template:
523
- - [ Failure mode brief description]
524
- - Detection: How can it be detected via metrics? Stated another way:
525
- how can an operator troubleshoot without logging into a master or worker node?
526
- - Mitigations: What can be done to stop the bleeding, especially for already
527
- running user workloads?
528
- - Diagnostics: What are the useful log messages and their required logging
529
- levels that could help debug the issue?
530
- Not required until feature graduated to beta.
531
- - Testing: Are there any tests for failure mode? If not, describe why.
540
+ - No feedback on invalid data sources
541
+ - Detection: volume-data-source-validator controller not installed, or
542
+ VolumePopulator CRD not installed, or complete absence of
543
+ ` volume_data_source_validator_operation_count ` metrics.
544
+ - Mitigations: Install the controller and CRD
545
+ - Diagnostics: None
546
+ - Testing: No, lack of feedback is expected result of not installing
547
+ the controller.
548
+ - PVCs using invalid data source
549
+ - Detection: Non-zero ` volume_data_source_validator_operation_count ` errors
550
+ - Mitigations: Install appropriate populator if possible
551
+ - Diagnostics: Examine data source group/kind values for affected PVCs
552
+ to determine what populator is missing.
553
+ - Testing: Yes.
554
+ - PVCs won't bind
555
+ - Detection: Non-zero ` volume_populator_operation_count ` errors
556
+ - Mitigations: Depends on the specific populator. If it was bad enough
557
+ the populator could be uninstalled, preventing future PVCs with the
558
+ matching data source group/kind from binding at all.
559
+ - Diagnostics: Investigate the logs for the specific populator.
560
+ - Testing: This will vary per-implementation of populators.
532
561
533
562
* ** What steps should be taken if SLOs are not being met to determine the problem?**
534
563
535
- [ supported limits ] : https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
536
- [ existing SLIs/SLOs ] : https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
564
+ First ensure that all the required components are installed. Most problems are
565
+ likely to result from a specific populator not being installed when users
566
+ expect it (although it is up to deployers to decide what to include and
567
+ communicate reasonable expectations to users) or from the
568
+ volume-data-source-validator and the VolumePopulator CRD not being installed.
569
+
570
+ Assuming all of the necessary components are installed, the next important
571
+ step is to identify which populator is affected, by looking at the data
572
+ sources of the PVCs that are having problems. Once the specific populator
573
+ is determined, a populator-specific investigation will be needed, starting
574
+ from looking at the logs for that populator.
537
575
538
576
## Implementation History
539
577
@@ -551,6 +589,7 @@ resource usage (CPU, RAM, disk, IO, ...) in any components?**
551
589
- Webhook replaced with controller in December 2020
552
590
- KEP updated Feb 2021 for v1.21
553
591
- Redesign with new ` DataSourceRef ` field in May 2021 for v1.22, still alpha
592
+ - Added metrics, troubleshooting, move to beta in Sep 2021 for v1.23.
554
593
555
594
## Alternatives
556
595
0 commit comments