Skip to content

Commit 4dd99c9

Browse files
committed
Add metrics and troubleshooting
1 parent 4614e4a commit 4dd99c9

File tree

1 file changed

+66
-27
lines changed
  • keps/sig-storage/1495-volume-populators

1 file changed

+66
-27
lines changed

keps/sig-storage/1495-volume-populators/README.md

Lines changed: 66 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -447,31 +447,49 @@ _This section must be completed when targeting beta graduation to a release._
447447
same mechanisms as the volume snapshot controller for installation and
448448
health monitoring.
449449

450+
Additionally, the volume-data-source-validator controller will supply metrics
451+
on the number of volumes it validates and the outcomes of those validations,
452+
called `volume_data_source_validator_operation_count`.
453+
454+
Individual populators can generate metrics, and the supplied populator library
455+
will supply population duration metrics called
456+
`volume_populator_operation_seconds` and number of operations and results
457+
including errors called `volume_populator_operation_count`.
458+
450459
* **What are the SLIs (Service Level Indicators) an operator can use to determine
451460
the health of the service?**
452-
- [ ] Metrics
453-
- Metric name:
461+
- [X] Metrics
462+
- Metric name: The `volume_data_source_validator_operation_count` metric will
463+
tally operations and include how many were valid/invalid. A significant
464+
number of invalid operations would be cause for concern. The
465+
`volume_populator_operation_seconds` metric will expose how long individual
466+
population operations are taking and problems can be detected by deviations
467+
from expected values. The `volume_populator_operation_count` metric
468+
will include error counts, and any errors would be cause for concern.
454469
- [Optional] Aggregation method:
455-
- Components exposing the metric:
470+
- Components exposing the metric: volume-data-source-validator controller
471+
and each populator that uses the lib-volume-populator library.
456472
- [ ] Other (treat as last resort)
457473
- Details:
458474

459475
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
460-
At a high level, this usually will be in the form of "high percentile of SLI
461-
per day <= X". It's impossible to provide comprehensive guidance, but at the very
462-
high level (needs more precise definitions) those may be things like:
463-
- per-day percentage of API calls finishing with 5XX errors <= 1%
464-
- 99% percentile over day of absolute value from (job creation time minus expected
465-
job creation time) for cron job <= 10%
466-
- 99,9% of /health requests per day finish with 200 code
476+
477+
For `volume_data_source_validator_operation_count` counts of PVCs invalid data
478+
sources should be very low, ideally zero. A value other than zero indicates that
479+
a user tried to use a volume populator that didn't exist, which suggests a
480+
mistake by either the deployer or the user. Small numbers might be ignorable
481+
if users are likely to be experimenting or playing around.
482+
For `volume_populator_operation_seconds` the reasonable times will depend on
483+
what the populator is doing. Some data source might be reasonably populated in
484+
under 1 second, while others might frequently require a minute or more (if large
485+
amounts of data copying are involved).
486+
For `volume_populator_operation_count` any errors for a populator would suggest
487+
that specific populator had problems worth investigating.
467488

468489
* **Are there any missing metrics that would be useful to have to improve observability
469490
of this feature?**
470-
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
471-
implementation difficulties, etc.).
472-
* Counter for number of PVCs with no data sources
473-
* Counter for number of PVCs with valid data sources
474-
* Counter for number of PVCs with invalid data sources
491+
492+
No
475493

476494
### Dependencies
477495

@@ -519,21 +537,41 @@ resource usage (CPU, RAM, disk, IO, ...) in any components?**
519537
n/a
520538

521539
* **What are other known failure modes?**
522-
For each of them, fill in the following information by copying the below template:
523-
- [Failure mode brief description]
524-
- Detection: How can it be detected via metrics? Stated another way:
525-
how can an operator troubleshoot without logging into a master or worker node?
526-
- Mitigations: What can be done to stop the bleeding, especially for already
527-
running user workloads?
528-
- Diagnostics: What are the useful log messages and their required logging
529-
levels that could help debug the issue?
530-
Not required until feature graduated to beta.
531-
- Testing: Are there any tests for failure mode? If not, describe why.
540+
- No feedback on invalid data sources
541+
- Detection: volume-data-source-validator controller not installed, or
542+
VolumePopulator CRD not installed, or complete absence of
543+
`volume_data_source_validator_operation_count` metrics.
544+
- Mitigations: Install the controller and CRD
545+
- Diagnostics: None
546+
- Testing: No, lack of feedback is expected result of not installing
547+
the controller.
548+
- PVCs using invalid data source
549+
- Detection: Non-zero `volume_data_source_validator_operation_count` errors
550+
- Mitigations: Install appropriate populator if possible
551+
- Diagnostics: Examine data source group/kind values for affected PVCs
552+
to determine what populator is missing.
553+
- Testing: Yes.
554+
- PVCs won't bind
555+
- Detection: Non-zero `volume_populator_operation_count` errors
556+
- Mitigations: Depends on the specific populator. If it was bad enough
557+
the populator could be uninstalled, preventing future PVCs with the
558+
matching data source group/kind from binding at all.
559+
- Diagnostics: Investigate the logs for the specific populator.
560+
- Testing: This will vary per-implementation of populators.
532561

533562
* **What steps should be taken if SLOs are not being met to determine the problem?**
534563

535-
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
536-
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
564+
First ensure that all the required components are installed. Most problems are
565+
likely to result from a specific populator not being installed when users
566+
expect it (although it is up to deployers to decide what to include and
567+
communicate reasonable expectations to users) or from the
568+
volume-data-source-validator and the VolumePopulator CRD not being installed.
569+
570+
Assuming all of the necessary components are installed, the next important
571+
step is to identify which populator is affected, by looking at the data
572+
sources of the PVCs that are having problems. Once the specific populator
573+
is determined, a populator-specific investigation will be needed, starting
574+
from looking at the logs for that populator.
537575

538576
## Implementation History
539577

@@ -551,6 +589,7 @@ resource usage (CPU, RAM, disk, IO, ...) in any components?**
551589
- Webhook replaced with controller in December 2020
552590
- KEP updated Feb 2021 for v1.21
553591
- Redesign with new `DataSourceRef` field in May 2021 for v1.22, still alpha
592+
- Added metrics, troubleshooting, move to beta in Sep 2021 for v1.23.
554593

555594
## Alternatives
556595

0 commit comments

Comments
 (0)