13
13
- [ Provider configuration] ( #provider-configuration )
14
14
- [ Provider input format] ( #provider-input-format )
15
15
- [ Provider output format] ( #provider-output-format )
16
+ - [ Metrics] ( #metrics )
16
17
- [ Risks and Mitigations] ( #risks-and-mitigations )
17
18
- [ Client authentication to the binary] ( #client-authentication-to-the-binary )
18
19
- [ Invalid credentials before cache expiry] ( #invalid-credentials-before-cache-expiry )
@@ -493,6 +494,114 @@ credentials).
493
494
PEM format. The certificate must be valid at the time of execution. These
494
495
credentials are used for mTLS handshakes.
495
496
497
+ ### Metrics
498
+
499
+ As discussed [ below] ( #rollout-upgrade-and-rollback-planning ) , there are 4
500
+ primary metrics used by this feature set.
501
+
502
+ ``` golang
503
+ var (
504
+ execPluginCertTTL = k8smetrics.NewGaugeFunc (
505
+ k8smetrics.GaugeOpts {
506
+ Name : " rest_client_exec_plugin_ttl_seconds" ,
507
+ Help : " Gauge of the shortest TTL (time-to-live) of the client " +
508
+ " certificate(s) managed by the auth exec plugin. The value " +
509
+ " is in seconds until certificate expiry (negative if " +
510
+ " already expired). If auth exec plugins are unused or manage no " +
511
+ " TLS certificates, the value will be +INF." ,
512
+ },
513
+ func () float64 {
514
+ if execPluginCertTTLAdapter.e == nil {
515
+ return math.Inf (1 )
516
+ }
517
+ return execPluginCertTTLAdapter.e .Sub (time.Now ()).Seconds ()
518
+ },
519
+ )
520
+
521
+ execPluginCertRotation = k8smetrics.NewHistogram (
522
+ &k8smetrics.HistogramOpts {
523
+ Name : " rest_client_exec_plugin_certificate_rotation_age" ,
524
+ Help : " Histogram of the number of seconds the last auth exec " +
525
+ " plugin client certificate lived before being rotated. " +
526
+ " If auth exec plugin client certificates are unused, " +
527
+ " histogram will contain no data." ,
528
+ // There are three sets of ranges these buckets intend to capture:
529
+ // - 10-60 minutes: captures a rotation cadence which is
530
+ // happening too quickly.
531
+ // - 4 hours - 1 month: captures an ideal rotation cadence.
532
+ // - 3 months - 4 years: captures a rotation cadence which is
533
+ // is probably too slow or much too slow.
534
+ Buckets : []float64 {
535
+ 600 , // 10 minutes
536
+ 1800 , // 30 minutes
537
+ 3600 , // 1 hour
538
+ 14400 , // 4 hours
539
+ 86400 , // 1 day
540
+ 604800 , // 1 week
541
+ 2592000 , // 1 month
542
+ 7776000 , // 3 months
543
+ 15552000 , // 6 months
544
+ 31104000 , // 1 year
545
+ 124416000 , // 4 years
546
+ },
547
+ },
548
+ )
549
+
550
+ execPluginCalls = k8smetrics.NewCounterVec (
551
+ &k8smetrics.CounterOpts {
552
+ Name : " rest_client_exec_plugin_calls" ,
553
+ Help : " Number of calls to an exec plugin." ,
554
+ },
555
+ []string {},
556
+ )
557
+
558
+ execPluginFailedCalls = k8smetrics.NewCounterVec (
559
+ &k8smetrics.CounterOpts {
560
+ Name : " rest_client_exec_plugin_failed_calls" ,
561
+ Help : " Number of calls to an exec plugin, partitioned by exit code." ,
562
+ },
563
+ []string {" exitCode" },
564
+ )
565
+ )
566
+ ```
567
+
568
+ As is common practice, these labels will be hidden behind abstract global
569
+ variables that will be called by the exec plugin code.
570
+ ``` golang
571
+ // DurationMetric is a measurement of some amount of time.
572
+ type DurationMetric interface {
573
+ Observe (duration time.Duration )
574
+ }
575
+
576
+ // ExpiryMetric sets some time of expiry. If nil, assume not relevant.
577
+ type ExpiryMetric interface {
578
+ Set (expiry *time.Time )
579
+ }
580
+
581
+ // CallsMetric counts calls that take place for a specific exec plugin.
582
+ type CallsMetric interface {
583
+ // Increment increments a counter. The provided exitCode is optional,
584
+ // so that this interface can be used for when a call takes place
585
+ // but the exit code does not matter.
586
+ Increment (exitCode int )
587
+ }
588
+
589
+ var (
590
+ // ClientCertExpiry is the expiry time of a client certificate
591
+ ClientCertExpiry ExpiryMetric = noopExpiry{}
592
+ // ClientCertRotationAge is the age of a certificate that has just been rotated.
593
+ ClientCertRotationAge DurationMetric = noopDuration{}
594
+ // ExecPluginCalls is the number of calls made to an exec plugin.
595
+ ExecPluginCalls CallsMetric = noopCalls{}
596
+ // ExecPluginFailedCalls is the number of calls made to an exec plugin that fail.
597
+ // I.e., when the binary returns a non-zero exit code.
598
+ ExecPluginFailedCalls CallsMetric = noopCalls{}
599
+ )
600
+ ```
601
+
602
+ The ` "exitCode" ` label of these metrics is an attempt to elucidate the exec
603
+ plugin failure mode to the user.
604
+
496
605
### Risks and Mitigations
497
606
498
607
#### Client authentication to the binary
@@ -532,6 +641,7 @@ Unit tests to confirm:
532
641
that structs are kept up to date
533
642
- Helper methods properly create ` "k8s.io/client-go/rest".Config ` from
534
643
` "k8s.io/client-go/pkg/apis/clientauthentication".Cluster ` and vice versa
644
+ - Metrics are reported as they should
535
645
536
646
Integration (or e2e CLI) tests to confirm:
537
647
@@ -546,6 +656,7 @@ Integration (or e2e CLI) tests to confirm:
546
656
+ Cert based auth
547
657
- Interactive login flows work
548
658
+ TTY forwarding between client and executable works
659
+ - Metrics are reported as they should
549
660
550
661
### Graduation Criteria
551
662
@@ -565,6 +676,7 @@ Feature is already in Beta.
565
676
- Address known bugs and add tests to prevent regressions
566
677
- Docs are up-to-date with latest version of APIs
567
678
- Docs describe set of best practices (i.e. do not mutate ` kubeconfig ` )
679
+ - Sufficient metrics
568
680
569
681
Note: this feature set does not need conformance tests because it is inherently
570
682
opt-in on the client-side and it relies on an extra binary to be present.
@@ -674,7 +786,10 @@ The downsides of this approach compared to exec model are:
674
786
authenticator could be behaving incorrectly. For example, if the certificate
675
787
expiration time is constantly increasing upon every authentication to the API, then
676
788
perhaps the exec plugin authenticator is refreshing the certificate credential too
677
- often.
789
+ often. Furthermore, the certificate's age (i.e., the time since the certificate's
790
+ ` NotBefore ` field) will be emitted as a metric. If this value is frequently much smaller
791
+ than the certificate's expected lifetime, then the exec plugin authenticator may be
792
+ rotating credentials too quickly which may point to a bug.
678
793
- The total number of calls to the exec plugin would also be helpful to obtain. This
679
794
metric should increase each time a credential is refreshed (see previous bullet point
680
795
for when this happens). If this number increases rapidly, then the exec plugin
@@ -710,16 +825,19 @@ _This section must be completed when targeting beta graduation to a release._
710
825
711
826
* ** What are the SLIs (Service Level Indicators) an operator can use to
712
827
determine the health of the service?**
713
- - [ ] Metrics
714
- - Metric name:
715
- - [ Optional ] Aggregation method:
716
- - Components exposing the metric:
717
- - [x ] Other (treat as last resort)
828
+ - [X ] Metrics
829
+ - Metric name: ` rest_client_exec_plugin_ttl_seconds ` , ` rest_client_exec_plugin_certificate_rotation_age ` ,
830
+ ` rest_client_exec_plugin_calls ` , ` rest_client_exec_plugin_failed_calls `
831
+ - Components exposing the metric: client-go
832
+ - [ ] Other (treat as last resort)
718
833
- Details:
719
834
- This feature set operates on the client-side.
720
835
721
836
* ** What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
722
- - This feature set operates on the client-side.
837
+ - ` rest_client_exec_plugin_ttl_seconds ` : the expected lifetime of client-side certificates, in seconds
838
+ - ` rest_client_exec_plugin_certificate_rotation_age ` : the expected lifetime of client-side certificates, in seconds
839
+ - ` rest_client_exec_plugin_calls ` : 1 per the lifetime of the credential returned by the exec plugin
840
+ - ` rest_client_exec_plugin_failed_calls ` : 0, or a very low number compared to ` rest_client_exec_plugin_calls `
723
841
724
842
* ** Are there any missing metrics that would be useful to have to improve
725
843
observability if this feature?**
0 commit comments