Skip to content

Commit 35617dc

Browse files
committed
exec credential provider: first pass at metrics design
Signed-off-by: Andrew Keesler <[email protected]>
1 parent ffdf4b8 commit 35617dc

File tree

1 file changed

+125
-7
lines changed
  • keps/sig-auth/541-external-credential-providers

1 file changed

+125
-7
lines changed

keps/sig-auth/541-external-credential-providers/README.md

Lines changed: 125 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
- [Provider configuration](#provider-configuration)
1414
- [Provider input format](#provider-input-format)
1515
- [Provider output format](#provider-output-format)
16+
- [Metrics](#metrics)
1617
- [Risks and Mitigations](#risks-and-mitigations)
1718
- [Client authentication to the binary](#client-authentication-to-the-binary)
1819
- [Invalid credentials before cache expiry](#invalid-credentials-before-cache-expiry)
@@ -493,6 +494,114 @@ credentials).
493494
PEM format. The certificate must be valid at the time of execution. These
494495
credentials are used for mTLS handshakes.
495496

497+
### Metrics
498+
499+
As discussed [below](#rollout-upgrade-and-rollback-planning), there are 4
500+
primary metrics used by this feature set.
501+
502+
```golang
503+
var (
504+
execPluginCertTTL = k8smetrics.NewGaugeFunc(
505+
k8smetrics.GaugeOpts{
506+
Name: "rest_client_exec_plugin_ttl_seconds",
507+
Help: "Gauge of the shortest TTL (time-to-live) of the client " +
508+
"certificate(s) managed by the auth exec plugin. The value " +
509+
"is in seconds until certificate expiry (negative if " +
510+
"already expired). If auth exec plugins are unused or manage no " +
511+
"TLS certificates, the value will be +INF.",
512+
},
513+
func() float64 {
514+
if execPluginCertTTLAdapter.e == nil {
515+
return math.Inf(1)
516+
}
517+
return execPluginCertTTLAdapter.e.Sub(time.Now()).Seconds()
518+
},
519+
)
520+
521+
execPluginCertRotation = k8smetrics.NewHistogram(
522+
&k8smetrics.HistogramOpts{
523+
Name: "rest_client_exec_plugin_certificate_rotation_age",
524+
Help: "Histogram of the number of seconds the last auth exec " +
525+
"plugin client certificate lived before being rotated. " +
526+
"If auth exec plugin client certificates are unused, " +
527+
"histogram will contain no data.",
528+
// There are three sets of ranges these buckets intend to capture:
529+
// - 10-60 minutes: captures a rotation cadence which is
530+
// happening too quickly.
531+
// - 4 hours - 1 month: captures an ideal rotation cadence.
532+
// - 3 months - 4 years: captures a rotation cadence which is
533+
// is probably too slow or much too slow.
534+
Buckets: []float64{
535+
600, // 10 minutes
536+
1800, // 30 minutes
537+
3600, // 1 hour
538+
14400, // 4 hours
539+
86400, // 1 day
540+
604800, // 1 week
541+
2592000, // 1 month
542+
7776000, // 3 months
543+
15552000, // 6 months
544+
31104000, // 1 year
545+
124416000, // 4 years
546+
},
547+
},
548+
)
549+
550+
execPluginCalls = k8smetrics.NewCounterVec(
551+
&k8smetrics.CounterOpts{
552+
Name: "rest_client_exec_plugin_calls",
553+
Help: "Number of calls to an exec plugin.",
554+
},
555+
[]string{},
556+
)
557+
558+
execPluginFailedCalls = k8smetrics.NewCounterVec(
559+
&k8smetrics.CounterOpts{
560+
Name: "rest_client_exec_plugin_failed_calls",
561+
Help: "Number of calls to an exec plugin, partitioned by exit code.",
562+
},
563+
[]string{"exitCode"},
564+
)
565+
)
566+
```
567+
568+
As is common practice, these labels will be hidden behind abstract global
569+
variables that will be called by the exec plugin code.
570+
```golang
571+
// DurationMetric is a measurement of some amount of time.
572+
type DurationMetric interface {
573+
Observe(duration time.Duration)
574+
}
575+
576+
// ExpiryMetric sets some time of expiry. If nil, assume not relevant.
577+
type ExpiryMetric interface {
578+
Set(expiry *time.Time)
579+
}
580+
581+
// CallsMetric counts calls that take place for a specific exec plugin.
582+
type CallsMetric interface {
583+
// Increment increments a counter. The provided exitCode is optional,
584+
// so that this interface can be used for when a call takes place
585+
// but the exit code does not matter.
586+
Increment(exitCode int)
587+
}
588+
589+
var (
590+
// ClientCertExpiry is the expiry time of a client certificate
591+
ClientCertExpiry ExpiryMetric = noopExpiry{}
592+
// ClientCertRotationAge is the age of a certificate that has just been rotated.
593+
ClientCertRotationAge DurationMetric = noopDuration{}
594+
// ExecPluginCalls is the number of calls made to an exec plugin.
595+
ExecPluginCalls CallsMetric = noopCalls{}
596+
// ExecPluginFailedCalls is the number of calls made to an exec plugin that fail.
597+
// I.e., when the binary returns a non-zero exit code.
598+
ExecPluginFailedCalls CallsMetric = noopCalls{}
599+
)
600+
```
601+
602+
The `"exitCode"` label of these metrics is an attempt to elucidate the exec
603+
plugin failure mode to the user.
604+
496605
### Risks and Mitigations
497606

498607
#### Client authentication to the binary
@@ -532,6 +641,7 @@ Unit tests to confirm:
532641
that structs are kept up to date
533642
- Helper methods properly create `"k8s.io/client-go/rest".Config` from
534643
`"k8s.io/client-go/pkg/apis/clientauthentication".Cluster` and vice versa
644+
- Metrics are reported as they should
535645

536646
Integration (or e2e CLI) tests to confirm:
537647

@@ -546,6 +656,7 @@ Integration (or e2e CLI) tests to confirm:
546656
+ Cert based auth
547657
- Interactive login flows work
548658
+ TTY forwarding between client and executable works
659+
- Metrics are reported as they should
549660

550661
### Graduation Criteria
551662

@@ -565,6 +676,7 @@ Feature is already in Beta.
565676
- Address known bugs and add tests to prevent regressions
566677
- Docs are up-to-date with latest version of APIs
567678
- Docs describe set of best practices (i.e. do not mutate `kubeconfig`)
679+
- Sufficient metrics
568680

569681
Note: this feature set does not need conformance tests because it is inherently
570682
opt-in on the client-side and it relies on an extra binary to be present.
@@ -674,7 +786,10 @@ The downsides of this approach compared to exec model are:
674786
authenticator could be behaving incorrectly. For example, if the certificate
675787
expiration time is constantly increasing upon every authentication to the API, then
676788
perhaps the exec plugin authenticator is refreshing the certificate credential too
677-
often.
789+
often. Furthermore, the certificate's age (i.e., the time since the certificate's
790+
`NotBefore` field) will be emitted as a metric. If this value is frequently much smaller
791+
than the certificate's expected lifetime, then the exec plugin authenticator may be
792+
rotating credentials too quickly which may point to a bug.
678793
- The total number of calls to the exec plugin would also be helpful to obtain. This
679794
metric should increase each time a credential is refreshed (see previous bullet point
680795
for when this happens). If this number increases rapidly, then the exec plugin
@@ -710,16 +825,19 @@ _This section must be completed when targeting beta graduation to a release._
710825

711826
* **What are the SLIs (Service Level Indicators) an operator can use to
712827
determine the health of the service?**
713-
- [ ] Metrics
714-
- Metric name:
715-
- [Optional] Aggregation method:
716-
- Components exposing the metric:
717-
- [x] Other (treat as last resort)
828+
- [X] Metrics
829+
- Metric name: `rest_client_exec_plugin_ttl_seconds`, `rest_client_exec_plugin_certificate_rotation_age`,
830+
`rest_client_exec_plugin_calls`, `rest_client_exec_plugin_failed_calls`
831+
- Components exposing the metric: client-go
832+
- [ ] Other (treat as last resort)
718833
- Details:
719834
- This feature set operates on the client-side.
720835

721836
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
722-
- This feature set operates on the client-side.
837+
- `rest_client_exec_plugin_ttl_seconds`: the expected lifetime of client-side certificates, in seconds
838+
- `rest_client_exec_plugin_certificate_rotation_age`: the expected lifetime of client-side certificates, in seconds
839+
- `rest_client_exec_plugin_calls`: 1 per the lifetime of the credential returned by the exec plugin
840+
- `rest_client_exec_plugin_failed_calls`: 0, or a very low number compared to `rest_client_exec_plugin_calls`
723841

724842
* **Are there any missing metrics that would be useful to have to improve
725843
observability if this feature?**

0 commit comments

Comments
 (0)