Skip to content

Proposal: Expose all client-go metrics by default #3202

@ahmetb

Description

@ahmetb

Summary

Expose more client-side metrics offered by client-go in the controller process by default, similar to how Kubernetes builtin controllers/apiserver does

Time and time again, lack of these metrics exposed our internal controllers has prevented us from being able to monitor how long we're getting stuck in the client-side rate limiter, or what is the observed latency of the REST client requests in the controller etc (without writing our own instrumented REST transport wrapper).

Details

client-go currently exposes the following hooks that a metrics collector can register to https://github.com/kubernetes/client-go/blob/v0.33.0/tools/metrics/metrics.go#L114-L127:

Metric Name Type Dimensions Description
rest_client_request_duration_seconds Histogram verb, host Request latency in seconds.

Buckets: [0.005, 0.025, 0.1, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 15.0, 30.0, 60.0]
rest_client_dns_resolution_duration_seconds Histogram host DNS resolver latency in seconds.

Buckets: [0.005, 0.025, 0.1, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 15.0, 30.0]
rest_client_request_size_bytes Histogram verb, host Request size in bytes.

Buckets: [64, 256, 512, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216]
rest_client_response_size_bytes Histogram verb, host Response size in bytes.

Buckets: [64, 256, 512, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216]
rest_client_rate_limiter_duration_seconds Histogram verb, host Client-side rate limiter latency in seconds.

Buckets: [0.005, 0.025, 0.1, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 15.0, 30.0, 60.0]
rest_client_requests_total Counter code, method, host Number of HTTP requests.
rest_client_request_retries_total Counter code, verb, host Number of request retries.
rest_client_transport_cache_entries Gauge (none) Number of transport entries in the internal cache.
rest_client_transport_create_calls_total Counter result Number of calls to get a new transport, partitioned by the result of the operation.

Among these, the only metric currently exposed with controller-runtime is rest_client_requests_total. Some other metrics were previously removed (#1587) due to unbounded dimension cardinality; however, with recent overhauls to the metrics, the highest cardinality we get is the host dimension (which is presumably just however many apiserver host:ports you have).

Proposal

  1. controller-runtime starts exposing all of the listed metrics (by copying them from k8s.io/component-base) in controller-runtime by default.

  2. Existing rest_client_requests_total metric should remain unmodified.

  3. ExecPluginCalls hook (i.e. rest_client_exec_plugin_call_total metric) should be left out as it is very rarely if ever useful for a controller process.

Considerations

  1. Stability: ALL of the metrics listed above are listed in ALPHA stage in component-base and in k8s.io Metrics Documentation, presumably for components like kube-scheduler, kube-controller-manager etc. Do we also offer them as stable? Or do we break users later?

  2. Cardinality: Some histogram metrics have 10-12 buckets. In a large cluster setup with 10 apiservers x 4 verbs, it can easily reach 400+ time series per metric (still bounded though).

  3. Future improvements: Client-go offers a url value in one of the hook functions. This url is actually a value that's free of resource {namespace,name} (i.e. it's bounded cardinality for us!) but is available only in one metric hook😢. component-base basically uses that url.URL value to find the host label.

    However, if client-go some day starts providing url label for every metric, it would be even more useful, but we'd likely need to break the metrics.

/kind design
/cc @alvaroaleman

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/designCategorizes issue or PR as related to design.lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions