Commit 030fa28
authored
fix: remove ambiguous metrics registry keys (#3987)
### 🖼️ background
the linkerd2 proxy implements, registers, and exports Prometheus metrics using a variety of systems, for historical reasons. new metrics broadly rely upon the official [`prometheus-client`](https://github.com/prometheus/client_rust/) library, whose interfaces are reexported for internal consumption in the [`linkerd_metrics::prom`](https://github.com/linkerd/linkerd2-proxy/blob/main/linkerd/metrics/src/lib.rs#L30-L60) namespace.
other metrics predate this library however, and rely on the metrics registry implemented in the workspace's [`linkerd-metrics`](https://github.com/linkerd/linkerd2-proxy/tree/main/linkerd/metrics) library.
### 🐛 bug report
* linkerd/linkerd2#13821
linkerd/linkerd2#13821 reported a bug in which duplicate metrics could be observed and subsequently dropped by Prometheus when upgrading the control plane via helm with an existing workload running.
### 🦋 reproduction example
for posterity, i'll note the reproduction steps here.
i used these steps to identify the `2025.3.2` edge release as the affected release. upgrading from `2025.2.3` to `2025.3.1` did not exhibit this behavior. see below for more discussion about the cause.
generate certificates via <https://linkerd.io/2.18/tasks/generate-certificates/>
using these two deployments, courtesy of @GTRekter:
<details>
<summary>**💾 click to expand: app deployment**</summary>
```yaml
apiVersion: v1
kind: Namespace
metadata:
name: simple-app
annotations:
linkerd.io/inject: enabled
---
apiVersion: v1
kind: Service
metadata:
name: simple-app-v1
namespace: simple-app
spec:
selector:
app: simple-app-v1
version: v1
ports:
- port: 80
targetPort: 5678
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: simple-app-v1
namespace: simple-app
spec:
replicas: 1
selector:
matchLabels:
app: simple-app-v1
version: v1
template:
metadata:
labels:
app: simple-app-v1
version: v1
spec:
containers:
- name: http-app
image: hashicorp/http-echo:latest
args:
- "-text=Simple App v1"
ports:
- containerPort: 5678
```
</details>
<details>
<summary>**🤠 click to expand: client deployment**</summary>
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: traffic
namespace: simple-app
spec:
replicas: 1
selector:
matchLabels:
app: traffic
template:
metadata:
labels:
app: traffic
spec:
containers:
- name: traffic
image: curlimages/curl:latest
command:
- /bin/sh
- -c
- |
while true; do
TIMESTAMP_SEND=$(date '+%Y-%m-%d %H:%M:%S')
PAYLOAD="{\"timestamp\":\"$TIMESTAMP_SEND\",\"test_id\":\"sniff_me\",\"message\":\"hello-world\"}"
echo "$TIMESTAMP_SEND - Sending payload: $PAYLOAD"
RESPONSE=$(curl -s -X POST \
-H "Content-Type: application/json" \
-d "$PAYLOAD" \
http://simple-app-v1.simple-app.svc.cluster.local:80)
TIMESTAMP_RESPONSE=$(date '+%Y-%m-%d %H:%M:%S')
echo "$TIMESTAMP_RESPONSE - RESPONSE: $RESPONSE"
sleep 1
done
```
</details>
and this prometheus configuration:
<details>
<summary>**🔥 click to expand: prometheus configuration**</summary>
```yaml
global:
scrape_interval: 10s
scrape_configs:
- job_name: 'pod'
scrape_interval: 10s
static_configs:
- targets: ['localhost:4191']
labels:
group: 'traffic'
```
</details>
we will perform the following steps:
```sh
# install the edge release
# specify the versions we'll migrate between.
export FROM="2025.3.1"
export TO="2025.3.2"
# create a cluster, and add the helm charts.
kind create cluster
helm repo add linkerd-edge https://helm.linkerd.io/edge
# install linkerd's crd's and control plane.
helm install linkerd-crds linkerd-edge/linkerd-crds \
-n linkerd --create-namespace --version $FROM
helm install linkerd-control-plane \
-n linkerd \
--set-file identityTrustAnchorsPEM=cert/ca.crt \
--set-file identity.issuer.tls.crtPEM=cert/issuer.crt \
--set-file identity.issuer.tls.keyPEM=cert/issuer.key \
--version $FROM \
linkerd-edge/linkerd-control-plane
# install a simple app and a client to drive traffic.
kubectl apply -f duplicate-metrics-simple-app.yml
kubectl apply -f duplicate-metrics-traffic.yml
# bind the traffic pod's metrics port to the host.
kubectl port-forward -n simple-app deploy/traffic 4191
# start prometheus, begin scraping metrics
prometheus --config.file=prometheus.yml
```
now, open a browser and query `irate(request_total[1m])`.
next, upgrade the control plane:
```
helm upgrade linkerd-crds linkerd-edge/linkerd-crds \
-n linkerd --create-namespace --version $TO
helm upgrade linkerd-control-plane \
-n linkerd \
--set-file identityTrustAnchorsPEM=cert/ca.crt \
--set-file identity.issuer.tls.crtPEM=cert/issuer.crt \
--set-file identity.issuer.tls.keyPEM=cert/issuer.key \
--version $TO \
linkerd-edge/linkerd-control-plane
```
prometheus will begin emitting warnings regarding 34 time series being dropped.
in your browser, querying `irate(request_total[1m])` once more will show that
the rate of requests has stopped, due to the new time series being dropped.
next, restart the workloads...
```
kubectl rollout restart deployment -n simple-app simple-app-v1 traffic
```
prometheus warnings will go away, as reported in linkerd/linkerd2#13821.
### 🔍 related changes
* linkerd/linkerd2#13699
* linkerd/linkerd2#13715
in linkerd/linkerd2#13715 and linkerd/linkerd2##13699, we made some changes to the destination controller. from the "Cautions" section of the `2025.3.2` edge release:
> Additionally, this release changes the default for `outbound-transport-mode`
> to `transport-header`, which will result in all traffic between meshed
> proxies flowing on port 4143, rather than using the original destination
> port.
linkerd/linkerd2#13699 (_included in `edge-25.3.1`_) introduced this outbound transport-protocol configuration surface, but maintained the default behavior, while linkerd/linkerd2#13715 (_included in `edge-25.3.2`_) altered the default behavior to route meshed traffic via port 4143.
this is a visible change in behavior that can be observed when upgrading from a version that preceded this change to the mesh. this means that when upgrading across `edge-25.3.2`, such as from the `2025.2.1` to `2025.3.2` versions of the helm charts, or from the `2025.2.3` to the `2025.3.4` versions of the helm charts (_reported upstream in linkerd/linkerd2#13821_), the freshly upgraded destination controller pods will begin routing meshed traffic differently.
i'll state explicitly, _that_ is not a bug! it is, however, an important clue to bear in mind: data plane pods that were started with the previous control plane version, and continue running after the control plane upgrade, will have seen both routing patterns. reporting a duplicate time series for affected metrics indicates that there is a hashing collision in our metrics system.
### 🐛 the bug(s)
we define a collection to structures to model labels for inbound and outbound endpoints'
metrics:
```rust
// linkerd/app/core/src/metrics.rs
#[derive(Clone, Debug, PartialEq, Eq, Hash)]
pub enum EndpointLabels {
Inbound(InboundEndpointLabels),
Outbound(OutboundEndpointLabels),
}
#[derive(Clone, Debug, PartialEq, Eq, Hash)]
pub struct InboundEndpointLabels {
pub tls: tls::ConditionalServerTls,
pub authority: Option<http::uri::Authority>,
pub target_addr: SocketAddr,
pub policy: RouteAuthzLabels,
}
#[derive(Clone, Debug, PartialEq, Eq, Hash)]
pub struct OutboundEndpointLabels {
pub server_id: tls::ConditionalClientTls,
pub authority: Option<http::uri::Authority>,
pub labels: Option<String>,
pub zone_locality: OutboundZoneLocality,
pub target_addr: SocketAddr,
}
```
\- <https://github.com/linkerd/linkerd2-proxy/blob/main/linkerd/app/core/src/metrics.rs>
bear particular attention to the derived `Hash` implementation. note the `tls::ConditionalClientTls` and `tls::ConditionalServerTls` types used in each of these labels. these are used by some of our types like `TlsConnect` to emit prometheus labels, using our legacy system's `FmtLabels` trait:
```rust
// linkerd/app/core/src/transport/labels.rs
impl FmtLabels for TlsConnect<'_> {
fn fmt_labels(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
match self.0 {
Conditional::None(tls::NoClientTls::Disabled) => {
write!(f, "tls=\"disabled\"")
}
Conditional::None(why) => {
write!(f, "tls=\"no_identity\",no_tls_reason=\"{}\"", why)
}
Conditional::Some(tls::ClientTls { server_id, .. }) => {
write!(f, "tls=\"true\",server_id=\"{}\"", server_id)
}
}
}
}
```
\- <https://github.com/linkerd/linkerd2-proxy/blob/99316f78987975a074ea63453c0dd21546fa4a48/linkerd/app/core/src/transport/labels.rs#L151-L165>
note the `ClientTls` case, which ignores fields in the client tls information:
```rust
// linkerd/tls/src/client.rs
/// A stack parameter that configures a `Client` to establish a TLS connection.
#[derive(Clone, Debug, Eq, PartialEq, Hash)]
pub struct ClientTls {
pub server_name: ServerName,
pub server_id: ServerId,
pub alpn: Option<AlpnProtocols>,
}
```
\- <https://github.com/linkerd/linkerd2-proxy/blob/99316f78987975a074ea63453c0dd21546fa4a48/linkerd/tls/src/client.rs#L20-L26>
this means that there is potential for an identical set of labels to be emitted given two `ClientTls` structures with distinct server names or ALPN protocols. for brevity, i'll elide the equivalent issue with `ServerTls`, and its corresponding `TlsAccept<'_>` label implementation, though it exhibits the same issue.
### 🔨 the fix
this pull request introduces two new types: `ClientTlsLabels` and `ServerTlsLabels`. these continue to implement `Hash`, for use as a key in our metrics registry, and for use in formatting labels.
`ClientTlsLabels` and `ServerTlsLabels` each resemble `ClientTls` and `ServerTls`, respectively, but do not contain any fields that are elided in label formatting, to prevent duplicate metrics from being emitted.
relatedly, #3988 audits our existing `FmtLabels` implementations and makes use of exhaustive bindings, to prevent this category of problem in the short-term future. ideally, we might eventually consider replacing the metrics interfaces in `linkerd-metrics`, but that is strictly kept out-of-scope for the purposes of this particular fix.
---
* fix: do not key transport metrics registry on `ClientTls`
Signed-off-by: katelyn martin <[email protected]>
* fix: do not key transport metrics registry on `ServerTls`
Signed-off-by: katelyn martin <[email protected]>
---------
Signed-off-by: katelyn martin <[email protected]>1 parent 085be99 commit 030fa28
File tree
17 files changed
+143
-56
lines changed- linkerd
- app
- admin/src
- core/src
- transport
- inbound/src
- http
- metrics
- policy
- outbound/src
- http
- endpoint
- opaq
- tls
- tls/src
17 files changed
+143
-56
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
214 | 214 | | |
215 | 215 | | |
216 | 216 | | |
217 | | - | |
| 217 | + | |
218 | 218 | | |
219 | 219 | | |
220 | 220 | | |
| |||
272 | 272 | | |
273 | 273 | | |
274 | 274 | | |
275 | | - | |
| 275 | + | |
276 | 276 | | |
277 | 277 | | |
278 | 278 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
54 | 54 | | |
55 | 55 | | |
56 | 56 | | |
57 | | - | |
| 57 | + | |
58 | 58 | | |
59 | 59 | | |
60 | 60 | | |
| |||
65 | 65 | | |
66 | 66 | | |
67 | 67 | | |
68 | | - | |
| 68 | + | |
69 | 69 | | |
70 | 70 | | |
71 | 71 | | |
| |||
98 | 98 | | |
99 | 99 | | |
100 | 100 | | |
101 | | - | |
| 101 | + | |
102 | 102 | | |
103 | 103 | | |
104 | 104 | | |
| |||
243 | 243 | | |
244 | 244 | | |
245 | 245 | | |
246 | | - | |
| 246 | + | |
247 | 247 | | |
248 | 248 | | |
249 | 249 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
| 23 | + | |
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
29 | | - | |
| 29 | + | |
30 | 30 | | |
31 | 31 | | |
32 | | - | |
| 32 | + | |
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
| |||
38 | 38 | | |
39 | 39 | | |
40 | 40 | | |
41 | | - | |
| 41 | + | |
42 | 42 | | |
43 | 43 | | |
44 | 44 | | |
| |||
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
65 | | - | |
| 65 | + | |
66 | 66 | | |
67 | 67 | | |
68 | 68 | | |
| |||
75 | 75 | | |
76 | 76 | | |
77 | 77 | | |
78 | | - | |
| 78 | + | |
79 | 79 | | |
80 | 80 | | |
81 | 81 | | |
| |||
90 | 90 | | |
91 | 91 | | |
92 | 92 | | |
93 | | - | |
| 93 | + | |
94 | 94 | | |
95 | 95 | | |
96 | 96 | | |
| |||
114 | 114 | | |
115 | 115 | | |
116 | 116 | | |
117 | | - | |
118 | | - | |
| 117 | + | |
| 118 | + | |
119 | 119 | | |
120 | 120 | | |
121 | 121 | | |
| |||
129 | 129 | | |
130 | 130 | | |
131 | 131 | | |
132 | | - | |
| 132 | + | |
133 | 133 | | |
134 | 134 | | |
135 | 135 | | |
136 | | - | |
| 136 | + | |
137 | 137 | | |
138 | 138 | | |
139 | 139 | | |
| |||
142 | 142 | | |
143 | 143 | | |
144 | 144 | | |
145 | | - | |
146 | | - | |
| 145 | + | |
| 146 | + | |
147 | 147 | | |
148 | 148 | | |
149 | 149 | | |
150 | 150 | | |
151 | 151 | | |
152 | 152 | | |
153 | | - | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
154 | 156 | | |
155 | 157 | | |
156 | 158 | | |
157 | 159 | | |
158 | 160 | | |
159 | 161 | | |
160 | | - | |
| 162 | + | |
161 | 163 | | |
162 | 164 | | |
163 | 165 | | |
| |||
194 | 196 | | |
195 | 197 | | |
196 | 198 | | |
197 | | - | |
| 199 | + | |
198 | 200 | | |
199 | | - | |
200 | 201 | | |
201 | 202 | | |
202 | 203 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
325 | 325 | | |
326 | 326 | | |
327 | 327 | | |
328 | | - | |
| 328 | + | |
329 | 329 | | |
330 | 330 | | |
331 | 331 | | |
| |||
429 | 429 | | |
430 | 430 | | |
431 | 431 | | |
432 | | - | |
| 432 | + | |
433 | 433 | | |
434 | 434 | | |
435 | 435 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
311 | 311 | | |
312 | 312 | | |
313 | 313 | | |
314 | | - | |
| 314 | + | |
315 | 315 | | |
316 | | - | |
317 | 316 | | |
318 | 317 | | |
319 | 318 | | |
| |||
344 | 343 | | |
345 | 344 | | |
346 | 345 | | |
347 | | - | |
| 346 | + | |
348 | 347 | | |
349 | | - | |
350 | 348 | | |
351 | 349 | | |
352 | 350 | | |
| |||
435 | 433 | | |
436 | 434 | | |
437 | 435 | | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
438 | 444 | | |
439 | 445 | | |
440 | 446 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
395 | 395 | | |
396 | 396 | | |
397 | 397 | | |
398 | | - | |
| 398 | + | |
399 | 399 | | |
400 | 400 | | |
401 | 401 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
664 | 664 | | |
665 | 665 | | |
666 | 666 | | |
667 | | - | |
| 667 | + | |
668 | 668 | | |
669 | 669 | | |
670 | 670 | | |
| |||
762 | 762 | | |
763 | 763 | | |
764 | 764 | | |
765 | | - | |
| 765 | + | |
766 | 766 | | |
767 | 767 | | |
768 | 768 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
67 | 67 | | |
68 | 68 | | |
69 | 69 | | |
70 | | - | |
| 70 | + | |
71 | 71 | | |
72 | 72 | | |
73 | 73 | | |
| |||
80 | 80 | | |
81 | 81 | | |
82 | 82 | | |
83 | | - | |
| 83 | + | |
84 | 84 | | |
85 | 85 | | |
86 | 86 | | |
| |||
93 | 93 | | |
94 | 94 | | |
95 | 95 | | |
96 | | - | |
| 96 | + | |
97 | 97 | | |
98 | 98 | | |
99 | 99 | | |
| |||
103 | 103 | | |
104 | 104 | | |
105 | 105 | | |
106 | | - | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
107 | 112 | | |
108 | 113 | | |
109 | 114 | | |
| |||
116 | 121 | | |
117 | 122 | | |
118 | 123 | | |
119 | | - | |
| 124 | + | |
120 | 125 | | |
121 | 126 | | |
122 | 127 | | |
| |||
187 | 192 | | |
188 | 193 | | |
189 | 194 | | |
190 | | - | |
| 195 | + | |
191 | 196 | | |
192 | 197 | | |
193 | 198 | | |
| |||
196 | 201 | | |
197 | 202 | | |
198 | 203 | | |
199 | | - | |
| 204 | + | |
200 | 205 | | |
201 | 206 | | |
202 | 207 | | |
| |||
205 | 210 | | |
206 | 211 | | |
207 | 212 | | |
208 | | - | |
| 213 | + | |
209 | 214 | | |
210 | 215 | | |
211 | 216 | | |
| |||
265 | 270 | | |
266 | 271 | | |
267 | 272 | | |
268 | | - | |
| 273 | + | |
269 | 274 | | |
270 | 275 | | |
271 | 276 | | |
| |||
281 | 286 | | |
282 | 287 | | |
283 | 288 | | |
284 | | - | |
| 289 | + | |
285 | 290 | | |
286 | 291 | | |
287 | 292 | | |
288 | 293 | | |
289 | 294 | | |
290 | | - | |
| 295 | + | |
291 | 296 | | |
292 | 297 | | |
293 | 298 | | |
294 | 299 | | |
295 | 300 | | |
296 | | - | |
| 301 | + | |
297 | 302 | | |
298 | 303 | | |
299 | 304 | | |
0 commit comments