Skip to content

Conversation

@cratelyn
Copy link
Member

@cratelyn cratelyn commented Apr 9, 2025

this commit fixes a bug discovered by @alpeb, which was introduced in proxy
v2.288.0.

The associated metric is outbound_http_route_request_statuses_total:

$ linkerd dg proxy-metrics -n booksapp deploy/webapp|rg outbound_http_route_request_statuses_total.*authors
outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="204",error=""} 5
outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="201",error="UNKNOWN"} 5
outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="200",error="UNKNOWN"} 10

The problem was introduced in edge-25.3.4, with the proxy v2.288.0.
Before that the metrics looked like:

$ linkerd dg proxy-metrics -n booksapp deploy/webapp|rg outbound_http_route_request_statuses_total.*authors
outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="200",error=""} 193
outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="204",error=""} 96
outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="201",error=""} 96

So the difference is the non-empty value for error=UNKNOWN even
when https_status is 2xx, which linkerd viz stat-outbound
interprets as failed requests.

in #3086 we introduced a suite of route- and backend-level metrics. that
subsystem contains a body middleware that will report itself as having
reached the end-of-stream by delegating directly down to its inner
body's is_end_stream() hint.

this is roughly correct, but is slightly distinct from the actual invariant: a
linkerd_http_prom::record_response::ResponseBody<B> must call its
end_stream helper to classify the outcome and increment the corresponding
time series in the outbound_http_route_request_statuses_total metric family.

in #3504 we upgraded our hyper dependency. while doing so, we neglected to
include a call to end_stream if a data frame is yielded and the inner body
reports itself as having reached the end-of-stream.

this meant that instrumented bodies would be polled until the end is reached,
but were being dropped before a None was encountered.

this commit fixes this issue in two ways, to be defensive:

-        let res =
-            futures::ready!(this.inner.as_mut().poll_data(cx)).map(|res| res.map_err(Into::into));
-        if let Some(Err(error)) = res.as_ref() {
-            end_stream(this.state, Err(error));
-        } else if (*this.inner).is_end_stream() {
-            end_stream(this.state, Ok(None));
+
+        // Poll the inner body for the next frame.
+        let poll = this.inner.as_mut().poll_frame(cx);
+        let frame = futures::ready!(poll).map(|res| res.map_err(Error::from));
+
+        match &frame {
+            Some(Ok(frame)) => {
+                if let trls @ Some(_) = frame.trailers_ref() {
+                    end_stream(this.state, Ok(trls));
+                }
+            }
+            Some(Err(error)) => end_stream(this.state, Err(error)),
+            None => end_stream(this.state, Ok(None)),
  • rather than delegating to the inner <B as Body>::is_end_stream() method,
    report the end-of-stream being reached by inspecting whether or not the inner
    response state has been taken. this is the state that directly indicates
    whether or not the ResponseBody<B> middleware is finished.

this adds a linkerd-mock-http-body development dependency, so we can use this
mock body type in the outbound proxy's unit tests. additional test coverage for
http and grpc routes' metrics is added to demonstrate that this fix works as
expected.

X-ref: #3504
X-ref: #3086
X-ref: linkerd/linkerd2#8733

cratelyn added 4 commits April 9, 2025 14:07
this adds a development dependency, so we can use this mock body type in
the outbound proxy's unit tests.

Signed-off-by: katelyn martin <[email protected]>
this commit fixes a bug discovered by @alpeb, which was introduced in
proxy v2.288.0.

> The associated metric is `outbound_http_route_request_statuses_total`:
>
> ```
> $ linkerd dg proxy-metrics -n booksapp deploy/webapp|rg outbound_http_route_request_statuses_total.*authors
> outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="204",error=""} 5
> outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="201",error="UNKNOWN"} 5
> outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="200",error="UNKNOWN"} 10
> ```
>
> The problem was introduced in `edge-25.3.4`, with the proxy `v2.288.0`.
> Before that the metrics looked like:
>
> ```
> $ linkerd dg proxy-metrics -n booksapp deploy/webapp|rg outbound_http_route_request_statuses_total.*authors
> outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="200",error=""} 193
> outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="204",error=""} 96
> outbound_http_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="booksapp",parent_name="authors",parent_port="7001",parent_section_name="",route_group="",route_kind="default",route_namespace="",route_name="http",hostname="",http_status="201",error=""} 96
> ```
>
> So the difference is the non-empty value for `error=UNKNOWN` even
> when `https_status` is 2xx, which `linkerd viz stat-outbound`
> interprets as failed requests.

in #3086 we introduced a suite of route- and backend-level metrics. that
subsystem contains a body middleware that will report itself as having
reached the end-of-stream by delegating directly down to its inner
body's `is_end_stream()` hint.

this is roughly correct, but is slightly distinct from the actual
invariant: a `linkerd_http_prom::record_response::ResponseBody<B>` must
call its `end_stream` helper to classify the outcome and increment the
corresponding time series in the
`outbound_http_route_request_statuses_total` metric family.

in #3504 we upgraded our hyper dependency. while doing so, we neglected
to include a call to `end_stream` if a data frame is yielded and the
inner body reports itself as having reached the end-of-stream.

this meant that instrumented bodies would be polled until the end is
reached, but were being dropped before a `None` was encountered.

this commit fixes this issue in two ways, to be defensive:

* invoke `end_stream()` if a non-trailers frame is yielded, and the
  inner body now reports itself as having ended. this restores the
  behavior in place prior to #3504. see the relevant component of that
  diff, here:
  <https://github.com/linkerd/linkerd2-proxy/pull/3504/files#diff-45d0bc344f76c111551a8eaf5d3f0e0c22ee6e6836a626e46402a6ae3cbc0035L262-R274>

* rather than delegating to the inner `<B as Body>::is_end_stream()`
  method, report the end-of-stream being reached by inspecting whether
  or not the inner response state has been taken. this is the state that
  directly indicates whether or not the `ResponseBody<B>` middleware is
  finished.

X-ref: #3504
X-ref: #3086
X-ref: linkerd/linkerd2#8733
Signed-off-by: katelyn martin <[email protected]>
@cratelyn cratelyn requested a review from a team as a code owner April 9, 2025 18:42
@cratelyn cratelyn merged commit 6426c38 into main Apr 9, 2025
15 checks passed
@cratelyn cratelyn deleted the kate/http-prom.fix-eos-unknown-error branch April 9, 2025 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants