Skip to content

Conversation

krinkinmu
Copy link
Contributor

@krinkinmu krinkinmu commented Oct 24, 2024

Commit Message:
This change fixes #28826. Some additional discussions for context can be found in proxy-wasm/proxy-wasm-cpp-host#423.

The issue reported in #28826 happens when proxy-wasm plugin calls proxy_send_local_response during the HTTP request proessing and HTTP response processing.

This happens because in attempt to mitigate a use-after-free issue (see #23049) we added logic to proxy-wasm that avoids calling sendLocalReply multiple times.

So now when proxy-wasm plugin calls proxy_send_local_response only the first call will result in sendLocalReply, while all subsequent calls will get ignored. At the same time, when proxy-wasm plugins call proxy_send_local_response, because it's used to report an error in the plugin, proxy-wasm also stops iteration.

During HTTP request processing this leads to the following chain of events:

  1. During request proxy-wasm plugin calls proxy_send_local_response
  2. proxy_send_local_response calls sendLocalReply, which schedules the local reply to be processed later through the filter chain
  3. Request processing filter chain gets aborted and Envoy sends the previous created local reply though the filter chain
  4. Proxy-wasm plugin gets called to process the response it generated and it calls proxy_send_local_response
  5. proxy_send_local_response does not call sendLocalReply, because proxy-wasm prevents multiple calls to sendLocalReply currently
  6. proxy-wasm stops iteration

So in the end the filter chain iteration is stopped for the response and because proxy_send_local_respose does not actually call sendLocalReply we don't send another locally generated response either.

I think we can do slightly better and close the connection in this case. This change includes the following parts:

  1. Partial rollback of fix sendLocalResponse in wasm #23049
  2. Tests covering this case and some other using the actual FilterManager.

The most important question is why rolling back
#23049 now is safe?

The reason why it's safe, is that since introduction of prepareLocalReplyViaFilterChain in
#24367, calling sendLocalReply multiple times is safe - that PR basically address the issue in a generic way for all the plugins, so a proxy-wasm specific fix is not needed anymore.

On top of being safe, there are additional benefits to making this change:

  1. We don't end up with a stuck connection in case of errors, which is slightly better
  2. We remove a failure mode from proxy_send_local_response that was introduced in fix sendLocalResponse in wasm #23049 - which is good, because proxy-wasm plugins don't have a good fallback when proxy_send_local_response is failing.

Finally, why replace the current mocks with a real FilterManager?

Mock implementation of sendLocalReply works fine for tests that just need to assert that sendLocalReply gets called. However, in this case we rely on the fact that it's safe to call sendLocalReply multiple times and it will do the right thing and we want to assert that the connection will get closed in the end - that cannot be tested by just checking that the sendLocalReply gets called or by relying on a simplistic mock implementation of sendLocalReply.

Additional Description:
Risk Level: low
Testing: Manually, by reproducing the case reported in #28826. I also added new unit tests and verified that they pass and aren't flaky:

bazel test --runs_per_test=1000 //test/extensions/common/wasm:all --config=docker-clang

Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A
Fixes #28826

@krinkinmu krinkinmu requested a review from kyessenov as a code owner October 24, 2024 14:55
@krinkinmu
Copy link
Contributor Author

+cc folks for visibility @PiotrSikora @johnlanni @wbpcode @mpwarres @martijneken and additional comments

@krinkinmu
Copy link
Contributor Author

krinkinmu commented Oct 24, 2024

Sorry, I will remove the logic that closes connections during panic from this PR for now.

I still think we can do better and not hang connection during panic, but I didn't really think it through well enough and the current change does not do what it should.

This change fixes envoyproxy#28826.
Some additional discussions for context can be found in
proxy-wasm/proxy-wasm-cpp-host#423.

The issue reported in envoyproxy#28826
happens when proxy-wasm plugin calls proxy_send_local_response during
the HTTP request proessing and HTTP response processing.

This happens because in attempt to mitigate a use-after-free issue
(see envoyproxy#23049) we added logic
to proxy-wasm that avoids calling sendLocalReply multiple times.

So now when proxy-wasm plugin calls proxy_send_local_response only
the first call will result in sendLocalReply, while all subsequent
calls will get ignored. At the same time, when proxy-wasm plugins
call proxy_send_local_response, because it's used to report an
error in the plugin, proxy-wasm also stops iteration.

During HTTP request processing this leads to the following chain
of events:

1. During request proxy-wasm plugin calls proxy_send_local_response
2. proxy_send_local_response calls sendLocalReply, which schedules
   the local reply to be processed later through the filter chain
3. Request processing filter chain gets aborted and Envoy sends the
   previous created local reply though the filter chain
4. Proxy-wasm plugin gets called to process the response it generated
   and it calls proxy_send_local_response
5. proxy_send_local_response **does not** call sendLocalReply, because
   proxy-wasm prevents multiple calls to sendLocalReply currently
6. proxy-wasm stops iteration

So in the end the filter chain iteration is stopped for the response
and because proxy_send_local_respose does not actually call
sendLocalReply we don't send another locally generated response
either.

I think we can do slightly better and close the connection in this
case. This change includes the following parts:

1. Partial rollback of envoyproxy#23049
2. Tests covering this case and some other using the actual FilterManager.

The most important question is why rolling back
envoyproxy#23049 now is safe?

The reason why it's safe, is that since introduction of
prepareLocalReplyViaFilterChain in
envoyproxy#24367, calling sendLocalReply
multiple times is safe - that PR basically address the issue in a generic
way for all the plugins, so a proxy-wasm specific fix is not needed anymore.

On top of being safe, there are additional benefits to making this change:

1. We don't end up with a stuck connection in case of errors, which is
   slightly better
2. We remove a failure mode from proxy_send_local_response that was
   introduced in envoyproxy#23049 - which
   is good, because proxy-wasm plugins don't have a good fallback when
   proxy_send_local_response is failing.

Finally, why replace the current mocks with a real FilterManager?

Mock implementation of sendLocalReply works fine for tests that just need
to assert that sendLocalReply gets called. However, in this case we rely
on the fact that it's safe to call sendLocalReply multiple times and it
will do the right thing and we want to assert that the connection will
get closed in the end - that cannot be tested by just checking that the
sendLocalReply gets called or by relying on a simplistic mock
implementation of sendLocalReply.

Signed-off-by: Mikhail Krinkin <[email protected]>
@krinkinmu
Copy link
Contributor Author

Prechecks failed with seemingly unrelated issue:

WARNING: Download from https://downloads.apache.org/kafka/3.8.0/kafka_2.13-3.8.0.tgz failed: class java.io.IOException connect timed out
ERROR: An error occurred during the fetch of repository 'kafka_server_binary':
   Traceback (most recent call last):
	File "/build/bazel_root/base/external/bazel_tools/tools/build_defs/repo/http.bzl", line 132, column 45, in _http_archive_impl
		download_info = ctx.download_and_extract(
Error in download_and_extract: java.io.IOException: Error downloading [https://downloads.apache.org/kafka/3.8.0/kafka_2.13-3.8.0.tgz] to /build/bazel_root/base/external/kafka_server_binary/temp2658821464613602825/kafka_2.13-3.8.0.tgz: connect timed out
ERROR: /source/WORKSPACE:17:19: fetching http_archive rule //external:kafka_server_binary: Traceback (most recent call last):
	File "/build/bazel_root/base/external/bazel_tools/tools/build_defs/repo/http.bzl", line 132, column 45, in _http_archive_impl
		download_info = ctx.download_and_extract(
Error in download_and_extract: java.io.IOException: Error downloading [https://downloads.apache.org/kafka/3.8.0/kafka_2.13-3.8.0.tgz] to /build/bazel_root/base/external/kafka_server_binary/temp2658821464613602825/kafka_2.13-3.8.0.tgz: connect timed out

I checked that the archive Bazel is trying to download is in general available, so it appears at the first glance to be a transient issue, so I want to re-run the tests, but will wait until other checks finish before doing that.

@krinkinmu
Copy link
Contributor Author

/retest

@krinkinmu
Copy link
Contributor Author

Prechecks failed with seemingly unrelated issue:

WARNING: Download from https://downloads.apache.org/kafka/3.8.0/kafka_2.13-3.8.0.tgz failed: class java.io.IOException connect timed out
ERROR: An error occurred during the fetch of repository 'kafka_server_binary':
   Traceback (most recent call last):
	File "/build/bazel_root/base/external/bazel_tools/tools/build_defs/repo/http.bzl", line 132, column 45, in _http_archive_impl
		download_info = ctx.download_and_extract(
Error in download_and_extract: java.io.IOException: Error downloading [https://downloads.apache.org/kafka/3.8.0/kafka_2.13-3.8.0.tgz] to /build/bazel_root/base/external/kafka_server_binary/temp2658821464613602825/kafka_2.13-3.8.0.tgz: connect timed out
ERROR: /source/WORKSPACE:17:19: fetching http_archive rule //external:kafka_server_binary: Traceback (most recent call last):
	File "/build/bazel_root/base/external/bazel_tools/tools/build_defs/repo/http.bzl", line 132, column 45, in _http_archive_impl
		download_info = ctx.download_and_extract(
Error in download_and_extract: java.io.IOException: Error downloading [https://downloads.apache.org/kafka/3.8.0/kafka_2.13-3.8.0.tgz] to /build/bazel_root/base/external/kafka_server_binary/temp2658821464613602825/kafka_2.13-3.8.0.tgz: connect timed out

I checked that the archive Bazel is trying to download is in general available, so it appears at the first glance to be a transient issue, so I want to re-run the tests, but will wait until other checks finish before doing that.

This happened again, re-runing the tests for now, since the issue with kafka downloads is not related to the content of this change (I also looked at the precheck status dashboard and it seems that this failure mode is not exactly new and other PRs also encountered it).

@krinkinmu
Copy link
Contributor Author

/retest

Copy link
Contributor

@kyessenov kyessenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, and thanks for the background explanation. Just a couple of questions in the test.

code = TestEnvironment::readFileToStringForTest(TestEnvironment::substitute(absl::StrCat(
"{{ test_rundir }}/test/extensions/common/wasm/test_data/test_context_cpp.wasm")));
} else {
// Skip the Null plugin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why skip null plugin for double local reply?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a case of temporary madness, we don't need to skip null VM in tests that don't have a crash. Fixed now. Thank you for spotting it!

EXPECT_CALL(filter_manager_callbacks_, encodeHeaders(_, _));
EXPECT_CALL(filter_manager_callbacks_, endStream());
filter_manager_->decodeHeaders(*request_headers_, false);
filter_manager_->decodeData(request_data_, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more realistic to call decodeData with end_stream set to true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it, but TBH I don't understand why? If you don't mind, can you elaborate a bit, just for my education?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It indicates the end of a stream. Otherwise, the stream is left hanging, so it's taking some error path on destruction, which is useful to exercise, but not the canonical code path.

@@ -1640,10 +1629,8 @@ WasmResult Context::sendLocalResponse(uint32_t response_code, std::string_view b
}
decoder_callbacks_->sendLocalReply(static_cast<Envoy::Http::Code>(response_code), body_text,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpwarres The spec is a bit unclear about the semantics of send_local_reply. It looks like it is delayed in Envoy implementation, e.g. nothing happens until after the callback completes, and then response callbacks are invoked later. Is this consistent with your thinking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I think your understanding is correct, it's delayed in Envoy (actually in two places for some code paths, here in the context.cc and in prepareLocalReplyViaFilterChain). It was the behavior ever since proxy-wasm was upstreamed (see 319a9a6).

Calling send_local_reply synchronously may cause re-entering proxy-wasm plugin, which does not seem to be a safe thing to do generally (even if some plugins may be fine with that), though I cannot come up with any specific example where it would cause an issue from the top of my head.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll follow up with a spec update because it is ambiguous at it is specified.

@kyessenov kyessenov merged commit 68c4d96 into envoyproxy:main Oct 31, 2024
20 checks passed
kyessenov added a commit to proxy-wasm/spec that referenced this pull request Oct 31, 2024
kyessenov added a commit to proxy-wasm/spec that referenced this pull request Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WASM Calling proxy_send_local_response twice will stuck remote http client(e.g. curl) forever until timeout or interrupted
2 participants