-
Notifications
You must be signed in to change notification settings - Fork 5.1k
[proxy-wasm] Prevent stuck connections in case of multiple local replies #36809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
+cc folks for visibility @PiotrSikora @johnlanni @wbpcode @mpwarres @martijneken and additional comments |
Sorry, I will remove the logic that closes connections during panic from this PR for now. I still think we can do better and not hang connection during panic, but I didn't really think it through well enough and the current change does not do what it should. |
This change fixes envoyproxy#28826. Some additional discussions for context can be found in proxy-wasm/proxy-wasm-cpp-host#423. The issue reported in envoyproxy#28826 happens when proxy-wasm plugin calls proxy_send_local_response during the HTTP request proessing and HTTP response processing. This happens because in attempt to mitigate a use-after-free issue (see envoyproxy#23049) we added logic to proxy-wasm that avoids calling sendLocalReply multiple times. So now when proxy-wasm plugin calls proxy_send_local_response only the first call will result in sendLocalReply, while all subsequent calls will get ignored. At the same time, when proxy-wasm plugins call proxy_send_local_response, because it's used to report an error in the plugin, proxy-wasm also stops iteration. During HTTP request processing this leads to the following chain of events: 1. During request proxy-wasm plugin calls proxy_send_local_response 2. proxy_send_local_response calls sendLocalReply, which schedules the local reply to be processed later through the filter chain 3. Request processing filter chain gets aborted and Envoy sends the previous created local reply though the filter chain 4. Proxy-wasm plugin gets called to process the response it generated and it calls proxy_send_local_response 5. proxy_send_local_response **does not** call sendLocalReply, because proxy-wasm prevents multiple calls to sendLocalReply currently 6. proxy-wasm stops iteration So in the end the filter chain iteration is stopped for the response and because proxy_send_local_respose does not actually call sendLocalReply we don't send another locally generated response either. I think we can do slightly better and close the connection in this case. This change includes the following parts: 1. Partial rollback of envoyproxy#23049 2. Tests covering this case and some other using the actual FilterManager. The most important question is why rolling back envoyproxy#23049 now is safe? The reason why it's safe, is that since introduction of prepareLocalReplyViaFilterChain in envoyproxy#24367, calling sendLocalReply multiple times is safe - that PR basically address the issue in a generic way for all the plugins, so a proxy-wasm specific fix is not needed anymore. On top of being safe, there are additional benefits to making this change: 1. We don't end up with a stuck connection in case of errors, which is slightly better 2. We remove a failure mode from proxy_send_local_response that was introduced in envoyproxy#23049 - which is good, because proxy-wasm plugins don't have a good fallback when proxy_send_local_response is failing. Finally, why replace the current mocks with a real FilterManager? Mock implementation of sendLocalReply works fine for tests that just need to assert that sendLocalReply gets called. However, in this case we rely on the fact that it's safe to call sendLocalReply multiple times and it will do the right thing and we want to assert that the connection will get closed in the end - that cannot be tested by just checking that the sendLocalReply gets called or by relying on a simplistic mock implementation of sendLocalReply. Signed-off-by: Mikhail Krinkin <[email protected]>
39251ed
to
1c1a86b
Compare
Prechecks failed with seemingly unrelated issue:
I checked that the archive Bazel is trying to download is in general available, so it appears at the first glance to be a transient issue, so I want to re-run the tests, but will wait until other checks finish before doing that. |
/retest |
Signed-off-by: Mikhail Krinkin <[email protected]>
067e5ff
to
d44e3d4
Compare
This happened again, re-runing the tests for now, since the issue with kafka downloads is not related to the content of this change (I also looked at the precheck status dashboard and it seems that this failure mode is not exactly new and other PRs also encountered it). |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, and thanks for the background explanation. Just a couple of questions in the test.
code = TestEnvironment::readFileToStringForTest(TestEnvironment::substitute(absl::StrCat( | ||
"{{ test_rundir }}/test/extensions/common/wasm/test_data/test_context_cpp.wasm"))); | ||
} else { | ||
// Skip the Null plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why skip null plugin for double local reply?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a case of temporary madness, we don't need to skip null VM in tests that don't have a crash. Fixed now. Thank you for spotting it!
EXPECT_CALL(filter_manager_callbacks_, encodeHeaders(_, _)); | ||
EXPECT_CALL(filter_manager_callbacks_, endStream()); | ||
filter_manager_->decodeHeaders(*request_headers_, false); | ||
filter_manager_->decodeData(request_data_, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more realistic to call decodeData with end_stream set to true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it, but TBH I don't understand why? If you don't mind, can you elaborate a bit, just for my education?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It indicates the end of a stream. Otherwise, the stream is left hanging, so it's taking some error path on destruction, which is useful to exercise, but not the canonical code path.
@@ -1640,10 +1629,8 @@ WasmResult Context::sendLocalResponse(uint32_t response_code, std::string_view b | |||
} | |||
decoder_callbacks_->sendLocalReply(static_cast<Envoy::Http::Code>(response_code), body_text, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpwarres The spec is a bit unclear about the semantics of send_local_reply. It looks like it is delayed in Envoy implementation, e.g. nothing happens until after the callback completes, and then response callbacks are invoked later. Is this consistent with your thinking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I think your understanding is correct, it's delayed in Envoy (actually in two places for some code paths, here in the context.cc and in prepareLocalReplyViaFilterChain). It was the behavior ever since proxy-wasm was upstreamed (see 319a9a6).
Calling send_local_reply synchronously may cause re-entering proxy-wasm plugin, which does not seem to be a safe thing to do generally (even if some plugins may be fine with that), though I cannot come up with any specific example where it would cause an issue from the top of my head.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll follow up with a spec update because it is ambiguous at it is specified.
Signed-off-by: Mikhail Krinkin <[email protected]>
Signed-off-by: Mikhail Krinkin <[email protected]>
Per discussion in envoyproxy/envoy#36809
Per discussion in envoyproxy/envoy#36809 Signed-off-by: Kuat Yessenov <[email protected]>
Commit Message:
This change fixes #28826. Some additional discussions for context can be found in proxy-wasm/proxy-wasm-cpp-host#423.
The issue reported in #28826 happens when proxy-wasm plugin calls proxy_send_local_response during the HTTP request proessing and HTTP response processing.
This happens because in attempt to mitigate a use-after-free issue (see #23049) we added logic to proxy-wasm that avoids calling sendLocalReply multiple times.
So now when proxy-wasm plugin calls proxy_send_local_response only the first call will result in sendLocalReply, while all subsequent calls will get ignored. At the same time, when proxy-wasm plugins call proxy_send_local_response, because it's used to report an error in the plugin, proxy-wasm also stops iteration.
During HTTP request processing this leads to the following chain of events:
So in the end the filter chain iteration is stopped for the response and because proxy_send_local_respose does not actually call sendLocalReply we don't send another locally generated response either.
I think we can do slightly better and close the connection in this case. This change includes the following parts:
The most important question is why rolling back
#23049 now is safe?
The reason why it's safe, is that since introduction of prepareLocalReplyViaFilterChain in
#24367, calling sendLocalReply multiple times is safe - that PR basically address the issue in a generic way for all the plugins, so a proxy-wasm specific fix is not needed anymore.
On top of being safe, there are additional benefits to making this change:
Finally, why replace the current mocks with a real FilterManager?
Mock implementation of sendLocalReply works fine for tests that just need to assert that sendLocalReply gets called. However, in this case we rely on the fact that it's safe to call sendLocalReply multiple times and it will do the right thing and we want to assert that the connection will get closed in the end - that cannot be tested by just checking that the sendLocalReply gets called or by relying on a simplistic mock implementation of sendLocalReply.
Additional Description:
Risk Level: low
Testing: Manually, by reproducing the case reported in #28826. I also added new unit tests and verified that they pass and aren't flaky:
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A
Fixes #28826