-
Notifications
You must be signed in to change notification settings - Fork 487
Frontend peek sequencing -- statement logging #34305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frontend peek sequencing -- statement logging #34305
Conversation
502d232 to
5790f83
Compare
bb49ead to
a12d26b
Compare
| /// executing) the state necessary to perform this work is bundled in | ||
| /// the `ExecuteContextExtra` object (today, it is simply empty). | ||
| /// executing). The state necessary to perform this work is bundled in | ||
| /// the `ExecuteContextExtra` object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The functionality that this comment covers is not actually being modified. The comment is being modified because it was extremely outdated.)
| ) { | ||
| let ws_id = self.controller.install_compute_watch_set(objects, t); | ||
| ) -> Result<(), CollectionLookupError> { | ||
| let ws_id = self.controller.install_compute_watch_set(objects, t)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Had to make these fallible for the same reason as a lot of things that the frontend peek sequencing calls have to be fallible: the query's dependencies might disappear during sequencing at any moment. This was not the case in the old peek sequencing, when the sequencing itself was occupying the main coordinator task.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored large parts of statement logging into a more modular form, so that the old and the new peek sequencing can share a lot of code. I moved the shared code mostly to the other statement_logging.rs, which is outside coord, to make it clear that it doesn't need a Coordinator self.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted on the other statement_logging.rs (which is under coord), much of the code in this statement_logging.rs is code that got factored out from the original statement logging code, and will now be shared between the frontend's statement logging and the old statement logging.
| compute_instance: ComputeInstanceId, | ||
| ) -> Result<&mut mz_compute_client::controller::instance::Client<Timestamp>, InstanceMissing> | ||
| { | ||
| ) -> Result<mz_compute_client::controller::instance::Client<Timestamp>, InstanceMissing> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Was made to not return a &mut to be able to call call_coordinator between ensure_compute_instance_client and client.peek in implement_fast_path_peek_plan.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(We had very little testing for statement logging metrics, so adding some here.)
a12d26b to
50ebb84
Compare
|
There is a Edit: It's just https://github.com/MaterializeInc/database-issues/issues/7304, see below. |
|
Ok, the stack trace looks very much like it happened from a connection drop, so then it's just https://github.com/MaterializeInc/database-issues/issues/7304. I'm planning to attend to that problem today, in a separate PR. |
50ebb84 to
907caa9
Compare
80730c5 to
c856b54
Compare
c856b54 to
fbc92be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify my knowledge, for the initial peek query, we're recording that attempt in the statement log and if we fallback, we record the fallback attempt as well? This makes sense to me.
Code looks good!
| ); | ||
| return Ok(None); | ||
| } | ||
| // This must happen BEFORE statement logging setup to avoid orphaned execution records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From this comment, it's not clear why doing this before statement logging setup avoids orphaned execution records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention of the new frontend peek sequencing code is to only begin logging a statement once we are sure that we won't fall back to the old sequencing (i.e., that we have a peek; the new sequencing supports only peeks).
This is not yet true as of this PR, but there are several follow-up PRs, which make this true. Those PRs have also been approved, and I'm about to merge them, right after merging this one.
So, this comment is on top of the code block that does the only allowed bailouts (i.e., when the statement is not a peek). And then we start the statement logging, and then we definitely run the frontend peek sequencing all the way, without falling back to the old sequencing. Does this extra context, with the follow-up PRs make the comment clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that makes sense to me!
This implements the largest outstanding item from the frontend peek sequencing work stream (https://github.com/MaterializeInc/database-issues/issues/9593): Adding statement logging.
I had to add a number of new Coordinator Commands. When statement logging is on, this hurts performance by tens of percents (see Nightly benchmarks) with the current CI default sampling/throttling settings (which is I think the same as the cloud defaults, but for self-managed it's completely turned off by default).
However, when statement logging is either turned off or heavily throttled, then it has a smaller impact. (If I disable statement logging, then all the Nightly benchmarks are green.) In that case, the only thing slowing us down is the
RegisterFrontendPeekCommand, which is sent for every fast-path peek, regardless of whether the statement is being logged or not. This is because it's needed not just for statement logging, but also for query cancellation.This PR also addresses a large part of the "Query cancellation" todo item of https://github.com/MaterializeInc/database-issues/issues/9593, because the code overlaps a lot with statement logging, see
RegisterFrontendPeek.(The PR doesn't address the old "execute context ... dropped without being properly retired" bug https://github.com/MaterializeInc/database-issues/issues/7304, which is planned as a follow-up. We'll probably need to solve this before widely rolling out the frontend peek sequencing, because this problem will probably become more common due to
awaiting more in the frontend task, and thus having more time for a drop to come at the wrong moment. However, I haven't seen evidence of this in CI. I did have some CI fails during development that looked like they could be this issue, but they turned out to be probably due to different bugs during the frontend peek sequencing development, which I have fixed since then. Edit: There was a related CI fail in the meantime, see below.)I did several Nightly runs with close to final states of the code, and they looked clean:
As mentioned above, we'll have to just accept the benchmark regressions, because statement logging is inherently more work. If a user needs very high QPS, they should set stronger sampling/throttling. We might also decide to revisit the default throttling in cloud.
(Before merging, I'll remove the commit that turns on frontend peek sequencing in CI, so that we keep testing mainly the old peek sequencing for now, e.g. that I have not broken statement logging there.)