Skip to content

Conversation

sulami
Copy link
Contributor

@sulami sulami commented Aug 24, 2025

Does your PR solve an issue?

First of all, this is not merge-able as-is, it's a request for feedback/advice.

We're using sqlx with AWS Aurora, and have noticed an issue with Aurora's "zero-downtime" restarts, which wipe prepared statements but preserve TCP connections. In this scenario, all existing connections get effectively poisoned without notice, breaking queries until they age out eventually. This has forced us to disable query caching, incurring quite a big performance overhead.

I've had a look at how ActiveRecord, our other main stack handles this, and they wipe the client-side cache and retry if they get back an error. While retrying right away would be nice to rescue the query, getting the connection back into a functional state for future queries is the next best thing, and rather easy.

The main problem is testing the change. I've included a second commit with a working, but bad integration test setup. It's bad because it needs to expose additional public interfaces for the test to enable deleting the server-side prepared statements while leaving the client-side cache intact, something that the library API should not allow for consistency. I feel like an integration test is the better way to test this scenario, but neither sqlx nor MySQL expose a way to set up the required conditions. Specifically, statements prepared via the binary protocol are not named, and thus cannot be deallocated via DEALLOCATE PREPARED.

I would like to invite feedback & advice on:

  • the fix in the first place, I think it's a valid fix, albeit for a niche situation
  • a better way to test this
  • assuming the above are sorted, specifics of the fix

If this or a similar approach gets accepted, the same change might also be needed for Postgres, I suspect it will exhibit the same issue.

Is this a breaking change?

Assuming all the issues get resolved, no, I don't think it would be breaking.

The worst case outcome I can think of is a false positive match on the error, which would wrongly throw away the statement cache and cause re-preparation. It might be nice to match more closely on the specific error, but given the state of this change, this is a functional PoC.

sulami added 2 commits August 24, 2025 14:00
Some cloud offerings like AWS Aurora allow for "zero-downtime" restarts
for patches, which preserves existing TCP connections but wipes out a
lot of server state, including the statement cache. In that scenario,
trying to execute a previously prepared statement causes an error
response with

```
HY000 Unknown prepared statement handler (<id>) given to mysql_stmt_precheck
```

which appears to be the only way to detect this scenario.

To avoid subsequent errors for the same connection, we can clear the
statement cache on the client side, causing all queries to get
re-prepared. This is basically what ActiveRecord does,[0] which we can
confirm is not vulnerable to the same issue. An even better solution
would be to re-prepare and -try the query right there, like ActiveRecord
does.

[0]: https://github.com/rails/rails/blob/main/activerecord/lib/active_record/connection_adapters/mysql2/database_statements.rb#L66-L78
I think it's best to test this issue through integration tests, but
there seems to be no way to clear the server-side prepared statements
without access to the raw packet stream, as they're not named and thus
cannot be deleted via DEALLOCATE PREPARED.

To make the test work, additional interfaces have to be made available,
which should not be shipped as part of the library for regular use.
@sulami sulami force-pushed the sulami/push-pmrnxvsoyqty branch from 5f8f07a to f1fd80a Compare August 24, 2025 23:06
@sulami sulami changed the title fix(sqlx-mysql): RFC Handle missing server-side prepared statements fix(mysql): RFC Handle missing server-side prepared statements Aug 24, 2025
@abonander
Copy link
Collaborator

I'm not hostile to this change, but I would consider the wiping of prepared statements to be a huge bug on Aurora's part. We've run into issues with "compatible" providers not handling prepared statements correctly before, e.g. PgBouncer, and in many cases they've actually agreed it was a bug/misfeature and fixed it.

We can merge a fix for this, but I'd like to see at least a little pressure to get this fixed upstream.

@sulami
Copy link
Contributor Author

sulami commented Oct 5, 2025

I agree, it's very surprising behaviour, and there seems to be no way of opting out whatsoever if you're using Aurora. I'm happy to open a support ticket with AWS on our organisation's behalf/complain to our account manager, but I'm afraid we're not significant enough to be in a position to actually exert any meaningful pressure on AWS.

Fwiw, in the meantime we're using a cheap query in before_acquire which detects the same issue by virtue of being prepared itself, so when the zero-downtime patch occurs that query fails and incrementally wipes the connection pool. The nice part about this is that it actually rescues the query without any need for a retry, but at the cost of an extra round-trip for every acquire.

@abonander
Copy link
Collaborator

I'm happy to open a support ticket with AWS on our organisation's behalf/complain to our account manager, but I'm afraid we're not significant enough to be in a position to actually exert any meaningful pressure on AWS.

That's fine, it generally takes many people complaining over time. As long as they're made aware of it. It's just become my pet peeve that all these third-party implementations advertise themselves as "compatible" with a database's wire protocol but then break spectacularly when you try to use any feature that isn't part of the basic query flow.

.intersects(Status::SERVER_STATUS_IN_TRANS)
}

pub async fn nuke_cached_statements(&mut self) -> Result<(), Error> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, this shouldn't just be a public method on the connection. Users might confuse it for the method that they're supposed to use to clear the prepared statement cache.

The easiest route would be to just mark this #[doc(hidden)].

// query response is a meta-packet which may be one of:
// Ok, Err, ResultSet, or (unhandled) LocalInfileRequest
let mut packet = self.inner.stream.recv_packet().await?;
let mut packet = self.inner.stream.recv_packet().await.inspect_err(|_| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if you could find the exact error code that's returned, that would probably be better. This is a huge pessimization.

Keep in mind we also get errors here for "normal" errors like constraint violations.

This would leak any valid prepared statements on the server side.

// Ok, Err, ResultSet, or (unhandled) LocalInfileRequest
let mut packet = self.inner.stream.recv_packet().await?;
let mut packet = self.inner.stream.recv_packet().await.inspect_err(|_| {
// if a prepared statement vanished on the server side, we get an error here
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to link to the specific issue for context, if you created one (I didn't look).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is also fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants