Skip to content

Resolves #1076: Incorrect continuation when scanning by EndpointType.PREFIX_STRING #3523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hoangmaihuy
Copy link

KeyValueCursor is including null terminator \x00 when both endpoints are PREFIX_STRING, which causes continuation missing one byte from last key.

In prefixString unit test I populated 5 records ("apple", 0) -> ("apple", 4) and scanned with "a" prefix, the continuation after scanning the first two records is ple\x00... instead of pple\x00 and calculated continuationBytes in KeyValueCursorBase is ...a\x00ple\x00... instead of ...apple\x00....

@hoangmaihuy hoangmaihuy changed the title Resolves FoundationDB#1076: Incorrect continuation when scanning by EndpointType.PREFIX_STRING Resolves #1076: Incorrect continuation when scanning by EndpointType.PREFIX_STRING Aug 2, 2025
@ngbinh
Copy link

ngbinh commented Aug 5, 2025

@alecgrieser can you take a look? Thanks

Copy link
Collaborator

@alecgrieser alecgrieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look at this! Definitely would be good to get this issue resolved

Comment on lines +204 to +209
fdb.database().run(tr -> {
for (int i = 0; i < 5; i++) {
tr.set(subspace.pack(Tuple.from("apple", i)), Tuple.from("apple", i).pack());
}
return null;
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop should probably be brought outside the run lambda. At present, this will run the block in a nested loop, which doesn't really mean anything semantically. Alternatively, we could run the body within the outer loop, which would set the data within the transaction, which would then be read by the scans. (FDB transactions can read their own writes.)

@@ -246,6 +246,11 @@ protected void prepare() {
// left (inclusive) and another on the right (exclusive).
prefixLength = calculatePrefixLength();

// When both endpoints are prefix strings, we should strip off the last byte \x00 from Tuple
if (lowEndpoint == EndpointType.PREFIX_STRING && highEndpoint == EndpointType.PREFIX_STRING) {
prefixLength--;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks right to me, in principle. We could probably use a bit more testing to make sure it handles some edge cases properly. I have a few suggestions on additional tests in the other file.

There's a second concern, and it has to do with cross-version compatibility. (This is, for what it's worth, what ended up being the problem with the previous attempt to fix this, #1106.) We expect the user to be able to take the continuation that is generated from one run and then hand it to another instance of the Record Layer and resume from where it has been left off. We expect this to work even if the software versions of the two Record Layer instances are different. (This is because we expect people to deploy fleets of different instances of the Record Layer that all access the same data, and then load balance requests across them. When they then perform a rolling upgrade of the fleet, then they might pick up a continuation generated from an older instance and hand it to a newer one, or vice versa. Once the upgrade is completed, this shouldn't be an issue.)

The concern here is that because we're implicitly changing the format here. In particular, if you had such a scan over a PREFIX_STRING range, then when the continuation is transferred between versions, then it will incorrectly interpret the prefix length.

The counter argument to caring about this (in this case) is that the prefix was already being wrongly handled, and so it was already an error to try and continue a prefix string search. So if the worst that can happen is that we now fail in a different way (or even the same way), maybe that's fine.

We do have tests that run multiple versions of the Record Layer and then validate the behavior across versions, including handling continuations. They use the YAML tests framework, which are set up to run a series of SQL queries against a schema template definition, and then it validates the expected results. It also runs each query in a mode where it stops after each row and resumes the query with a continuation, and in multi-server mode, it alternates between different server versions. I've added some tests of almost that behavior in this PR: #3536. The problem is that the SQL planner isn't currently smart enough to translate expressions like abc% into [PREFIX_STRING 'abc'] scan comparisons, so it doesn't actually exercise this codepath.

We have a few options here:

  1. Convince ourselves this is fine without a test. I'm not thrilled with this option, but I could almost buy that because the continuations were already broken, we don't have any problem here. I suppose it's notable that the continuations themselves haven't been modified, it's just the way the continuations get interpreted. Maybe that's enough to convince us that this is fine?
  2. Update SQL so that we can turn those scans into PREFIX_STRING ranges. It would be nice if we had that anyway, as it would make the queries more efficient. It would be a more involved change though as the SQL stuff has a sharper learning curve, and we'd also have to worry about how this interacted with the "plan cache". (In order to avoid costly re-planning, we have a cache we maintain of the plan associated with a given query. We also strip literals from the plan before caching it so that we can re-use plans for queries that are the same but with different constants. However, you wouldn't want to reuse a plan for SELECT * FROM T WHERE s LIKE 'abc%', which can use a PREFIX_STRING, with one for SELECT * FROM T WHERE s LIKE 'abc%d', which cannot without additional compensation. So whatever solution we come up with would need to take that into account.)
  3. Do mixed mode testing in some other way. We currently don't have any other ways of doing mixed mode testing, but we could theoretically come up with something. That's probably a larger discussion though

Comment on lines +211 to +214
Tuple.from("a"),
Tuple.from("a"),
EndpointType.PREFIX_STRING,
EndpointType.PREFIX_STRING
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be good to have some more tests. I can think of a couple like:

  1. An empty string as the prefix string
  2. A scan where there are substrings involved (e.g., both app and apple are in the list, and we search for a range that covers one or both of them)
  3. A special case of the above where there are null characters involved. For example, app and app\0 and app\0le would be an interesting case because the way strings are encoded, we insert a final \0 byte at the end. The exact encoding for the first is \x02app\x00, the second is \x02app\x00\xff\x00, and the third is \x02app\x00\xffle\x00
  4. Scans in reverse, validating that the scans still stitch up

You might need to restructure this to be a parameterized test that takes a prefix string and an expected set of results, and maybe page size and reverse-ness, and then it executes the scan across continuations and validates the results.

@alecgrieser alecgrieser added the bug fix Change that fixes a bug label Aug 8, 2025
alecgrieser added a commit that referenced this pull request Aug 11, 2025
This adds an additional set of tests to `like.yamsql` that are designed
to handle prefix scans where the prefix can be satisfied by an index.
Our logic currently doesn't handle transforming `LIKE '${str}%'` into a
tuple scan with a range like `[PREFIX_STRING ${str}]`, which means that
we don't use the index as efficiently as we could. This is shown in the
`explain`s here, where we only use the index in order to get a set of
covering fields, but we don't add it to a scan range.

Note that this is relevant to #3523 (addressing #1076), as if we did
turn those kinds of `LIKE` expressions into prefix string scans, we'd
get mixed-mode coverage of the fix for the bug being addressed there.
AFAIK, there isn't any other SQL comparison that we have that we could
use to generate prefix string scans in some other way.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix Change that fixes a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants