-
Notifications
You must be signed in to change notification settings - Fork 11
[Postgres] Optimise IN and NOT_IN Queries for Primitive and ARRAY Fields #251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Postgres] Optimise IN and NOT_IN Queries for Primitive and ARRAY Fields #251
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #251 +/- ##
============================================
+ Coverage 80.49% 80.50% +0.01%
- Complexity 1162 1210 +48
============================================
Files 217 224 +7
Lines 5551 5626 +75
Branches 490 487 -3
============================================
+ Hits 4468 4529 +61
Misses 753 753
- Partials 330 344 +14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
|
||
| // Extract array type if available | ||
| String arrayTypeCast = null; | ||
| if (expression.getLhs() instanceof ArrayIdentifierExpression) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we try getting rid of instanceof here. Should we use visitor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@puneet-traceable Refactored to use visitors
| .collect(Collectors.joining(", ")); | ||
|
|
||
| // Use array overlap operator for array fields | ||
| if (arrayTypeCast != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: arrayType is probably better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
LGTM |
|
|
||
| @Override | ||
| public String visit(JsonIdentifierExpression expression) { | ||
| return null; // JSON fields don't have array type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be throwing exceptions from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, done.
Description
This PR optimises
INandNOT_INqueries both for primitives/array fields in PG.Current State and Scope of Optimisation
Currently Generated SQL Queries
Primitive Fields
_id)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE ARRAY["_id"]::text[] && ARRAY[('1'::int4), ('3'::int4), ('5'::int4)]::text[]) p(countWithParser)item)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE ARRAY["item"]::text[] && ARRAY[('Soap'), ('Shampoo')]::text[]) p(countWithParser)price)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE ARRAY["price"]::text[] && ARRAY[('5'::int4), ('10'::int4)]::text[]) p(countWithParser)_id)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "_id" IS NULL OR NOT (ARRAY["_id"]::text[] && ARRAY[('1'::int4), ('3'::int4), ('5'::int4)]::text[])) p(countWithParser)item)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "item" IS NULL OR NOT (ARRAY["item"]::text[] && ARRAY[('Soap')]::text[])) p(countWithParser)Array Fields
tags)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE ARRAY["tags"]::text[] && ARRAY[('hygiene'), ('grooming')]::text[]) p(countWithParser)numbers)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE ARRAY["numbers"]::text[] && ARRAY[('1'::int4), ('10'::int4)]::text[]) p(countWithParser)tags)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "tags" IS NULL OR NOT (ARRAY["tags"]::text[] && ARRAY[('hygiene')]::text[])) p(countWithParser)Observations
For primitives, it casts to
::text[]arrays and then uses the overlap operator to evaluate the predicate. This is very efficient because PG cannot use indexes on the LHS col anymore + casting overhead. Instead, we should start generatingINqueries for primitives.For arrays, it casts both LHS and RHS to
::text[]. This again is efficient because PG cannot use index on the casted LHS col.New Queries (after this change)
This PR has the following changes to optimise the queries above:
INoperator with no casting on the LHS col.2.1: Users continue using
IdentifierExpressionfor array columns - This is not supported anymore and will break any existing queries. Users must start usingArrayIdentifierExpressionfor arrays. This is safe because flat collections are not being used by any customers right now.2.2: Users start using
ArrayIdentifierExpressionwithout theArrayType(sodocument-storeknow that this is an array col but cannot tell the type of its objects).2.3: Users start using the
ArrayIdentifierExpressionwith the correspondingArrayType- This is the most optimal case.Primitive Fields (Using Scalar Parser - Optimized)
_id)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "_id" IN (('1'::int4), ('3'::int4), ('5'::int4))) p(countWithParser)item)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "item" IN (('Soap'), ('Shampoo'))) p(countWithParser)price)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "price" IN (('5'::int4), ('10'::int4))) p(countWithParser)_id)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "_id" IS NULL OR NOT ("_id" IN (('1'::int4), ('3'::int4), ('5'::int4)))) p(countWithParser)item)SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "item" IS NULL OR NOT ("item" IN (('Soap')))) p(countWithParser)Observation: We keep using the older logic of casting both LHS and RHS to
::text[], resulting in the current poor perf.Array Fields (Using [ArrayIdentifierExpression] without [ArrayType])
tags)SELECT * FROM "myTestFlat" WHERE "tags"::text[] && ARRAY[?, ?]::text[]numbers)SELECT * FROM "myTestFlat" WHERE "numbers"::text[] && ARRAY[?, ?]::text[]tags)SELECT * FROM "myTestFlat" WHERE "tags" IS NULL OR NOT ("tags"::text[] && ARRAY[?]::text[])Observation: We keep using the older logic of casting both LHS and RHS to
::text[], resulting in the current poor perf.Array Fields (Using [ArrayIdentifierExpression] with [ArrayType])
tags)SELECT * FROM "myTestFlat" WHERE "tags" && ARRAY[?, ?]::text[]numbers)SELECT * FROM "myTestFlat" WHERE "numbers" && ARRAY[?, ?]tags)SELECT * FROM "myTestFlat" WHERE "tags" IS NULL OR NOT ("tags" && ARRAY[?]::text[])Observation: With the type info in hand, we cast only RHS for
text[]arrays. For arrays of primitive types, we don't cast at all, resulting in the best performance. Note that even with"tags" && ARRAY[?, ?]::text[], PG would be able to use indices for this query. This casting is required because o/w, JDBC binds these params as character varying[] which results in a casting error.Testing
Checklist: