Skip to content

Conversation

@suddendust
Copy link
Contributor

@suddendust suddendust commented Nov 20, 2025

Description

This PR optimises IN and NOT_IN queries both for primitives/array fields in PG.

Current State and Scope of Optimisation

Currently Generated SQL Queries

Primitive Fields

Operation Field Type Generated SQL
IN INT (_id) SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE ARRAY["_id"]::text[] && ARRAY[('1'::int4), ('3'::int4), ('5'::int4)]::text[]) p(countWithParser)
IN STRING (item) SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE ARRAY["item"]::text[] && ARRAY[('Soap'), ('Shampoo')]::text[]) p(countWithParser)
IN NUMERIC (price) SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE ARRAY["price"]::text[] && ARRAY[('5'::int4), ('10'::int4)]::text[]) p(countWithParser)
NOT_IN INT (_id) SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "_id" IS NULL OR NOT (ARRAY["_id"]::text[] && ARRAY[('1'::int4), ('3'::int4), ('5'::int4)]::text[])) p(countWithParser)
NOT_IN STRING (item) SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "item" IS NULL OR NOT (ARRAY["item"]::text[] && ARRAY[('Soap')]::text[])) p(countWithParser)

Array Fields

Operation Field Type Generated SQL
IN TEXT[] (tags) SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE ARRAY["tags"]::text[] && ARRAY[('hygiene'), ('grooming')]::text[]) p(countWithParser)
IN INTEGER[] (numbers) SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE ARRAY["numbers"]::text[] && ARRAY[('1'::int4), ('10'::int4)]::text[]) p(countWithParser)
NOT_IN TEXT[] (tags) SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "tags" IS NULL OR NOT (ARRAY["tags"]::text[] && ARRAY[('hygiene')]::text[])) p(countWithParser)

Observations

  1. For primitives, it casts to ::text[] arrays and then uses the overlap operator to evaluate the predicate. This is very efficient because PG cannot use indexes on the LHS col anymore + casting overhead. Instead, we should start generating IN queries for primitives.

  2. For arrays, it casts both LHS and RHS to ::text[]. This again is efficient because PG cannot use index on the casted LHS col.

New Queries (after this change)

This PR has the following changes to optimise the queries above:

  1. For primitives, it uses the IN operator with no casting on the LHS col.
  2. For arrays, we have three cases:
    2.1: Users continue using IdentifierExpression for array columns - This is not supported anymore and will break any existing queries. Users must start using ArrayIdentifierExpression for arrays. This is safe because flat collections are not being used by any customers right now.
    2.2: Users start using ArrayIdentifierExpression without the ArrayType (so document-store know that this is an array col but cannot tell the type of its objects).
    2.3: Users start using the ArrayIdentifierExpression with the corresponding ArrayType - This is the most optimal case.

Primitive Fields (Using Scalar Parser - Optimized)

Operation Field Type Generated SQL
IN INT (_id) SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "_id" IN (('1'::int4), ('3'::int4), ('5'::int4))) p(countWithParser)
IN STRING (item) SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "item" IN (('Soap'), ('Shampoo'))) p(countWithParser)
IN NUMERIC (price) SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "price" IN (('5'::int4), ('10'::int4))) p(countWithParser)
NOT_IN INT (_id) SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "_id" IS NULL OR NOT ("_id" IN (('1'::int4), ('3'::int4), ('5'::int4)))) p(countWithParser)
NOT_IN STRING (item) SELECT COUNT(*) FROM (SELECT * FROM "myTestFlat" WHERE "item" IS NULL OR NOT ("item" IN (('Soap')))) p(countWithParser)

Observation: We keep using the older logic of casting both LHS and RHS to ::text[], resulting in the current poor perf.

Array Fields (Using [ArrayIdentifierExpression] without [ArrayType])

Operation Field Type Generated SQL
IN TEXT[] (tags) SELECT * FROM "myTestFlat" WHERE "tags"::text[] && ARRAY[?, ?]::text[]
IN INTEGER[] (numbers) SELECT * FROM "myTestFlat" WHERE "numbers"::text[] && ARRAY[?, ?]::text[]
NOT_IN TEXT[] (tags) SELECT * FROM "myTestFlat" WHERE "tags" IS NULL OR NOT ("tags"::text[] && ARRAY[?]::text[])

Observation: We keep using the older logic of casting both LHS and RHS to ::text[], resulting in the current poor perf.

Array Fields (Using [ArrayIdentifierExpression] with [ArrayType])

Operation Field Type Generated SQL
IN TEXT[] (tags) SELECT * FROM "myTestFlat" WHERE "tags" && ARRAY[?, ?]::text[]
IN INTEGER[] (numbers) SELECT * FROM "myTestFlat" WHERE "numbers" && ARRAY[?, ?]
NOT_IN TEXT[] (tags) SELECT * FROM "myTestFlat" WHERE "tags" IS NULL OR NOT ("tags" && ARRAY[?]::text[])

Observation: With the type info in hand, we cast only RHS for text[] arrays. For arrays of primitive types, we don't cast at all, resulting in the best performance. Note that even with "tags" && ARRAY[?, ?]::text[], PG would be able to use indices for this query. This casting is required because o/w, JDBC binds these params as character varying[] which results in a casting error.

Testing

  1. Added integration tests for all 3 cases.
  2. Tested them in a live environment.

Checklist:

  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • Any dependent changes have been merged and published in downstream modules

@codecov
Copy link

codecov bot commented Nov 20, 2025

Codecov Report

❌ Patch coverage is 83.69565% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.50%. Comparing base (61e844b) to head (468bd36).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
.../parser/filter/PostgresContainsParserSelector.java 61.53% 0 Missing and 5 partials ⚠️
...ery/v1/parser/filter/PostgresInParserSelector.java 64.28% 0 Missing and 5 partials ⚠️
.../v1/parser/filter/PostgresNotInParserSelector.java 64.28% 0 Missing and 5 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #251      +/-   ##
============================================
+ Coverage     80.49%   80.50%   +0.01%     
- Complexity     1162     1210      +48     
============================================
  Files           217      224       +7     
  Lines          5551     5626      +75     
  Branches        490      487       -3     
============================================
+ Hits           4468     4529      +61     
  Misses          753      753              
- Partials        330      344      +14     
Flag Coverage Δ
integration 80.50% <83.69%> (+0.01%) ⬆️
unit 57.83% <61.95%> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@suddendust suddendust changed the title [Draft] [Postgres] Optimise IN and NOT_IN Queries for Primitive and ARRAY Fields [Postgres] Optimise IN and NOT_IN Queries for Primitive and ARRAY Fields Nov 20, 2025

// Extract array type if available
String arrayTypeCast = null;
if (expression.getLhs() instanceof ArrayIdentifierExpression) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we try getting rid of instanceof here. Should we use visitor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will refactor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@puneet-traceable Refactored to use visitors

.collect(Collectors.joining(", "));

// Use array overlap operator for array fields
if (arrayTypeCast != null) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: arrayType is probably better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@puneet-traceable
Copy link

LGTM

suresh-prakash
suresh-prakash previously approved these changes Nov 21, 2025

@Override
public String visit(JsonIdentifierExpression expression) {
return null; // JSON fields don't have array type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be throwing exceptions from here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, done.

@suresh-prakash suresh-prakash merged commit d411cbc into hypertrace:main Nov 21, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants