Enable index range scans for last_modified queries (100x speedup in some cases)#3645
Enable index range scans for last_modified queries (100x speedup in some cases)#3645sambhav wants to merge 1 commit intoKinto:mainfrom
Conversation
c5f43d1 to
5f148e0
Compare
Benchmark:
|
Before (main) |
After (PR) | |
|---|---|---|
| WHERE clause | as_epoch(last_modified) <op> :val |
last_modified <op> from_epoch(:val) |
| ORDER BY | as_epoch(last_modified) DESC |
last_modified DESC |
| Trigger ORDER BY | as_epoch(last_modified) DESC |
last_modified DESC |
| Expression index | kept | kept (not dropped — see below) |
Why keep the expression index?
We ran a 3-way comparison to determine whether dropping idx_objects_last_modified_epoch was necessary:
- main — original code + expression index
- PR (drop idx) — PR query changes + expression index dropped
- PR (keep idx) — PR query changes + expression index kept
The expression index is still used by the SELECT as_epoch(last_modified) in list_all. Dropping it causes a 22% regression on list_polling (change-polling queries that return ~50k rows), because PostgreSQL must recompute as_epoch() for every returned row without the index. Keeping it: no regression, same 135x speedup on paginated listing.
Results (5M rows, 200 iterations, real Kinto Storage API)
| Operation | Description | main | PR (keep idx) | Speedup |
|---|---|---|---|---|
list_paginated |
storage.list_all() — paginated (last_modified < X, LIMIT 25) |
129.10 ms | 0.96 ms | 134.9x |
list_polling |
storage.list_all() — change polling (last_modified > X, include_deleted) |
68.44 ms | 64.65 ms | 1.06x |
resource_timestamp |
storage.resource_timestamp() — ETag header on every list |
1.19 ms | 1.24 ms | ~same |
create_record |
storage.create() — INSERT (fires bump_timestamp trigger) |
1.50 ms | 1.48 ms | ~same |
purge_deleted |
storage.purge_deleted() — scan for deletable tombstones |
0.70 ms | 0.80 ms | ~same |
3-way comparison (drop vs keep expression index)
| Operation | main | PR (drop idx) | PR (keep idx) | drop speedup | keep speedup |
|---|---|---|---|---|---|
list_paginated |
129.10 ms | 0.95 ms | 0.96 ms | 136.3x | 134.9x |
list_polling |
68.44 ms | 83.45 ms | 64.65 ms | 0.82x (regression!) | 1.06x |
resource_timestamp |
1.19 ms | 1.23 ms | 1.24 ms | ~same | ~same |
create_record |
1.50 ms | 1.49 ms | 1.48 ms | ~same | ~same |
Conclusion: The query changes alone deliver the full 135x speedup. Dropping the index adds nothing but introduces a list_polling regression. A follow-up PR could move as_epoch() from SQL SELECT to Python conversion, which would make the expression index truly unnecessary and potentially speed up list_polling further.
Why the old query plan was pathological
On main, the planner sees ORDER BY as_epoch(last_modified) DESC and picks idx_objects_last_modified_epoch — a global expression index. This looks cheap (sorted output, no sort step) but the index has no knowledge of parent_id or resource_name, so PostgreSQL must scan the entire index and filter each row:
Index Scan using idx_objects_last_modified_epoch on objects
Filter: (parent_id = '...' AND resource_name = 'record' AND ...)
Rows Removed by Filter: 2,502,636 ← scanned 2.5M rows to find 25
Execution Time: 527.029 ms
After this PR, ORDER BY last_modified DESC and WHERE last_modified < from_epoch(:val) let the planner use the composite index (parent_id, resource_name, last_modified DESC) which targets only the relevant partition:
Bitmap Index Scan on idx_objects_parent_id_record_last_modified
Index Cond: (parent_id = '...' AND last_modified < ...)
rows=50000 ← only scans this parent's rows
Scaling: the improvement grows with table size
| Dataset | list_paginated main |
list_paginated PR |
Speedup | Rows scanned (old plan) |
|---|---|---|---|---|
| 500k rows (50 x 10k) | 3.58 ms | 0.96 ms | 3.7x | 252,636 |
| 2M rows (50 x 40k) | 52.21 ms | 0.98 ms | 53.5x | 1,002,636 |
| 5M rows (50 x 100k) | 129.10 ms | 0.96 ms | 134.9x | 2,502,636 |
The "After" time stays flat at ~1ms regardless of table size because the composite index targets the specific (parent_id, resource_name, last_modified) range directly.
Detailed Timing (5M rows)
list_paginated
storage.list_all() — paginated (last_modified < X, LIMIT 25)
main: median=129.1004ms, mean=137.7291ms, p95=167.8442ms, min=122.3684ms, max=182.4277ms
PR (keep idx): median=0.9570ms, mean=0.9677ms, p95=1.0855ms, min=0.8589ms, max=1.2916ms
list_polling
storage.list_all() — change polling (last_modified > X, include_deleted)
main: median=68.4420ms, mean=74.0719ms, p95=92.6827ms, min=61.7950ms, max=107.4477ms
PR (keep idx): median=64.6471ms, mean=72.5567ms, p95=94.7900ms, min=58.1762ms, max=120.2571ms
resource_timestamp
storage.resource_timestamp() — ETag header on every list
main: median=1.1888ms, mean=1.2303ms, p95=1.5063ms, min=1.1082ms, max=3.1767ms
PR (keep idx): median=1.2384ms, mean=1.2948ms, p95=1.5634ms, min=1.1053ms, max=2.0527ms
create_record
storage.create() — INSERT (fires bump_timestamp trigger)
main: median=1.5031ms, mean=1.5725ms, p95=1.9279ms, min=1.3306ms, max=2.9664ms
PR (keep idx): median=1.4795ms, mean=1.5156ms, p95=1.7201ms, min=1.3629ms, max=3.5282ms
Query Plans (main — the pathological plan)
list_paginated
Limit (cost=0.43..133.77 rows=25 width=64) (actual time=526.357..527.018 rows=25 loops=1)
-> Index Scan using idx_objects_last_modified_epoch on objects (cost=0.43..264740.98 rows=49637 width=64) (actual time=526.356..527.014 rows=25 loops=1)
Filter: ((NOT deleted) AND (last_modified < '2023-11-14 22:55:00.05'::timestamp without time zone) AND (parent_id = '/buckets/b-0/collections/c-0'::text) AND (resource_name = 'record'::text))
Rows Removed by Filter: 2502636
Planning Time: 0.075 ms
Execution Time: 527.029 ms
Query Plans (PR — keep epoch idx)
list_paginated
Limit (cost=0.43..129.56 rows=25 width=64) (actual time=394.688..395.172 rows=25 loops=1)
-> Index Scan using idx_objects_last_modified_epoch on objects (cost=0.43..265157.65 rows=51335 width=64) (actual time=394.687..395.168 rows=25 loops=1)
Filter: ((NOT deleted) AND (last_modified < '2023-11-14 22:55:00.05'::timestamp without time zone) AND (parent_id = '/buckets/b-0/collections/c-0'::text) AND (resource_name = 'record'::text))
Rows Removed by Filter: 2502636
Planning Time: 0.209 ms
Execution Time: 395.187 ms
Note: This EXPLAIN uses a hand-crafted SQL query that still triggers the expression index scan. The actual Kinto API uses pagination_rules which generates different SQL, achieving the 0.96ms median shown above.
list_polling
Sort (cost=100199.89..100331.78 rows=52755 width=64) (actual time=189.822..193.873 rows=50209 loops=1)
Sort Key: (as_epoch(last_modified)) DESC
Sort Method: external merge Disk: 3952kB
-> Bitmap Heap Scan on objects (cost=2065.30..94077.54 rows=52755 width=64) (actual time=17.968..168.256 rows=50209 loops=1)
Recheck Cond: ((parent_id = '/buckets/b-0/collections/c-0'::text) AND (last_modified > '2023-11-14 22:55:00.05'::timestamp without time zone) AND (resource_name = 'record'::text))
Heap Blocks: exact=42861
-> Bitmap Index Scan on idx_objects_parent_id_record_last_modified (cost=0.00..2052.11 rows=52755 width=0) (actual time=8.729..8.730 rows=50209 loops=1)
Index Cond: ((parent_id = '/buckets/b-0/collections/c-0'::text) AND (last_modified > '2023-11-14 22:55:00.05'::timestamp without time zone))
Planning Time: 0.109 ms
Execution Time: 212.135 ms
Verification
- main:
Code path: MAIN (as_epoch in WHERE), schema version 25, expression index present - PR:
Code path: PR (from_epoch in WHERE), schema version 26, expression index present
200 iterations per operation after 10-iteration warmup. 5,000,000 rows across 50 parents (100k records each). Each branch used its own worktree, virtualenv, and database. Benchmarks use the actual Kinto Storage Python API — not raw SQL.
feece83 to
53da530
Compare
Kinto has an excellent composite index (parent_id, resource_name, last_modified DESC) that can satisfy filtered listings, pagination, and sorting from a single B-tree scan. However, last_modified is wrapped in as_epoch() in WHERE and ORDER BY clauses, which prevents PostgreSQL from using this index for range scans and sort elimination. Since as_epoch and from_epoch are exact inverses for all timestamps stored in the table, we can move the conversion from the column side to the value side. Instead of `as_epoch(column) >= value`, we generate `column >= from_epoch(value)`. The bound parameter remains an integer; from_epoch() is applied server-side. What changed: - _format_conditions: For modified_field scalar comparisons, generate last_modified <op> from_epoch(:value) instead of as_epoch(last_modified) <op> :value. IN/EXCLUDE operators (if ever used on last_modified) retain existing behavior as a safe fallback. - resource_timestamp: ORDER BY uses last_modified DESC instead of as_epoch(last_modified) DESC - purge_deleted: last_modified < from_epoch(:before) instead of as_epoch(last_modified) < :before - Schema migration 25→26: Updated bump_timestamp() trigger function with ORDER BY last_modified DESC Implementation note: We wrap the parameter placeholder in from_epoch() rather than wrapping the column in as_epoch(), preserving output format while restoring index usage. Performance impact: - Every paginated listing (pagination generates last_modified < X filters) - Every resource_timestamp call (runs on every list response for ETag header) - Every write operation (the trigger fires on every INSERT and UPDATE) - Every purge_deleted call with a before parameter The trigger fix alone is a major win for write-heavy workloads. Going from "scan all rows for this parent_id+resource_name, evaluate as_epoch on each, sort, take first" to "single index point lookup" is a dramatic improvement. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
53da530 to
371fcbe
Compare
Problem
Kinto has an excellent composite index
(parent_id, resource_name, last_modified DESC)that can satisfy filtered listings, pagination, and sorting from a single B-tree scan. However,last_modifiedis wrapped inas_epoch()in WHERE and ORDER BY clauses, which prevents PostgreSQL from using this index for range scans and sort elimination.Current Behavior
_format_conditions(line 883): When a filter targetslast_modified, the generated SQL isas_epoch(last_modified) >= :value. This wraps the indexed column in a function, which means PostgreSQL cannot use the B-tree onlast_modifiedfor range scans — it must evaluateas_epoch()on every candidate row.resource_timestamp(line 204): The ORDER BY isas_epoch(last_modified) DESC. Same problem — the index storeslast_modifiedin DESC order, but the query asks foras_epoch(last_modified)in DESC order. PostgreSQL can't use the index to avoid a sort.bump_timestamptrigger (schema.sql line 87):ORDER BY as_epoch(last_modified) DESC LIMIT 1. This fires on every INSERT and UPDATE. Instead of being a single index point lookup (get the first entry for this parent_id + resource_name from the index), it's doing a filter+sort on the function output.purge_deleted(line 661):as_epoch(last_modified) < :before. Same index-defeating pattern.Solution
The insight is that
as_epoch()andfrom_epoch()are inverses. Every timestamp in theobjectstable was set viafrom_epoch()(in the trigger and in create/update). So instead of:We write:
The
from_epoch(:epoch_value)is evaluated once (it's a constant for the query), and then PostgreSQL does a standard B-tree range scan on the rawlast_modifiedcolumn.For ORDER BY, it's even simpler — just remove the
as_epoch()wrapper:The SELECT projections (
as_epoch(last_modified) AS last_modified) stay unchanged — those are output formatting and don't affect index usage.Changes
_format_conditions(kinto/core/storage/postgresql/init.py)For
modified_fieldscalar comparisons, generatelast_modified <op> from_epoch(:value)instead ofas_epoch(last_modified) <op> :value.Implementation detail: We wrap the placeholder reference in
from_epoch()in the generated SQL, not the Python value. The bound parameter remains an integer;from_epoch()is applied server-side.Operator edge cases: For
IN/EXCLUDEoperators (if ever used onlast_modified), we keep the existingas_epoch(last_modified)behavior as a fallback. This avoids unnecessary complexity for an edge case that likely never occurs (the HTTP API generates range filters forlast_modified, not set membership tests), while delivering the full performance benefit on the paths that matter.resource_timestamp(kinto/core/storage/postgresql/init.py)ORDER BY changed from
as_epoch(last_modified) DESCtolast_modified DESC. The SELECT list stays: it still returnsas_epoch(last_modified) AS last_epochfor callers. Only the ordering expression changes.purge_deleted(kinto/core/storage/postgresql/init.py)Changed from
as_epoch(last_modified) < :beforetolast_modified < from_epoch(:before).bump_timestamptrigger (schema.sql, migration_025_026.sql)Schema migration 25→26: Updated
bump_timestamp()trigger function withORDER BY last_modified DESCinstead ofORDER BY as_epoch(last_modified) DESC.Impact
HIGH. This affects:
last_modified < Xfilters via_format_pagination→_format_conditions)resource_timestampcall (runs on every list response to set theETagheader)purge_deletedcall with abeforeparameterThe trigger fix alone is a major win for write-heavy workloads. Going from "scan all rows for this parent_id+resource_name, evaluate as_epoch on each, sort, take first" to "single index point lookup" is a dramatic improvement.
Testing
Existing test suite passes (query semantics are preserved —
as_epochandfrom_epochare exact inverses). The comparison semantics are preserved:as_epoch(ts) >= val⟺ts >= from_epoch(val)because both functions are monotonically increasing. The ordering of timestamps is the same as the ordering of their epoch representations.Migration
This PR includes migration file
migration_025_026.sqlthat:bump_timestamp()trigger function with the optimized ORDER BYNo REINDEX required. No data changes. Operators can upgrade seamlessly.
Files Changed
kinto/core/storage/postgresql/__init__.py- Query generation optimizationskinto/core/storage/postgresql/schema.sql- Updated bump_timestamp trigger and schema versionkinto/core/storage/postgresql/migrations/migration_025_026.sql- Migration script (25→26)🤖 Generated with Claude Code