You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[#28439] DocDB: Use bloom filter when user specified multiple keys
Summary:
After commit 9d2e474/D41176, bloom filters can be modified dynamically during scan operations.
Moreover, commit b97ccc9 / D41475 introduced bloom filter support for simple index scans that do not use HybridScanChoices.
There are scenarios where users query multiple keys, for instance, using an IN clause where HybridScanChoices is used.
This diff adds support for variable bloom filters to handle such cases efficiently.
Core challenge: SeekTuple used by indexscan uses Seek to move the underlying iterator to the appropriate tuple. However, a scan using HybridScanChoices uses SeekForward to move the iterator across various scan choices because SeekForward is more efficient than a plain Seek. SeekForward intentionally does a no-op if iterator is already positioned after the target key/tuple. This implies checking the current position of iterator and comparing it with the target key. To avoid incorrect skipping of keys, iterator should never return current position after target key if there are keys between current position and target key. Therefore, to support variable bloom filter, HybridScanChoices must prevent the underlying iterator from moving beyond the current scan choice. Otherwise, the next scan choice may not be found even when present.
To support the SeekForward optimization,
1. UpdateFilterKey now takes an extra seek_key argument to reposition the SST file iterators appropriately.
2. HybridScanChoices now uses the upperbound mechanism to iterate through the scan choices one by one.
Tradeoff: After this change, there are some scenarios with increased seeks.
Example:
CREATE TABLE t (k INT PRIMARY KEY, v INT);
INSERT INTO t VALUES (1000, 1000);
SELECT k FROM t WHERE k IN (0, 1, 2, .., 999);
HybridScanChoices now does a seek for each scan choice. This is the not the case before this change. However, users typically do not query for non existent keys.
Performance measurements using newly added `PgSingleTServerTest.BloomFilterPerf` against master (b5373ca), release build, no LTO:
Master: 1.0s
This diff: 0.8s
In one of the long-running (i.e. as data set footprint grew larger over a 24hr period) workloads, which had a good mix of queries with IN lists on primary key or indexed columns, with this optimization, we observed that the overall business txns/sec improved from 115 to 155; i.e. about 34% improvement.
Also used TPCC to check that there is no regression in scenarios that do not covered by a improved logic.
Jira: DB-18123
Test Plan:
PgSingleTServerTest.BloomFilterIn
PgSingleTServerTest.BloomFilterPerf
Reviewers: timur, rthallam, patnaik.balivada
Reviewed By: timur, patnaik.balivada
Subscribers: smishra, ybase, yql
Tags: #jenkins-ready
Differential Revision: https://phorge.dev.yugabyte.com/D46548
0 commit comments