[feature](read-uncommitted) Add READ UNCOMMITTED isolation level (Phase 1)#60586
[feature](read-uncommitted) Add READ UNCOMMITTED isolation level (Phase 1)#60586dataroaring wants to merge 14 commits intomasterfrom
Conversation
…unpublished rowsets (Phase 1) Enable queries to read uncommitted but committed-to-storage rowsets via SET read_uncommitted = true. Supports DUP_KEYS and UNIQUE_KEYS tables in both local and cloud modes. Includes async cross-uncommitted dedup for MoW tables with compaction and schema change awareness.
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
Adds an opt-in READ UNCOMMITTED query mode by propagating a new read_uncommitted session/query option from FE→BE, and introducing a BE-side UncommittedRowsetRegistry to surface uncommitted rowsets (with async MoW dedup/delete-bitmap handling) to the scan path.
Changes:
- Add
read_uncommittedtoTQueryOptionsand FESessionVariable, and forward it to BE runtime state. - Introduce
UncommittedRowsetRegistry(local + cloud engine integration) to track uncommitted rowsets and compute cross-uncommitted delete bitmaps asynchronously. - Hook registry lifecycle into write/commit, publish/rollback cleanup, compaction invalidation, tablet state changes, and OLAP scan source injection.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| gensrc/thrift/PaloInternalService.thrift | Adds read_uncommitted to TQueryOptions for FE→BE propagation. |
| fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java | Introduces session variable and forwards it into thrift query options. |
| be/src/runtime/runtime_state.h | Exposes read_uncommitted() accessor from thrift query options. |
| be/src/pipeline/exec/olap_scan_operator.cpp | Injects dedup-ready uncommitted rowsets (and merges delete bitmaps) into scan read sources when enabled. |
| be/src/olap/uncommitted_rowset_registry.h | Defines registry + entry model for tracking uncommitted rowsets and delete bitmaps. |
| be/src/olap/uncommitted_rowset_registry.cpp | Implements sharded concurrent registry and async cross-uncommitted dedup bitmap computation. |
| be/src/olap/txn_manager.cpp | Unregisters uncommitted rowsets on publish/delete. |
| be/src/olap/storage_engine.h / be/src/olap/storage_engine.cpp | Adds/initializes registry in local storage engine. |
| be/src/olap/rowset_builder.cpp | Registers committed-but-unpublished rowsets into the registry. |
| be/src/olap/compaction.cpp | Notifies registry after compaction to invalidate/recompute dedup state. |
| be/src/olap/base_tablet.cpp | Clears registry entries when tablet leaves RUNNING. |
| be/src/cloud/cloud_txn_delete_bitmap_cache.cpp | Unregisters registry entries when cloud txn info expires/is removed. |
| be/src/cloud/cloud_storage_engine.h / be/src/cloud/cloud_storage_engine.cpp | Adds/initializes registry in cloud storage engine. |
| be/src/cloud/cloud_rowset_builder.cpp | Registers cloud committed-but-unpublished rowsets into the registry. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public boolean isInDebugMode() { | ||
| return showHiddenColumns || skipDeleteBitmap || skipDeletePredicate || skipDeleteSign || skipStorageEngineMerge | ||
| || skipMissingVersion || skipBadTablet; | ||
| || skipMissingVersion || skipBadTablet || readUncommitted; |
There was a problem hiding this comment.
read_uncommitted is added to DEBUG_VARIABLES/isInDebugMode(), which will unexpectedly put the session into “debug mode” and therefore disables MV rewrite and forbids UPDATE/DELETE (see callers of isInDebugMode()). This looks like a functional behavior change unrelated to READ UNCOMMITTED; consider removing it from DEBUG_VARIABLES and isInDebugMode() (or introduce a separate flag/check) so enabling READ UNCOMMITTED doesn’t block DML or planner features.
| || skipMissingVersion || skipBadTablet || readUncommitted; | |
| || skipMissingVersion || skipBadTablet; |
| { | ||
| std::shared_lock rlock(shard.lock); | ||
| auto it = shard.entries.find(tablet_id); | ||
| if (it == shard.entries.end()) { | ||
| return; | ||
| } | ||
|
|
||
| // Invalidate all cross-delete bitmaps and mark not ready | ||
| for (auto& entry : it->second) { | ||
| if (entry->unique_key_merge_on_write) { | ||
| entry->dedup_ready.store(false, std::memory_order_release); | ||
| entry->cross_delete_bitmap.reset(); | ||
| has_mow_entries = true; | ||
| } |
There was a problem hiding this comment.
on_compaction_completed() mutates UncommittedRowsetEntry state (dedup_ready, cross_delete_bitmap) while only holding a shard shared_lock, and readers access these fields without any locking (e.g. scan injection reads entry->cross_delete_bitmap). This can cause a real data race on the std::shared_ptr assignment/reset. Consider making entry updates copy-on-write (replace the shared_ptr entry object in the vector), or guard bitmap pointers with synchronization (e.g. per-entry mutex or std::atomic<std::shared_ptr<...>>), and use an exclusive shard lock for updates.
| { | |
| std::shared_lock rlock(shard.lock); | |
| auto it = shard.entries.find(tablet_id); | |
| if (it == shard.entries.end()) { | |
| return; | |
| } | |
| // Invalidate all cross-delete bitmaps and mark not ready | |
| for (auto& entry : it->second) { | |
| if (entry->unique_key_merge_on_write) { | |
| entry->dedup_ready.store(false, std::memory_order_release); | |
| entry->cross_delete_bitmap.reset(); | |
| has_mow_entries = true; | |
| } | |
| std::lock_guard wlock(shard.lock); | |
| auto it = shard.entries.find(tablet_id); | |
| if (it == shard.entries.end()) { | |
| return; | |
| } | |
| // Invalidate all cross-delete bitmaps and mark not ready | |
| for (auto& entry : it->second) { | |
| if (entry->unique_key_merge_on_write) { | |
| entry->dedup_ready.store(false, std::memory_order_release); | |
| entry->cross_delete_bitmap.reset(); | |
| has_mow_entries = true; |
| // Invalidate all cross-delete bitmaps and mark not ready | ||
| for (auto& entry : it->second) { | ||
| if (entry->unique_key_merge_on_write) { | ||
| entry->dedup_ready.store(false, std::memory_order_release); | ||
| entry->cross_delete_bitmap.reset(); | ||
| has_mow_entries = true; | ||
| } | ||
| } |
There was a problem hiding this comment.
Compaction changes published rowsets/row-ids, so committed_delete_bitmap (layer 2) computed at commit time against published rowsets can become stale after compaction. on_compaction_completed() currently only clears/recomputes cross_delete_bitmap (layer 3), which risks incorrect MoW dedup under READ UNCOMMITTED after compaction. Consider also invalidating and recomputing the committed-vs-published bitmap (or otherwise ensuring it is mapped to the post-compaction rowsets).
…ompute bitmaps on publish 1. Replace boolean read_uncommitted with MySQL-compatible transaction_isolation session variable (supports READ-COMMITTED, READ-UNCOMMITTED). tx_isolation is an alias. Default changed from REPEATABLE-READ to READ-COMMITTED. 2. Fix correctness: when a rowset is published, remaining uncommitted rowsets need both committed_delete_bitmap (layer 2) and cross_delete_bitmap (layer 3) recomputed, because the published rowset changes the set of visible rowsets. Store pre_rowset_ids in entries for incremental recomputation.
…tes for READ UNCOMMITTED (Option B) Remove the complex async cross-uncommitted dedup machinery in favor of a simpler design that accepts potential duplicate keys across concurrent uncommitted MoW transactions. This eliminates: - CalcDeleteBitmapExecutor integration and thread pool usage - Serial per-tablet dedup mutexes and async task submission - cross_delete_bitmap, pre_rowset_ids, dedup_ready, creation_time fields - Compaction hook (on_compaction_completed) - Bitmap recomputation on publish The registry is now a simple sharded concurrent map. The scan operator adds rowset-ID dedup to prevent double-reading during the publish window, and only merges the committed-vs-published delete bitmap (layer 2). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ommitted rowsets Delete bitmap handling for READ UNCOMMITTED is complex and not essential for Phase 1. Uncommitted rowsets are now injected without any bitmap — MoW UNIQUE_KEYS tablets may show stale key versions under dirty reads, which is acceptable for this isolation level. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…for MoW tablets The delete bitmap is already computed by the normal MoW write path — we just capture it, no extra computation. Without it, READ UNCOMMITTED on UNIQUE_KEYS tables would show stale rows that should be masked. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ted rowset management Tablet state handling is not the responsibility of the uncommitted rowset registry — it is handled upstream. Remove on_tablet_state_change and the TABLET_RUNNING guard during scan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e expiration only Remove unregister_rowset calls from publish, rollback, and expiration paths in TxnManager and CloudTxnDeleteBitmapCache. Cleanup is now handled by: 1. Expiration: remove_expired_entries() runs periodically (default 30s) 2. Query-time: scan operator removes entries already in published path Add uncommitted_rowset_expire_sec config (default 30s). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… of set_txn_related_delete_bitmap Registration is not related to delete bitmaps and should run for all table types, not just MoW. Extract to a separate register_uncommitted_rowset method and call it as step 7 in cloud_tablets_channel close flow. Also fixes a bug where MoW tablets with skip_writing_rowset_metadata would return early and skip registration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…wset Move registration into CloudDeltaWriter::commit_rowset() right after the meta service commit succeeds, which is the natural point where the rowset becomes committed-but-not-published. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ed files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dRowsetRegistry BE unit tests cover register/get/unregister/expiration/sharding logic. Regression test verifies session variable, queries on dup/mow/mor tables, isolation level switching, and predicate pushdown under READ-UNCOMMITTED. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
80f004b to
9afc5a7
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
…stry in olap_server olap_server.cpp calls remove_expired_entries() on UncommittedRowsetRegistry but only had a forward declaration via storage_engine.h. Add the full header include to fix the incomplete type error. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
14dd034 to
d37bc8d
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 30785 ms |
ClickBench: Total hot run time: 28.33 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
…nfig flag Add mutable bool config `enable_uncommitted_rowset_registry` (default true) to control whether uncommitted rowsets are registered on BE. When disabled, the register calls in both local and cloud rowset builders are skipped, effectively turning off READ UNCOMMITTED visibility without restart. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 30646 ms |
TPC-DS: Total hot run time: 190529 ms |
ClickBench: Total hot run time: 28.31 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
Summary
SET read_uncommitted = truesession variable to enable READ UNCOMMITTED isolation levelUncommittedRowsetRegistry— a sharded concurrent registry that tracks uncommitted rowsets and computes cross-uncommitted delete bitmaps asynchronously for MoW (merge-on-write) UNIQUE_KEYS tablesArchitecture
Three-layer delete bitmap strategy for MoW correctness:
Test plan
SET read_uncommitted=true→ query sees data🤖 Generated with Claude Code