[fix](fe) avoid concurrent tablet stat iteration failures#63298
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
1 similar comment
|
run buildall |
TPC-H: Total hot run time: 30762 ms |
TPC-DS: Total hot run time: 167902 ms |
FE UT Coverage ReportIncrement line coverage |
|
run p0 |
|
run buildall |
TPC-H: Total hot run time: 31516 ms |
TPC-DS: Total hot run time: 169444 ms |
|
/review |
be90113 to
4f1a804
Compare
|
run buildall |
…tablet lists Replace defensive-copy-on-read with copy-on-write via volatile snapshot for LocalTablet.replicas and MaterializedIndex.tablets. Writers (synchronized) build a new list and swap the volatile reference; readers take a single volatile read and iterate an immutable snapshot, so getReplicas()/getTablets() can no longer throw ConcurrentModificationException even while a concurrent DDL thread mutates the tablet, and the hot read path no longer copies elements. Also add a staleTabletStatSkipped counter to TabletStatMgr to make the TOCTOU skip observable, harden the regression test with a positive SHOW DATA assertion and orphan-table cleanup, and note the cloud stat-mgr path is covered.
4f1a804 to
9714dbe
Compare
|
run buildall |
1 similar comment
|
run buildall |
TPC-H: Total hot run time: 30725 ms |
TPC-DS: Total hot run time: 169285 ms |
FE Regression Coverage ReportIncrement line coverage |
|
run p0 |
|
run p0 regression |
|
run p0 |
FE Regression Coverage ReportIncrement line coverage |
1 similar comment
FE Regression Coverage ReportIncrement line coverage |
|
run p0 |
FE Regression Coverage ReportIncrement line coverage |
|
run feut |
FE Regression Coverage ReportIncrement line coverage |
FE UT Coverage ReportIncrement line coverage |
|
/review |
There was a problem hiding this comment.
Review result: request changes.
Critical checkpoint conclusions:
- Goal/test: The PR aims to make tablet/stat iteration safe while replicas/tablets are concurrently changed. The new unit tests cover immutable snapshot iteration for the local classes, but the existing review thread already notes the prior end-to-end TabletStatMgr regression did not reliably exercise the daemon path; I did not duplicate that inline concern.
- Scope/focus: The change is mostly focused, but the MaterializedIndex write path introduces a broad performance regression for bulk tablet creation.
- Concurrency: Copy-on-write snapshots avoid ConcurrentModificationException for list iteration, and writer methods are synchronized. No additional deadlock from the new synchronization was found, but the expensive copy is now done while holding the index monitor.
- Lifecycle/static init: No special lifecycle or static-initialization issue found.
- Configuration/compatibility/protocol: No new config, storage format, or FE-BE protocol compatibility issue found.
- Parallel paths: LocalTablet and MaterializedIndex were both changed; CloudTablet is unaffected because it returns a singleton replica list.
- Tests: Unit coverage was added for snapshot behavior. No tests were run locally in this review.
- Observability/transactions/persistence/data writes: No new logging/metrics need identified; persistence shape remains the same. The main data-write concern is FE DDL/restore performance for bulk tablet creation.
User focus: No additional user-provided review focus was specified.
| // (bucket count) is small and creation cost is dominated by replica/RPC work — the | ||
| // copy is negligible, and CME-safe reads on the hot query path are worth it. | ||
| public synchronized void addTablet(Tablet tablet, TabletMeta tabletMeta, boolean isRestore) { | ||
| idToTablets.put(tablet.getId(), tablet); |
There was a problem hiding this comment.
This makes bulk tablet creation quadratic. InternalCatalog.createTablets() loops over distributionInfo.getBucketNum() and calls index.addTablet(...) once per bucket, so this new new ArrayList<>(tablets) copies 0 + 1 + ... + (bucketNum - 1) entries while holding the synchronized index monitor. The default bucket limit is already 768 and max_bucket_num_per_partition can be raised or disabled, so CREATE TABLE / ADD PARTITION / restore with many buckets or rollups now spend avoidable FE time and allocations just rebuilding the same prefix repeatedly. Please avoid per-tablet copy-on-write for bulk creation, for example by adding a bulk publish path that builds the new list once, or by otherwise preserving append as O(1) while still returning reader snapshots.
MaterializedIndex.addTablet performed copy-on-write per tablet, making bulk paths (createTablets, schema change, MV rollup, restore) quadratic inside the synchronized block. Add appendTablets(Collection<Tablet>) that does one O(n) copy + one volatile publish per batch, and convert InternalCatalog.createTablets, CloudInternalCatalog.createCloudTablets, SchemaChangeHandler, MaterializedViewHandler, RestoreJob, CloudRestoreJob, and OlapTable's restore loop to use it. TabletInvertedIndex registration stays per-iteration so Tablet.addReplica still finds the tablet; only the per-index list copy moves out of the loop. Readers remain CME-safe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/review |
There was a problem hiding this comment.
I found two remaining concurrency issues in the copy-on-write changes. The list iteration paths are improved, but MaterializedIndex.getTablet() still reads a mutable HashMap concurrently with synchronized writers, and LocalTablet still has direct multi-read uses of the volatile replica list that can observe different snapshots inside one method.
Critical checkpoint conclusions:
- Goal/test: the PR aims to make concurrent tablet/replica reads safe and adds deterministic FE unit tests, but the tests only cover
getTablets()/getReplicas()iteration and miss the remaininggetTablet()and direct-field reader cases. - Scope: the approach is focused, but the snapshot model is incomplete where parallel lookup paths remain.
- Concurrency: yes; writers may run without an OlapTable write lock, and not all shared state is protected by the new volatile snapshot pattern.
- Lifecycle/static initialization: no special lifecycle or static initialization concerns found.
- Configuration: no new configuration items in the final diff.
- Compatibility/storage format: no incompatible persisted format change found.
- Parallel paths:
getTablet()remains a parallel reader path togetTablets(), and directreplicasreaders remain parallel paths togetReplicas(). - Conditional checks: no problematic new conditionals found.
- Test coverage: unit tests cover list iteration snapshots but not map lookup races or direct volatile-field multi-read methods.
- Test results: I did not run tests in this review runner.
- Observability: not applicable beyond existing debug logging.
- Transaction/persistence/data writes: no transaction or EditLog persistence issue found.
- FE/BE variable passing: not applicable.
- Performance: the previous quadratic bulk-publish issue appears addressed by
appendTablets.
User focus: no additional user-provided review focus was present.
| next.addAll(tablets); | ||
| for (Tablet tablet : newTablets) { | ||
| idToTablets.put(tablet.getId(), tablet); | ||
| next.add(tablet); |
There was a problem hiding this comment.
This still mutates idToTablets in place while getTablet() reads the same plain HashMap without synchronization. The PR makes the ordered tablets list a volatile copy-on-write snapshot, but lookup readers such as scheduler/report/proc paths use getTablet() and do not get that protection; at the same time the new comments explicitly document writers that do not necessarily hold the OlapTable write lock. A concurrent appendTablets()/clearTabletsForRestore() can therefore race with getTablet() on HashMap, which is exactly the kind of metadata reader race this change is trying to remove. Please publish idToTablets with the same snapshot discipline (for example a volatile copied map updated together with tablets) or otherwise make all reads/writes consistently synchronized/concurrent.
| @SerializedName(value = "rs", alternate = {"replicas"}) | ||
| private List<Replica> replicas; | ||
| private volatile List<Replica> replicas; | ||
| @SerializedName(value = "lastCheckTime") |
There was a problem hiding this comment.
Making replicas volatile is not enough while methods still read the field directly more than once instead of capturing one snapshot. For example, getRemoteDataSize() iterates replicas looking for cooldownReplicaId, then rereads replicas for stream().max(...).get(); a concurrent deleteReplica*() can publish an empty/different list between those reads and make this throw or compute from a different snapshot. Other direct-field readers such as getReplicaByBackendId(), equals(), and readyToBeRepaired() also bypass the getReplicas() snapshot convention. Please update these readers to use a single local List<Replica> snapshot = replicas (or getReplicas()) per method so the copy-on-write guarantee actually covers all replica readers, not only external iteration.
…metadata Address two follow-up review notes on the COW snapshot work: 1. MaterializedIndex.idToTablets was still a plain HashMap mutated under synchronized writers but read lock-free by getTablet(id) on scheduler, report, proc and other paths. Publish it as a volatile immutable snapshot in lockstep with the tablets list. Ordering preserves the tablets ⊆ idToTablets invariant: append publishes the map before the list (any id visible via iteration is queryable), clearTabletsForRestore drops the list before the map. 2. LocalTablet had readers re-reading the volatile `replicas` field more than once inside a single method (getRemoteDataSize, equals, readyToBeRepaired), defeating the COW snapshot guarantee. Capture one local snapshot at the top of each method and use it throughout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/review |
There was a problem hiding this comment.
Review conclusion: I did not find additional distinct blocking issues in the current head beyond the already-known inline review context.
Critical checkpoint conclusions:
- Goal/test: The PR targets FE tablet/replica metadata reader safety under concurrent mutation by using volatile copy-on-write snapshots and adds focused unit tests for snapshot immutability/concurrent iteration.
- Scope: Changes are focused on MaterializedIndex/LocalTablet and required bulk-publish call sites.
- Concurrency: The updated MaterializedIndex map/list and LocalTablet replica list readers now use snapshot-style access; no new deadlock-prone lock ordering was identified.
- Lifecycle/static initialization: No special static/global lifecycle issue found.
- Config/compatibility: No new config, storage format, or FE-BE protocol compatibility issue found.
- Parallel paths: Shared-nothing/cloud create and restore paths were checked; no additional distinct issue found.
- Tests: Added FE unit coverage is relevant. I attempted targeted tests with
mvn -pl fe-core -Dtest=org.apache.doris.catalog.MaterializedIndexTest,org.apache.doris.catalog.TabletTest test -Dskip.doc=true, but the runner could not resolve local artifactorg.apache.doris:fe-foundation:1.2-SNAPSHOT, so tests did not run here. - Observability/transactions/persistence: No new observability or persistence/edit-log concern found for these metadata-only changes.
- User focus: No additional user-provided review focus was specified.
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
TPC-H: Total hot run time: 30729 ms |
TPC-DS: Total hot run time: 169253 ms |
FE Regression Coverage ReportIncrement line coverage |
1 similar comment
FE Regression Coverage ReportIncrement line coverage |
### What problem does this PR solve? Issue Number: #59138 Problem Summary: TabletStatMgr.runAfterCatalogReady() is a periodic master-FE daemon that iterates every tablet/replica to pull statistics. When DDL runs concurrently with this daemon, two races fire: Iteration race (CME). MaterializedIndex.tablets and LocalTablet.replicas were plain ArrayLists whose getters returned the internal list. A concurrent addTablet / addReplica / deleteReplica (clone, repair, schema change, restore, report handler) during iteration triggered the fail-fast iterator and threw ConcurrentModificationException. TOCTOU race. In updateTabletStat, a getTabletMeta(id) != null check is followed by getReplica(id, beId). If the tablet is removed in between, getReplica hits Preconditions.checkState(...) and throws IllegalStateException. When the daemon throws, the current cycle leaves stale tablet/partition/table sizes and skewed MetricRepo metrics until the next cycle. Solution: Close the CME race for good with copy-on-write via a volatile snapshot. A first attempt returned a defensive copy (Lists.newArrayList(...)), but the copy itself iterates the source list and can still CME mid-copy — the window shrank but did not close. This PR instead: Makes LocalTablet.replicas and MaterializedIndex.tablets volatile. Writers (addReplica / deleteReplica / deleteReplicaByBackendId / addTablet / clearTabletsForRestore) are synchronized, build a new list, and atomically swap the volatile reference — they never mutate a list in place. Readers (getReplicas() / getTablets()) do a single volatile read and return an immutable snapshot (Collections.unmodifiableList). Iteration is lock-free and can never CME, and the hot read path no longer copies elements. synchronized on writers is required (not just volatile) because some write paths do not hold the OlapTable write lock — verified by tracing call sites: InternalCatalog.createPartitionWithIndices and RestoreJob.resetPartitionForRestore call addReplica/addTablet without the table write lock, so concurrent writers are real and a plain volatile field would allow lost updates. Writers are infrequent (DDL / repair / restore), so the lock cost is negligible; reads stay lock-free. TOCTOU race is handled by catching IllegalStateException around getReplica (kept from the original fix) and counting the skip via a new TabletStatMgr.staleTabletStatSkipped counter, which makes the race observable (>0 proves the window was actually hit) instead of relying solely on log scraping. Cloud path: CloudTabletStatMgr.updateStatInfo iterates tablet.getReplicas() and is covered by the same snapshot fix; its updateTabletStat uses getReplicasByTabletId (locked, returns empty list, no checkState) and is already safe. ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into --> --------- Co-authored-by: morningman <yunyou@selectdb.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
What problem does this PR solve?
Issue Number: #59138
Problem Summary:
TabletStatMgr.runAfterCatalogReady() is a periodic master-FE daemon that iterates every tablet/replica to pull statistics. When DDL runs concurrently with this daemon, two races fire:
Iteration race (CME). MaterializedIndex.tablets and LocalTablet.replicas were plain ArrayLists whose getters returned the internal list. A concurrent addTablet / addReplica / deleteReplica (clone, repair, schema change, restore, report handler) during iteration triggered the fail-fast iterator and threw ConcurrentModificationException.
TOCTOU race. In updateTabletStat, a getTabletMeta(id) != null check is followed by getReplica(id, beId). If the tablet is removed in between, getReplica hits Preconditions.checkState(...) and throws IllegalStateException.
When the daemon throws, the current cycle leaves stale tablet/partition/table sizes and skewed MetricRepo metrics until the next cycle.
Solution:
Close the CME race for good with copy-on-write via a volatile snapshot. A first attempt returned a defensive copy (Lists.newArrayList(...)), but the copy itself iterates the source list and can still CME mid-copy — the window shrank but did not close. This PR instead:
Makes LocalTablet.replicas and MaterializedIndex.tablets volatile.
Writers (addReplica / deleteReplica / deleteReplicaByBackendId / addTablet / clearTabletsForRestore) are synchronized, build a new list, and atomically swap the volatile reference — they never mutate a list in place.
Readers (getReplicas() / getTablets()) do a single volatile read and return an immutable snapshot (Collections.unmodifiableList). Iteration is lock-free and can never CME, and the hot read path no longer copies elements.
synchronized on writers is required (not just volatile) because some write paths do not hold the OlapTable write lock — verified by tracing call sites: InternalCatalog.createPartitionWithIndices and RestoreJob.resetPartitionForRestore call addReplica/addTablet without the table write lock, so concurrent writers are real and a plain volatile field would allow lost updates. Writers are infrequent (DDL / repair / restore), so the lock cost is negligible; reads stay lock-free.
TOCTOU race is handled by catching IllegalStateException around getReplica (kept from the original fix) and counting the skip via a new TabletStatMgr.staleTabletStatSkipped counter, which makes the race observable (>0 proves the window was actually hit) instead of relying solely on log scraping.
Cloud path: CloudTabletStatMgr.updateStatInfo iterates tablet.getReplicas() and is covered by the same snapshot fix; its updateTabletStat uses getReplicasByTabletId (locked, returns empty list, no checkState) and is already safe.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)