Skip to content

[fix](fe) avoid concurrent tablet stat iteration failures#63298

Merged
yx-keith merged 8 commits into
apache:masterfrom
yx-keith:fix-tablet-stat-concurrency
May 23, 2026
Merged

[fix](fe) avoid concurrent tablet stat iteration failures#63298
yx-keith merged 8 commits into
apache:masterfrom
yx-keith:fix-tablet-stat-concurrency

Conversation

@yx-keith
Copy link
Copy Markdown
Contributor

@yx-keith yx-keith commented May 15, 2026

What problem does this PR solve?

Issue Number: #59138

Problem Summary:

TabletStatMgr.runAfterCatalogReady() is a periodic master-FE daemon that iterates every tablet/replica to pull statistics. When DDL runs concurrently with this daemon, two races fire:

Iteration race (CME). MaterializedIndex.tablets and LocalTablet.replicas were plain ArrayLists whose getters returned the internal list. A concurrent addTablet / addReplica / deleteReplica (clone, repair, schema change, restore, report handler) during iteration triggered the fail-fast iterator and threw ConcurrentModificationException.
TOCTOU race. In updateTabletStat, a getTabletMeta(id) != null check is followed by getReplica(id, beId). If the tablet is removed in between, getReplica hits Preconditions.checkState(...) and throws IllegalStateException.
When the daemon throws, the current cycle leaves stale tablet/partition/table sizes and skewed MetricRepo metrics until the next cycle.

Solution:
Close the CME race for good with copy-on-write via a volatile snapshot. A first attempt returned a defensive copy (Lists.newArrayList(...)), but the copy itself iterates the source list and can still CME mid-copy — the window shrank but did not close. This PR instead:

Makes LocalTablet.replicas and MaterializedIndex.tablets volatile.
Writers (addReplica / deleteReplica / deleteReplicaByBackendId / addTablet / clearTabletsForRestore) are synchronized, build a new list, and atomically swap the volatile reference — they never mutate a list in place.
Readers (getReplicas() / getTablets()) do a single volatile read and return an immutable snapshot (Collections.unmodifiableList). Iteration is lock-free and can never CME, and the hot read path no longer copies elements.
synchronized on writers is required (not just volatile) because some write paths do not hold the OlapTable write lock — verified by tracing call sites: InternalCatalog.createPartitionWithIndices and RestoreJob.resetPartitionForRestore call addReplica/addTablet without the table write lock, so concurrent writers are real and a plain volatile field would allow lost updates. Writers are infrequent (DDL / repair / restore), so the lock cost is negligible; reads stay lock-free.

TOCTOU race is handled by catching IllegalStateException around getReplica (kept from the original fix) and counting the skip via a new TabletStatMgr.staleTabletStatSkipped counter, which makes the race observable (>0 proves the window was actually hit) instead of relying solely on log scraping.

Cloud path: CloudTabletStatMgr.updateStatInfo iterates tablet.getReplicas() and is covered by the same snapshot fix; its updateTabletStat uses getReplicasByTabletId (locked, returns empty list, no checkState) and is already safe.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@yx-keith
Copy link
Copy Markdown
Contributor Author

run buildall

1 similar comment
@yx-keith
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 30762 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit be901132f6069d74136c6c887890d4831902fc1f, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17684	3852	3886	3852
q2	q3	10794	1362	839	839
q4	4680	465	348	348
q5	7585	2254	2109	2109
q6	323	171	142	142
q7	942	761	640	640
q8	9454	1679	1601	1601
q9	6740	4943	4899	4899
q10	6445	2108	1803	1803
q11	426	282	236	236
q12	685	422	290	290
q13	18228	3381	2755	2755
q14	257	257	232	232
q15	q16	825	781	705	705
q17	947	958	927	927
q18	6750	5633	5453	5453
q19	1204	1219	1038	1038
q20	524	398	259	259
q21	5630	2617	2327	2327
q22	432	358	307	307
Total cold run time: 100555 ms
Total hot run time: 30762 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4267	4195	4164	4164
q2	q3	4432	4897	4353	4353
q4	2158	2189	1375	1375
q5	4416	4241	4284	4241
q6	224	173	127	127
q7	2090	1899	1611	1611
q8	2633	2153	2143	2143
q9	7781	7702	7721	7702
q10	4570	4477	4082	4082
q11	585	422	370	370
q12	894	750	515	515
q13	3286	3622	3039	3039
q14	301	301	277	277
q15	q16	720	727	651	651
q17	1381	1340	1377	1340
q18	7904	7359	7085	7085
q19	1091	1099	1071	1071
q20	2213	2199	1934	1934
q21	5343	4619	4507	4507
q22	531	459	400	400
Total cold run time: 56820 ms
Total hot run time: 50987 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 167902 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit be901132f6069d74136c6c887890d4831902fc1f, data reload: false

query5	4350	651	509	509
query6	342	227	195	195
query7	4300	536	306	306
query8	319	226	218	218
query9	8800	3992	3995	3992
query10	448	332	305	305
query11	5794	2349	2212	2212
query12	180	131	128	128
query13	1311	616	404	404
query14	5900	5342	5043	5043
query14_1	4373	4345	4322	4322
query15	209	208	185	185
query16	989	458	472	458
query17	1161	742	612	612
query18	2512	482	368	368
query19	222	212	170	170
query20	146	135	134	134
query21	222	143	124	124
query22	13636	13525	13451	13451
query23	17127	16371	15996	15996
query23_1	16213	16078	16194	16078
query24	7408	1743	1284	1284
query24_1	1281	1293	1292	1292
query25	539	460	404	404
query26	1306	317	173	173
query27	2703	551	357	357
query28	4431	1964	1935	1935
query29	1001	623	496	496
query30	306	239	200	200
query31	1115	1049	933	933
query32	86	73	74	73
query33	530	354	292	292
query34	1157	1111	639	639
query35	774	769	679	679
query36	1340	1304	1204	1204
query37	164	106	88	88
query38	3212	3131	3038	3038
query39	921	916	903	903
query39_1	871	894	870	870
query40	234	148	125	125
query41	66	66	63	63
query42	108	109	116	109
query43	327	321	284	284
query44	
query45	208	204	189	189
query46	1096	1157	711	711
query47	2329	2332	2167	2167
query48	406	395	297	297
query49	629	489	365	365
query50	956	334	256	256
query51	4302	4293	4238	4238
query52	105	104	94	94
query53	252	286	207	207
query54	314	265	245	245
query55	89	86	86	86
query56	284	292	303	292
query57	1433	1470	1379	1379
query58	327	275	269	269
query59	1601	1723	1501	1501
query60	320	330	315	315
query61	162	154	151	151
query62	669	627	576	576
query63	241	204	211	204
query64	2415	813	621	621
query65	
query66	1742	488	366	366
query67	30252	30162	29001	29001
query68	
query69	455	334	299	299
query70	1015	949	995	949
query71	307	273	267	267
query72	2900	2684	2367	2367
query73	883	737	395	395
query74	5064	4945	4757	4757
query75	2681	2595	2249	2249
query76	2310	1134	773	773
query77	395	414	355	355
query78	12201	12121	11639	11639
query79	1381	1037	718	718
query80	656	581	476	476
query81	467	280	248	248
query82	427	156	124	124
query83	357	277	257	257
query84	268	144	110	110
query85	955	629	450	450
query86	407	345	313	313
query87	3378	3322	3238	3238
query88	3464	2664	2643	2643
query89	432	397	334	334
query90	1936	172	183	172
query91	176	173	143	143
query92	80	78	73	73
query93	1605	1475	881	881
query94	531	352	324	324
query95	681	387	341	341
query96	965	821	325	325
query97	2685	2687	2566	2566
query98	238	224	239	224
query99	1120	1072	945	945
Total cold run time: 251899 ms
Total hot run time: 167902 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 14.29% (2/14) 🎉
Increment coverage report
Complete coverage report

@yx-keith
Copy link
Copy Markdown
Contributor Author

run p0

@yx-keith
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31516 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit be901132f6069d74136c6c887890d4831902fc1f, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17807	3974	3976	3974
q2	q3	10830	1375	811	811
q4	4692	479	353	353
q5	7601	2296	2111	2111
q6	233	181	145	145
q7	985	784	633	633
q8	9404	1648	1605	1605
q9	5183	4941	4979	4941
q10	6403	2073	1784	1784
q11	443	276	243	243
q12	627	421	292	292
q13	18115	3388	2807	2807
q14	267	256	261	256
q15	q16	825	772	711	711
q17	1006	974	915	915
q18	7064	5624	5481	5481
q19	1297	1326	1063	1063
q20	515	543	308	308
q21	6340	2885	2748	2748
q22	476	385	335	335
Total cold run time: 100113 ms
Total hot run time: 31516 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5041	4642	4657	4642
q2	q3	4851	5223	4613	4613
q4	2123	2202	1409	1409
q5	4863	4651	4629	4629
q6	233	179	134	134
q7	1955	1727	1492	1492
q8	2466	2163	2107	2107
q9	7740	7521	7219	7219
q10	4463	4401	3975	3975
q11	539	378	369	369
q12	708	716	527	527
q13	3005	3411	2799	2799
q14	267	268	258	258
q15	q16	690	702	626	626
q17	1274	1252	1245	1245
q18	7282	6692	6688	6688
q19	1110	1104	1095	1095
q20	2218	2220	1942	1942
q21	5335	4647	4518	4518
q22	522	456	414	414
Total cold run time: 56685 ms
Total hot run time: 50701 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169444 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit be901132f6069d74136c6c887890d4831902fc1f, data reload: false

query5	4325	666	521	521
query6	343	211	193	193
query7	4231	576	308	308
query8	336	243	246	243
query9	8849	4038	4045	4038
query10	448	332	299	299
query11	5848	2401	2226	2226
query12	179	131	127	127
query13	1308	619	438	438
query14	6052	5348	5071	5071
query14_1	4349	4337	4354	4337
query15	218	204	188	188
query16	1039	451	445	445
query17	1176	730	642	642
query18	2806	491	361	361
query19	224	211	188	188
query20	143	136	135	135
query21	226	144	120	120
query22	13705	13523	13400	13400
query23	17151	16338	16101	16101
query23_1	16172	16153	16262	16153
query24	7514	1792	1308	1308
query24_1	1314	1326	1294	1294
query25	577	512	443	443
query26	1322	317	176	176
query27	2730	571	349	349
query28	4379	1970	1965	1965
query29	1010	643	524	524
query30	304	230	202	202
query31	1124	1068	941	941
query32	90	81	78	78
query33	561	405	298	298
query34	1158	1143	637	637
query35	774	773	680	680
query36	1286	1379	1166	1166
query37	159	108	93	93
query38	3220	3151	3062	3062
query39	931	924	891	891
query39_1	887	895	883	883
query40	239	144	127	127
query41	71	67	63	63
query42	116	116	113	113
query43	324	329	282	282
query44	
query45	213	199	194	194
query46	1057	1202	721	721
query47	2320	2350	2161	2161
query48	376	429	298	298
query49	637	489	381	381
query50	956	356	252	252
query51	4274	4372	4176	4176
query52	105	106	94	94
query53	254	285	207	207
query54	316	267	258	258
query55	94	90	85	85
query56	298	309	307	307
query57	1398	1379	1268	1268
query58	301	282	271	271
query59	1561	1600	1444	1444
query60	322	324	319	319
query61	155	160	157	157
query62	669	612	558	558
query63	241	202	203	202
query64	2370	780	634	634
query65	
query66	1643	486	396	396
query67	30062	29933	29713	29713
query68	
query69	457	349	311	311
query70	1001	999	992	992
query71	319	276	283	276
query72	3039	2736	2433	2433
query73	879	803	422	422
query74	5100	4913	4718	4718
query75	2683	2623	2253	2253
query76	2279	1152	791	791
query77	411	409	341	341
query78	12134	12110	11647	11647
query79	1463	1085	741	741
query80	1250	554	471	471
query81	518	280	242	242
query82	1381	162	127	127
query83	349	281	251	251
query84	258	136	113	113
query85	967	532	459	459
query86	429	327	346	327
query87	3467	3362	3195	3195
query88	3605	2697	2640	2640
query89	458	392	336	336
query90	1823	184	178	178
query91	183	175	145	145
query92	79	77	70	70
query93	1469	1477	823	823
query94	636	369	309	309
query95	682	478	350	350
query96	990	742	319	319
query97	2685	2729	2566	2566
query98	240	239	237	237
query99	1117	1120	1005	1005
Total cold run time: 254349 ms
Total hot run time: 169444 ms

@morningman
Copy link
Copy Markdown
Contributor

/review

@yx-keith yx-keith force-pushed the fix-tablet-stat-concurrency branch from be90113 to 4f1a804 Compare May 21, 2026 06:36
@yx-keith
Copy link
Copy Markdown
Contributor Author

run buildall

…tablet lists

Replace defensive-copy-on-read with copy-on-write via volatile snapshot for
LocalTablet.replicas and MaterializedIndex.tablets. Writers (synchronized)
build a new list and swap the volatile reference; readers take a single
volatile read and iterate an immutable snapshot, so getReplicas()/getTablets()
can no longer throw ConcurrentModificationException even while a concurrent DDL
thread mutates the tablet, and the hot read path no longer copies elements.

Also add a staleTabletStatSkipped counter to TabletStatMgr to make the TOCTOU
skip observable, harden the regression test with a positive SHOW DATA assertion
and orphan-table cleanup, and note the cloud stat-mgr path is covered.
@yx-keith yx-keith force-pushed the fix-tablet-stat-concurrency branch from 4f1a804 to 9714dbe Compare May 21, 2026 07:06
@yx-keith
Copy link
Copy Markdown
Contributor Author

run buildall

1 similar comment
@yx-keith
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 30725 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit a149f4a68cd65db5ee46f7f9defef6ee40cc1857, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17645	4014	3858	3858
q2	q3	10813	1352	809	809
q4	4682	471	343	343
q5	7576	2320	2125	2125
q6	319	180	139	139
q7	1007	760	624	624
q8	9367	1758	1502	1502
q9	6824	4929	4860	4860
q10	6452	2146	1806	1806
q11	438	276	251	251
q12	688	418	292	292
q13	18220	3315	2746	2746
q14	263	257	234	234
q15	q16	815	764	700	700
q17	996	924	904	904
q18	7066	5755	5535	5535
q19	1245	1213	1055	1055
q20	522	394	261	261
q21	5780	2580	2381	2381
q22	441	369	300	300
Total cold run time: 101159 ms
Total hot run time: 30725 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4173	4156	4145	4145
q2	q3	4518	4889	4297	4297
q4	2102	2216	1379	1379
q5	4362	4267	4332	4267
q6	239	178	130	130
q7	2252	1904	1638	1638
q8	2441	2109	2022	2022
q9	7818	7784	7737	7737
q10	4538	4431	4060	4060
q11	613	422	372	372
q12	878	777	558	558
q13	3199	3623	2969	2969
q14	296	319	292	292
q15	q16	687	732	644	644
q17	1282	1304	1297	1297
q18	8011	7240	6988	6988
q19	1117	1126	1061	1061
q20	2208	2215	1934	1934
q21	5236	4581	4382	4382
q22	509	444	398	398
Total cold run time: 56479 ms
Total hot run time: 50570 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169285 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit a149f4a68cd65db5ee46f7f9defef6ee40cc1857, data reload: false

query5	4316	673	517	517
query6	339	219	202	202
query7	4334	561	305	305
query8	326	233	225	225
query9	8847	4046	4011	4011
query10	468	344	305	305
query11	5769	2401	2188	2188
query12	189	134	130	130
query13	1292	590	419	419
query14	6049	5521	5207	5207
query14_1	4509	4494	4460	4460
query15	210	210	189	189
query16	1022	487	427	427
query17	1146	753	623	623
query18	2480	497	362	362
query19	221	213	170	170
query20	136	134	132	132
query21	217	139	123	123
query22	13657	13661	13414	13414
query23	17322	16364	15992	15992
query23_1	16109	16134	16197	16134
query24	7549	1832	1295	1295
query24_1	1297	1313	1308	1308
query25	565	483	447	447
query26	1311	340	176	176
query27	2660	541	347	347
query28	4475	1926	1942	1926
query29	1020	666	526	526
query30	320	249	203	203
query31	1121	1082	955	955
query32	95	78	75	75
query33	554	375	311	311
query34	1180	1151	643	643
query35	797	821	684	684
query36	1292	1352	1231	1231
query37	154	107	89	89
query38	3226	3175	3049	3049
query39	927	932	901	901
query39_1	874	863	857	857
query40	232	147	130	130
query41	67	63	61	61
query42	108	107	114	107
query43	322	338	294	294
query44	
query45	206	198	193	193
query46	1109	1203	720	720
query47	2302	2301	2267	2267
query48	385	411	291	291
query49	629	485	372	372
query50	953	341	259	259
query51	4339	4295	4285	4285
query52	106	106	99	99
query53	261	295	206	206
query54	347	270	255	255
query55	94	89	84	84
query56	297	316	308	308
query57	1403	1400	1318	1318
query58	314	281	288	281
query59	1595	1698	1483	1483
query60	376	332	313	313
query61	160	163	160	160
query62	683	641	574	574
query63	259	206	211	206
query64	2476	815	690	690
query65	
query66	1750	480	354	354
query67	29513	29369	29780	29369
query68	
query69	483	338	306	306
query70	1071	1013	1006	1006
query71	313	282	264	264
query72	3022	2658	2438	2438
query73	873	837	437	437
query74	5054	4937	4719	4719
query75	2663	2603	2234	2234
query76	2271	1190	793	793
query77	399	420	343	343
query78	12222	12191	11592	11592
query79	1526	1070	730	730
query80	707	547	450	450
query81	473	277	252	252
query82	1395	159	124	124
query83	318	278	258	258
query84	259	143	108	108
query85	881	542	451	451
query86	400	336	331	331
query87	3493	3324	3222	3222
query88	3589	2659	2638	2638
query89	447	378	333	333
query90	1927	189	187	187
query91	180	169	140	140
query92	80	81	75	75
query93	1468	1423	932	932
query94	553	349	296	296
query95	685	409	368	368
query96	1011	774	338	338
query97	2705	2718	2551	2551
query98	233	231	226	226
query99	1121	1107	1006	1006
Total cold run time: 253454 ms
Total hot run time: 169285 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 33.90% (20/59) 🎉
Increment coverage report
Complete coverage report

@yx-keith
Copy link
Copy Markdown
Contributor Author

run p0

@yx-keith
Copy link
Copy Markdown
Contributor Author

run p0 regression

@yx-keith
Copy link
Copy Markdown
Contributor Author

run p0

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 33.90% (20/59) 🎉
Increment coverage report
Complete coverage report

1 similar comment
@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 33.90% (20/59) 🎉
Increment coverage report
Complete coverage report

@yx-keith
Copy link
Copy Markdown
Contributor Author

run p0

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 30.77% (20/65) 🎉
Increment coverage report
Complete coverage report

@yx-keith
Copy link
Copy Markdown
Contributor Author

run feut

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 22.09% (19/86) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 67.27% (37/55) 🎉
Increment coverage report
Complete coverage report

@morningman
Copy link
Copy Markdown
Contributor

/review

@github-actions github-actions Bot removed the approved Indicates a PR has been approved by one committer. label May 22, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: request changes.

Critical checkpoint conclusions:

  • Goal/test: The PR aims to make tablet/stat iteration safe while replicas/tablets are concurrently changed. The new unit tests cover immutable snapshot iteration for the local classes, but the existing review thread already notes the prior end-to-end TabletStatMgr regression did not reliably exercise the daemon path; I did not duplicate that inline concern.
  • Scope/focus: The change is mostly focused, but the MaterializedIndex write path introduces a broad performance regression for bulk tablet creation.
  • Concurrency: Copy-on-write snapshots avoid ConcurrentModificationException for list iteration, and writer methods are synchronized. No additional deadlock from the new synchronization was found, but the expensive copy is now done while holding the index monitor.
  • Lifecycle/static init: No special lifecycle or static-initialization issue found.
  • Configuration/compatibility/protocol: No new config, storage format, or FE-BE protocol compatibility issue found.
  • Parallel paths: LocalTablet and MaterializedIndex were both changed; CloudTablet is unaffected because it returns a singleton replica list.
  • Tests: Unit coverage was added for snapshot behavior. No tests were run locally in this review.
  • Observability/transactions/persistence/data writes: No new logging/metrics need identified; persistence shape remains the same. The main data-write concern is FE DDL/restore performance for bulk tablet creation.

User focus: No additional user-provided review focus was specified.

// (bucket count) is small and creation cost is dominated by replica/RPC work — the
// copy is negligible, and CME-safe reads on the hot query path are worth it.
public synchronized void addTablet(Tablet tablet, TabletMeta tabletMeta, boolean isRestore) {
idToTablets.put(tablet.getId(), tablet);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes bulk tablet creation quadratic. InternalCatalog.createTablets() loops over distributionInfo.getBucketNum() and calls index.addTablet(...) once per bucket, so this new new ArrayList<>(tablets) copies 0 + 1 + ... + (bucketNum - 1) entries while holding the synchronized index monitor. The default bucket limit is already 768 and max_bucket_num_per_partition can be raised or disabled, so CREATE TABLE / ADD PARTITION / restore with many buckets or rollups now spend avoidable FE time and allocations just rebuilding the same prefix repeatedly. Please avoid per-tablet copy-on-write for bulk creation, for example by adding a bulk publish path that builds the new list once, or by otherwise preserving append as O(1) while still returning reader snapshots.

MaterializedIndex.addTablet performed copy-on-write per tablet, making
bulk paths (createTablets, schema change, MV rollup, restore) quadratic
inside the synchronized block. Add appendTablets(Collection<Tablet>) that
does one O(n) copy + one volatile publish per batch, and convert
InternalCatalog.createTablets, CloudInternalCatalog.createCloudTablets,
SchemaChangeHandler, MaterializedViewHandler, RestoreJob, CloudRestoreJob,
and OlapTable's restore loop to use it. TabletInvertedIndex registration
stays per-iteration so Tablet.addReplica still finds the tablet; only the
per-index list copy moves out of the loop. Readers remain CME-safe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@morningman
Copy link
Copy Markdown
Contributor

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found two remaining concurrency issues in the copy-on-write changes. The list iteration paths are improved, but MaterializedIndex.getTablet() still reads a mutable HashMap concurrently with synchronized writers, and LocalTablet still has direct multi-read uses of the volatile replica list that can observe different snapshots inside one method.

Critical checkpoint conclusions:

  • Goal/test: the PR aims to make concurrent tablet/replica reads safe and adds deterministic FE unit tests, but the tests only cover getTablets()/getReplicas() iteration and miss the remaining getTablet() and direct-field reader cases.
  • Scope: the approach is focused, but the snapshot model is incomplete where parallel lookup paths remain.
  • Concurrency: yes; writers may run without an OlapTable write lock, and not all shared state is protected by the new volatile snapshot pattern.
  • Lifecycle/static initialization: no special lifecycle or static initialization concerns found.
  • Configuration: no new configuration items in the final diff.
  • Compatibility/storage format: no incompatible persisted format change found.
  • Parallel paths: getTablet() remains a parallel reader path to getTablets(), and direct replicas readers remain parallel paths to getReplicas().
  • Conditional checks: no problematic new conditionals found.
  • Test coverage: unit tests cover list iteration snapshots but not map lookup races or direct volatile-field multi-read methods.
  • Test results: I did not run tests in this review runner.
  • Observability: not applicable beyond existing debug logging.
  • Transaction/persistence/data writes: no transaction or EditLog persistence issue found.
  • FE/BE variable passing: not applicable.
  • Performance: the previous quadratic bulk-publish issue appears addressed by appendTablets.

User focus: no additional user-provided review focus was present.

next.addAll(tablets);
for (Tablet tablet : newTablets) {
idToTablets.put(tablet.getId(), tablet);
next.add(tablet);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still mutates idToTablets in place while getTablet() reads the same plain HashMap without synchronization. The PR makes the ordered tablets list a volatile copy-on-write snapshot, but lookup readers such as scheduler/report/proc paths use getTablet() and do not get that protection; at the same time the new comments explicitly document writers that do not necessarily hold the OlapTable write lock. A concurrent appendTablets()/clearTabletsForRestore() can therefore race with getTablet() on HashMap, which is exactly the kind of metadata reader race this change is trying to remove. Please publish idToTablets with the same snapshot discipline (for example a volatile copied map updated together with tablets) or otherwise make all reads/writes consistently synchronized/concurrent.

@SerializedName(value = "rs", alternate = {"replicas"})
private List<Replica> replicas;
private volatile List<Replica> replicas;
@SerializedName(value = "lastCheckTime")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making replicas volatile is not enough while methods still read the field directly more than once instead of capturing one snapshot. For example, getRemoteDataSize() iterates replicas looking for cooldownReplicaId, then rereads replicas for stream().max(...).get(); a concurrent deleteReplica*() can publish an empty/different list between those reads and make this throw or compute from a different snapshot. Other direct-field readers such as getReplicaByBackendId(), equals(), and readyToBeRepaired() also bypass the getReplicas() snapshot convention. Please update these readers to use a single local List<Replica> snapshot = replicas (or getReplicas()) per method so the copy-on-write guarantee actually covers all replica readers, not only external iteration.

…metadata

Address two follow-up review notes on the COW snapshot work:

1. MaterializedIndex.idToTablets was still a plain HashMap mutated under
   synchronized writers but read lock-free by getTablet(id) on scheduler,
   report, proc and other paths. Publish it as a volatile immutable
   snapshot in lockstep with the tablets list. Ordering preserves the
   tablets ⊆ idToTablets invariant: append publishes the map before the
   list (any id visible via iteration is queryable), clearTabletsForRestore
   drops the list before the map.

2. LocalTablet had readers re-reading the volatile `replicas` field more
   than once inside a single method (getRemoteDataSize, equals,
   readyToBeRepaired), defeating the COW snapshot guarantee. Capture one
   local snapshot at the top of each method and use it throughout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@morningman
Copy link
Copy Markdown
Contributor

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review conclusion: I did not find additional distinct blocking issues in the current head beyond the already-known inline review context.

Critical checkpoint conclusions:

  • Goal/test: The PR targets FE tablet/replica metadata reader safety under concurrent mutation by using volatile copy-on-write snapshots and adds focused unit tests for snapshot immutability/concurrent iteration.
  • Scope: Changes are focused on MaterializedIndex/LocalTablet and required bulk-publish call sites.
  • Concurrency: The updated MaterializedIndex map/list and LocalTablet replica list readers now use snapshot-style access; no new deadlock-prone lock ordering was identified.
  • Lifecycle/static initialization: No special static/global lifecycle issue found.
  • Config/compatibility: No new config, storage format, or FE-BE protocol compatibility issue found.
  • Parallel paths: Shared-nothing/cloud create and restore paths were checked; no additional distinct issue found.
  • Tests: Added FE unit coverage is relevant. I attempted targeted tests with mvn -pl fe-core -Dtest=org.apache.doris.catalog.MaterializedIndexTest,org.apache.doris.catalog.TabletTest test -Dskip.doc=true, but the runner could not resolve local artifact org.apache.doris:fe-foundation:1.2-SNAPSHOT, so tests did not run here.
  • Observability/transactions/persistence: No new observability or persistence/edit-log concern found for these metadata-only changes.
  • User focus: No additional user-provided review focus was specified.

@morningman
Copy link
Copy Markdown
Contributor

run buildall

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 22, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 30729 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 676d21ebfe1314d23e8d66f584ae8946d15ecc8d, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17616	3870	3821	3821
q2	q3	10748	1333	832	832
q4	4682	472	353	353
q5	7602	2237	2131	2131
q6	284	175	136	136
q7	924	758	653	653
q8	9352	1765	1533	1533
q9	6825	4827	4859	4827
q10	6456	2120	1781	1781
q11	436	283	244	244
q12	688	426	288	288
q13	18217	3404	2773	2773
q14	263	254	238	238
q15	q16	815	769	702	702
q17	1193	985	891	891
q18	6742	5609	5575	5575
q19	1281	1270	919	919
q20	519	404	265	265
q21	5690	2629	2462	2462
q22	419	355	305	305
Total cold run time: 100752 ms
Total hot run time: 30729 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4166	4111	4085	4085
q2	q3	4485	4885	4273	4273
q4	2097	2221	1415	1415
q5	4432	4245	4260	4245
q6	225	174	130	130
q7	2057	1964	1661	1661
q8	2366	2147	2052	2052
q9	7859	7581	7673	7581
q10	4528	4491	4006	4006
q11	558	416	394	394
q12	744	744	655	655
q13	3288	3582	3037	3037
q14	301	312	296	296
q15	q16	731	724	626	626
q17	1337	1317	1301	1301
q18	7989	7406	6886	6886
q19	1098	1081	1066	1066
q20	2211	2223	1931	1931
q21	5287	4608	4490	4490
q22	509	487	420	420
Total cold run time: 56268 ms
Total hot run time: 50550 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169253 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 676d21ebfe1314d23e8d66f584ae8946d15ecc8d, data reload: false

query5	4334	660	507	507
query6	342	222	210	210
query7	4219	578	307	307
query8	327	230	229	229
query9	8856	3971	3964	3964
query10	465	350	310	310
query11	5801	2408	2252	2252
query12	181	131	124	124
query13	1321	597	425	425
query14	5928	5361	5056	5056
query14_1	4380	4347	4357	4347
query15	211	201	184	184
query16	997	471	476	471
query17	1151	743	630	630
query18	2543	492	347	347
query19	213	195	174	174
query20	136	129	126	126
query21	222	136	120	120
query22	13587	13656	13419	13419
query23	17265	16408	16015	16015
query23_1	16106	16108	16195	16108
query24	7490	1752	1284	1284
query24_1	1312	1278	1302	1278
query25	533	469	410	410
query26	1308	306	171	171
query27	2717	547	327	327
query28	4468	1958	1932	1932
query29	1015	603	485	485
query30	306	239	200	200
query31	1102	1063	932	932
query32	88	73	71	71
query33	539	340	288	288
query34	1160	1095	640	640
query35	767	795	669	669
query36	1322	1341	1190	1190
query37	168	100	91	91
query38	3161	3148	3087	3087
query39	925	921	898	898
query39_1	866	874	876	874
query40	228	142	123	123
query41	69	65	65	65
query42	109	108	118	108
query43	323	327	283	283
query44	
query45	217	199	207	199
query46	1045	1168	702	702
query47	2290	2377	2247	2247
query48	410	395	306	306
query49	616	483	380	380
query50	966	356	251	251
query51	4344	4325	4275	4275
query52	105	104	94	94
query53	279	283	213	213
query54	329	289	273	273
query55	98	103	90	90
query56	314	321	308	308
query57	1429	1465	1339	1339
query58	314	282	275	275
query59	1523	1594	1404	1404
query60	337	336	316	316
query61	187	180	178	178
query62	675	626	577	577
query63	245	200	210	200
query64	2471	867	697	697
query65	
query66	1779	494	366	366
query67	29989	29849	29839	29839
query68	
query69	463	343	333	333
query70	1002	1017	960	960
query71	304	272	264	264
query72	3058	2692	2369	2369
query73	822	717	444	444
query74	5064	4899	4719	4719
query75	2652	2583	2264	2264
query76	2272	1139	774	774
query77	401	398	333	333
query78	12103	12188	11547	11547
query79	1484	1088	762	762
query80	1334	531	452	452
query81	504	281	242	242
query82	1379	160	125	125
query83	346	269	246	246
query84	255	140	106	106
query85	970	538	445	445
query86	452	322	359	322
query87	3397	3377	3197	3197
query88	3549	2683	2644	2644
query89	445	379	329	329
query90	1812	181	178	178
query91	178	173	162	162
query92	80	72	71	71
query93	1518	1419	827	827
query94	687	368	324	324
query95	661	483	358	358
query96	979	759	320	320
query97	2676	2699	2546	2546
query98	233	226	228	226
query99	1130	1101	978	978
Total cold run time: 253660 ms
Total hot run time: 169253 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 52.63% (60/114) 🎉
Increment coverage report
Complete coverage report

1 similar comment
@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 52.63% (60/114) 🎉
Increment coverage report
Complete coverage report

@yx-keith yx-keith merged commit 962b88f into apache:master May 23, 2026
33 checks passed
github-actions Bot pushed a commit that referenced this pull request May 23, 2026
### What problem does this PR solve?

Issue Number: #59138

Problem Summary:

TabletStatMgr.runAfterCatalogReady() is a periodic master-FE daemon that
iterates every tablet/replica to pull statistics. When DDL runs
concurrently with this daemon, two races fire:

Iteration race (CME). MaterializedIndex.tablets and LocalTablet.replicas
were plain ArrayLists whose getters returned the internal list. A
concurrent addTablet / addReplica / deleteReplica (clone, repair, schema
change, restore, report handler) during iteration triggered the
fail-fast iterator and threw ConcurrentModificationException.
TOCTOU race. In updateTabletStat, a getTabletMeta(id) != null check is
followed by getReplica(id, beId). If the tablet is removed in between,
getReplica hits Preconditions.checkState(...) and throws
IllegalStateException.
When the daemon throws, the current cycle leaves stale
tablet/partition/table sizes and skewed MetricRepo metrics until the
next cycle.

Solution:
Close the CME race for good with copy-on-write via a volatile snapshot.
A first attempt returned a defensive copy (Lists.newArrayList(...)), but
the copy itself iterates the source list and can still CME mid-copy —
the window shrank but did not close. This PR instead:

Makes LocalTablet.replicas and MaterializedIndex.tablets volatile.
Writers (addReplica / deleteReplica / deleteReplicaByBackendId /
addTablet / clearTabletsForRestore) are synchronized, build a new list,
and atomically swap the volatile reference — they never mutate a list in
place.
Readers (getReplicas() / getTablets()) do a single volatile read and
return an immutable snapshot (Collections.unmodifiableList). Iteration
is lock-free and can never CME, and the hot read path no longer copies
elements.
synchronized on writers is required (not just volatile) because some
write paths do not hold the OlapTable write lock — verified by tracing
call sites: InternalCatalog.createPartitionWithIndices and
RestoreJob.resetPartitionForRestore call addReplica/addTablet without
the table write lock, so concurrent writers are real and a plain
volatile field would allow lost updates. Writers are infrequent (DDL /
repair / restore), so the lock cost is negligible; reads stay lock-free.

TOCTOU race is handled by catching IllegalStateException around
getReplica (kept from the original fix) and counting the skip via a new
TabletStatMgr.staleTabletStatSkipped counter, which makes the race
observable (>0 proves the window was actually hit) instead of relying
solely on log scraping.

Cloud path: CloudTabletStatMgr.updateStatInfo iterates
tablet.getReplicas() and is covered by the same snapshot fix; its
updateTabletStat uses getReplicasByTabletId (locked, returns empty list,
no checkState) and is already safe.


### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->

---------

Co-authored-by: morningman <yunyou@selectdb.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.x dev/4.0.x-conflict dev/4.1.x reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants