Skip to content

Conversation

@koarz
Copy link
Contributor

@koarz koarz commented Dec 19, 2025

During the execution of init_file_cache_factory, the following call path is triggered:

init_file_cache_factory -> FileCacheFactory::create_file_cache -> cache->initialize() -> initialize_unlocked -> _storage->init(this) -> FSFileCacheStorage::init()

(At this point, a thread named _cache_background_load_thread is created, and the remaining operations run within this thread)
-> upgrade_cache_dir_if_necessary -> read_file_cache_version -> FileSystem::open_file -> open_file_impl -> LocalFileReader::LocalFileReader -> BeConfDataDirReader::get_data_dir_by_file_path

After FSFileCacheStorage::init completes (spawning the _cache_background_load_thread), ExecEnv::_init continues to execute doris::io::BeConfDataDirReader::init_be_conf_data_dir. This function performs push operations on be_config_data_dir_list.

Simultaneously, BeConfDataDirReader::get_data_dir_by_file_path (running in the background thread) iterates over this same be_config_data_dir_list. This leads to a race condition: if doris::io::BeConfDataDirReader::init_be_conf_data_dir is inserting data while the vector is being read, two issues arise:

  1. Modifying be_config_data_dir_list while iterating over it via a range-based for loop results in Undefined Behavior (UB).
  2. If be_config_data_dir_list triggers a reallocation (expansion) during the insertion, concurrent read operations on its elements will access dangling references, triggering a heap-use-after-free error.

Since init_be_conf_data_dir depends on cache_paths derived from init_file_cache_factory, we must carefully manage the synchronization sequence to prevent these errors.

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@koarz koarz changed the title [fix](core)be core when vec realloc [fix](core)be core when vector realloc Dec 19, 2025
@koarz
Copy link
Contributor Author

koarz commented Dec 19, 2025

run buildall

@koarz
Copy link
Contributor Author

koarz commented Dec 22, 2025

run buildall

@koarz
Copy link
Contributor Author

koarz commented Dec 23, 2025

run buildall

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/2) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.39% (18945/35487)
Line Coverage 39.25% (175654/447564)
Region Coverage 33.84% (135967/401749)
Branch Coverage 34.72% (58634/168855)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 60.00% (3/5) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 51.93% (18015/34690)
Line Coverage 38.04% (169768/446291)
Region Coverage 32.23% (130842/405942)
Branch Coverage 33.47% (56703/169433)

@koarz
Copy link
Contributor Author

koarz commented Dec 24, 2025

run buildall

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 0.00% (0/5) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.38% (18942/35488)
Line Coverage 39.22% (175549/447559)
Region Coverage 33.81% (135810/401745)
Branch Coverage 34.70% (58599/168855)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (8/8) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.93% (25646/34691)
Line Coverage 61.32% (273641/446286)
Region Coverage 56.14% (227910/405938)
Branch Coverage 58.12% (98479/169433)

@koarz
Copy link
Contributor Author

koarz commented Dec 24, 2025

run performance

@doris-robot
Copy link

TPC-H: Total hot run time: 35023 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 863ca3dcc5d03aeb3ff5c2b7e0390964bd7da326, data reload: false

------ Round 1 ----------------------------------
q1	17633	4233	4056	4056
q2	1994	345	236	236
q3	10222	1322	724	724
q4	10210	857	323	323
q5	7506	2122	1911	1911
q6	187	165	137	137
q7	973	853	710	710
q8	9342	1409	1103	1103
q9	6864	5332	5358	5332
q10	6858	2410	1982	1982
q11	516	324	311	311
q12	664	780	552	552
q13	17780	3729	3032	3032
q14	285	283	279	279
q15	599	504	517	504
q16	701	679	631	631
q17	693	828	571	571
q18	7458	7002	7066	7002
q19	1108	960	633	633
q20	403	364	248	248
q21	4239	3874	3795	3795
q22	1088	1005	951	951
Total cold run time: 107323 ms
Total hot run time: 35023 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4093	4047	4049	4047
q2	329	407	332	332
q3	2118	2689	2233	2233
q4	1337	1750	1261	1261
q5	4208	4436	4756	4436
q6	222	180	141	141
q7	2073	2017	1759	1759
q8	2659	2618	2515	2515
q9	7610	7471	7543	7471
q10	3091	3260	2876	2876
q11	632	536	535	535
q12	698	775	760	760
q13	3512	4072	3173	3173
q14	296	317	269	269
q15	532	498	527	498
q16	669	690	632	632
q17	1303	1495	1707	1495
q18	7749	7639	7789	7639
q19	847	865	965	865
q20	1983	2083	2009	2009
q21	4963	4416	4219	4219
q22	1108	1036	952	952
Total cold run time: 52032 ms
Total hot run time: 50117 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 179761 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 863ca3dcc5d03aeb3ff5c2b7e0390964bd7da326, data reload: false

query5	4393	606	454	454
query6	317	229	209	209
query7	4233	475	282	282
query8	329	254	253	253
query9	8766	2559	2546	2546
query10	516	375	323	323
query11	15514	15365	15108	15108
query12	178	117	114	114
query13	1269	509	386	386
query14	5690	3089	2933	2933
query14_1	2723	2698	2747	2698
query15	207	195	181	181
query16	881	472	434	434
query17	1170	745	617	617
query18	2441	465	356	356
query19	232	234	207	207
query20	119	116	117	116
query21	223	144	125	125
query22	3900	3965	3873	3873
query23	16601	16154	15958	15958
query23_1	16128	16203	15913	15913
query24	7499	1672	1252	1252
query24_1	1275	1229	1243	1229
query25	600	484	419	419
query26	1251	258	159	159
query27	2795	468	307	307
query28	4471	2119	2117	2117
query29	819	551	460	460
query30	298	233	211	211
query31	812	706	620	620
query32	78	69	65	65
query33	530	331	278	278
query34	897	899	547	547
query35	764	795	707	707
query36	871	913	846	846
query37	129	104	82	82
query38	3001	3064	2995	2995
query39	781	750	736	736
query39_1	708	693	709	693
query40	233	139	123	123
query41	67	67	62	62
query42	110	113	114	113
query43	432	441	410	410
query44	1355	751	743	743
query45	195	197	185	185
query46	891	1018	611	611
query47	1671	1709	1662	1662
query48	306	330	255	255
query49	612	421	344	344
query50	670	298	235	235
query51	3860	3833	3928	3833
query52	107	110	101	101
query53	319	345	288	288
query54	278	270	263	263
query55	77	76	71	71
query56	289	305	304	304
query57	1175	1161	1088	1088
query58	269	254	264	254
query59	2354	2422	2388	2388
query60	331	317	290	290
query61	164	156	155	155
query62	746	728	677	677
query63	339	295	300	295
query64	5056	1303	1001	1001
query65	4012	3938	3975	3938
query66	1468	445	309	309
query67	15095	14810	14652	14652
query68	4715	1043	742	742
query69	514	341	307	307
query70	1110	990	991	990
query71	365	306	282	282
query72	6088	5172	5125	5125
query73	677	597	308	308
query74	8921	9007	8768	8768
query75	3192	3181	2834	2834
query76	3824	1166	762	762
query77	513	392	310	310
query78	9368	9526	8878	8878
query79	2047	848	621	621
query80	718	663	562	562
query81	523	266	233	233
query82	203	130	101	101
query83	271	255	242	242
query84	253	119	106	106
query85	900	515	451	451
query86	383	297	276	276
query87	3212	3130	3086	3086
query88	4532	2288	2280	2280
query89	475	425	395	395
query90	2211	164	159	159
query91	173	164	141	141
query92	82	69	67	67
query93	2217	927	554	554
query94	490	305	278	278
query95	584	336	305	305
query96	608	479	218	218
query97	2255	2298	2228	2228
query98	219	200	198	198
query99	1342	1375	1282	1282
Total cold run time: 258810 ms
Total hot run time: 179761 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 27.22 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 863ca3dcc5d03aeb3ff5c2b7e0390964bd7da326, data reload: false

query1	0.06	0.04	0.05
query2	0.10	0.05	0.05
query3	0.26	0.09	0.09
query4	1.60	0.12	0.11
query5	0.27	0.24	0.26
query6	1.17	0.64	0.63
query7	0.03	0.02	0.03
query8	0.05	0.04	0.04
query9	0.56	0.49	0.49
query10	0.54	0.55	0.54
query11	0.15	0.11	0.11
query12	0.14	0.11	0.11
query13	0.63	0.60	0.60
query14	0.99	0.98	0.97
query15	0.81	0.80	0.79
query16	0.41	0.40	0.39
query17	1.07	1.07	1.06
query18	0.23	0.21	0.22
query19	1.85	1.72	1.76
query20	0.02	0.01	0.01
query21	15.46	0.29	0.14
query22	5.01	0.05	0.05
query23	16.02	0.28	0.10
query24	1.31	0.33	0.19
query25	0.10	0.08	0.06
query26	0.14	0.13	0.14
query27	0.06	0.06	0.05
query28	3.14	1.22	1.03
query29	12.64	3.95	3.19
query30	0.28	0.15	0.14
query31	2.82	0.62	0.40
query32	3.23	0.53	0.45
query33	3.00	2.97	3.08
query34	17.71	5.30	4.60
query35	4.64	4.66	4.63
query36	0.66	0.52	0.51
query37	0.12	0.07	0.07
query38	0.07	0.04	0.04
query39	0.05	0.03	0.03
query40	0.17	0.14	0.13
query41	0.09	0.04	0.03
query42	0.05	0.03	0.03
query43	0.04	0.03	0.04
Total cold run time: 97.75 s
Total hot run time: 27.22 s

@koarz koarz changed the title [fix](core)be core when vector realloc [fix](core)be core when BeConfDataDirReader::get_data_dir_by_file_path Dec 24, 2025
gavinchou
gavinchou previously approved these changes Dec 24, 2025
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Dec 24, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@freemandealer freemandealer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work!

@koarz
Copy link
Contributor Author

koarz commented Dec 24, 2025

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Dec 24, 2025
@doris-robot
Copy link

TPC-H: Total hot run time: 34965 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 8ed49e1c87ba4d13b03403d82b8fe3a847ffcaf0, data reload: false

------ Round 1 ----------------------------------
q1	17633	4208	4057	4057
q2	2024	356	241	241
q3	10194	1325	721	721
q4	10219	910	313	313
q5	7514	2124	1949	1949
q6	190	169	138	138
q7	988	868	694	694
q8	9360	1447	1142	1142
q9	7066	5296	5315	5296
q10	6833	2401	1986	1986
q11	527	352	297	297
q12	646	714	575	575
q13	17754	3660	3053	3053
q14	291	289	275	275
q15	604	516	517	516
q16	678	679	625	625
q17	683	804	552	552
q18	7689	7202	7086	7086
q19	1094	948	630	630
q20	389	350	243	243
q21	4172	3957	3615	3615
q22	1047	990	961	961
Total cold run time: 107595 ms
Total hot run time: 34965 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4033	4007	4289	4007
q2	325	406	324	324
q3	2157	2711	2294	2294
q4	1326	1763	1283	1283
q5	4217	4564	4605	4564
q6	230	181	137	137
q7	2055	1957	1809	1809
q8	2693	2464	2524	2464
q9	7623	7475	7487	7475
q10	3059	3311	2791	2791
q11	645	572	526	526
q12	686	930	621	621
q13	3512	3913	3386	3386
q14	306	313	283	283
q15	553	526	515	515
q16	682	696	641	641
q17	1146	1497	1468	1468
q18	7960	7753	7578	7578
q19	882	827	887	827
q20	2088	2053	1842	1842
q21	4528	4187	4147	4147
q22	1105	1053	991	991
Total cold run time: 51811 ms
Total hot run time: 49973 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 179596 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 8ed49e1c87ba4d13b03403d82b8fe3a847ffcaf0, data reload: false

query5	4395	584	459	459
query6	305	218	203	203
query7	4226	459	276	276
query8	309	249	225	225
query9	8762	2559	2536	2536
query10	505	381	322	322
query11	15442	15366	14839	14839
query12	178	119	117	117
query13	1265	520	417	417
query14	5918	2995	2753	2753
query14_1	2674	2669	2664	2664
query15	209	193	180	180
query16	867	456	460	456
query17	1124	733	607	607
query18	2442	446	350	350
query19	245	233	208	208
query20	122	118	115	115
query21	221	145	120	120
query22	4166	4183	4192	4183
query23	16646	16242	16028	16028
query23_1	16152	16151	16235	16151
query24	7371	1656	1241	1241
query24_1	1259	1233	1240	1233
query25	597	493	446	446
query26	1249	275	166	166
query27	2754	479	317	317
query28	4493	2124	2122	2122
query29	848	582	522	522
query30	312	235	211	211
query31	821	708	597	597
query32	74	72	66	66
query33	531	332	287	287
query34	919	897	530	530
query35	764	809	711	711
query36	899	921	811	811
query37	126	87	73	73
query38	3058	3023	2997	2997
query39	762	751	716	716
query39_1	701	701	702	701
query40	222	139	118	118
query41	65	68	62	62
query42	109	107	102	102
query43	432	425	396	396
query44	1335	738	735	735
query45	190	190	183	183
query46	892	978	615	615
query47	1687	1707	1630	1630
query48	313	322	242	242
query49	621	420	348	348
query50	659	292	220	220
query51	3833	3837	3768	3768
query52	100	106	98	98
query53	323	359	289	289
query54	284	255	248	248
query55	83	79	75	75
query56	297	298	282	282
query57	1180	1156	1101	1101
query58	272	262	254	254
query59	2398	2462	2412	2412
query60	309	302	293	293
query61	160	157	196	157
query62	716	695	702	695
query63	332	300	310	300
query64	5089	1327	996	996
query65	4041	3980	3969	3969
query66	1460	448	326	326
query67	15302	15024	14862	14862
query68	6233	1020	725	725
query69	500	346	321	321
query70	1045	988	945	945
query71	372	303	283	283
query72	6058	4980	5006	4980
query73	681	583	306	306
query74	8626	8909	8650	8650
query75	3186	3188	2820	2820
query76	3880	1149	765	765
query77	530	398	294	294
query78	9347	9630	8863	8863
query79	1189	890	614	614
query80	733	674	541	541
query81	503	269	236	236
query82	224	136	101	101
query83	267	257	241	241
query84	253	121	103	103
query85	893	519	473	473
query86	338	321	284	284
query87	3299	3258	3174	3174
query88	3238	2298	2295	2295
query89	474	430	406	406
query90	1990	166	160	160
query91	181	168	143	143
query92	71	65	62	62
query93	1078	931	563	563
query94	468	325	298	298
query95	573	374	313	313
query96	595	471	216	216
query97	2279	2330	2217	2217
query98	206	198	202	198
query99	1309	1328	1316	1316
Total cold run time: 257050 ms
Total hot run time: 179596 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 27.01 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 8ed49e1c87ba4d13b03403d82b8fe3a847ffcaf0, data reload: false

query1	0.05	0.04	0.05
query2	0.09	0.05	0.05
query3	0.27	0.08	0.09
query4	1.60	0.12	0.11
query5	0.27	0.26	0.27
query6	1.16	0.64	0.62
query7	0.03	0.03	0.02
query8	0.05	0.04	0.04
query9	0.57	0.48	0.50
query10	0.54	0.53	0.54
query11	0.15	0.10	0.11
query12	0.16	0.12	0.12
query13	0.62	0.60	0.60
query14	0.99	0.98	1.00
query15	0.82	0.81	0.81
query16	0.42	0.42	0.38
query17	0.99	1.07	1.02
query18	0.24	0.22	0.21
query19	1.94	1.69	1.76
query20	0.01	0.02	0.01
query21	15.45	0.28	0.14
query22	4.79	0.05	0.04
query23	15.94	0.29	0.10
query24	0.91	0.85	0.31
query25	0.11	0.13	0.06
query26	0.14	0.13	0.14
query27	0.06	0.05	0.04
query28	4.88	1.21	1.03
query29	12.55	3.98	3.21
query30	0.29	0.14	0.11
query31	2.82	0.64	0.39
query32	3.24	0.54	0.46
query33	2.98	2.97	2.97
query34	16.94	5.20	4.48
query35	4.54	4.55	4.59
query36	0.67	0.52	0.49
query37	0.12	0.07	0.06
query38	0.07	0.04	0.04
query39	0.05	0.03	0.03
query40	0.18	0.14	0.13
query41	0.09	0.04	0.03
query42	0.05	0.03	0.03
query43	0.05	0.03	0.03
Total cold run time: 97.89 s
Total hot run time: 27.01 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 36.36% (4/11) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.38% (18942/35488)
Line Coverage 39.22% (175554/447565)
Region Coverage 33.81% (135821/401751)
Branch Coverage 34.70% (58590/168859)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 81.82% (9/11) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 72.18% (25041/34691)
Line Coverage 58.90% (262857/446289)
Region Coverage 53.83% (218515/405944)
Branch Coverage 55.32% (93731/169437)

@koarz
Copy link
Contributor Author

koarz commented Dec 26, 2025

run nonConcurrent

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 81.82% (9/11) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 72.19% (25043/34691)
Line Coverage 58.91% (262904/446289)
Region Coverage 53.82% (218470/405944)
Branch Coverage 55.34% (93761/169437)

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Dec 30, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@gavinchou gavinchou merged commit 13ea84d into apache:master Dec 30, 2025
26 of 27 checks passed
github-actions bot pushed a commit that referenced this pull request Dec 30, 2025
#59204)

During the execution of `init_file_cache_factory`, the following call
path is triggered:

```txt
init_file_cache_factory -> FileCacheFactory::create_file_cache -> cache->initialize() -> initialize_unlocked -> _storage->init(this) -> FSFileCacheStorage::init()

```

(At this point, a thread named `_cache_background_load_thread` is
created, and the remaining operations run within this thread)
`-> upgrade_cache_dir_if_necessary -> read_file_cache_version ->
FileSystem::open_file -> open_file_impl ->
LocalFileReader::LocalFileReader ->
BeConfDataDirReader::get_data_dir_by_file_path`

After `FSFileCacheStorage::init` completes (spawning the
`_cache_background_load_thread`), `ExecEnv::_init` continues to execute
`doris::io::BeConfDataDirReader::init_be_conf_data_dir`. This function
performs push operations on `be_config_data_dir_list`.

Simultaneously, `BeConfDataDirReader::get_data_dir_by_file_path`
(running in the background thread) iterates over this same
`be_config_data_dir_list`. This leads to a race condition: if
`doris::io::BeConfDataDirReader::init_be_conf_data_dir` is inserting
data while the vector is being read, two issues arise:

1. Modifying `be_config_data_dir_list` while iterating over it via a
range-based for loop results in **Undefined Behavior (UB)**.
2. If `be_config_data_dir_list` triggers a reallocation (expansion)
during the insertion, concurrent read operations on its elements will
access dangling references, triggering a **heap-use-after-free** error.

Since `init_be_conf_data_dir` depends on `cache_paths` derived from
`init_file_cache_factory`, we must carefully manage the synchronization
sequence to prevent these errors.
github-actions bot pushed a commit that referenced this pull request Dec 30, 2025
#59204)

During the execution of `init_file_cache_factory`, the following call
path is triggered:

```txt
init_file_cache_factory -> FileCacheFactory::create_file_cache -> cache->initialize() -> initialize_unlocked -> _storage->init(this) -> FSFileCacheStorage::init()

```

(At this point, a thread named `_cache_background_load_thread` is
created, and the remaining operations run within this thread)
`-> upgrade_cache_dir_if_necessary -> read_file_cache_version ->
FileSystem::open_file -> open_file_impl ->
LocalFileReader::LocalFileReader ->
BeConfDataDirReader::get_data_dir_by_file_path`

After `FSFileCacheStorage::init` completes (spawning the
`_cache_background_load_thread`), `ExecEnv::_init` continues to execute
`doris::io::BeConfDataDirReader::init_be_conf_data_dir`. This function
performs push operations on `be_config_data_dir_list`.

Simultaneously, `BeConfDataDirReader::get_data_dir_by_file_path`
(running in the background thread) iterates over this same
`be_config_data_dir_list`. This leads to a race condition: if
`doris::io::BeConfDataDirReader::init_be_conf_data_dir` is inserting
data while the vector is being read, two issues arise:

1. Modifying `be_config_data_dir_list` while iterating over it via a
range-based for loop results in **Undefined Behavior (UB)**.
2. If `be_config_data_dir_list` triggers a reallocation (expansion)
during the insertion, concurrent read operations on its elements will
access dangling references, triggering a **heap-use-after-free** error.

Since `init_be_conf_data_dir` depends on `cache_paths` derived from
`init_file_cache_factory`, we must carefully manage the synchronization
sequence to prevent these errors.
yiguolei pushed a commit that referenced this pull request Dec 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.1.x dev/4.0.3-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants