Skip to content

Conversation

@kaka11chen
Copy link
Contributor

@kaka11chen kaka11chen commented Nov 24, 2025

What problem does this PR solve?

Release note

Summary

  1. Fix a self-deadlock that can occur in _dispatch_thread when a destructor path attempts to re-acquire the same mutex already held by the thread. The root cause is destructors (triggered while holding _lock) performing operations that try to re-acquire the same mutex. A safe fix should ensure destructors that may call remove_task() run outside the _lock scope or avoid re-locking the same mutex inside destructor paths.
  2. Call _task_executor->wait() in TaskExecutorSimplifiedScanScheduler::stop().

Details / Reproduction steps

  1. std::shared_ptr<PrioritizedSplitRunner> split = _tokenless->_entries->take();
  2. l.lock();_dispatch_thread acquires _lock.
  3. After the while loop finishes, split goes out of scope and the shared_ptr is destroyed.
  4. PrioritizedSplitRunner destructor runs → destroys _split_runner (ScannerSplitRunner).
  5. ScannerSplitRunner::_scan_func destructor runs → destroys captured ctx (std::shared_ptr<ScannerContext>).
  6. ScannerContext::~ScannerContext() calls remove_task().
  7. remove_task() attempts to acquire _lock.
  8. Result: self-deadlock because _lock is already held by _dispatch_thread.

Solution

We must explicitly release 'split' BEFORE acquiring _lock to avoid self-deadlock. The destructor chain (PrioritizedSplitRunner -> ScannerSplitRunner-> _scan_func lambda -> captured ScannerContext) may call remove_task() which tries to acquire _lock. Since _lock is not a recursive mutex, this would deadlock.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Nov 24, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@kaka11chen
Copy link
Contributor Author

run nonConcurrent

@kaka11chen kaka11chen force-pushed the time_sharing_task_executor_timeout_test branch from 90c5711 to 4fab545 Compare November 24, 2025 03:30
@kaka11chen
Copy link
Contributor Author

run nonConcurrent

@kaka11chen
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 34076 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 4fab5453ab5dddd7f5f7cfc6eaeea975af002f91, data reload: false

------ Round 1 ----------------------------------
q1	17664	5011	4848	4848
q2	2038	340	215	215
q3	10289	1278	699	699
q4	10261	923	369	369
q5	7503	2383	2267	2267
q6	182	164	134	134
q7	885	730	626	626
q8	9365	1341	1152	1152
q9	7138	5333	5336	5333
q10	6805	2195	1826	1826
q11	487	302	269	269
q12	331	359	226	226
q13	17827	3626	3069	3069
q14	235	239	223	223
q15	558	505	506	505
q16	1023	1000	943	943
q17	575	861	362	362
q18	7533	7082	7015	7015
q19	1237	961	547	547
q20	360	345	247	247
q21	3823	3166	2245	2245
q22	1033	981	956	956
Total cold run time: 107152 ms
Total hot run time: 34076 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4824	5015	4822	4822
q2	328	406	319	319
q3	2166	2713	2318	2318
q4	1341	1736	1285	1285
q5	4162	4327	4574	4327
q6	217	185	140	140
q7	2054	1956	1969	1956
q8	2661	2453	2476	2453
q9	7646	7539	7593	7539
q10	2998	3199	2874	2874
q11	620	524	525	524
q12	717	785	682	682
q13	3592	3898	3379	3379
q14	304	311	284	284
q15	564	499	506	499
q16	1092	1123	1069	1069
q17	1143	1563	1386	1386
q18	7832	7717	7616	7616
q19	763	824	851	824
q20	1975	2043	1976	1976
q21	5054	4354	4165	4165
q22	1080	1033	998	998
Total cold run time: 53133 ms
Total hot run time: 51435 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 27.91 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 4fab5453ab5dddd7f5f7cfc6eaeea975af002f91, data reload: false

query1	0.06	0.05	0.04
query2	0.09	0.06	0.05
query3	0.26	0.08	0.09
query4	1.60	0.12	0.11
query5	0.27	0.25	0.24
query6	1.17	0.66	0.64
query7	0.04	0.03	0.02
query8	0.06	0.04	0.05
query9	0.57	0.51	0.52
query10	0.59	0.57	0.57
query11	0.17	0.12	0.11
query12	0.15	0.12	0.12
query13	0.62	0.60	0.62
query14	1.01	1.01	1.00
query15	0.85	0.82	0.85
query16	0.38	0.38	0.39
query17	1.04	1.10	1.03
query18	0.21	0.19	0.20
query19	1.95	1.80	1.82
query20	0.02	0.01	0.01
query21	15.44	0.19	0.13
query22	5.11	0.07	0.05
query23	15.67	0.26	0.10
query24	2.38	1.14	0.76
query25	0.07	0.06	0.07
query26	0.15	0.14	0.12
query27	0.06	0.06	0.06
query28	4.89	1.20	0.95
query29	12.68	3.92	3.28
query30	0.29	0.14	0.12
query31	2.82	0.57	0.40
query32	3.23	0.55	0.47
query33	3.13	3.06	3.15
query34	15.65	5.19	4.57
query35	4.59	4.60	4.64
query36	0.68	0.51	0.49
query37	0.09	0.06	0.07
query38	0.06	0.04	0.04
query39	0.04	0.03	0.02
query40	0.17	0.14	0.13
query41	0.08	0.03	0.03
query42	0.04	0.02	0.02
query43	0.05	0.03	0.03
Total cold run time: 98.48 s
Total hot run time: 27.91 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 79.17% (76/96) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.62% (18281/34739)
Line Coverage 38.07% (166407/437155)
Region Coverage 33.07% (129692/392165)
Branch Coverage 33.81% (55486/164124)

@kaka11chen
Copy link
Contributor Author

run nonConcurrent

7 similar comments
@kaka11chen
Copy link
Contributor Author

run nonConcurrent

@kaka11chen
Copy link
Contributor Author

run nonConcurrent

@kaka11chen
Copy link
Contributor Author

run nonConcurrent

@kaka11chen
Copy link
Contributor Author

run nonConcurrent

@kaka11chen
Copy link
Contributor Author

run nonConcurrent

@kaka11chen
Copy link
Contributor Author

run nonConcurrent

@kaka11chen
Copy link
Contributor Author

run nonConcurrent

@kaka11chen kaka11chen force-pushed the time_sharing_task_executor_timeout_test branch from 4fab545 to d3e7247 Compare December 8, 2025 11:17
@kaka11chen
Copy link
Contributor Author

run buildall

@kaka11chen kaka11chen force-pushed the time_sharing_task_executor_timeout_test branch from d3e7247 to f3cc40b Compare December 8, 2025 11:51
@kaka11chen
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 35954 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f3cc40bfa76361b83e233aaaec04d5dd4ba69a35, data reload: false

------ Round 1 ----------------------------------
q1	17663	5007	4853	4853
q2	2057	321	210	210
q3	10248	1325	740	740
q4	10292	895	324	324
q5	7610	2451	2164	2164
q6	189	170	135	135
q7	965	777	635	635
q8	9397	1475	1094	1094
q9	7151	5355	5309	5309
q10	6821	2229	1764	1764
q11	586	319	286	286
q12	343	377	239	239
q13	17780	3661	3055	3055
q14	243	235	235	235
q15	576	522	509	509
q16	877	887	810	810
q17	702	848	458	458
q18	7525	7942	7923	7923
q19	1202	993	639	639
q20	391	372	236	236
q21	4697	4024	3277	3277
q22	1110	1062	1059	1059
Total cold run time: 108425 ms
Total hot run time: 35954 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5262	5065	5127	5065
q2	329	398	338	338
q3	2398	2836	2573	2573
q4	1423	1935	1398	1398
q5	4692	4539	4664	4539
q6	209	174	128	128
q7	2008	1988	1837	1837
q8	2661	2471	2498	2471
q9	7539	7611	7589	7589
q10	3035	3340	2846	2846
q11	597	530	493	493
q12	691	734	546	546
q13	3254	3620	3036	3036
q14	266	281	253	253
q15	528	482	487	482
q16	852	875	854	854
q17	1156	1455	1396	1396
q18	7249	7227	7048	7048
q19	859	819	845	819
q20	1916	1989	1808	1808
q21	4746	4268	4222	4222
q22	1081	1028	970	970
Total cold run time: 52751 ms
Total hot run time: 50711 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 186512 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f3cc40bfa76361b83e233aaaec04d5dd4ba69a35, data reload: false

query5	5389	800	670	670
query6	409	299	272	272
query7	4734	513	319	319
query8	355	297	278	278
query9	8870	3296	3297	3296
query10	618	443	429	429
query11	15758	14886	14938	14886
query12	243	168	168	168
query13	1738	530	414	414
query14	7842	3353	3222	3222
query14_1	3071	3027	3100	3027
query15	309	212	195	195
query16	7837	512	484	484
query17	1607	788	636	636
query18	2096	468	387	387
query19	299	227	210	210
query20	174	166	162	162
query21	223	143	121	121
query22	4035	3957	3861	3861
query23	16850	16461	15925	15925
query23_1	16025	16083	16117	16083
query24	6391	1674	1278	1278
query24_1	1250	1252	1264	1252
query25	651	569	530	530
query26	1263	307	218	218
query27	2826	486	342	342
query28	4441	2379	2336	2336
query29	826	605	511	511
query30	327	245	223	223
query31	810	701	643	643
query32	146	128	129	128
query33	704	480	448	448
query34	892	896	598	598
query35	891	867	786	786
query36	903	942	863	863
query37	181	157	148	148
query38	3988	3930	3841	3841
query39	749	761	727	727
query39_1	708	718	704	704
query40	283	187	176	176
query41	68	61	63	61
query42	178	150	148	148
query43	459	464	413	413
query44	1375	881	872	872
query45	233	222	221	221
query46	942	990	662	662
query47	1747	1768	1635	1635
query48	428	368	273	273
query49	818	552	506	506
query50	710	329	269	269
query51	3890	3900	3843	3843
query52	157	147	142	142
query53	288	277	228	228
query54	462	432	425	425
query55	149	135	128	128
query56	499	467	461	461
query57	1215	1173	1123	1123
query58	469	439	423	423
query59	2369	2386	2339	2339
query60	518	485	470	470
query61	209	203	197	197
query62	793	693	644	644
query63	272	222	223	222
query64	4656	1371	1099	1099
query65	4100	4007	3965	3965
query66	1316	523	437	437
query67	15415	15131	14823	14823
query68	7366	943	680	680
query69	603	459	432	432
query70	1170	1062	1063	1062
query71	492	413	416	413
query72	6014	4913	4895	4895
query73	705	582	346	346
query74	8583	8745	8704	8704
query75	2973	2938	2514	2514
query76	3351	1217	874	874
query77	579	521	446	446
query78	9397	9844	8913	8913
query79	1518	841	631	631
query80	783	689	616	616
query81	520	273	247	247
query82	228	169	161	161
query83	296	276	263	263
query84	270	119	99	99
query85	913	545	503	503
query86	438	351	342	342
query87	4147	4137	3920	3920
query88	3018	2325	2310	2310
query89	429	360	326	326
query90	2273	258	249	249
query91	177	176	143	143
query92	139	136	122	122
query93	1858	1074	699	699
query94	874	353	333	333
query95	671	510	497	497
query96	564	566	250	250
query97	2585	2677	2590	2590
query98	293	243	236	236
query99	1344	1380	1255	1255
Total cold run time: 274815 ms
Total hot run time: 186512 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 27.2 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f3cc40bfa76361b83e233aaaec04d5dd4ba69a35, data reload: false

query1	0.05	0.05	0.05
query2	0.10	0.05	0.05
query3	0.26	0.09	0.09
query4	1.61	0.11	0.11
query5	0.27	0.25	0.25
query6	1.18	0.64	0.62
query7	0.03	0.03	0.02
query8	0.06	0.04	0.04
query9	0.57	0.50	0.50
query10	0.57	0.55	0.56
query11	0.16	0.11	0.11
query12	0.15	0.12	0.11
query13	0.62	0.61	0.61
query14	0.99	0.98	0.99
query15	0.82	0.79	0.81
query16	0.39	0.38	0.37
query17	1.05	1.01	0.99
query18	0.23	0.21	0.22
query19	1.91	1.78	1.84
query20	0.02	0.01	0.02
query21	15.44	0.32	0.14
query22	4.54	0.06	0.04
query23	16.04	0.29	0.10
query24	1.48	0.45	0.18
query25	0.09	0.11	0.06
query26	0.13	0.14	0.13
query27	0.08	0.06	0.05
query28	4.21	1.25	1.02
query29	12.56	4.10	3.20
query30	0.27	0.13	0.12
query31	2.81	0.63	0.39
query32	3.24	0.55	0.46
query33	3.06	3.03	3.05
query34	16.23	5.24	4.62
query35	4.64	4.60	4.58
query36	0.65	0.50	0.49
query37	0.11	0.07	0.06
query38	0.07	0.05	0.04
query39	0.05	0.03	0.04
query40	0.17	0.16	0.13
query41	0.08	0.03	0.04
query42	0.05	0.03	0.03
query43	0.05	0.04	0.03
Total cold run time: 97.09 s
Total hot run time: 27.2 s

@kaka11chen
Copy link
Contributor Author

run nonConcurrent

@kaka11chen kaka11chen force-pushed the time_sharing_task_executor_timeout_test branch from f3cc40b to 8651d73 Compare December 9, 2025 04:26
@kaka11chen
Copy link
Contributor Author

run buildall

@kaka11chen kaka11chen force-pushed the time_sharing_task_executor_timeout_test branch from 8651d73 to 79a2525 Compare December 9, 2025 06:33
@kaka11chen
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 80.79% (122/151) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.35% (18712/35071)
Line Coverage 39.07% (173097/443010)
Region Coverage 33.81% (134707/398414)
Branch Coverage 34.64% (57648/166420)

@kaka11chen
Copy link
Contributor Author

run nonConcurrent

@kaka11chen kaka11chen force-pushed the time_sharing_task_executor_timeout_test branch from 70ed1bd to e1e8ec2 Compare January 12, 2026 11:23
@kaka11chen
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32094 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e1e8ec2289e5a0bba31e4e16112b3fa3014e529a, data reload: false

------ Round 1 ----------------------------------
q1	17598	4251	4035	4035
q2	2024	367	237	237
q3	10167	1276	708	708
q4	10229	904	325	325
q5	7511	2051	1907	1907
q6	186	168	138	138
q7	939	793	666	666
q8	9258	1348	1110	1110
q9	4935	4624	4665	4624
q10	6791	1829	1406	1406
q11	511	320	270	270
q12	682	722	596	596
q13	17784	3813	3126	3126
q14	287	298	275	275
q15	584	511	510	510
q16	671	680	645	645
q17	684	794	540	540
q18	6683	6441	6983	6441
q19	1182	1051	618	618
q20	424	401	265	265
q21	3233	2695	2591	2591
q22	1137	1136	1061	1061
Total cold run time: 103500 ms
Total hot run time: 32094 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4318	4274	4329	4274
q2	330	402	334	334
q3	2341	2771	2403	2403
q4	1487	1853	1428	1428
q5	4507	4353	4365	4353
q6	227	175	131	131
q7	2003	1986	1851	1851
q8	2526	2390	2388	2388
q9	7334	7135	7163	7135
q10	2441	2490	2120	2120
q11	546	463	437	437
q12	675	685	579	579
q13	3408	3833	3112	3112
q14	290	277	260	260
q15	529	492	487	487
q16	624	641	603	603
q17	1079	1242	1304	1242
q18	7391	7361	7187	7187
q19	835	802	816	802
q20	1876	1954	1792	1792
q21	4500	4260	4173	4173
q22	1049	1030	975	975
Total cold run time: 50316 ms
Total hot run time: 48066 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 172925 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e1e8ec2289e5a0bba31e4e16112b3fa3014e529a, data reload: false

query5	4372	618	465	465
query6	326	248	207	207
query7	4237	458	273	273
query8	339	250	236	236
query9	8703	2887	2858	2858
query10	518	382	333	333
query11	15266	15169	14929	14929
query12	181	117	115	115
query13	1279	476	393	393
query14	6169	3058	2804	2804
query14_1	2755	2689	2701	2689
query15	207	193	172	172
query16	994	492	478	478
query17	1111	672	574	574
query18	2475	443	348	348
query19	236	224	198	198
query20	131	116	115	115
query21	217	137	118	118
query22	4087	3817	3796	3796
query23	15936	15508	15246	15246
query23_1	15482	15500	15395	15395
query24	7148	1570	1186	1186
query24_1	1218	1235	1204	1204
query25	563	481	426	426
query26	1239	267	161	161
query27	2778	453	290	290
query28	4500	2170	2154	2154
query29	788	559	476	476
query30	307	246	207	207
query31	821	648	559	559
query32	89	74	76	74
query33	536	354	326	326
query34	908	869	527	527
query35	731	749	713	713
query36	867	844	842	842
query37	134	98	84	84
query38	2716	2688	2622	2622
query39	769	764	729	729
query39_1	712	700	717	700
query40	212	129	116	116
query41	65	64	62	62
query42	106	103	102	102
query43	484	443	417	417
query44	1320	726	726	726
query45	184	181	179	179
query46	860	945	586	586
query47	1443	1463	1363	1363
query48	302	331	241	241
query49	593	412	329	329
query50	627	277	207	207
query51	3793	3767	3736	3736
query52	111	111	96	96
query53	298	330	278	278
query54	293	267	277	267
query55	89	84	79	79
query56	311	309	290	290
query57	1040	1041	892	892
query58	274	262	253	253
query59	2093	2078	1981	1981
query60	351	356	317	317
query61	159	157	156	156
query62	391	343	297	297
query63	300	269	271	269
query64	4892	1293	982	982
query65	3775	3696	3744	3696
query66	1467	414	322	322
query67	15649	15009	14690	14690
query68	7170	977	697	697
query69	517	361	322	322
query70	1035	936	949	936
query71	375	313	298	298
query72	6072	3383	3449	3383
query73	780	717	302	302
query74	8760	8816	8668	8668
query75	2812	2805	2436	2436
query76	3336	1081	658	658
query77	528	387	294	294
query78	9921	9728	9251	9251
query79	1644	891	585	585
query80	673	585	479	479
query81	507	266	236	236
query82	214	140	110	110
query83	262	252	245	245
query84	255	114	102	102
query85	897	503	460	460
query86	397	326	297	297
query87	2968	2856	2762	2762
query88	3141	2258	2252	2252
query89	386	353	327	327
query90	2159	161	151	151
query91	165	166	138	138
query92	86	78	75	75
query93	1425	896	536	536
query94	570	311	307	307
query95	589	337	373	337
query96	583	478	202	202
query97	2331	2371	2290	2290
query98	241	201	199	199
query99	632	568	522	522
Total cold run time: 253998 ms
Total hot run time: 172925 ms

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 83.33% (5/6) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.06% (18993/35796)
Line Coverage 39.12% (176077/450054)
Region Coverage 33.71% (136395/404636)
Branch Coverage 34.72% (58938/169743)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (6/6) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.92% (25928/35074)
Line Coverage 61.36% (275790/449457)
Region Coverage 56.29% (230294/409137)
Branch Coverage 58.18% (99192/170504)

@kaka11chen kaka11chen marked this pull request as ready for review January 13, 2026 02:13
@kaka11chen kaka11chen changed the title [only-for-debug-test] time sharing task executor timeout test. [fix](executor) Fix time sharing task executor hang. Jan 13, 2026
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jan 13, 2026
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@kaka11chen kaka11chen changed the title [fix](executor) Fix time sharing task executor hang. [fix](executor) Fix rare self-deadlock that can cause the time-sharing task executor to hang. Jan 13, 2026
@morningman morningman merged commit 0a6270b into apache:master Jan 13, 2026
30 of 34 checks passed
github-actions bot pushed a commit that referenced this pull request Jan 13, 2026
…g task executor to hang. (#58273)

### What problem does this PR solve?

### Release note

#### Summary

1. Fix a self-deadlock that can occur in `_dispatch_thread` when a
destructor path attempts to re-acquire the same mutex already held by
the thread. The root cause is destructors (triggered while holding
`_lock`) performing operations that try to re-acquire the same mutex. A
safe fix should ensure destructors that may call `remove_task()` run
outside the `_lock` scope or avoid re-locking the same mutex inside
destructor paths.
2. Call `_task_executor->wait()` in
`TaskExecutorSimplifiedScanScheduler::stop()`.

---

#### Details / Reproduction steps 
1. `std::shared_ptr<PrioritizedSplitRunner> split =
_tokenless->_entries->take();`
3. `l.lock();` — `_dispatch_thread` acquires `_lock`.  
4. After the `while` loop finishes, `split` goes out of scope and the
`shared_ptr` is destroyed.
5. `PrioritizedSplitRunner` destructor runs → destroys `_split_runner`
(`ScannerSplitRunner`).
6. `ScannerSplitRunner::_scan_func` destructor runs → destroys captured
`ctx` (`std::shared_ptr<ScannerContext>`).
7. `ScannerContext::~ScannerContext()` calls `remove_task()`.
8. `remove_task()` attempts to acquire `_lock`.  
9. Result: **self-deadlock** because `_lock` is already held by
`_dispatch_thread`.

--- 

#### Solution
We must explicitly release 'split' BEFORE acquiring _lock to avoid
self-deadlock. The destructor chain (`PrioritizedSplitRunner` ->
`ScannerSplitRunner`-> `_scan_func` lambda -> captured `ScannerContext`)
may call `remove_task()` which tries to acquire `_lock`. Since `_lock`
is not a recursive mutex, this would deadlock.
zzzxl1993 pushed a commit to zzzxl1993/doris that referenced this pull request Jan 13, 2026
…g task executor to hang. (apache#58273)

### What problem does this PR solve?

### Release note

#### Summary

1. Fix a self-deadlock that can occur in `_dispatch_thread` when a
destructor path attempts to re-acquire the same mutex already held by
the thread. The root cause is destructors (triggered while holding
`_lock`) performing operations that try to re-acquire the same mutex. A
safe fix should ensure destructors that may call `remove_task()` run
outside the `_lock` scope or avoid re-locking the same mutex inside
destructor paths.
2. Call `_task_executor->wait()` in
`TaskExecutorSimplifiedScanScheduler::stop()`.

---

#### Details / Reproduction steps 
1. `std::shared_ptr<PrioritizedSplitRunner> split =
_tokenless->_entries->take();`
3. `l.lock();` — `_dispatch_thread` acquires `_lock`.  
4. After the `while` loop finishes, `split` goes out of scope and the
`shared_ptr` is destroyed.
5. `PrioritizedSplitRunner` destructor runs → destroys `_split_runner`
(`ScannerSplitRunner`).
6. `ScannerSplitRunner::_scan_func` destructor runs → destroys captured
`ctx` (`std::shared_ptr<ScannerContext>`).
7. `ScannerContext::~ScannerContext()` calls `remove_task()`.
8. `remove_task()` attempts to acquire `_lock`.  
9. Result: **self-deadlock** because `_lock` is already held by
`_dispatch_thread`.

--- 

#### Solution
We must explicitly release 'split' BEFORE acquiring _lock to avoid
self-deadlock. The destructor chain (`PrioritizedSplitRunner` ->
`ScannerSplitRunner`-> `_scan_func` lambda -> captured `ScannerContext`)
may call `remove_task()` which tries to acquire `_lock`. Since `_lock`
is not a recursive mutex, this would deadlock.
kaka11chen added a commit to kaka11chen/doris that referenced this pull request Jan 21, 2026
…g task executor to hang. (apache#58273)

1. Fix a self-deadlock that can occur in `_dispatch_thread` when a
destructor path attempts to re-acquire the same mutex already held by
the thread. The root cause is destructors (triggered while holding
`_lock`) performing operations that try to re-acquire the same mutex. A
safe fix should ensure destructors that may call `remove_task()` run
outside the `_lock` scope or avoid re-locking the same mutex inside
destructor paths.
2. Call `_task_executor->wait()` in
`TaskExecutorSimplifiedScanScheduler::stop()`.

---

1. `std::shared_ptr<PrioritizedSplitRunner> split =
_tokenless->_entries->take();`
3. `l.lock();` — `_dispatch_thread` acquires `_lock`.
4. After the `while` loop finishes, `split` goes out of scope and the
`shared_ptr` is destroyed.
5. `PrioritizedSplitRunner` destructor runs → destroys `_split_runner`
(`ScannerSplitRunner`).
6. `ScannerSplitRunner::_scan_func` destructor runs → destroys captured
`ctx` (`std::shared_ptr<ScannerContext>`).
7. `ScannerContext::~ScannerContext()` calls `remove_task()`.
8. `remove_task()` attempts to acquire `_lock`.
9. Result: **self-deadlock** because `_lock` is already held by
`_dispatch_thread`.

---

We must explicitly release 'split' BEFORE acquiring _lock to avoid
self-deadlock. The destructor chain (`PrioritizedSplitRunner` ->
`ScannerSplitRunner`-> `_scan_func` lambda -> captured `ScannerContext`)
may call `remove_task()` which tries to acquire `_lock`. Since `_lock`
is not a recursive mutex, this would deadlock.
yiguolei pushed a commit that referenced this pull request Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.3-merged dev/4.1.x reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants