Skip to content

Conversation

@wyxxxcat
Copy link
Contributor

@wyxxxcat wyxxxcat commented Nov 27, 2025

What problem does this PR solve?

Summary

Add a RECYCLE state for rowset/meta (rs meta) and update the recycler logic to mark metadata as RECYCLE before final deletion. This reduces the risk of accidental data loss.

Problem

The recycler sometimes deletes rs meta too early (race conditions, restarts, or recovery cases), which can cause metadata and file inconsistencies or data loss.

Solution

  • Introduce a RECYCLE intermediate state for rs meta.
  • When an item is chosen for cleanup, mark it RECYCLE and record a timestamp.
  • Only perform the final delete after a confirmation window or additional checks.
  • Make recovery/restart logic treat RECYCLE items as recoverable until final deletion.

Main changes

  • Add RECYCLE to the rs meta state enum.
  • Update metadata APIs to set/query RECYCLE.
  • Update recycler to use two-step deletion: mark -> confirm -> abort txn/job and delete.
  • Add logs and tests for the new flow.

Test case

1. begin_txn -> prepare_rowset -> force_recycle -> commit_rowset -> commit_txn
2. start_job -> prepare_rowset -> force_recycle -> commit_rowset -> finish_job
Rowset will be marked as recycled to prevent commit_rowset and finish job/txn

3. begin_txn -> prepare_rowset -> commit_rowset -> force_recycle -> commit_txn
4. start_job -> prepare_rowset -> commit_rowset -> force_recycle -> finish_job
Rowset will be marked as recycled to prevent finish job/txn

5. begin_txn -> prepare_rowset -> force_recycle * 2 -> commit_rowset -> commit_txn
6. start_job -> prepare_rowset -> force_recycle * 2 -> commit_rowset -> finish_job
7. begin_txn -> prepare_rowset -> commit_rowset -> force_recycle * 2 -> commit_txn
9. start_job -> prepare_rowset -> commit_rowset -> force_recycle * 2 -> finish_job
10. delete_job -> commit_rowset -> force_recycle * 2 -> finish_job
11. delete_job -> prepare_rowset -> commit_rowset -> force_recycle * 2 -> finish_job
12. delete_job -> prepare_rowset ->  force_recycle * 2 -> commit_rowset -> finish_job
Double recycle job will mark rowset as recycled and abort job/txn, then delete data and kv

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

optional RecycleStatePB recycle_state = 111;
}

enum RecycleStatePB {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add scope to this state, easily to be conflict

@wyxxxcat wyxxxcat force-pushed the recycle_rowsets_loss branch 8 times, most recently from a97bb6b to 331af0c Compare December 8, 2025 08:10
@wyxxxcat wyxxxcat marked this pull request as ready for review December 8, 2025 08:10
@wyxxxcat wyxxxcat force-pushed the recycle_rowsets_loss branch 4 times, most recently from 1914e9e to ee7e572 Compare December 9, 2025 07:13
@wyxxxcat wyxxxcat marked this pull request as draft December 9, 2025 07:13
@wyxxxcat wyxxxcat force-pushed the recycle_rowsets_loss branch from ee7e572 to 97b9243 Compare December 16, 2025 02:57
@wyxxxcat wyxxxcat marked this pull request as ready for review December 16, 2025 02:58
@wyxxxcat
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage `` 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

TPC-H: Total hot run time: 36404 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 97b92434f2c898d0f48eea9333e8b4f81bdde4ce, data reload: false

------ Round 1 ----------------------------------
q1	17625	4283	4065	4065
q2	2026	363	244	244
q3	10434	1360	756	756
q4	10364	833	306	306
q5	9454	2118	1954	1954
q6	216	168	133	133
q7	1022	862	716	716
q8	9379	1437	1237	1237
q9	7298	5376	5412	5376
q10	6869	2404	1947	1947
q11	517	321	288	288
q12	697	744	589	589
q13	17801	3655	2984	2984
q14	288	292	276	276
q15	590	523	525	523
q16	682	668	624	624
q17	674	825	522	522
q18	7475	8195	7731	7731
q19	1310	1012	606	606
q20	422	381	248	248
q21	4562	4404	4205	4205
q22	1128	1086	1074	1074
Total cold run time: 110833 ms
Total hot run time: 36404 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4291	4228	4247	4228
q2	430	416	312	312
q3	2502	2816	2397	2397
q4	1452	1887	1466	1466
q5	4669	4270	4620	4270
q6	204	174	126	126
q7	2065	1953	1815	1815
q8	2683	2579	2545	2545
q9	7659	7503	7375	7375
q10	2899	3116	2673	2673
q11	574	499	480	480
q12	624	706	562	562
q13	3272	3595	3023	3023
q14	265	300	261	261
q15	532	507	493	493
q16	614	638	593	593
q17	1131	1487	1368	1368
q18	7216	7042	7015	7015
q19	815	798	808	798
q20	1905	2054	1800	1800
q21	4686	4324	4192	4192
q22	1107	1091	977	977
Total cold run time: 51595 ms
Total hot run time: 48769 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 177868 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 97b92434f2c898d0f48eea9333e8b4f81bdde4ce, data reload: false

query5	4834	626	477	477
query6	329	228	207	207
query7	4223	478	282	282
query8	331	266	256	256
query9	8805	2559	2594	2559
query10	520	398	337	337
query11	15111	15180	14535	14535
query12	183	118	115	115
query13	1268	497	389	389
query14	5894	3283	2986	2986
query14_1	2868	2894	2915	2894
query15	206	192	182	182
query16	859	478	464	464
query17	1129	726	609	609
query18	2543	454	354	354
query19	230	242	216	216
query20	120	115	112	112
query21	231	136	114	114
query22	3942	4084	3869	3869
query23	16761	16247	15821	15821
query23_1	15995	16128	15960	15960
query24	7381	1662	1244	1244
query24_1	1245	1254	1240	1240
query25	594	498	464	464
query26	1262	278	167	167
query27	2750	470	326	326
query28	4459	2156	2151	2151
query29	836	577	479	479
query30	315	246	221	221
query31	851	700	624	624
query32	88	69	72	69
query33	553	353	340	340
query34	890	911	535	535
query35	756	814	731	731
query36	880	889	824	824
query37	131	94	82	82
query38	2839	2812	2784	2784
query39	766	733	871	733
query39_1	692	695	699	695
query40	222	134	118	118
query41	66	62	63	62
query42	106	102	102	102
query43	415	435	422	422
query44	1308	756	747	747
query45	198	194	181	181
query46	867	967	617	617
query47	1611	1689	1607	1607
query48	313	323	255	255
query49	628	422	353	353
query50	664	285	227	227
query51	3781	4073	3749	3749
query52	106	113	98	98
query53	315	345	294	294
query54	295	261	243	243
query55	77	75	69	69
query56	281	291	305	291
query57	1132	1142	1078	1078
query58	302	253	266	253
query59	2455	2533	2385	2385
query60	328	304	290	290
query61	167	159	155	155
query62	685	672	611	611
query63	330	290	299	290
query64	4983	1319	997	997
query65	3982	3942	3962	3942
query66	1423	450	315	315
query67	15390	14899	14632	14632
query68	8382	1018	724	724
query69	513	342	301	301
query70	1074	981	941	941
query71	372	313	291	291
query72	6051	4882	4908	4882
query73	666	575	318	318
query74	8856	8706	8543	8543
query75	3149	3119	2748	2748
query76	3967	1154	745	745
query77	541	408	282	282
query78	9546	9707	8942	8942
query79	1669	885	641	641
query80	741	663	561	561
query81	514	273	233	233
query82	204	135	111	111
query83	284	258	237	237
query84	259	127	106	106
query85	905	509	464	464
query86	386	291	276	276
query87	3060	3072	3010	3010
query88	3216	2322	2322	2322
query89	472	413	386	386
query90	2132	158	157	157
query91	177	165	143	143
query92	80	69	63	63
query93	1240	902	568	568
query94	475	305	315	305
query95	599	380	308	308
query96	594	467	212	212
query97	2291	2313	2262	2262
query98	216	200	203	200
query99	1301	1290	1217	1217
Total cold run time: 260081 ms
Total hot run time: 177868 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.43 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 97b92434f2c898d0f48eea9333e8b4f81bdde4ce, data reload: false

query1	0.05	0.05	0.05
query2	0.15	0.07	0.07
query3	0.38	0.08	0.08
query4	1.61	0.10	0.10
query5	0.26	0.25	0.25
query6	1.17	0.63	0.65
query7	0.04	0.03	0.02
query8	0.07	0.05	0.06
query9	0.60	0.51	0.52
query10	0.56	0.58	0.55
query11	0.26	0.14	0.13
query12	0.26	0.14	0.14
query13	0.63	0.63	0.61
query14	1.02	0.99	1.01
query15	0.90	0.80	0.81
query16	0.39	0.39	0.39
query17	0.98	1.06	1.00
query18	0.24	0.22	0.23
query19	1.85	1.88	1.86
query20	0.02	0.01	0.02
query21	15.39	0.28	0.23
query22	4.96	0.10	0.10
query23	15.43	0.38	0.22
query24	2.43	0.47	0.29
query25	0.10	0.09	0.09
query26	0.19	0.17	0.19
query27	0.10	0.09	0.09
query28	3.89	1.34	1.16
query29	12.59	4.06	3.30
query30	0.34	0.14	0.12
query31	2.81	0.66	0.42
query32	3.24	0.60	0.48
query33	2.95	3.03	3.01
query34	16.89	5.19	4.68
query35	4.65	4.62	4.65
query36	0.62	0.49	0.48
query37	0.24	0.09	0.09
query38	0.20	0.06	0.06
query39	0.07	0.05	0.05
query40	0.20	0.17	0.18
query41	0.13	0.07	0.05
query42	0.08	0.06	0.04
query43	0.06	0.05	0.05
Total cold run time: 99 s
Total hot run time: 28.43 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 89.66% (26/29) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.43% (18846/35273)
Line Coverage 39.21% (174429/444848)
Region Coverage 33.84% (135000/398983)
Branch Coverage 34.78% (58098/167033)

@github-actions
Copy link
Contributor

github-actions bot commented Jan 2, 2026

PR approved by at least one committer and no changes requested.

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage `` 🎉
Increment coverage report
Complete coverage report

@wyxxxcat wyxxxcat force-pushed the recycle_rowsets_loss branch from 311dbe2 to ffcac6b Compare January 5, 2026 06:53
@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Jan 5, 2026
@wyxxxcat
Copy link
Contributor Author

wyxxxcat commented Jan 5, 2026

run cloudut

@wyxxxcat wyxxxcat force-pushed the recycle_rowsets_loss branch 2 times, most recently from 5c98f34 to f6726b2 Compare January 5, 2026 07:38
@wyxxxcat
Copy link
Contributor Author

wyxxxcat commented Jan 5, 2026

run cloudut

@wyxxxcat wyxxxcat force-pushed the recycle_rowsets_loss branch from f6726b2 to 36f9742 Compare January 5, 2026 09:10
@wyxxxcat
Copy link
Contributor Author

wyxxxcat commented Jan 5, 2026

run cloudut

@doris-robot
Copy link

Cloud UT Coverage Report

Increment line coverage 63.21% (256/405) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.58% (1781/2238)
Line Coverage 64.87% (31675/48830)
Region Coverage 65.38% (15744/24082)
Branch Coverage 56.04% (8382/14956)

@wyxxxcat
Copy link
Contributor Author

wyxxxcat commented Jan 7, 2026

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 63.21% (256/405) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.57% (1784/2242)
Line Coverage 64.81% (31735/48965)
Region Coverage 65.50% (15791/24107)
Branch Coverage 56.04% (8382/14958)

@doris-robot
Copy link

TPC-H: Total hot run time: 32118 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 36f97420b3801690f96e2cde101334142cf86ea1, data reload: false

------ Round 1 ----------------------------------
q1	17462	4274	4042	4042
q2	2002	363	236	236
q3	10098	1373	725	725
q4	10209	883	317	317
q5	7500	2133	1865	1865
q6	185	176	138	138
q7	987	814	694	694
q8	9305	1453	1171	1171
q9	5062	4561	4639	4561
q10	6812	1807	1400	1400
q11	530	307	265	265
q12	761	753	589	589
q13	17800	3834	3093	3093
q14	288	290	269	269
q15	581	526	503	503
q16	661	685	621	621
q17	685	796	561	561
q18	6757	6532	6980	6532
q19	1258	1022	635	635
q20	426	400	255	255
q21	3271	2685	2606	2606
q22	1122	1095	1040	1040
Total cold run time: 103762 ms
Total hot run time: 32118 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4450	4204	4262	4204
q2	331	399	324	324
q3	2260	2781	2368	2368
q4	1408	2060	1390	1390
q5	4356	4466	4355	4355
q6	210	167	125	125
q7	1950	1915	1751	1751
q8	2543	2376	2360	2360
q9	7379	7119	7287	7119
q10	2501	2737	2243	2243
q11	547	466	450	450
q12	685	747	639	639
q13	3523	4131	3330	3330
q14	288	292	272	272
q15	542	489	486	486
q16	618	641	592	592
q17	1082	1242	1299	1242
q18	7202	7289	7258	7258
q19	807	781	790	781
q20	1876	1948	1785	1785
q21	4573	4232	4111	4111
q22	1091	1047	960	960
Total cold run time: 50222 ms
Total hot run time: 48145 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 172675 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 36f97420b3801690f96e2cde101334142cf86ea1, data reload: false

query5	4484	587	449	449
query6	330	228	205	205
query7	4241	467	271	271
query8	354	247	235	235
query9	8754	2669	2684	2669
query10	501	380	327	327
query11	15304	15080	14890	14890
query12	177	121	114	114
query13	1286	490	378	378
query14	6189	2973	2737	2737
query14_1	2645	2617	2691	2617
query15	205	196	175	175
query16	1010	491	454	454
query17	1115	686	570	570
query18	2500	442	349	349
query19	230	226	196	196
query20	122	116	116	116
query21	217	140	122	122
query22	3971	4124	4085	4085
query23	15982	15628	15486	15486
query23_1	15476	15463	15333	15333
query24	7308	1554	1166	1166
query24_1	1187	1190	1191	1190
query25	565	486	422	422
query26	1252	270	165	165
query27	2765	455	290	290
query28	4528	2159	2125	2125
query29	781	567	461	461
query30	316	243	216	216
query31	815	643	537	537
query32	75	69	68	68
query33	550	342	296	296
query34	923	880	518	518
query35	757	782	704	704
query36	907	916	859	859
query37	129	91	85	85
query38	2834	2707	2615	2615
query39	778	781	737	737
query39_1	729	717	708	708
query40	216	134	115	115
query41	68	63	60	60
query42	103	100	108	100
query43	473	473	413	413
query44	1301	727	719	719
query45	184	183	174	174
query46	857	950	582	582
query47	1405	1487	1361	1361
query48	311	317	244	244
query49	602	426	331	331
query50	622	264	200	200
query51	3762	3847	3776	3776
query52	105	110	94	94
query53	291	328	271	271
query54	279	275	240	240
query55	76	75	69	69
query56	290	308	282	282
query57	993	1025	905	905
query58	268	258	249	249
query59	2128	2059	2044	2044
query60	326	349	294	294
query61	161	158	157	157
query62	390	360	310	310
query63	298	265	266	265
query64	4839	1301	988	988
query65	3781	3702	3714	3702
query66	1431	407	296	296
query67	14892	15252	14902	14902
query68	7953	985	696	696
query69	498	345	311	311
query70	1032	945	916	916
query71	356	295	284	284
query72	6149	3441	3460	3441
query73	763	717	306	306
query74	8735	8745	8559	8559
query75	2837	2826	2444	2444
query76	3903	1057	635	635
query77	522	362	271	271
query78	9634	9944	9135	9135
query79	1347	903	591	591
query80	625	599	471	471
query81	506	257	226	226
query82	207	147	111	111
query83	262	257	238	238
query84	255	120	100	100
query85	891	513	452	452
query86	347	325	283	283
query87	2898	2854	2719	2719
query88	3790	2219	2205	2205
query89	385	356	334	334
query90	2204	160	156	156
query91	174	164	138	138
query92	79	72	65	65
query93	1365	934	529	529
query94	559	322	313	313
query95	573	335	312	312
query96	584	472	211	211
query97	2340	2370	2271	2271
query98	223	198	193	193
query99	630	590	541	541
Total cold run time: 254384 ms
Total hot run time: 172675 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 27.42 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 36f97420b3801690f96e2cde101334142cf86ea1, data reload: false

query1	0.06	0.05	0.05
query2	0.10	0.05	0.04
query3	0.26	0.09	0.08
query4	1.60	0.12	0.11
query5	0.28	0.26	0.25
query6	1.14	0.67	0.66
query7	0.03	0.03	0.03
query8	0.05	0.04	0.04
query9	0.56	0.51	0.50
query10	0.56	0.55	0.56
query11	0.14	0.10	0.09
query12	0.14	0.11	0.12
query13	0.61	0.59	0.59
query14	0.96	0.94	0.94
query15	0.79	0.77	0.78
query16	0.39	0.40	0.43
query17	1.07	1.08	1.05
query18	0.23	0.21	0.21
query19	1.95	1.84	1.80
query20	0.02	0.01	0.01
query21	15.43	0.27	0.14
query22	5.17	0.04	0.04
query23	16.22	0.28	0.10
query24	1.29	0.50	0.65
query25	0.16	0.05	0.06
query26	0.14	0.14	0.14
query27	0.08	0.05	0.04
query28	4.07	1.07	0.88
query29	12.57	3.91	3.17
query30	0.27	0.14	0.12
query31	2.83	0.65	0.40
query32	3.24	0.57	0.45
query33	2.98	3.08	3.07
query34	16.96	5.12	4.45
query35	4.46	4.72	5.07
query36	0.70	0.54	0.54
query37	0.11	0.07	0.06
query38	0.07	0.03	0.03
query39	0.04	0.02	0.02
query40	0.18	0.14	0.14
query41	0.09	0.02	0.02
query42	0.04	0.03	0.03
query43	0.03	0.04	0.03
Total cold run time: 98.07 s
Total hot run time: 27.42 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 89.66% (26/29) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.27% (18988/35646)
Line Coverage 39.22% (175875/448454)
Region Coverage 33.76% (136212/403468)
Branch Coverage 34.76% (58819/169232)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 96.55% (28/29) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.87% (25742/34846)
Line Coverage 61.20% (273696/447235)
Region Coverage 56.10% (228646/407587)
Branch Coverage 57.96% (98403/169781)

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jan 7, 2026
@gavinchou gavinchou merged commit e3fc71b into apache:master Jan 8, 2026
26 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants