Skip to content

[fix](move-memtable) fix StreamWrite EINVAL error when report tablet load info#60688

Merged
dataroaring merged 1 commit intoapache:masterfrom
sollhui:brpc_stream_write_fail
Feb 11, 2026
Merged

[fix](move-memtable) fix StreamWrite EINVAL error when report tablet load info#60688
dataroaring merged 1 commit intoapache:masterfrom
sollhui:brpc_stream_write_fail

Conversation

@sollhui
Copy link
Contributor

@sollhui sollhui commented Feb 11, 2026

Summary

Fix a bug where _report_tablet_load_info writes an empty IOBuf to brpc stream,
causing Socket::Write to return EINVAL and log warnings like:

Fail to write to _fake_socket, Invalid argument

Root Cause

_report_tablet_load_info is called on every ADD_SEGMENT request to report tablet
load info (version count) back to the sender for back-pressure control.

The call chain is:

  1. _report_tablet_load_info gets write_tablet_ids from the index stream
  2. _collect_tablet_load_info_from_tablets iterates over these tablet IDs
  3. collect_tablet_load_rowset_num_info only adds an entry when version_count
    exceeds max_version_config * load_back_pressure_version_threshold / 100

In normal cases where no tablet hits the back-pressure threshold, tablet_load_infos
remains empty. A protobuf message with only an empty repeated field serializes to an
empty string (0 bytes). The empty IOBuf is then passed to brpc::StreamWrite, and
brpc's Socket::Write rejects empty data with EINVAL.

This error is harmless (the socket is not closed or failed, and subsequent writes
succeed), but it produces noisy WARNING logs on every ADD_SEGMENT request.

Fix

Skip the StreamWrite call when tablet_load_infos is empty after collection,
since there is nothing to report.

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@sollhui
Copy link
Contributor Author

sollhui commented Feb 11, 2026

run buildall

Copy link
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 11, 2026
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TPC-H: Total hot run time: 29968 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ecdecb663fd66297aa7c8d10933c125c42c1c4d3, data reload: false

------ Round 1 ----------------------------------
q1	17592	4425	4288	4288
q2	2068	348	245	245
q3	10118	1288	704	704
q4	10196	777	307	307
q5	7504	2137	1922	1922
q6	189	176	145	145
q7	886	747	609	609
q8	9261	1343	1088	1088
q9	4703	4658	4590	4590
q10	6789	1958	1549	1549
q11	448	266	249	249
q12	339	385	221	221
q13	17753	4017	3213	3213
q14	240	229	223	223
q15	869	814	824	814
q16	692	668	634	634
q17	693	790	535	535
q18	6519	5768	5621	5621
q19	1238	969	619	619
q20	497	498	380	380
q21	2515	1832	1767	1767
q22	335	284	245	245
Total cold run time: 101444 ms
Total hot run time: 29968 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4341	4339	4318	4318
q2	255	350	260	260
q3	2052	2698	2209	2209
q4	1370	1730	1286	1286
q5	4252	4164	4173	4164
q6	214	176	134	134
q7	1884	1798	1634	1634
q8	2477	2510	2559	2510
q9	7600	7403	7537	7403
q10	2841	3154	2620	2620
q11	511	440	410	410
q12	696	757	600	600
q13	3902	4303	3658	3658
q14	292	327	284	284
q15	929	837	844	837
q16	684	712	696	696
q17	1185	1371	1382	1371
q18	8311	8130	7981	7981
q19	886	878	853	853
q20	2081	2133	1974	1974
q21	4792	4405	4334	4334
q22	520	502	442	442
Total cold run time: 52075 ms
Total hot run time: 49978 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.31 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ecdecb663fd66297aa7c8d10933c125c42c1c4d3, data reload: false

query1	0.06	0.05	0.05
query2	0.10	0.04	0.05
query3	0.25	0.08	0.08
query4	1.61	0.12	0.11
query5	0.27	0.25	0.25
query6	1.16	0.67	0.68
query7	0.03	0.02	0.03
query8	0.05	0.04	0.04
query9	0.56	0.50	0.49
query10	0.55	0.53	0.55
query11	0.15	0.09	0.09
query12	0.14	0.10	0.11
query13	0.64	0.61	0.62
query14	1.05	1.06	1.07
query15	0.88	0.86	0.88
query16	0.41	0.39	0.40
query17	1.08	1.12	1.12
query18	0.22	0.20	0.21
query19	2.09	2.01	2.03
query20	0.02	0.02	0.02
query21	15.42	0.28	0.15
query22	5.16	0.06	0.05
query23	15.97	0.31	0.11
query24	2.25	0.28	0.56
query25	0.12	0.08	0.06
query26	0.15	0.13	0.13
query27	0.08	0.06	0.06
query28	4.71	1.15	0.96
query29	12.59	3.93	3.15
query30	0.27	0.14	0.12
query31	2.81	0.64	0.40
query32	3.25	0.59	0.48
query33	3.22	3.21	3.19
query34	16.26	5.39	4.78
query35	4.82	4.80	4.77
query36	0.65	0.51	0.50
query37	0.11	0.08	0.07
query38	0.07	0.04	0.04
query39	0.05	0.03	0.03
query40	0.20	0.17	0.15
query41	0.08	0.03	0.03
query42	0.04	0.04	0.03
query43	0.04	0.04	0.03
Total cold run time: 99.64 s
Total hot run time: 28.31 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.77% (19474/36906)
Line Coverage 36.24% (181408/500552)
Region Coverage 32.62% (140681/431246)
Branch Coverage 33.65% (60998/181284)

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (3/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.73% (25939/36161)
Line Coverage 54.36% (271434/499313)
Region Coverage 51.75% (225420/435625)
Branch Coverage 53.30% (96992/181988)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (3/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.73% (25939/36161)
Line Coverage 54.36% (271430/499313)
Region Coverage 51.73% (225347/435625)
Branch Coverage 53.29% (96979/181988)

@dataroaring dataroaring merged commit 71139d7 into apache:master Feb 11, 2026
31 of 32 checks passed
sollhui added a commit to sollhui/doris that referenced this pull request Feb 27, 2026
…load info (apache#60688)

## Summary

Fix a bug where `_report_tablet_load_info` writes an empty `IOBuf` to
brpc stream,
causing `Socket::Write` to return EINVAL and log warnings like:

```
Fail to write to _fake_socket, Invalid argument
```

## Root Cause

`_report_tablet_load_info` is called on every `ADD_SEGMENT` request to
report tablet
load info (version count) back to the sender for back-pressure control.

The call chain is:

1. `_report_tablet_load_info` gets `write_tablet_ids` from the index
stream
2. `_collect_tablet_load_info_from_tablets` iterates over these tablet
IDs
3. `collect_tablet_load_rowset_num_info` only adds an entry when
`version_count`
exceeds `max_version_config * load_back_pressure_version_threshold /
100`

In normal cases where no tablet hits the back-pressure threshold,
`tablet_load_infos`
remains empty. A protobuf message with only an empty repeated field
serializes to an
empty string (0 bytes). The empty `IOBuf` is then passed to
`brpc::StreamWrite`, and
brpc's `Socket::Write` rejects empty data with EINVAL.

This error is harmless (the socket is not closed or failed, and
subsequent writes
succeed), but it produces noisy WARNING logs on every `ADD_SEGMENT`
request.

## Fix

Skip the `StreamWrite` call when `tablet_load_infos` is empty after
collection,
since there is nothing to report.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.x dev/4.0.x-conflict reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants