performance(compio-net): avoid send_zerocopy for small TCP writes by johnnyshields · Pull Request #771 · compio-rs/compio

johnnyshields · 2026-03-15T18:02:12Z

Fix/Performance for compio-net: avoid send_zerocopy for small TCP writes

Fix

Add a ZEROCOPY_THRESHOLD (8KB). Writes below the threshold use regular send (IORING_OP_SEND, 1 CQE), which completes immediately without waiting for the peer's ACK. Writes at or above the threshold continue using send_zerocopy where the kernel copy savings justify the extra CQE round-trip.

Applied to both write and write_vectored. The send_zerocopy / send_zerocopy_vectored public APIs are unaffected — only the AsyncWrite trait impl uses the threshold.

This fixes two problems at once:

Problem 1

For small data transfers, traditional kernel copies usually outperforms zero-copy. See benchmarks below; TLDR its a 1.5-2.5x speedup for transfers under 8KB.

Problem 2

send_zerocopy (IORING_OP_SEND_ZC) requires two CQEs: one for send completion and one for buffer release. The buffer-release CQE is only generated after the peer ACKs the data. On Linux, TCP delayed ACKs can defer this by ~40ms — but only for small writes. Large writes produce multiple MSS-sized segments, and the receiver's delayed-ACK logic fast-ACKs after every second full segment. A small write (e.g. a 13-byte H2 WINDOW_UPDATE) arrives as a single undersized segment with no companion to trigger fast-ACK, so the receiver's delayed-ACK timer runs to expiration. This causes small standalone writes to stall for the full ~40ms waiting for the second CQE.

I found this while working on H2 support (for GPRC), this was tripping up 13-byte H2 WINDOW_UPDATE frames, and caused a 3700x latency cliff. The root cause chain:

H2 WINDOW_UPDATE (13 bytes) sent as standalone TCP write
AsyncWrite::write → send_zerocopy → SEND_ZC submitted to io_uring
First CQE (send complete) arrives in ~3µs
Second CQE (buffer release) waits for peer's TCP ACK
Peer's delayed-ACK timer fires after ~40ms
Total: ~40ms for a 13-byte write

Benchmark results

I added a new /benches/ dir in this PR -- run cargo bench -p compio-net --bench tcp_write

Size	regular send	send_zerocopy	Speedup
13B	6µs	15µs	2.5x
100B	7µs	15µs	2.1x
1KB	4µs	6µs	1.5x
8KB	12µs	10µs	~1x (zerocopy slightly faster at this size--we use zerocopy from 8kb)

The 2-CQE overhead of zerocopy send adds ~8-9µs per small write in steady state. For writes below 8KB, regular send is consistently faster. At 8KB and above, zerocopy breaks even or wins due to kernel copy savings.

(The ~40ms delayed-ACK stall issue is separate from the steady-state overhead above , it affects p99 latency more than median.)

Berrysoft · 2026-03-15T18:12:58Z

I have created #770 to make zerocopy explicit. It might be better rather than a threshold constant.

johnnyshields · 2026-03-15T18:24:57Z

Hmmm... I think it is good to have both an explicit send_zerocopy method and this write method which uses the threshold. I think the threshold is the easiest way for the vast majority of downstream implementations (including my compio-h2 PR to be raised soon)--it just works. Otherwise, we push the necessity to optimize small write handling to each downstream implementation.

Berrysoft · 2026-03-15T18:30:37Z

I mean, I don't want write to call send_zerocopy. It might always not be a good idea.

The threshold logic is not transparent. Users need to read the documents to know what happends inside.
Not all kernels support SEND_ZC, which means it's a totally negative optimization on old kernels and other platforms (as there is another await point).

Otherwise, we push the necessity to optimize small write handling to each downstream implementation.

So I want to push the optimization for large buffers to downstreams instead.

johnnyshields · 2026-03-15T18:44:13Z

So I want to push the optimization for large buffers to downstreams instead.

I would disagree. IMHO the best pattern is for write method to do the right thing by default--the 8kb threshold is generally sensible, it may cost 1~2us extra vs a kernel-specific "optimized" implementation, but at least it won't hit a latency cliff with delayed ACKs. We could add some kernel-specific compile-time gating/optimizations to it in the future.

FYI I wasted about ~3 hours debugging latency in my compio-h2 implementation b/c compio-net write didn't have this simple optimization, and I blindly expected it would.

Implementers who want "sharp knives" should be able to use send_zerocopy or send_kernel methods as direct alternatives.

Berrysoft · 2026-03-17T06:55:52Z

FYI I wasted about ~3 hours debugging latency in my compio-h2 implementation b/c compio-net write didn't have this simple optimization, and I blindly expected it would.

Sorry to hear that. It's a fault to make write call send_zerocopy unconditionally, and that's why I want to change it back to send. It doesn't prevent the downstream optimization at all. For example, you can write a wrapper around compio::net::TcpStream and implement AsyncWrite yourself, calling send_zerocopy with custom threshold and conditions.

You might be right considering your performance, but I would argue that all conditions, all branches and all benchmarks cost. You will never be able to say that an optimization is general enough. If I set the threshold to 8KB, but another user argues it should be 10KB, it's hard and complex for us to decide how to satisfy all users. Instead, if we just provide a plain write (which calls send internally) and a plain send_zerocopy without any conditions, users will be not that happy but at least easy to optimize the TcpStream themselves for their specific usages.

Implementers who want "sharp knives" should be able to use send_zerocopy or send_kernel methods as direct alternatives.

It's always possible to call send_zerocopy directly.

We might consider providing a method to indicate whether the kernel supports send_zerocopy...

johnnyshields · 2026-03-17T14:42:46Z

How about the following:

Revert write to alias to send as you suggest (write = kernel-copied send, names follow POSIX nomenclature)
Introduce write_zerocopy, which is what write is on master today, i.e. the buffer-blocking variant of send_zerocopy (non-controversial I think)
Add write_adaptive which is my threshold method in this PR.
(and add *_vectored equivalents for all the above)

Now, regarding the 8kb threshold, I doubt we will get into heated debates over this 😄:

Empirically my h2 benchmarking shows 8kb works well, in a way I doubt is "specific" to just H2.
Linux and GPRC have references to 10kb and 16kb for MSG_ZEROCOPY (uses socket error queue) which is different than SEND_ZC (CQE-based, lower overhead).
It would be possible to make it tunable if someone really wants to, which could be as simple as adding a function arg which the downstream can pass in.
If we learn that 8kb is a bad threshold for a certain OS--or that ZC isn't supported at all--then we compile-time override it for that OS. (All downstreams would reap the benefit of such an override, rather than forcing each downstream to discover and inconsistently handle OS edge cases.)

Berrysoft · 2026-03-17T15:04:13Z

The first suggestion will be solved in #770 , and the following ones could be done yourself in your own project.

johnnyshields · 2026-03-17T15:13:17Z

Hmmm... it's your project so it's your call, but most users who would consider compio are considering it specifically because of io_uring's net efficiency. Without write_adaptive (or similar) in compio-net, every downstream that uses AsyncWrite will either miss out on SEND_ZC entirely, or independently re-implement the same threshold logic. So it just feels like a waste. But again--your call.

Berrysoft · 2026-03-17T15:28:18Z

every downstream that uses AsyncWrite will either miss out on SEND_ZC entirely, or independently re-implement the same threshold logic.

Well, yes. That's a compromise somehow. To my opinion, I would like them to re-implement the same threshold logic independently. compio is a fundamental crate. It should provide more opportunities rather than a "complete and beautiful" solution (and that's somehow tokio does - I will not be surprised if someday they optimize their send with MSG_ZEROCOPY). We don't provide our implementation of basic macros (e.g., join or select) because futures has already provided that. We don't provide our mutexes or channels. We provide low-level driver APIs rather than hiding it inside the high-level runtime. The same logic applies here. We provide simple sockets, and platform-specific APIs (actually zerocopy is Linux-only, and we provide a cross-platform API just to make cross-platform code easier), but should not provide a threshold-based optimization that could be implemented by the users easily. Each API is easy to guess and predict - write is write or send, and read is read or recv, that's it. Those users who read the API documents will know that TcpStream supports send_zerocopy, and of course they will not expect that write actually sometimes call send_zerocopy - it might benefit some new coming users unexpectely, but will make advanced users mad. An advanced user might want to adjust the threshold, the if condition, or even the total optimize logic. A simple write method will be more convenient for both of us.

johnnyshields · 2026-03-17T15:58:14Z

OK thats fine if that's compio's philosophy: "a literal wrapper around the kernel." I would just ask to make that philosophy crystal-clear in the README and other places.

I do think that as compio's adoption grows, that philosophy will fight against what most implementers want/expect--and there's no reason compio can't provide a "near-optimal default" AND "sharp knives" at the same time.

FYI tokio doesn't use zerocopy in their core at all--it is in their io-uring crate (which is low adoption.)

Berrysoft · 2026-03-17T16:19:49Z

Well, things always change. I cannot make sure what will happen in the future, but currently there's even no other successful example for us to refer, and I don't think it's easy to provide a balanced, zero-cost, optimized, sharp knife.

Other references:

tokio-uring, only provides send_zc for UDP sockets.
monoio, provides a wrong implementation (which passes MSG_ZEROCOPY to SEND).
glommio, does nothing.

perf(compio-net): avoid send_zerocopy for small TCP writes

26f22ef

johnnyshields changed the title ~~Performance: compio-net -- avoid send_zerocopy for small TCP writes~~ performance(compio-net): avoid send_zerocopy for small TCP writes Mar 17, 2026

johnnyshields mentioned this pull request Mar 17, 2026

feat(h2): compio H2 implementation #775

Open

Merge branch 'master' into fix-send-zerocopy

b51ffbc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance(compio-net): avoid send_zerocopy for small TCP writes#771

performance(compio-net): avoid send_zerocopy for small TCP writes#771
johnnyshields wants to merge 2 commits intocompio-rs:masterfrom
johnnyshields:fix-send-zerocopy

johnnyshields commented Mar 15, 2026 •

edited

Loading

Uh oh!

Berrysoft commented Mar 15, 2026

Uh oh!

johnnyshields commented Mar 15, 2026 •

edited

Loading

Uh oh!

Berrysoft commented Mar 15, 2026 •

edited

Loading

Uh oh!

johnnyshields commented Mar 15, 2026 •

edited

Loading

Uh oh!

Berrysoft commented Mar 17, 2026

Uh oh!

johnnyshields commented Mar 17, 2026 •

edited

Loading

Uh oh!

Berrysoft commented Mar 17, 2026

Uh oh!

johnnyshields commented Mar 17, 2026

Uh oh!

Berrysoft commented Mar 17, 2026

Uh oh!

johnnyshields commented Mar 17, 2026 •

edited

Loading

Uh oh!

Berrysoft commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

johnnyshields commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix/Performance for compio-net: avoid send_zerocopy for small TCP writes

Fix

Problem 1

Problem 2

Benchmark results

Uh oh!

Berrysoft commented Mar 15, 2026

Uh oh!

johnnyshields commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Berrysoft commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnnyshields commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Berrysoft commented Mar 17, 2026

Uh oh!

johnnyshields commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Berrysoft commented Mar 17, 2026

Uh oh!

johnnyshields commented Mar 17, 2026

Uh oh!

Berrysoft commented Mar 17, 2026

Uh oh!

johnnyshields commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Berrysoft commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

johnnyshields commented Mar 15, 2026 •

edited

Loading

johnnyshields commented Mar 15, 2026 •

edited

Loading

Berrysoft commented Mar 15, 2026 •

edited

Loading

johnnyshields commented Mar 15, 2026 •

edited

Loading

johnnyshields commented Mar 17, 2026 •

edited

Loading

johnnyshields commented Mar 17, 2026 •

edited

Loading