Skip to content

performance(compio-net): avoid send_zerocopy for small TCP writes#771

Open
johnnyshields wants to merge 2 commits intocompio-rs:masterfrom
johnnyshields:fix-send-zerocopy
Open

performance(compio-net): avoid send_zerocopy for small TCP writes#771
johnnyshields wants to merge 2 commits intocompio-rs:masterfrom
johnnyshields:fix-send-zerocopy

Conversation

@johnnyshields
Copy link
Contributor

@johnnyshields johnnyshields commented Mar 15, 2026

Fix/Performance for compio-net: avoid send_zerocopy for small TCP writes

Fix

Add a ZEROCOPY_THRESHOLD (8KB). Writes below the threshold use regular send (IORING_OP_SEND, 1 CQE), which completes immediately without waiting for the peer's ACK. Writes at or above the threshold continue using send_zerocopy where the kernel copy savings justify the extra CQE round-trip.

Applied to both write and write_vectored. The send_zerocopy / send_zerocopy_vectored public APIs are unaffected — only the AsyncWrite trait impl uses the threshold.

This fixes two problems at once:

Problem 1

For small data transfers, traditional kernel copies usually outperforms zero-copy. See benchmarks below; TLDR its a 1.5-2.5x speedup for transfers under 8KB.

Problem 2

send_zerocopy (IORING_OP_SEND_ZC) requires two CQEs: one for send completion and one for buffer release. The buffer-release CQE is only generated after the peer ACKs the data. On Linux, TCP delayed ACKs can defer this by ~40ms — but only for small writes. Large writes produce multiple MSS-sized segments, and the receiver's delayed-ACK logic fast-ACKs after every second full segment. A small write (e.g. a 13-byte H2 WINDOW_UPDATE) arrives as a single undersized segment with no companion to trigger fast-ACK, so the receiver's delayed-ACK timer runs to expiration. This causes small standalone writes to stall for the full ~40ms waiting for the second CQE.

I found this while working on H2 support (for GPRC), this was tripping up 13-byte H2 WINDOW_UPDATE frames, and caused a 3700x latency cliff. The root cause chain:

  1. H2 WINDOW_UPDATE (13 bytes) sent as standalone TCP write
  2. AsyncWrite::writesend_zerocopy → SEND_ZC submitted to io_uring
  3. First CQE (send complete) arrives in ~3µs
  4. Second CQE (buffer release) waits for peer's TCP ACK
  5. Peer's delayed-ACK timer fires after ~40ms
  6. Total: ~40ms for a 13-byte write

Benchmark results

I added a new /benches/ dir in this PR -- run cargo bench -p compio-net --bench tcp_write

Size regular send send_zerocopy Speedup
13B  6µs 15µs 2.5x
100B 7µs 15µs 2.1x
1KB  4µs 6µs 1.5x
8KB  12µs 10µs ~1x (zerocopy slightly faster at this size--we use zerocopy from 8kb)

The 2-CQE overhead of zerocopy send adds ~8-9µs per small write in steady state. For writes below 8KB, regular send is consistently faster. At 8KB and above, zerocopy breaks even or wins due to kernel copy savings.

(The ~40ms delayed-ACK stall issue is separate from the steady-state overhead above , it affects p99 latency more than median.)

@Berrysoft
Copy link
Member

I have created #770 to make zerocopy explicit. It might be better rather than a threshold constant.

@johnnyshields
Copy link
Contributor Author

johnnyshields commented Mar 15, 2026

Hmmm... I think it is good to have both an explicit send_zerocopy method and this write method which uses the threshold. I think the threshold is the easiest way for the vast majority of downstream implementations (including my compio-h2 PR to be raised soon)--it just works. Otherwise, we push the necessity to optimize small write handling to each downstream implementation.

@Berrysoft
Copy link
Member

Berrysoft commented Mar 15, 2026

I mean, I don't want write to call send_zerocopy. It might always not be a good idea.

  • The threshold logic is not transparent. Users need to read the documents to know what happends inside.
  • Not all kernels support SEND_ZC, which means it's a totally negative optimization on old kernels and other platforms (as there is another await point).

Otherwise, we push the necessity to optimize small write handling to each downstream implementation.

So I want to push the optimization for large buffers to downstreams instead.

@johnnyshields
Copy link
Contributor Author

johnnyshields commented Mar 15, 2026

So I want to push the optimization for large buffers to downstreams instead.

I would disagree. IMHO the best pattern is for write method to do the right thing by default--the 8kb threshold is generally sensible, it may cost 1~2us extra vs a kernel-specific "optimized" implementation, but at least it won't hit a latency cliff with delayed ACKs. We could add some kernel-specific compile-time gating/optimizations to it in the future.

FYI I wasted about ~3 hours debugging latency in my compio-h2 implementation b/c compio-net write didn't have this simple optimization, and I blindly expected it would.

Implementers who want "sharp knives" should be able to use send_zerocopy or send_kernel methods as direct alternatives.

@Berrysoft
Copy link
Member

FYI I wasted about ~3 hours debugging latency in my compio-h2 implementation b/c compio-net write didn't have this simple optimization, and I blindly expected it would.

Sorry to hear that. It's a fault to make write call send_zerocopy unconditionally, and that's why I want to change it back to send. It doesn't prevent the downstream optimization at all. For example, you can write a wrapper around compio::net::TcpStream and implement AsyncWrite yourself, calling send_zerocopy with custom threshold and conditions.

You might be right considering your performance, but I would argue that all conditions, all branches and all benchmarks cost. You will never be able to say that an optimization is general enough. If I set the threshold to 8KB, but another user argues it should be 10KB, it's hard and complex for us to decide how to satisfy all users. Instead, if we just provide a plain write (which calls send internally) and a plain send_zerocopy without any conditions, users will be not that happy but at least easy to optimize the TcpStream themselves for their specific usages.

Implementers who want "sharp knives" should be able to use send_zerocopy or send_kernel methods as direct alternatives.

It's always possible to call send_zerocopy directly.

We might consider providing a method to indicate whether the kernel supports send_zerocopy...

@johnnyshields johnnyshields changed the title Performance: compio-net -- avoid send_zerocopy for small TCP writes performance(compio-net): avoid send_zerocopy for small TCP writes Mar 17, 2026
@johnnyshields
Copy link
Contributor Author

johnnyshields commented Mar 17, 2026

How about the following:

  • Revert write to alias to send as you suggest (write = kernel-copied send, names follow POSIX nomenclature)
  • Introduce write_zerocopy, which is what write is on master today, i.e. the buffer-blocking variant of send_zerocopy (non-controversial I think)
  • Add write_adaptive which is my threshold method in this PR.
  • (and add *_vectored equivalents for all the above)

Now, regarding the 8kb threshold, I doubt we will get into heated debates over this 😄:

  • Empirically my h2 benchmarking shows 8kb works well, in a way I doubt is "specific" to just H2.
  • Linux and GPRC have references to 10kb and 16kb for MSG_ZEROCOPY (uses socket error queue) which is different than SEND_ZC (CQE-based, lower overhead).
  • It would be possible to make it tunable if someone really wants to, which could be as simple as adding a function arg which the downstream can pass in.
  • If we learn that 8kb is a bad threshold for a certain OS--or that ZC isn't supported at all--then we compile-time override it for that OS. (All downstreams would reap the benefit of such an override, rather than forcing each downstream to discover and inconsistently handle OS edge cases.)

@Berrysoft
Copy link
Member

The first suggestion will be solved in #770 , and the following ones could be done yourself in your own project.

@johnnyshields
Copy link
Contributor Author

Hmmm... it's your project so it's your call, but most users who would consider compio are considering it specifically because of io_uring's net efficiency. Without write_adaptive (or similar) in compio-net, every downstream that uses AsyncWrite will either miss out on SEND_ZC entirely, or independently re-implement the same threshold logic. So it just feels like a waste. But again--your call.

@Berrysoft
Copy link
Member

every downstream that uses AsyncWrite will either miss out on SEND_ZC entirely, or independently re-implement the same threshold logic.

Well, yes. That's a compromise somehow. To my opinion, I would like them to re-implement the same threshold logic independently. compio is a fundamental crate. It should provide more opportunities rather than a "complete and beautiful" solution (and that's somehow tokio does - I will not be surprised if someday they optimize their send with MSG_ZEROCOPY). We don't provide our implementation of basic macros (e.g., join or select) because futures has already provided that. We don't provide our mutexes or channels. We provide low-level driver APIs rather than hiding it inside the high-level runtime. The same logic applies here. We provide simple sockets, and platform-specific APIs (actually zerocopy is Linux-only, and we provide a cross-platform API just to make cross-platform code easier), but should not provide a threshold-based optimization that could be implemented by the users easily. Each API is easy to guess and predict - write is write or send, and read is read or recv, that's it. Those users who read the API documents will know that TcpStream supports send_zerocopy, and of course they will not expect that write actually sometimes call send_zerocopy - it might benefit some new coming users unexpectely, but will make advanced users mad. An advanced user might want to adjust the threshold, the if condition, or even the total optimize logic. A simple write method will be more convenient for both of us.

@johnnyshields
Copy link
Contributor Author

johnnyshields commented Mar 17, 2026

OK thats fine if that's compio's philosophy: "a literal wrapper around the kernel." I would just ask to make that philosophy crystal-clear in the README and other places.

I do think that as compio's adoption grows, that philosophy will fight against what most implementers want/expect--and there's no reason compio can't provide a "near-optimal default" AND "sharp knives" at the same time.

FYI tokio doesn't use zerocopy in their core at all--it is in their io-uring crate (which is low adoption.)

@Berrysoft
Copy link
Member

Well, things always change. I cannot make sure what will happen in the future, but currently there's even no other successful example for us to refer, and I don't think it's easy to provide a balanced, zero-cost, optimized, sharp knife.

Other references:

  • tokio-uring, only provides send_zc for UDP sockets.
  • monoio, provides a wrong implementation (which passes MSG_ZEROCOPY to SEND).
  • glommio, does nothing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants