performance(compio-net): avoid send_zerocopy for small TCP writes#771
performance(compio-net): avoid send_zerocopy for small TCP writes#771johnnyshields wants to merge 2 commits intocompio-rs:masterfrom
Conversation
|
I have created #770 to make zerocopy explicit. It might be better rather than a threshold constant. |
|
Hmmm... I think it is good to have both an explicit |
|
I mean, I don't want
So I want to push the optimization for large buffers to downstreams instead. |
I would disagree. IMHO the best pattern is for FYI I wasted about ~3 hours debugging latency in my compio-h2 implementation b/c compio-net Implementers who want "sharp knives" should be able to use |
Sorry to hear that. It's a fault to make You might be right considering your performance, but I would argue that all conditions, all branches and all benchmarks cost. You will never be able to say that an optimization is general enough. If I set the threshold to 8KB, but another user argues it should be 10KB, it's hard and complex for us to decide how to satisfy all users. Instead, if we just provide a plain
It's always possible to call We might consider providing a method to indicate whether the kernel supports |
|
How about the following:
Now, regarding the 8kb threshold, I doubt we will get into heated debates over this 😄:
|
|
The first suggestion will be solved in #770 , and the following ones could be done yourself in your own project. |
|
Hmmm... it's your project so it's your call, but most users who would consider compio are considering it specifically because of io_uring's net efficiency. Without |
Well, yes. That's a compromise somehow. To my opinion, I would like them to re-implement the same threshold logic independently. |
|
OK thats fine if that's compio's philosophy: "a literal wrapper around the kernel." I would just ask to make that philosophy crystal-clear in the README and other places. I do think that as compio's adoption grows, that philosophy will fight against what most implementers want/expect--and there's no reason compio can't provide a "near-optimal default" AND "sharp knives" at the same time. FYI tokio doesn't use zerocopy in their core at all--it is in their io-uring crate (which is low adoption.) |
|
Well, things always change. I cannot make sure what will happen in the future, but currently there's even no other successful example for us to refer, and I don't think it's easy to provide a balanced, zero-cost, optimized, sharp knife. Other references:
|
Fix/Performance for compio-net: avoid send_zerocopy for small TCP writes
Fix
Add a
ZEROCOPY_THRESHOLD(8KB). Writes below the threshold use regularsend(IORING_OP_SEND, 1 CQE), which completes immediately without waiting for the peer's ACK. Writes at or above the threshold continue usingsend_zerocopywhere the kernel copy savings justify the extra CQE round-trip.Applied to both
writeandwrite_vectored. Thesend_zerocopy/send_zerocopy_vectoredpublic APIs are unaffected — only theAsyncWritetrait impl uses the threshold.This fixes two problems at once:
Problem 1
For small data transfers, traditional kernel copies usually outperforms zero-copy. See
benchmarksbelow; TLDR its a 1.5-2.5x speedup for transfers under 8KB.Problem 2
send_zerocopy(IORING_OP_SEND_ZC) requires two CQEs: one for send completion and one for buffer release. The buffer-release CQE is only generated after the peer ACKs the data. On Linux, TCP delayed ACKs can defer this by ~40ms — but only for small writes. Large writes produce multiple MSS-sized segments, and the receiver's delayed-ACK logic fast-ACKs after every second full segment. A small write (e.g. a 13-byte H2WINDOW_UPDATE) arrives as a single undersized segment with no companion to trigger fast-ACK, so the receiver's delayed-ACK timer runs to expiration. This causes small standalone writes to stall for the full ~40ms waiting for the second CQE.I found this while working on H2 support (for GPRC), this was tripping up 13-byte H2
WINDOW_UPDATEframes, and caused a 3700x latency cliff. The root cause chain:AsyncWrite::write→send_zerocopy→ SEND_ZC submitted to io_uringBenchmark results
I added a new /benches/ dir in this PR -- run
cargo bench -p compio-net --bench tcp_writeThe 2-CQE overhead of zerocopy send adds ~8-9µs per small write in steady state. For writes below 8KB, regular send is consistently faster. At 8KB and above, zerocopy breaks even or wins due to kernel copy savings.
(The ~40ms delayed-ACK stall issue is separate from the steady-state overhead above , it affects p99 latency more than median.)