fix(proto): tail-loss probes should always be ack-eliciting#561
fix(proto): tail-loss probes should always be ack-eliciting#561
Conversation
The PacketBuilder gets is_ack_eliciting from the SendableFrames, but that is empty if data had to be queued especially for the probe. This would mean that the PacketBuilder::finish_and_track did not note the correct number of in-flight bytes for the congestion controller and also did not update the PacketNumberSpace::time_of_last_ack_eliciting_packet. That field in turn is used by the PTO calculation as time from which to arm the next PTO timeout. To fix this we add information about all frames on whether they are ack-eliciting or not. And if any ack-eliciting frame is written by the PacketBuilder it keeps track of this internally.
|
Documentation for this PR has been generated and is available at: https://n0-computer.github.io/noq/pr/561/docs/noq/ Last updated: 2026-04-02T16:41:19Z |
Performance Comparison Report
|
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5644.6 Mbps | 8287.8 Mbps | -31.9% | 96.2% / 107.0% |
| medium-concurrent | 5823.9 Mbps | 7987.7 Mbps | -27.1% | 95.7% / 106.0% |
| medium-single | 4022.1 Mbps | 4716.9 Mbps | -14.7% | 98.4% / 163.0% |
| small-concurrent | 3902.1 Mbps | 5296.1 Mbps | -26.3% | 93.8% / 108.0% |
| small-single | 3635.7 Mbps | 4676.4 Mbps | -22.3% | 93.7% / 108.0% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3184.4 Mbps | 4009.6 Mbps | -20.6% |
| lan | 782.4 Mbps | 810.3 Mbps | -3.4% |
| lossy | 69.8 Mbps | 55.9 Mbps | +25.0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 24.4% slower on average
c187201cb12f3629419d35a9cd7ff3d8d94a767b - artifacts
Raw Benchmarks (localhost)
| Scenario | noq | upstream | Delta | CPU (avg/max) |
|---|---|---|---|---|
| large-single | 5785.7 Mbps | 8088.1 Mbps | -28.5% | 98.6% / 159.0% |
| medium-concurrent | 5715.7 Mbps | 7290.7 Mbps | -21.6% | 97.9% / 159.0% |
| medium-single | 3827.5 Mbps | 4625.2 Mbps | -17.2% | 93.6% / 108.0% |
| small-concurrent | 3889.4 Mbps | 5274.4 Mbps | -26.3% | 95.5% / 109.0% |
| small-single | 3656.9 Mbps | 4650.8 Mbps | -21.4% | 90.5% / 97.9% |
Netsim Benchmarks (network simulation)
| Condition | noq | upstream | Delta |
|---|---|---|---|
| ideal | 3063.1 Mbps | 3605.7 Mbps | -15.0% |
| lan | 782.0 Mbps | 796.4 Mbps | -1.8% |
| lossy | 69.8 Mbps | 55.9 Mbps | +25.0% |
| wan | 83.8 Mbps | 83.8 Mbps | ~0% |
Summary
noq is 22.0% slower on average
This is enough to fix the test that was failing. But really #556 also needs fixing.
|
This is ready for review now. |
| // Loss probes and CONNECTION_CLOSE should be subject to pacing, even though | ||
| // they are not congestion controlled. | ||
| trace!(?space_id, %path_id, "blocked by pacing"); | ||
| trace!(?space_id, %path_id, delay = ?(resume_time - now), "blocked by pacing"); |
There was a problem hiding this comment.
not a fan of this since resume_time might be earlier than now (it can't, but logic errors happen) and this not panicking isn't guaranteed by the standard library. One simple option is just using saturating_duration_since
A more involved but better option is to make pacing_delay return.. a delay, a Duration is what one would expect from the name, and set the timer by adding and the log of the delay requiring no operation. From what I can tell this is only used here and tests. That said, I understand this is a change out of scope for the PR
| // The client is confirmed as soon at it managed to compute 1-RTT keys. | ||
| self.crypto_state.has_keys(EncryptionLevel::OneRtt) |
There was a problem hiding this comment.
RFC 9001 says
At the client, the handshake is considered confirmed when a HANDSHAKE_DONE frame is received.
Additionally, a client MAY consider the handshake to be confirmed when it receives an acknowledgment for a 1-RTT packet
Not sure how these lines relate to that
Description
The PacketBuilder gets is_ack_eliciting from the SendableFrames, but
that is empty if data had to be queued especially for the probe. This
would mean that the PacketBuilder::finish_and_track did not note the
correct number of in-flight bytes for the congestion controller and
also did not update the
PacketNumberSpace::time_of_last_ack_eliciting_packet. That field in
turn is used by the PTO calculation as time from which to arm the next
PTO timeout.
To fix this we add information about all frames on whether they are
ack-eliciting or not. And if any ack-eliciting frame is written by the
PacketBuilder it keeps track of this internally.
Tail-loss probes should also not happen for the Data spaces as long as
the handshake has not yet been confirmed. During this time the Initial
and Handshake spaces' tail-loss probes are the ones that drive the
connection forward. And sending tail-loss probes on the Data space already
consumes congestion controller window and pacing tokens. Also you
don't even know if the peer has keys to decrypt them.
Breaking Changes
n/a
Notes & open questions
n/a