Skip to content

fix(proto): tail-loss probes should always be ack-eliciting#561

Open
flub wants to merge 2 commits intomainfrom
flub/pto-always-ack-eliciting
Open

fix(proto): tail-loss probes should always be ack-eliciting#561
flub wants to merge 2 commits intomainfrom
flub/pto-always-ack-eliciting

Conversation

@flub
Copy link
Copy Markdown
Collaborator

@flub flub commented Apr 2, 2026

Description

The PacketBuilder gets is_ack_eliciting from the SendableFrames, but
that is empty if data had to be queued especially for the probe. This
would mean that the PacketBuilder::finish_and_track did not note the
correct number of in-flight bytes for the congestion controller and
also did not update the
PacketNumberSpace::time_of_last_ack_eliciting_packet. That field in
turn is used by the PTO calculation as time from which to arm the next
PTO timeout.

To fix this we add information about all frames on whether they are
ack-eliciting or not. And if any ack-eliciting frame is written by the
PacketBuilder it keeps track of this internally.

Tail-loss probes should also not happen for the Data spaces as long as
the handshake has not yet been confirmed. During this time the Initial
and Handshake spaces' tail-loss probes are the ones that drive the
connection forward. And sending tail-loss probes on the Data space already
consumes congestion controller window and pacing tokens. Also you
don't even know if the peer has keys to decrypt them.

Breaking Changes

n/a

Notes & open questions

n/a

The PacketBuilder gets is_ack_eliciting from the SendableFrames, but
that is empty if data had to be queued especially for the probe. This
would mean that the PacketBuilder::finish_and_track did not note the
correct number of in-flight bytes for the congestion controller and
also did not update the
PacketNumberSpace::time_of_last_ack_eliciting_packet. That field in
turn is used by the PTO calculation as time from which to arm the next
PTO timeout.

To fix this we add information about all frames on whether they are
ack-eliciting or not. And if any ack-eliciting frame is written by the
PacketBuilder it keeps track of this internally.
@flub flub marked this pull request as draft April 2, 2026 12:34
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/noq/pr/561/docs/noq/

Last updated: 2026-04-02T16:41:19Z

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

Performance Comparison Report

cdce1f8837fffff61503a935e31466b885e987ea - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5644.6 Mbps 8287.8 Mbps -31.9% 96.2% / 107.0%
medium-concurrent 5823.9 Mbps 7987.7 Mbps -27.1% 95.7% / 106.0%
medium-single 4022.1 Mbps 4716.9 Mbps -14.7% 98.4% / 163.0%
small-concurrent 3902.1 Mbps 5296.1 Mbps -26.3% 93.8% / 108.0%
small-single 3635.7 Mbps 4676.4 Mbps -22.3% 93.7% / 108.0%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3184.4 Mbps 4009.6 Mbps -20.6%
lan 782.4 Mbps 810.3 Mbps -3.4%
lossy 69.8 Mbps 55.9 Mbps +25.0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 24.4% slower on average

---
c187201cb12f3629419d35a9cd7ff3d8d94a767b - artifacts

Raw Benchmarks (localhost)

Scenario noq upstream Delta CPU (avg/max)
large-single 5785.7 Mbps 8088.1 Mbps -28.5% 98.6% / 159.0%
medium-concurrent 5715.7 Mbps 7290.7 Mbps -21.6% 97.9% / 159.0%
medium-single 3827.5 Mbps 4625.2 Mbps -17.2% 93.6% / 108.0%
small-concurrent 3889.4 Mbps 5274.4 Mbps -26.3% 95.5% / 109.0%
small-single 3656.9 Mbps 4650.8 Mbps -21.4% 90.5% / 97.9%

Netsim Benchmarks (network simulation)

Condition noq upstream Delta
ideal 3063.1 Mbps 3605.7 Mbps -15.0%
lan 782.0 Mbps 796.4 Mbps -1.8%
lossy 69.8 Mbps 55.9 Mbps +25.0%
wan 83.8 Mbps 83.8 Mbps ~0%

Summary

noq is 22.0% slower on average

@n0bot n0bot bot added this to iroh Apr 2, 2026
@github-project-automation github-project-automation bot moved this to 🚑 Needs Triage in iroh Apr 2, 2026
This is enough to fix the test that was failing. But really #556 also
needs fixing.
@flub flub marked this pull request as ready for review April 2, 2026 17:47
@flub
Copy link
Copy Markdown
Collaborator Author

flub commented Apr 2, 2026

This is ready for review now.

// Loss probes and CONNECTION_CLOSE should be subject to pacing, even though
// they are not congestion controlled.
trace!(?space_id, %path_id, "blocked by pacing");
trace!(?space_id, %path_id, delay = ?(resume_time - now), "blocked by pacing");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a fan of this since resume_time might be earlier than now (it can't, but logic errors happen) and this not panicking isn't guaranteed by the standard library. One simple option is just using saturating_duration_since

A more involved but better option is to make pacing_delay return.. a delay, a Duration is what one would expect from the name, and set the timer by adding and the log of the delay requiring no operation. From what I can tell this is only used here and tests. That said, I understand this is a change out of scope for the PR

Comment on lines +7017 to +7018
// The client is confirmed as soon at it managed to compute 1-RTT keys.
self.crypto_state.has_keys(EncryptionLevel::OneRtt)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFC 9001 says

At the client, the handshake is considered confirmed when a HANDSHAKE_DONE frame is received.

Additionally, a client MAY consider the handshake to be confirmed when it receives an acknowledgment for a 1-RTT packet

Not sure how these lines relate to that

@dignifiedquire dignifiedquire moved this from 🚑 Needs Triage to 👀 In review in iroh Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 👀 In review

Development

Successfully merging this pull request may close these issues.

3 participants