Skip to content

BREAKING-CHANGE Are Chunked Responses Being Handled Wrong? #818

@dnwiebe

Description

@dnwiebe

A Route consists of a Vec of CryptData objects. Most of those CryptData objects, when decrypted, are LiveHops, containing the public key of the destination, a Payer object, and a Component enum telling which actor in the target Node will be handling the data. However, the very last CryptData object in a Route can't be decrypted into a LiveHop, because it's actually just a single u32 that's known as a Return Route ID.

When a ClientResponsePayload_0v1 comes back to the originating Node from the exit Node (refer to ProxyServer::handle_client_response_payload(), we use the Return Route ID at the very end of the Route to look up the return route information in the Proxy Server. This return route information was stored when we sent the request that resulted in the ClientResponsePayload_0v1, so that we'd know what to do when the response came back, and when it was stored, we got an ID back, which was put in the Route at the end where we'd see it.

The return route information contains almost everything we need to pay what we owe to all the Nodes along the return route, including each of those Nodes' RatePacks; but we can't compute an actual amount of money until we know how much data they processed for us, which we find out when the ClientResponsePayload_0v1 arrives. So after we use the Return Route ID in the Route to look up the return route information in the Proxy Server's HashMap (actually a TtlHashMap, more on that later), we can combine the two pieces of information and report the services consumed to the Accountant.

So here's the issue: when should the Proxy Server forget the return route information it has stored? When does the Proxy Server forget the return route information it has stored? Are those two the same?

At a high level, here are the answers to the first question:

  1. When we're done receiving the response to the request that resulted in the return route information being stored
  2. When the stream over which the request was sent is terminated, even if no response has yet arrived
  3. After a decent amount of time has elapsed, in case the server never responded or the response (or request) got lost along the way, so that we don't endlessly accumulate useless data in the Proxy Server

However, #1 is deceptive, in the face of chunked HTTP responses (and possibly things that protocols other than HTTP do). If we remove the RRI immediately after receiving the first chunk of response, then every other chunk will arrive and find no RRI, and so we will have no idea how to pay the return-route bill for it.

We're not sure whether #2 is implemented or not; we've looked briefly at the StreamKey-retirement code and haven't found anything about RRI, but we haven't searched exhaustively.

#3 is implemented by the fact that we're using a TtlHashMap, which keeps its data around for only two minutes, and then removes it. #3 may be taking care of #2 as well.

What we do know is that we're receiving too many logs like this one from NetFlix:

2025-05-25 23:13:02.380 Thd8: ERROR: ProxyServer: Can't report services consumed: received response with bogus return-route ID 2257 for client response. Ignoring
2025-05-25 23:13:02.404 Thd8: ERROR: ProxyServer: Can't report services consumed: received response with bogus return-route ID 2255 for client response. Ignoring
2025-05-25 23:13:02.448 Thd8: ERROR: ProxyServer: Can't report services consumed: received response with bogus return-route ID 2256 for client response. Ignoring
2025-05-25 23:13:02.504 Thd8: ERROR: ProxyServer: Can't report services consumed: received response with bogus return-route ID 2254 for client response. Ignoring
2025-05-25 23:13:03.195 Thd8: ERROR: ProxyServer: Can't report services consumed: received response with bogus return-route ID 2257 for client response. Ignoring
2025-05-25 23:13:03.221 Thd8: ERROR: ProxyServer: Can't report services consumed: received response with bogus return-route ID 2255 for client response. Ignoring
2025-05-25 23:13:03.266 Thd8: ERROR: ProxyServer: Can't report services consumed: received response with bogus return-route ID 2256 for client response. Ignoring
2025-05-25 23:13:03.336 Thd8: ERROR: ProxyServer: Can't report services consumed: received response with bogus return-route ID 2254 for client response. Ignoring

That's what happens when a response arrives with a Return Route ID in its Route, but the Proxy Server has forgotten the corresponding return route information. Notice that in these eight logs, there are four pairs of two. Each ID is being searched for twice in vain, with the two attempts occurring a little less than a second apart. This would tend to indicate that it's not just a random problem with late single responses arriving after the 120-second timeout; there's probably something deeper going on here.

Task: Investigate the ProxyServer and related code and find out how return route information is actually removed from the ProxyServer, and determine how the design should change to eliminate logs like the ones above. We should not use loads and loads of memory remembering RRI forever, but we should also be able to handle HTTP chunked responses and other protocols that send multiple response packets for a single request packet.

Do not fix the code as part of this card; create a card or series of cards to solve whatever problems you find.

Metadata

Metadata

Assignees

Type

No type

Projects

Status

📃 Quality Assurance Unfinished

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions