-
Couldn't load subscription status.
- Fork 5
Add missing retry strategy reset in streaming #148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make a test for that?
| call = self._stream_method() | ||
| self._retry_strategy.reset() | ||
| await sender.send(StreamStartedEvent()) | ||
| async for msg in call: | ||
| await sender.send(self._transform(msg)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a feeling we need to make a flag and put the StreamStartedEvent and the retry_strategy reset inside the loop.
Until we try to iterate over the call, we don't know if it was successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially for some kinds of issues, like the streaming method on the API is failing before sending any data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, true, but for me this is good enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StreamStarted sounds like there were messages. StreamOpened might be better in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, it looks like we need to think a bit more about this, but I think it is very subtle when to consider a stream being started. For me if the stream request was sent, it started, and any errors after that is a normal retry. In any case, discussing this could take some time and effort, and I think that is a separate discussion and could be a future improvement, but it is a bug not ever reset the retry strategy. So I suggest to merge this an open a different issue for this if we want to change it, because also I think I already spent more time than I should have improving the streaming events 😬, so unless someone else is willing to pick it up, I don't see much happening beyond this PR and #149.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it's my fault, misunderstanding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this and I don't think putting the reset (let's leave the event for a separate issue, as this PR is not touching on that, but I think the arguments for reset applies to the event too) is really not an option. A stream could be idle for a long time until something is sent through it. If errors are really delayed until the returned iterator is awaited, then maybe the only option would be to try to get the first element with some very short timeout, but I'm not sure if we have any guarantees. For example if the other end disappeared, I'm not sure any errors will be raised until the connection really times out for good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AI overlords suggest using await call.initial_metadata() to "survey for errors" (or wait_for_connection(), but this one is marked as experimental, and not sure if it is only about the underlying channel connection or if it will also catch other errors that happen after the connection, like a NOT FOUND or something like that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the AI overlords are not hallucinating, it seems like it should work.
sequenceDiagram
actor Client
actor Server
%% Phase 0: Client-Side Setup
Note over Client: Phase 0: Client-Side Setup
Client->>Client: Creates gRPC channel
Client->>Client: Instantiates service stub
%% Phase 1: RPC Invocation & Connection Establishment
Note over Client: Phase 1: RPC Invocation & Connection Establishment
Client->>Client: Calls unary_stream method (call = stub.Method())
Client->>Client: Executes await call.initial_metadata()
opt TCP/TLS Connection Not Established
Client->>Server: TCP: SYN
Server-->>Client: TCP: SYN-ACK
Client->>Server: TCP: ACK
Note over Client,Server: TCP Connection Established
Client->>Server: TLS: ClientHello
Server-->>Client: TLS: ServerHello, Certificate, etc.
Client->>Server: TLS: KeyExchange, ChangeCipherSpec
Server-->>Client: TLS: ChangeCipherSpec
Note over Client,Server: TLS Handshake Complete
Client->>Server: HTTP/2: Connection Preface (Magic + SETTINGS)
Server-->>Client: HTTP/2: SETTINGS frame
Note over Client,Server: HTTP/2 Connection Initialized
end
Note over Client,Server: RPC Request over established HTTP/2
Client->>Server: HTTP/2: HEADERS (method, metadata, :status=200, content-type=application/grpc)
Client->>Server: HTTP/2: DATA (serialized request_message, END_STREAM=true)
Server-->>Client: HTTP/2: HEADERS (initial server metadata, END_STREAM=false)
Note over Client: await call.initial_metadata() completes
%% Phase 2: Server Streaming Responses
Note over Client,Server: Phase 2: Server Streaming Responses
loop For each message in stream
Note over Client: Enters/awaits in `async for msg in call`
Server-->>Client: HTTP/2: DATA (serialized response_message_N, END_STREAM=false)
Client->>Client: Receives and deserializes msg_N
end
%% Phase 3: Stream Termination (Normal)
Note over Client,Server: Phase 3: Stream Termination (Normal)
Note over Client: Awaits in `async for` (server finished sending)
Server-->>Client: HTTP/2: HEADERS (Trailers: grpc-status=OK, END_STREAM=true)
Note over Client: `async for` loop terminates
Client->>Client: Accesses call.code(), call.details()
%% Alternate Flow: Error During Stream
alt Error During Stream
Note over Client,Server: Alternate Flow: Error During Stream
Note over Client: Awaits next message in `async for msg in call`
Server-->>Client: HTTP/2: HEADERS (Trailers: grpc-status=ERROR_CODE, grpc-message, END_STREAM=true)
Note over Client: AioRpcError raised
Client->>Client: Accesses call.code(), call.details()
end
|
Updated, still the test needs to be fixed because to do this we can't take an arbitrary async iterator anymore, we need to take a proper |
Signed-off-by: Leandro Lucarella <[email protected]>
|
OK, this is ready for review again. After considering where to put the I didn't change when the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes the retry strategy behavior in the GrpcStreamBroadcaster by resetting the retry strategy after a successful stream start, and improves the timing of the StreamStarted event.
- Reset retry strategy after receiving the first message to prevent accumulating delays across retries
- Wait for initial metadata before firing StreamStarted event for better connection indication
- Update channel closing methods and dependency version
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/frequenz/client/base/streaming.py | Add retry strategy reset logic and improve StreamStarted event timing |
| tests/streaming/test_grpc_stream_broadcaster.py | Refactor test mocks and add test for retry strategy reset behavior |
| pyproject.toml | Update frequenz-channels dependency version requirement |
| RELEASE_NOTES.md | Document bug fixes and breaking changes |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
After the streaming started successfully we should reset the retry strategy, so next time there is a failure we can retry from the beginning of the retry strategy again. Signed-off-by: Leandro Lucarella <[email protected]>
Sending the `StreamStarted` event as soon as the streaming method is called can give a false impression that the stream is active and working, when in reality it might not be connected to the server at all. Now we wait until we receive the initial metadata from the server before firing the `StreamStarted` event. That should give users a better indication that the stream is actually active and working without having to wait for the first item to be received, which can take a long time for some low-frequency streams. Signed-off-by: Leandro Lucarella <[email protected]>
This removes the need for an extra adaptor lambda when passing the generator as a callable. Signed-off-by: Leandro Lucarella <[email protected]>
The `stream_method` is not a simple async iterator anymore, as we need to also use the `initial_metadata()` method from gRPC calls. Because of this we now need more complex mocking. The new `unary_stream_call_mock()` facilitates the mocking of unary stream gRPC calls. Signed-off-by: Leandro Lucarella <[email protected]>
Same as in the previous commit, we need to mock the `stream_method` with a full `UnaryStreamCall` instead of an `AsyncIterator`. Signed-off-by: Leandro Lucarella <[email protected]>
Before the changes to the errored stream tests, the generator used for error tests was not re-created each time the streamer reconnected. This means that after the first error, the stream would resume correctly. Before this was not an issue because the retry strategy was never reset, so when retry was set to 2, it didn't matter that the stream was resumed successfully, the next retry would still reduce the number or further retry attempts. Now that the retry strategy is properly reset, we need to make sure to keep the errored channel in error state, otherwise the retry counter will always be reset and never give up. Because of this, now we don't receive messages anymore in the test after the first failure. Signed-off-by: Leandro Lucarella <[email protected]>
The most common case will probably be that after a failure while iterating, then the stream will also fail at `initial_metadata()`. To test this case we add support to making the errored stream start failing `initial_metadata()` after it enters the error state. We also keep the old test case where it only fails while iterating, there might be cases where the server sends some wrong/unexpected message consistently that will only fail when the message is received and not during the initial metadata exchange. Signed-off-by: Leandro Lucarella <[email protected]>
The overloads for `new_receiver()`'s `include_events` parameter used `Literal[True]` and `Literal[False]` to differentiate the return types, but this requires users to call the method only using literals, which is a bit too restrictive. This commit changes the type hints to use just `bool` for the parameter for the general case. It also simplifies the type hints for the return type. Signed-off-by: Leandro Lucarella <[email protected]>
This makes sure it is clear what this argument is on the call site. Signed-off-by: Leandro Lucarella <[email protected]>
Signed-off-by: Leandro Lucarella <[email protected]>
Signed-off-by: Leandro Lucarella <[email protected]>
This requires bumping the minimum dependency version of 'frequenz-channels' to 1.8.0. Signed-off-by: Leandro Lucarella <[email protected]>
|
@shsms do you want to give it a look too as you also reviewed in the past? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a comment for future improvement.
| warn_on_overflow: bool = True, | ||
| include_events: Literal[True], | ||
| include_events: bool, | ||
| ) -> channels.Receiver[StreamEvent | OutputT]: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But here they can pass False from a variable and still get typed StreamEvent | OutputT. And there's another overload above that would give OutputT when users send Literal[False]. I understand that it is unusable with literals and that in practice we're just adding support for variables while not providing any narrowing. But still, I feel it is a bad idea to do any overloads based on literals and maybe there's a type based solution to this, with sentinels, etc. But I can't think of a nice scheme at the moment, so happy to postpone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we are providing narrowing when using a literal, so x.new_receiver(include_events=False) (which is the default) should be typed -> OutputT. The problem is we can properly narrow the return type when a variable is passed. So I would say it is half-useful, but since this half will probably be used in 95% of the cases (the default), I think it is still pretty useful.
g: GrpcStreamBroadcaster[int, str] = GrpcStreamBroadcaster[int, str](
stream_name="test",
stream_method=None, # type: ignore
)
reveal_type(obj: g.new_receiver())Gives:
Type of "g.new_receiver()" is "Receiver[str]"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's true. It is a problem only when people start to mess with things. :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I hit the issue only because I parametrized a test, so I started passing a variable to include_events, but I don't think it will be a common case in practice.
After the streaming started successfully we should reset the retry strategy, so next time there is a failure we can retry from the beginning of the retry strategy again.