-
Notifications
You must be signed in to change notification settings - Fork 110
Description
Current behavior
Description:
Recently, we encountered an issue in Gorouter where a new instance endpoint was registered on the same index as an existing stale endpoint. This resulted in Gorouter routing requests to the unhealthy endpoint, leading to multiple 502 errors.
Details:
Route-emitter was down and not sending unregister messages. Due to route integrity, gorouter retained the endpoint information and did not prune those stale endpoint.
In the meantime, Diego recreated a new instance on another cell with a new instance_id and canonical address (IP:Port). Gorouter treated this as a new endpoint and added it to the routing pool on the same index where the stale endpoint already existed. The current implementation
in Gorouter does not validate the index number but only considers the canonical address and instance_id when adding endpoints.
Since mTLS is enabled for Gorouter-to-app container traffic, Gorouter does not prune stale endpoint unless they match one of the prunableClassifiers. Additionally, the requests were non-idempotent, so Gorouter did not retry them on the healthy endpoint. Eventually, we observed the prune-endpoint-failed log in Gorouter, but by then, it was too late.
Desired behavior
We propose introducing new logic in Gorouter to check if an endpoint already exists in the pool for the same index. If a new registration message is received for the same index, the existing endpoint should be updated or replaced to prevent duplicate registrations.
Affected Version
0.351.0
Metadata
Metadata
Assignees
Labels
Type
Projects
Status