-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Labels
enhancementNew feature or requestNew feature or request
Description
What, why, who
As a user of the node service I would like to have unhealthy nodes timed out so that they do not get stressed with unnecessary load and the node selection focuses on healthy nodes.
Acceptance criteria
- Count erroneous calls to nodes within a sliding window
- Error: node does not respond properly (call times out, i/o timeout, connection refused, ...).
- Treat tick delay to the reliable node tick as an error.
- Sliding window: call based (not time based)
- If a certain threshold of errors is reached within a certain window then time out the node.
- After a time out do a probe request if the node is working again.
- In case the probe fails then time out the node again.
- If the probe is successful reset the error counter and treat the node as healthy.
- If a node times out a certain number of times and does not recover by probes it should be removed from the node pool completely (it will be added again if it is still a public peer). This feature is only enabled if the public node strategy is used.
- Statically configured nodes must never be removed from the node pool!
- Configuration
- Error threshold (percentage of erroneous calls)
- Sliding window (number of counted calls)
- Tick delay that is treated as an error
- Number of failed probes for node removal
Open questions
- We don't know where to implement this because we also have go-qubic that implements node pooling. Where should we add this?
- Should we support time based sliding windows, too?
- Can we use a library?
References
Circuit breaker example: https://resilience4j.readme.io/docs/circuitbreaker
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request
Type
Projects
Status
π Backlog