Skip to content

Improve node pooling: circuit breakerΒ #19

@qubicmio

Description

@qubicmio

What, why, who

As a user of the node service I would like to have unhealthy nodes timed out so that they do not get stressed with unnecessary load and the node selection focuses on healthy nodes.

Acceptance criteria

  • Count erroneous calls to nodes within a sliding window
    • Error: node does not respond properly (call times out, i/o timeout, connection refused, ...).
    • Treat tick delay to the reliable node tick as an error.
    • Sliding window: call based (not time based)
  • If a certain threshold of errors is reached within a certain window then time out the node.
  • After a time out do a probe request if the node is working again.
    • In case the probe fails then time out the node again.
    • If the probe is successful reset the error counter and treat the node as healthy.
  • If a node times out a certain number of times and does not recover by probes it should be removed from the node pool completely (it will be added again if it is still a public peer). This feature is only enabled if the public node strategy is used.
  • Statically configured nodes must never be removed from the node pool!
  • Configuration
    • Error threshold (percentage of erroneous calls)
    • Sliding window (number of counted calls)
    • Tick delay that is treated as an error
    • Number of failed probes for node removal

Open questions

  • We don't know where to implement this because we also have go-qubic that implements node pooling. Where should we add this?
  • Should we support time based sliding windows, too?
  • Can we use a library?

References

Circuit breaker example: https://resilience4j.readme.io/docs/circuitbreaker

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    Status

    πŸ“‹ Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions