Skip to content

Leader election-based distributed timer and image rescan rate limiting #415

@achimnol

Description

@achimnol

Recently we have replaced our distributed locks and global timers to use etcd's concurrency API to guarantee active-active HA.

However, there are still edge cases that require a global coordination of all manager processes, such as rate-limited container registry access (e.g., Docker Hub with anonymous user). Since there are many manager processes that receives the API requests in a load-balanced fashion, it is difficult to share the rate-limit states between different manager processes. This is why lablup/backend.ai-manager#501 is on hold.

Let's localize such globally coordinated states to a single manager process, or a leader.To keep high availability, we should perform periodic checks on the liveness the leader and re-elect it, and fortunately etcd provides the facilities to implement this.

  • manager: Implement leader election of manager processes with periodic leader status checks.
  • manager: Rewrite global timer to run on the leader manager process. (When a new leader is elected, the new one should start global timers while the old one should stop, of course when the old one is still alive.)
  • manager: Add a generic "leader task" message queue based on Redis stream to reroute API requests accepted by arbitrary manager processes that should be exclusively processed by the leader
  • manager: Rewrite fix: improve rescan image task to prevent too many requests backend.ai-manager#501 to use a local aiolimiter state to implement its own rate-limiting to the container registries. Use the leader task queue to trigger the rescan task.

JIRA Issue: BA-269

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions