-
Notifications
You must be signed in to change notification settings - Fork 163
Description
Recently we have replaced our distributed locks and global timers to use etcd's concurrency API to guarantee active-active HA.
However, there are still edge cases that require a global coordination of all manager processes, such as rate-limited container registry access (e.g., Docker Hub with anonymous user). Since there are many manager processes that receives the API requests in a load-balanced fashion, it is difficult to share the rate-limit states between different manager processes. This is why lablup/backend.ai-manager#501 is on hold.
Let's localize such globally coordinated states to a single manager process, or a leader.To keep high availability, we should perform periodic checks on the liveness the leader and re-elect it, and fortunately etcd provides the facilities to implement this.
- manager: Implement leader election of manager processes with periodic leader status checks.
- manager: Rewrite global timer to run on the leader manager process. (When a new leader is elected, the new one should start global timers while the old one should stop, of course when the old one is still alive.)
- manager: Add a generic "leader task" message queue based on Redis stream to reroute API requests accepted by arbitrary manager processes that should be exclusively processed by the leader
- manager: Rewrite fix: improve rescan image task to prevent too many requests backend.ai-manager#501 to use a local
aiolimiterstate to implement its own rate-limiting to the container registries. Use the leader task queue to trigger the rescan task.
JIRA Issue: BA-269