A distributed health monitoring system designed to handle high-concurrency checks with a non-blocking architecture.
This system isn't just a simple setInterval loop. It's built to scale. The core philosophy was to decouple the scheduling of checks from the execution of checks.
Instead of the application trying to do everything at once (find due monitors -> check them -> save results), we use a Distributed Queue System.
- Thinking Process:
- The Problem: In a synchronous system, if 10,000 monitors are due at t=0, the event loop would block trying to fire 10,000 requests.
- The Solution: Application-level flow control. The Scheduler only produces "jobs". The Workers consume them at their own pace.
- Why RabbitMQ?:
- Backpressure: It acts as a shock absorber. If the network is slow, the queue fills up, but the scheduler keeps ticking.
- Worker Scalability: We can spin up 50 generic worker nodes on different servers, all listening to the same queue.
- Reliability: If a worker crashes while processing a job, RabbitMQ can re-queue it (via Acknowledgements) so the check isn't lost.
We didn't want to scan the entire MongoDB health_monitors collection every second to find what's due. That's O(N) operation which degrades linearly as users add monitors.
- Thinking Process:
- State vs Stateless: The "schedule" is a stateful entity. We need fast random access and range queries.
- Efficiency: Redis Sorted Sets (
ZSET) allow us to storenext_check_atas a score.
- Mechanism:
- O(log N) Polling:
ZRANGEBYSCOREallows us to fetch only the monitors due right now without touching the millions of monitors scheduled for later. - Concurrency Safe: Redis operations are atomic, preventing race conditions if we were to scale the scheduler (with locking).
- O(log N) Polling:
- Stores the configuration (
HealthMonitor) and the historical results (HealthStatusCheck). - Trade-off: For now, we store time-series data (check results) in a standard document collection.
- Future Upgrade: Move
HealthStatusCheckto a dedicated Time-Series Database (like InfluxDB or TimescaleDB) for better compression and query performance on large datasets.
- Future Upgrade: Move