Design Discussion: Embedding AlertServer into dolphinscheduler-api Module #18005

shrihari7396 · 2026-02-28T05:38:28Z

shrihari7396
Feb 28, 2026

Hi all,

I’ve been studying the architectural requirements for embedding the AlertServer into the API Server (related to #8975). After reviewing the initialization flows in dolphinscheduler-alert-server and dolphinscheduler-api, I’d like to discuss a potential design direction and gather feedback.

My goal is to transition the alerting mechanism from a standalone process to an embedded background service while maintaining DolphinScheduler's high-availability and reliability standards.

Proposed Technical Direction

1. Logic Decoupling (Modularization)

Instead of source-code duplication, refactor the core alerting logic (e.g., AlertBootstrapService, AlertSender) into a reusable library module. The dolphinscheduler-api will consume this as a dependency, ensuring a single source of truth for alerting logic.

2. Lifecycle Integration

Use Spring-managed components and @PostConstruct hooks within the API Server to initialize the alerting engine. This ensures alerting threads are orchestrated alongside the API's primary lifecycle, starting only after the server successfully joins the Registry.

3. Leader Election & High Availability (HA)

To prevent duplicate alert processing in horizontally scaled API deployments, I propose leveraging the existing RegistryClient (ZooKeeper/Etcd) to implement a Leader-Follower model. Only the "Leader" API instance will activate the AlertEventLoop, with standby nodes ready to take over upon leader failure.

4. Fault Tolerance & Data Integrity

Atomic Claim Mechanism: Implement SQL-based optimistic locking (e.g., UPDATE ... SET status = 'SENDING', handler_instance = 'ID' WHERE status = 'PENDING') to ensure thread-safe row acquisition.
Self-Healing "Janitor" Thread: Introduce a background monitoring thread on the leader node to identify alerts orphaned in a SENDING state due to unexpected instance crashes and reset them to PENDING for re-delivery.

5. Performance Isolation

Configure a dedicated ThreadPoolExecutor for alerting tasks. This prevents long-running notification I/O (e.g., slow SMTP or Webhook responses) from starving the API's Netty/Tomcat worker threads, keeping the REST interface responsive.

6. SPI Management & Decommissioning

Ensure the API Server remains compatible with the Alert SPI for dynamic plugin loading. This plan includes the complete removal of standalone AlertServer.java entry points, assembly descriptors, and redundant Docker/K8s service definitions to simplify the deployment footprint.

I would appreciate any feedback or concerns regarding this approach, particularly on the distributed coordination strategy, before I proceed further with implementation planning.

Best regards,

Shrihari Rajendrakumar Kulkarni

shrihari7396 · 2026-03-06T16:19:21Z

shrihari7396
Mar 6, 2026
Author

Hi all,

Just a gentle follow-up on this design discussion. I wanted to check if anyone had a chance to review the proposed approach for embedding the AlertServer into the API Server.

If there are any concerns regarding the leader election strategy, lifecycle integration, or alert processing model, I would really appreciate your feedback before proceeding further with the implementation.

Thanks for your time and guidance.

Best regards,
Shrihari Rajendrakumar Kulkarni

0 replies

SonAIengine · 2026-03-16T02:24:42Z

SonAIengine
Mar 16, 2026

Been looking at the initialization flow too — main thing I'd worry about is lifecycle management when both services need to shut down gracefully. Did you consider keeping them separate but sharing the same datasource instead? That's what we did for another module and it avoided a ton of coupling issues.

1 reply

shrihari7396 Mar 16, 2026
Author

Thanks for the feedback, @SonAIengine! Lifecycle management during shutdown is definitely an important point, and I’m glad you brought it up.

I did look into the shared-datasource / separate-service approach as well. However, after reviewing the current AlertServer implementation, it seems possible to embed it safely while keeping the same alert-processing semantics by reusing the existing components like AbstractEventFetcher and AlertEventLoop without changing the database lifecycle of alerts.

For shutdown handling, the idea is to follow a simple and safe graceful shutdown flow:

Stop fetching new alerts
When shutdown begins, the fetcher thread can be stopped so it no longer pulls new alerts from the database.
Clear the in-memory queue safely
Alerts sitting in the memory queue are still in the WAIT_EXECUTION state in the database, so clearing the queue is safe. They will simply be picked up again when the system restarts or by another node.
Let active dispatchers finish
Threads currently executing doSendEvent() can be allowed to finish so they update the DB to SUCCESS or FAILURE properly. This avoids interrupting alerts midway.

This way the API server can shut down quickly without risking alert loss, and rolling restarts should remain safe.

For performance isolation, alert processing would still run in dedicated ThreadPoolExecutors. Slow I/O operations (SMTP, webhook calls, etc.) would stay isolated from the API’s Netty/Tomcat worker threads, so REST responsiveness should not be affected.

For failover, the leadership mechanism remains the same as today. Since only the leader node fetches alerts, embedding the logic inside the API server doesn’t introduce additional distributed state management.

Overall I see this mainly as a tradeoff between operational simplicity (fewer services to deploy/manage) and stronger service isolation. I’m also completely open to starting with logic decoupling/modularization first and then aligning on the final deployment model based on community feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design Discussion: Embedding AlertServer into dolphinscheduler-api Module #18005

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Design Discussion: Embedding AlertServer into dolphinscheduler-api Module #18005

Uh oh!

Uh oh!

shrihari7396 Feb 28, 2026

Proposed Technical Direction

1. Logic Decoupling (Modularization)

2. Lifecycle Integration

3. Leader Election & High Availability (HA)

4. Fault Tolerance & Data Integrity

5. Performance Isolation

6. SPI Management & Decommissioning

Replies: 2 comments · 1 reply

Uh oh!

shrihari7396 Mar 6, 2026 Author

Uh oh!

SonAIengine Mar 16, 2026

Uh oh!

shrihari7396 Mar 16, 2026 Author

shrihari7396
Feb 28, 2026

Replies: 2 comments 1 reply

shrihari7396
Mar 6, 2026
Author

SonAIengine
Mar 16, 2026

shrihari7396 Mar 16, 2026
Author