feat: sync alert jobs #3059

ruslandoga · 2025-12-22T23:20:34Z

ANL-1154 -- Periodically perform a Alerts scheduler sync, every 60 minutes
ANL-1156 -- Fetch the alert definition just before executing the query, to avoid using stale data -- this is done by storing and using alert_query_id instead of %AlertQuery{} struct in the Quantum job.

Ziinc · 2025-12-31T07:45:11Z

lib/logflare/alerting.ex

          :ok | {:error, :not_enabled} | {:error, :below_min_cluster_size}
+  def run_alert(alert_id, :scheduled) when is_integer(alert_id) do
+    # sync the alert job for the next run
+    sync_alert_job(alert_id)


would be slightly better to return the alert job (if present) as an :ok tuple from the sync job, then that would allow us to avoid a 2nd query to re-fetch the alert job.

Which second query do you mean? The run_alert/2 below seems to operate on AlertQuery, not Quantum.Job, and the job only knows about alert_id now (since we don't really need to put %AlertQuery{} in the job anymore if we refetch the alert definition each time).

sync_alert_job would perform 1 db query on the scheduler_node, then perform another db query at get_alert_query_by at line 285, so that would result in 2 db queries being performed.

alternative is to fetch the alert_query first then pass it to sync_alert_job, reversing the order and reducing db queries to 1

Do we actually need to call sync_alert_job which "re-adds" the job from run_query, or is it enough to unschedule the job for alerts that no longer exist, i.e., 4519e14. Or maybe it can just be no-op and sync would then be done periodically (every 60 minutes), but that has another problem: #3059 (comment)

Re-adding the job from run_query/1 feels off for cron jobs somehow, ~~but in light of #3059 (comment) it might be the only way to reliably re-sync active jobs.~~ One possible problem with that approach is when an rare alert gets modified to become less rare, the update won't take place until it runs at least once which might be surprising.

I think I'd like to go with no syncing in run_alert/1 since it already uses the up-to-date alert query thanks to fetching it by id on each run (and being no-op if it doesn't exist), and try and update sync_alert_jobs to be a bit smarter (instead of "full" delete followed by "full" insert) to avoid the problem in #3059 (comment)

UPDATE: done.

Ziinc · 2026-01-07T05:55:32Z

lib/logflare/alerting.ex

          :ok | {:error, :not_enabled} | {:error, :below_min_cluster_size}
+  def run_alert(alert_id, :scheduled) when is_integer(alert_id) do
+    # sync the alert job for the next run
+    sync_alert_job(alert_id)


sync_alert_job would perform 1 db query on the scheduler_node, then perform another db query at get_alert_query_by at line 285, so that would result in 2 db queries being performed.

alternative is to fetch the alert_query first then pass it to sync_alert_job, reversing the order and reducing db queries to 1

test/logflare/alerting_test.exs

ruslandoga · 2026-01-07T13:44:18Z

lib/logflare/alerting.ex

+        Cluster.Utils.rpc_call(node, func)
+
+      nil ->
+        raise "Alerting scheduler node not found"


Previously that function was able to silently be a no-op if scheduler node wasn't found, I wonder if this raise is a good idea? It would make these cases more noticeable, if they ever happen.

we don't need to explicitly raise since this is unexpected behaviour and the case matcherrorr will be sufficient for us to pinpoint the issue. we should always have a global scheduler present.

ruslandoga · 2026-01-07T14:24:13Z

config/config.exs

+    ],
+    alerts_scheduler_sync: [
+      run_strategy: Quantum.RunStrategy.Local,
+      schedule: "0 * * * *",


I wonder if it's possible for alert schedule be rarer than alerts_scheduler_sync schedule, less than once in 60 minutes? That would probably mean that those jobs would never run, since in the current do_sync_alert_jobs they would keep getting re-added (which in Quantum doesn't seem (?) to execute the job right away but kind of reschedules and effectively postpones them indefinitely).

Yes there are jobs that run on a minutely basis, or every hour

Hmm that behaviour would not be good. Perhaps separate sync schedule jobs, syncing once a day for hourly schedules and once a minute for jobs less than an hour

Actually, I was wrong and misunderstood how Quantum works. The original "wipe and re-add" approach is safe for all but the jobs coinciding with alerts_scheduler_sync schedule ("0 * * * *") so I think it still makes sense to make do_sync_alert_jobs a bit smarter. Right now I'm going with something like this

defp do_sync_alert_jobs do wanted_jobs = init_alert_jobs() wanted_jobs_set = MapSet.new(wanted_jobs, & &1.name) current_jobs = AlertsScheduler.jobs() # Delete jobs that are no longer wanted Enum.each(current_jobs, fn {name, _job} -> if not MapSet.member?(wanted_jobs_set, name) do AlertsScheduler.delete_job(name) end end) # Upsert all wanted jobs Enum.each(wanted_jobs, &AlertsScheduler.add_job/1) end

lib/logflare/alerting.ex

Ziinc · 2026-01-12T09:58:57Z

lib/logflare/alerting.ex

  def create_alert_job_struct(%AlertQuery{} = alert_query) do
+    %AlertQuery{id: alert_query_id, cron: cron} = alert_query
+
+    if is_nil(alert_query_id) do
+      raise ArgumentError, "AlertQuery is missing id: #{inspect(alert_query)}"
+    end
+


Suggested change

def create_alert_job_struct(%AlertQuery{} = alert_query) do

%AlertQuery{id: alert_query_id, cron: cron} = alert_query

if is_nil(alert_query_id) do

raise ArgumentError, "AlertQuery is missing id: #{inspect(alert_query)}"

end

def create_alert_job_struct(%AlertQuery{id: alert_query_id, cron: cron} = alert_query) when alert_query_id != nil do

its fine for the match to fail since it effectively will result in the same error raising. the nice error message is a a nice to have but not really necessary since it is not expected behaviour.

Ziinc · 2026-01-12T10:02:12Z

lib/logflare/alerting.ex

+
+      nil ->
+        raise "Alerting scheduler node not found"


Suggested change

nil ->

raise "Alerting scheduler node not found"

Ziinc · 2026-01-12T10:02:46Z

lib/logflare/alerting.ex

+        Cluster.Utils.rpc_call(node, func)
+
+      nil ->
+        raise "Alerting scheduler node not found"


we don't need to explicitly raise since this is unexpected behaviour and the case matcherrorr will be sufficient for us to pinpoint the issue. we should always have a global scheduler present.

Ziinc · 2026-01-12T10:06:58Z

@ruslandoga can you do a follow up PR for the changes 🙏 changes look great, thanks for digging into the scheduler behaviour

ruslandoga · 2026-01-13T13:28:41Z

Follow-up PR: #3087

ruslandoga added 2 commits December 18, 2025 21:48

vide coded

5299d1d

another one

8586dc4

ruslandoga changed the title ~~draft: sync altert jobs~~ draft: sync alert jobs Dec 22, 2025

github-actions bot assigned ruslandoga Dec 22, 2025

Merge branch 'main' into sync-alert-jobs

ce68f8a

Ziinc reviewed Dec 31, 2025

View reviewed changes

ruslandoga added 8 commits January 5, 2026 19:05

Merge branch 'main' into sync-alert-jobs

fe79e92

update specs

6fed4f6

ensure alert query has id

8d0800f

more error info

6b05ebe

use is_nil

65a196a

Merge branch 'main' into sync-alert-jobs

5b301b6

add two tests

0f58560

test name change

d2b35c7

ruslandoga marked this pull request as ready for review January 6, 2026 20:07

Merge branch 'main' into sync-alert-jobs

6135d10

ruslandoga changed the title ~~draft: sync alert jobs~~ feat: sync alert jobs Jan 6, 2026

Ziinc requested changes Jan 7, 2026

View reviewed changes

ruslandoga added 2 commits January 7, 2026 16:41

raise on missing scheduler

492095f

refactor tests

55fc122

ruslandoga commented Jan 7, 2026

View reviewed changes

ruslandoga added 2 commits January 7, 2026 17:07

no sync, just delete

4519e14

ensure :ok

21a834f

ruslandoga requested a review from Ziinc January 7, 2026 14:20

ruslandoga commented Jan 7, 2026

View reviewed changes

lib/logflare/alerting.ex Show resolved Hide resolved

ruslandoga added 4 commits January 11, 2026 18:40

continue

c7c6204

test

0b25355

cleanup

f88dc83

cleanup

f57c296

Merge branch 'main' into sync-alert-jobs

3a93b06

ruslandoga marked this pull request as draft January 11, 2026 16:27

ruslandoga added 2 commits January 11, 2026 19:38

eh...

6247067

simplify

8bf16ad

ruslandoga marked this pull request as ready for review January 11, 2026 16:46

cleanup once again

951cca3

Ziinc approved these changes Jan 12, 2026

View reviewed changes

Ziinc merged commit 0c81078 into Logflare:main Jan 12, 2026
8 checks passed

ruslandoga mentioned this pull request Jan 13, 2026

sync alert jobs followup: remove raises #3087

Merged

feat: sync alert jobs #3059

feat: sync alert jobs #3059

Uh oh!

Conversation

ruslandoga commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruslandoga Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruslandoga Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruslandoga Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ruslandoga Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruslandoga Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruslandoga Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ziinc commented Jan 12, 2026

Uh oh!

Uh oh!

ruslandoga commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ruslandoga commented Dec 22, 2025 •

edited

Loading

ruslandoga Jan 5, 2026 •

edited

Loading

ruslandoga Jan 7, 2026 •

edited

Loading

ruslandoga Jan 7, 2026 •

edited

Loading

ruslandoga Jan 7, 2026 •

edited

Loading

ruslandoga Jan 7, 2026 •

edited

Loading

ruslandoga Jan 11, 2026 •

edited

Loading