-
Notifications
You must be signed in to change notification settings - Fork 73
feat: sync alert jobs #3059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: sync alert jobs #3059
Conversation
lib/logflare/alerting.ex
Outdated
| :ok | {:error, :not_enabled} | {:error, :below_min_cluster_size} | ||
| def run_alert(alert_id, :scheduled) when is_integer(alert_id) do | ||
| # sync the alert job for the next run | ||
| sync_alert_job(alert_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be slightly better to return the alert job (if present) as an :ok tuple from the sync job, then that would allow us to avoid a 2nd query to re-fetch the alert job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which second query do you mean? The run_alert/2 below seems to operate on AlertQuery, not Quantum.Job, and the job only knows about alert_id now (since we don't really need to put %AlertQuery{} in the job anymore if we refetch the alert definition each time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync_alert_job would perform 1 db query on the scheduler_node, then perform another db query at get_alert_query_by at line 285, so that would result in 2 db queries being performed.
alternative is to fetch the alert_query first then pass it to sync_alert_job, reversing the order and reducing db queries to 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need to call sync_alert_job which "re-adds" the job from run_query, or is it enough to unschedule the job for alerts that no longer exist, i.e., 4519e14. Or maybe it can just be no-op and sync would then be done periodically (every 60 minutes), but that has another problem: #3059 (comment)
Re-adding the job from run_query/1 feels off for cron jobs somehow, but in light of #3059 (comment) it might be the only way to reliably re-sync active jobs. One possible problem with that approach is when an rare alert gets modified to become less rare, the update won't take place until it runs at least once which might be surprising.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd like to go with no syncing in run_alert/1 since it already uses the up-to-date alert query thanks to fetching it by id on each run (and being no-op if it doesn't exist), and try and update sync_alert_jobs to be a bit smarter (instead of "full" delete followed by "full" insert) to avoid the problem in #3059 (comment)
UPDATE: done.
lib/logflare/alerting.ex
Outdated
| :ok | {:error, :not_enabled} | {:error, :below_min_cluster_size} | ||
| def run_alert(alert_id, :scheduled) when is_integer(alert_id) do | ||
| # sync the alert job for the next run | ||
| sync_alert_job(alert_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync_alert_job would perform 1 db query on the scheduler_node, then perform another db query at get_alert_query_by at line 285, so that would result in 2 db queries being performed.
alternative is to fetch the alert_query first then pass it to sync_alert_job, reversing the order and reducing db queries to 1
| Cluster.Utils.rpc_call(node, func) | ||
|
|
||
| nil -> | ||
| raise "Alerting scheduler node not found" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously that function was able to silently be a no-op if scheduler node wasn't found, I wonder if this raise is a good idea? It would make these cases more noticeable, if they ever happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to explicitly raise since this is unexpected behaviour and the case matcherrorr will be sufficient for us to pinpoint the issue. we should always have a global scheduler present.
| ], | ||
| alerts_scheduler_sync: [ | ||
| run_strategy: Quantum.RunStrategy.Local, | ||
| schedule: "0 * * * *", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's possible for alert schedule be rarer than alerts_scheduler_sync schedule, less than once in 60 minutes? That would probably mean that those jobs would never run, since in the current do_sync_alert_jobs they would keep getting re-added (which in Quantum doesn't seem (?) to execute the job right away but kind of reschedules and effectively postpones them indefinitely).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there are jobs that run on a minutely basis, or every hour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that behaviour would not be good. Perhaps separate sync schedule jobs, syncing once a day for hourly schedules and once a minute for jobs less than an hour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I was wrong and misunderstood how Quantum works. The original "wipe and re-add" approach is safe for all but the jobs coinciding with alerts_scheduler_sync schedule ("0 * * * *") so I think it still makes sense to make do_sync_alert_jobs a bit smarter. Right now I'm going with something like this
defp do_sync_alert_jobs do
wanted_jobs = init_alert_jobs()
wanted_jobs_set = MapSet.new(wanted_jobs, & &1.name)
current_jobs = AlertsScheduler.jobs()
# Delete jobs that are no longer wanted
Enum.each(current_jobs, fn {name, _job} ->
if not MapSet.member?(wanted_jobs_set, name) do
AlertsScheduler.delete_job(name)
end
end)
# Upsert all wanted jobs
Enum.each(wanted_jobs, &AlertsScheduler.add_job/1)
end| def create_alert_job_struct(%AlertQuery{} = alert_query) do | ||
| %AlertQuery{id: alert_query_id, cron: cron} = alert_query | ||
|
|
||
| if is_nil(alert_query_id) do | ||
| raise ArgumentError, "AlertQuery is missing id: #{inspect(alert_query)}" | ||
| end | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def create_alert_job_struct(%AlertQuery{} = alert_query) do | |
| %AlertQuery{id: alert_query_id, cron: cron} = alert_query | |
| if is_nil(alert_query_id) do | |
| raise ArgumentError, "AlertQuery is missing id: #{inspect(alert_query)}" | |
| end | |
| def create_alert_job_struct(%AlertQuery{id: alert_query_id, cron: cron} = alert_query) when alert_query_id != nil do |
its fine for the match to fail since it effectively will result in the same error raising. the nice error message is a a nice to have but not really necessary since it is not expected behaviour.
|
|
||
| nil -> | ||
| raise "Alerting scheduler node not found" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| nil -> | |
| raise "Alerting scheduler node not found" |
| Cluster.Utils.rpc_call(node, func) | ||
|
|
||
| nil -> | ||
| raise "Alerting scheduler node not found" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to explicitly raise since this is unexpected behaviour and the case matcherrorr will be sufficient for us to pinpoint the issue. we should always have a global scheduler present.
|
@ruslandoga can you do a follow up PR for the changes 🙏 changes look great, thanks for digging into the scheduler behaviour |
|
Follow-up PR: #3087 |
alert_query_idinstead of%AlertQuery{}struct in the Quantum job.