-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Add new behavior to avoid races on config reload #4705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add new behavior to avoid races on config reload #4705
Conversation
Signed-off-by: Ethan Hunter <[email protected]>
fb35209 to
abc1e0a
Compare
Signed-off-by: Ethan Hunter <[email protected]>
| // first, start the inhibitor so the inhibition cache can populate | ||
| // wait for this to load alerts before starting the dispatcher so | ||
| // we don't accidentially notify for an alert that will be inhibited | ||
| go inhibitor.Run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if this is better, or we should instead load everything not in a goroutine, then call run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, that actually would simplify things a little. The reason I did it this way was to allow this all to happen concurrently (and without waiting) if we wanted. However, in practice it's all sequential.
Maybe I could change this so that both the Inhibitor and Dispatcher have Load() methods? I think Run should still call Load(), but maybe we can make this work such that Load is a no-op after being called once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I could change this so that both the Inhibitor and Dispatcher have Load() methods? I think Run should still call Load(), but maybe we can make this work such that Load is a no-op after being called once.
I tried this, and it ended up pretty awkward. It might work, but not without a bigger refactoring. Basically, it becomes necessary to move the iterator into a struct field that's protected by the struct's global mutex. This is very awkward because we'd have to lock and unlock the mutex around reads of the iterator's channel.
The only easy fix would be splitting SlurpAndSubscribe method into two, but that's not desirable because the entire benefit comes from being able to atomic separate alerts into "before the call" and "after the call" groups.
| // next, start the dispatcher and wait for it to load before swapping the disp pointer. | ||
| // This ensures that the API doesn't see the new dispatcher before it finishes populating | ||
| // the aggrGroups | ||
| go newDisp.Run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here...
| // Implementation of SlurpAndSubcribe is optional - providers may choose to | ||
| // return an empty list for the first return value and the result of Subscribe | ||
| // for the second return value. | ||
| SlurpAndSubscribe(name string) ([]*types.Alert, AlertIterator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok the name is funny, but maybe SnapshotAndSubscribe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative names SubscribeWithHistory() or SubscribeWithReplay().
Also why not modify Subscribe()? On startup it has no alerts, so we can always send if anything is in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, happy to change the name. I think the origin of this name was "slurp up all the existing alerts" 😆
Also why not modify Subscribe()? On startup it has no alerts, so we can always send if anything is in memory.
I wanted to avoid every possible consumer needing to make a code change after this update. However, we already made a breaking change to Subscribe in the last few weeks, so maybe we can sneak this in there too.
I know there are a few projects out there which are linking alertmanager and depending on these interfaces...
When the config is reloaded, alertmanager re-creates the
inhibit.Inhibitorand thedispatch.Dispatcher. This is necessary because both theInhibitorandDispatcherhave internal state which depends on the config.In theory, the re-build should be fast because the
provider.Alertsstill contains all the active alerts in memory. In practice, it causes some problems. Since theInhibitorandDispatcherboth ingest alerts one at a time on separate goroutines, their state is built concurrently.The
Inhibitorworks by building an internal cache of inhibiting alerts. If an active alert is missing from theInhibitor's cache, it does not cause inhibitions. This means that an inhibitor with a partially built cache can erroneously returnfalsefromInhibitor.Mutes.The
Dispatcheris responsible for buildingaggrGroups which in turn flush alerts into the notification pipeline. The notification pipeline callsInhibitor.Mutesto prevent notifying for inhibited alerts.This all comes together to cause a problem: If the
Dispatcheris able to build anaggrGroupthat contains inhibited alerts before theInhibitoris able to process those alerts, it may cause a incorrect notification to fire. In the worst case, if theDispatcheris able to build anyaggrGroupbefore theInhibitorhas completed building its cache, we could see incorrect notifications. Essentially, config reloads cause a race condition between theDispatcherandInhibitorinternal caches.#4704 largely solves this problem for alertmanager restarts by causing the
Dispatcherto delay sending alerts after startup. However, this isn't desirable during config reloads for a number of reasons. Most importantly, config reloads need to be applied to the entire alertmanager cluster all at once. Any artificial delay will delay any notifications from the cluster. In practice, we've seen this as a spike of notifications for inhibited alerts right after a config reload.Another related problem is the API. If an API function calls
Dispatcher.Groupsright after a config reload, it might see fewer groups than the in-memory state of alerts would actually create. This is because thedisppointer that the API uses is swapped as soon as the newDIspatcheris constructed. In practice, we've seen this as the/alerts/groupsendpoint returning nothing right after a config reload.This PR adds new mechanisms to avoid all these race conditions. Since the
provider.Alertsisn't reconstructed, we just need theDispatcherto wait for theInhibitorto process all the alerts which are already in theprovider.Alerts. Unfortunately, there's no interface to do that. This PR addsprovider.Alerts.SlurpAndSubscribemethod which allows implementations ofprovider.Alertsto return a batch of alerts to the caller immediately, rather than one at a time through aprovider.AlertIterator. The implementation in themem.Alertsis very simple - just return everything that's currently in memory as a batch and then construct the iterator as normal.InhibitorandDispatcherto indicate whether they're done loading. This just uses async.WaitGroupinternal to each struct. Both are updated to callSlurpAndSubscribeand indicate loading is finished when they've finished processing the initial batch.DispatcherandInhibitorduring a config reload.Inhibitorfirst, and then reload theDispatcherwhen that's finisheddisppointer to the newly reconstructedDispatcheruntil it's done loading. This prevents the API from seeing the wrong number of groups when it callsDispatcher.Groups.