Skip to content

Conversation

@Spaceman1701
Copy link
Contributor

When the config is reloaded, alertmanager re-creates the inhibit.Inhibitor and the dispatch.Dispatcher. This is necessary because both the Inhibitor and Dispatcher have internal state which depends on the config.

In theory, the re-build should be fast because the provider.Alerts still contains all the active alerts in memory. In practice, it causes some problems. Since the Inhibitor and Dispatcher both ingest alerts one at a time on separate goroutines, their state is built concurrently.

The Inhibitor works by building an internal cache of inhibiting alerts. If an active alert is missing from the Inhibitor's cache, it does not cause inhibitions. This means that an inhibitor with a partially built cache can erroneously return false from Inhibitor.Mutes.

The Dispatcher is responsible for building aggrGroups which in turn flush alerts into the notification pipeline. The notification pipeline calls Inhibitor.Mutes to prevent notifying for inhibited alerts.

This all comes together to cause a problem: If the Dispatcher is able to build an aggrGroup that contains inhibited alerts before the Inhibitor is able to process those alerts, it may cause a incorrect notification to fire. In the worst case, if the Dispatcher is able to build any aggrGroup before the Inhibitor has completed building its cache, we could see incorrect notifications. Essentially, config reloads cause a race condition between the Dispatcher and Inhibitor internal caches.

#4704 largely solves this problem for alertmanager restarts by causing the Dispatcher to delay sending alerts after startup. However, this isn't desirable during config reloads for a number of reasons. Most importantly, config reloads need to be applied to the entire alertmanager cluster all at once. Any artificial delay will delay any notifications from the cluster. In practice, we've seen this as a spike of notifications for inhibited alerts right after a config reload.

Another related problem is the API. If an API function calls Dispatcher.Groups right after a config reload, it might see fewer groups than the in-memory state of alerts would actually create. This is because the disp pointer that the API uses is swapped as soon as the new DIspatcher is constructed. In practice, we've seen this as the /alerts/groups endpoint returning nothing right after a config reload.

This PR adds new mechanisms to avoid all these race conditions. Since the provider.Alerts isn't reconstructed, we just need the Dispatcher to wait for the Inhibitor to process all the alerts which are already in the provider.Alerts. Unfortunately, there's no interface to do that. This PR adds

  1. A new provider.Alerts.SlurpAndSubscribe method which allows implementations of provider.Alerts to return a batch of alerts to the caller immediately, rather than one at a time through a provider.AlertIterator. The implementation in the mem.Alerts is very simple - just return everything that's currently in memory as a batch and then construct the iterator as normal.
  2. Add mechanisms to both the Inhibitor and Dispatcher to indicate whether they're done loading. This just uses a sync.WaitGroup internal to each struct. Both are updated to call SlurpAndSubscribe and indicate loading is finished when they've finished processing the initial batch.
  3. Re-order the logic for reconstructing the Dispatcher and Inhibitor during a config reload.
    1. Reload the Inhibitor first, and then reload the Dispatcher when that's finished
    2. Don't swap the disp pointer to the newly reconstructed Dispatcher until it's done loading. This prevents the API from seeing the wrong number of groups when it calls Dispatcher.Groups.

@Spaceman1701 Spaceman1701 force-pushed the feature/inhibitor-race-fix branch from fb35209 to abc1e0a Compare November 6, 2025 17:56
Signed-off-by: Ethan Hunter <[email protected]>
// first, start the inhibitor so the inhibition cache can populate
// wait for this to load alerts before starting the dispatcher so
// we don't accidentially notify for an alert that will be inhibited
go inhibitor.Run()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if this is better, or we should instead load everything not in a goroutine, then call run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, that actually would simplify things a little. The reason I did it this way was to allow this all to happen concurrently (and without waiting) if we wanted. However, in practice it's all sequential.

Maybe I could change this so that both the Inhibitor and Dispatcher have Load() methods? I think Run should still call Load(), but maybe we can make this work such that Load is a no-op after being called once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I could change this so that both the Inhibitor and Dispatcher have Load() methods? I think Run should still call Load(), but maybe we can make this work such that Load is a no-op after being called once.

I tried this, and it ended up pretty awkward. It might work, but not without a bigger refactoring. Basically, it becomes necessary to move the iterator into a struct field that's protected by the struct's global mutex. This is very awkward because we'd have to lock and unlock the mutex around reads of the iterator's channel.

The only easy fix would be splitting SlurpAndSubscribe method into two, but that's not desirable because the entire benefit comes from being able to atomic separate alerts into "before the call" and "after the call" groups.

// next, start the dispatcher and wait for it to load before swapping the disp pointer.
// This ensures that the API doesn't see the new dispatcher before it finishes populating
// the aggrGroups
go newDisp.Run()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here...

// Implementation of SlurpAndSubcribe is optional - providers may choose to
// return an empty list for the first return value and the result of Subscribe
// for the second return value.
SlurpAndSubscribe(name string) ([]*types.Alert, AlertIterator)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok the name is funny, but maybe SnapshotAndSubscribe?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative names SubscribeWithHistory() or SubscribeWithReplay().
Also why not modify Subscribe()? On startup it has no alerts, so we can always send if anything is in memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, happy to change the name. I think the origin of this name was "slurp up all the existing alerts" 😆

Also why not modify Subscribe()? On startup it has no alerts, so we can always send if anything is in memory.

I wanted to avoid every possible consumer needing to make a code change after this update. However, we already made a breaking change to Subscribe in the last few weeks, so maybe we can sneak this in there too.

I know there are a few projects out there which are linking alertmanager and depending on these interfaces...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants