Add new behavior to avoid races on config reload #4705

Spaceman1701 · 2025-11-06T17:56:24Z

When the config is reloaded, alertmanager re-creates the inhibit.Inhibitor and the dispatch.Dispatcher. This is necessary because both the Inhibitor and Dispatcher have internal state which depends on the config.

In theory, the re-build should be fast because the provider.Alerts still contains all the active alerts in memory. In practice, it causes some problems. Since the Inhibitor and Dispatcher both ingest alerts one at a time on separate goroutines, their state is built concurrently.

The Inhibitor works by building an internal cache of inhibiting alerts. If an active alert is missing from the Inhibitor's cache, it does not cause inhibitions. This means that an inhibitor with a partially built cache can erroneously return false from Inhibitor.Mutes.

The Dispatcher is responsible for building aggrGroups which in turn flush alerts into the notification pipeline. The notification pipeline calls Inhibitor.Mutes to prevent notifying for inhibited alerts.

This all comes together to cause a problem: If the Dispatcher is able to build an aggrGroup that contains inhibited alerts before the Inhibitor is able to process those alerts, it may cause a incorrect notification to fire. In the worst case, if the Dispatcher is able to build any aggrGroup before the Inhibitor has completed building its cache, we could see incorrect notifications. Essentially, config reloads cause a race condition between the Dispatcher and Inhibitor internal caches.

#4704 largely solves this problem for alertmanager restarts by causing the Dispatcher to delay sending alerts after startup. However, this isn't desirable during config reloads for a number of reasons. Most importantly, config reloads need to be applied to the entire alertmanager cluster all at once. Any artificial delay will delay any notifications from the cluster. In practice, we've seen this as a spike of notifications for inhibited alerts right after a config reload.

Another related problem is the API. If an API function calls Dispatcher.Groups right after a config reload, it might see fewer groups than the in-memory state of alerts would actually create. This is because the disp pointer that the API uses is swapped as soon as the new DIspatcher is constructed. In practice, we've seen this as the /alerts/groups endpoint returning nothing right after a config reload.

This PR adds new mechanisms to avoid all these race conditions. Since the provider.Alerts isn't reconstructed, we just need the Dispatcher to wait for the Inhibitor to process all the alerts which are already in the provider.Alerts. Unfortunately, there's no interface to do that. This PR adds

A new provider.Alerts.SlurpAndSubscribe method which allows implementations of provider.Alerts to return a batch of alerts to the caller immediately, rather than one at a time through a provider.AlertIterator. The implementation in the mem.Alerts is very simple - just return everything that's currently in memory as a batch and then construct the iterator as normal.
Add mechanisms to both the Inhibitor and Dispatcher to indicate whether they're done loading. This just uses a sync.WaitGroup internal to each struct. Both are updated to call SlurpAndSubscribe and indicate loading is finished when they've finished processing the initial batch.
Re-order the logic for reconstructing the Dispatcher and Inhibitor during a config reload.
1. Reload the Inhibitor first, and then reload the Dispatcher when that's finished
2. Don't swap the disp pointer to the newly reconstructed Dispatcher until it's done loading. This prevents the API from seeing the wrong number of groups when it calls Dispatcher.Groups.

Signed-off-by: Ethan Hunter <[email protected]>

ultrotter · 2025-11-06T18:19:00Z

cmd/alertmanager/main.go

+		// first, start the inhibitor so the inhibition cache can populate
+		// wait for this to load alerts before starting the dispatcher so
+		// we don't accidentially notify for an alert that will be inhibited
 		go inhibitor.Run()


I am wondering if this is better, or we should instead load everything not in a goroutine, then call run?

Ah yeah, that actually would simplify things a little. The reason I did it this way was to allow this all to happen concurrently (and without waiting) if we wanted. However, in practice it's all sequential.

Maybe I could change this so that both the Inhibitor and Dispatcher have Load() methods? I think Run should still call Load(), but maybe we can make this work such that Load is a no-op after being called once.

Maybe I could change this so that both the Inhibitor and Dispatcher have Load() methods? I think Run should still call Load(), but maybe we can make this work such that Load is a no-op after being called once.

I tried this, and it ended up pretty awkward. It might work, but not without a bigger refactoring. Basically, it becomes necessary to move the iterator into a struct field that's protected by the struct's global mutex. This is very awkward because we'd have to lock and unlock the mutex around reads of the iterator's channel.

The only easy fix would be splitting SlurpAndSubscribe method into two, but that's not desirable because the entire benefit comes from being able to atomic separate alerts into "before the call" and "after the call" groups.

ultrotter · 2025-11-06T18:19:14Z

cmd/alertmanager/main.go

+		// next, start the dispatcher and wait for it to load before swapping the disp pointer.
+		// This ensures that the API doesn't see the new dispatcher before it finishes populating
+		// the aggrGroups
+		go newDisp.Run()


Same here...

ultrotter · 2025-11-06T18:22:12Z

provider/provider.go

+	// Implementation of SlurpAndSubcribe is optional - providers may choose to
+	// return an empty list for the first return value and the result of Subscribe
+	// for the second return value.
+	SlurpAndSubscribe(name string) ([]*types.Alert, AlertIterator)


Ok the name is funny, but maybe SnapshotAndSubscribe?

Alternative names SubscribeWithHistory() or SubscribeWithReplay().
Also why not modify Subscribe()? On startup it has no alerts, so we can always send if anything is in memory.

Sure, happy to change the name. I think the origin of this name was "slurp up all the existing alerts" 😆

Also why not modify Subscribe()? On startup it has no alerts, so we can always send if anything is in memory.

I wanted to avoid every possible consumer needing to make a code change after this update. However, we already made a breaking change to Subscribe in the last few weeks, so maybe we can sneak this in there too.

I know there are a few projects out there which are linking alertmanager and depending on these interfaces...

Add new behavior to avoid races on config reload

abc1e0a

Signed-off-by: Ethan Hunter <[email protected]>

Spaceman1701 force-pushed the feature/inhibitor-race-fix branch from fb35209 to abc1e0a Compare November 6, 2025 17:56

fix formatting

82e52aa

Signed-off-by: Ethan Hunter <[email protected]>

ultrotter approved these changes Nov 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add new behavior to avoid races on config reload #4705

Add new behavior to avoid races on config reload #4705

Uh oh!

Spaceman1701 commented Nov 6, 2025

Uh oh!

ultrotter Nov 6, 2025

Uh oh!

Spaceman1701 Nov 6, 2025

Uh oh!

Spaceman1701 Nov 7, 2025

Uh oh!

ultrotter Nov 6, 2025

Uh oh!

ultrotter Nov 6, 2025

Uh oh!

siavashs Nov 7, 2025

Uh oh!

Spaceman1701 Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add new behavior to avoid races on config reload #4705

Are you sure you want to change the base?

Add new behavior to avoid races on config reload #4705

Uh oh!

Conversation

Spaceman1701 commented Nov 6, 2025

Uh oh!

ultrotter Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Spaceman1701 Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Spaceman1701 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

ultrotter Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

ultrotter Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

siavashs Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Spaceman1701 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants