Skip to content

Conversation

@siavashs
Copy link
Contributor

@siavashs siavashs commented Nov 6, 2025

This change adds a new cmd flag --dispatch.start-delay which corresponds to the --rules.alert.resend-delay flag in Prometheus.
This flag controls the minimum amount of time that Prometheus waits before resending an alert to Alertmanager.

By adding this value to the start time of Alertmanager, we delay the aggregation groups' first flush, until we are confident all alerts are resent by Prometheus instances.

This should help avoid race conditions in inhibitions after a (re)start.

Other improvements:

  • remove hasFlushed flag from aggrGroup
  • remove mutex locking from aggrGroup

@siavashs siavashs changed the title feat(dispatch): honor group_wait on first flush & sync with Prometheus' --alerts.resend-delay feat(dispatch): honor group_wait on first flush & sync with Prometheus' --rules.alerts.resend-delay Nov 6, 2025
Copy link
Contributor

@Spaceman1701 Spaceman1701 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind splitting this into two PRs? One that adds the --alerts.resend-delay and one that adds the wait_on_startup config to the route?

I'm asking because I think the --alerts.resend-delay is something we should definitely merge, but I'm a little concerned about wait_on_startup.

From the description in the PR, it seems like these are both aimed at solving the same problem - the inhibitor and the dispatcher race on alertmanager restart because alertmanager has to wait for prometheus to resend alerts. resend-delay seems to address this directly, while wait_on_startup seems more like a hack - there's no guarantee that group_wait is the right duration to wait after a restart. Additionally, group_wait is intended to express user's logic, not handle the protocol between alertmanager and prometheus. I wouldn't want to give users competing concerns around what value to use group_wait.

Is there any other use case you envision for wait_on_startup that I might be missing?

// alert is already over.
ag.mtx.Lock()
defer ag.mtx.Unlock()
if !ag.hasFlushed && alert.StartsAt.Add(ag.opts.GroupWait).Before(time.Now()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat unrelated to this change, but I noticed it when reviewing the new code - I think there's a very minor logic bug here - if an alert's StartsAt is in the past, but not at least ag.opts.GroupWait in the past, I think we should check if the next flush is before or after it would be scheduled purely from the new alert. If it's after, we should reset the timer to that duration. I don't thin we're keeping track of the next flush time outside of the timer, so that'd need to change too 🤔

E.g.

wantedFlush := time.Since(alert.StartsAt.Add(ag.opts.GroupWait))
if wantedFlush < time.Duration(0) {
    wantedFlush = time.Duration(0)
}
actualFlush := ag.durationToNextFlush()
if wantedFlush < actualFlush {
  timer.Reset(wantedFlush)
}

I don't think we should change the behavior in this PR though. Perhaps as a follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, we can add it here or as a follow up.

@juliusv
Copy link
Member

juliusv commented Nov 7, 2025

there's no guarantee that group_wait is the right duration to wait after a restart.

That's what I was thinking as well: some people may even have a group_wait: 1d for low-prio grouped alerts. Then you would never get any alerts if you restarted Alertmanager once a day, right?

@siavashs
Copy link
Contributor Author

siavashs commented Nov 7, 2025

I'm dropping the WaitOnStartup as we never used it internally and based on the comments it can be tricky if user uses a long group_wait value.

@siavashs siavashs changed the title feat(dispatch): honor group_wait on first flush & sync with Prometheus' --rules.alerts.resend-delay feat(dispatch): sync with Prometheus resend delay Nov 7, 2025
@siavashs
Copy link
Contributor Author

siavashs commented Nov 7, 2025

We are now failing this test which is vague and I remember debugging before but not documenting:

func TestReload(t *testing.T) {
t.Parallel()
// This integration test ensures that the first alert isn't notified twice
// and repeat_interval applies after the AlertManager process has been
// reloaded.
conf := `
route:
receiver: "default"
group_by: []
group_wait: 1s
group_interval: 6s
repeat_interval: 10m
receivers:
- name: "default"
webhook_configs:
- url: 'http://%s'
`
at := NewAcceptanceTest(t, &AcceptanceOpts{
Tolerance: 150 * time.Millisecond,
})
co := at.Collector("webhook")
wh := NewWebhook(t, co)
amc := at.AlertmanagerCluster(fmt.Sprintf(conf, wh.Address()), 1)
amc.Push(At(1), Alert("alertname", "test1"))
at.Do(At(3), amc.Reload)
amc.Push(At(4), Alert("alertname", "test2"))
co.Want(Between(2, 2.5), Alert("alertname", "test1").Active(1))
// Timers are reset on reload regardless, so we count the 6 second group
// interval from 3 onwards.
co.Want(Between(9, 9.5),
Alert("alertname", "test1").Active(1),
Alert("alertname", "test2").Active(4),
)
at.Run()
t.Log(co.Check())
}

@ultrotter
Copy link
Contributor

We are now failing this test which is vague and I remember debugging before but not documenting:

func TestReload(t *testing.T) {
t.Parallel()
// This integration test ensures that the first alert isn't notified twice
// and repeat_interval applies after the AlertManager process has been
// reloaded.
conf := `
route:
receiver: "default"
group_by: []
group_wait: 1s
group_interval: 6s
repeat_interval: 10m
receivers:
- name: "default"
webhook_configs:
- url: 'http://%s'
`
at := NewAcceptanceTest(t, &AcceptanceOpts{
Tolerance: 150 * time.Millisecond,
})
co := at.Collector("webhook")
wh := NewWebhook(t, co)
amc := at.AlertmanagerCluster(fmt.Sprintf(conf, wh.Address()), 1)
amc.Push(At(1), Alert("alertname", "test1"))
at.Do(At(3), amc.Reload)
amc.Push(At(4), Alert("alertname", "test2"))
co.Want(Between(2, 2.5), Alert("alertname", "test1").Active(1))
// Timers are reset on reload regardless, so we count the 6 second group
// interval from 3 onwards.
co.Want(Between(9, 9.5),
Alert("alertname", "test1").Active(1),
Alert("alertname", "test2").Active(4),
)
at.Run()
t.Log(co.Check())
}

Maybe checking the hasFlushed condition would help this test too? After all that should exactly prevent from notifying twice?

@siavashs
Copy link
Contributor Author

siavashs commented Nov 7, 2025

Maybe checking the hasFlushed condition would help this test too? After all that should exactly prevent from notifying twice?

So the problem is not duplicate notification but earlier than expected notification:

        interval [2,2.5]
        ---
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 0001-01-01T00:00:01.000Z <nil> <nil> { map[alertname:test1]}}[-9.223372036854776e+09:]
          [ ✓ ]
        interval [9,9.5]
        ---
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 0001-01-01T00:00:01.000Z <nil> <nil> { map[alertname:test1]}}[-9.223372036854776e+09:]
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 0001-01-01T00:00:04.000Z <nil> <nil> { map[alertname:test2]}}[-9.223372036854776e+09:]
          [ ✗ ]

        received:
        @ 2.00549375
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 2025-11-07T15:47:48.705+01:00 <nil> <nil> { map[alertname:test1]}}[1.002707:]
        @ 4.009307375
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 2025-11-07T15:47:48.705+01:00 <nil> <nil> { map[alertname:test1]}}[1.002707:]
        - &{map[] 0001-01-01T00:00:00.000Z <nil> [] 2025-11-07T15:47:51.706+01:00 <nil> <nil> { map[alertname:test2]}}[4.003107:]

@siavashs siavashs changed the title feat(dispatch): sync with Prometheus resend delay feat(dispatch): add start delay Nov 10, 2025
@siavashs
Copy link
Contributor Author

hasFlashed is dropped now since Dispatcher controls that with status checks after creating an AG.
This also means that AGs are now lock-free.

@siavashs siavashs force-pushed the feat/dispatch-wait branch 2 times, most recently from bcfbfbe to 960ed5e Compare November 10, 2025 18:00
@siavashs siavashs force-pushed the feat/dispatch-wait branch 3 times, most recently from 253e717 to 885824c Compare November 13, 2025 09:11
@siavashs
Copy link
Contributor Author

It seems I broke acceptance tests again, I'll check and fix them.

This change adds a new cmd flag `--dispatch.start-delay` which
corresponds to the `--rules.alert.resend-delay` flag in Prometheus.
This flag controls the minimum amount of time that Prometheus waits
before resending an alert to Alertmanager.

By adding this value to the start time of Alertmanager, we delay
the aggregation groups' first flush, until we are confident all alerts
are resent by Prometheus instances.

This should help avoid race conditions in inhibitions after a (re)start.

Other improvements:
- remove hasFlushed flag from aggrGroup
- remove mutex locking from aggrGroup

Signed-off-by: Alexander Rickardsson <[email protected]>
Signed-off-by: Siavash Safi <[email protected]>
maxSilenceSizeBytes = kingpin.Flag("silences.max-silence-size-bytes", "Maximum silence size in bytes. If negative or zero, no limit is set.").Default("0").Int()
alertGCInterval = kingpin.Flag("alerts.gc-interval", "Interval between alert GC.").Default("30m").Duration()
dispatchMaintenanceInterval = kingpin.Flag("dispatch.maintenance-interval", "Interval between maintenance of aggregation groups in the dispatcher.").Default("30s").Duration()
DispatchStartDelay = kingpin.Flag("dispatch.start-delay", "Minimum amount of time to wait before dispatching alerts. This option should be synced with value of --rules.alert.resend-delay on Prometheus.").Default("0s").Duration()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default in prometheus is 1m so should the default in AM also be 1m?

Copy link
Contributor Author

@siavashs siavashs Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially set this to 1m but it changes the default behaviour and lots of acceptance tests fail. We can avoid breaking the existing tests by setting it to 0 for all tests.
Adjusting timings is not possible since we add +1m to each test.

But I thought same thing could happen to users unexpectedly if they don't pay attention to the changelog and this new cmd flag.

I'm open to setting the default to 1m to sync it with Prometheus defaults.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, whatever you choose, you will choose wrong :D

Copy link
Contributor

@ultrotter ultrotter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a few minor comments! Thanks!

configLogger,
)
configCoordinator.Subscribe(func(conf *config.Config) error {
configCoordinator.Subscribe(func(conf *config.Config, initial bool) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to rather pass a delay here, of type time.Duration, then we can pass 0 on reload, and *DispatchStartDelay on initial, rather?

mutex sync.Mutex
config *Config
subscribers []func(*Config, bool) error
initialReload bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we could have an initialDelay, and pass it into the function, plus set it to 0 after the initial notification.
This would b set explicitly at New and not implied...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case we can bypass this and just do the reset to zero after the first time dispatch runs in main.


mtx sync.RWMutex
hasFlushed bool
running atomic.Bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we use go.uber.org/atomic elsewhere, but does this buy us anything over sync/atomic Bool? https://pkg.go.dev/sync/atomic#pkg-types Should we use this and start a switch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is currently a lint check which enforces this.
I guess we inherit this from Prometheus.
I need to check if this is still required.
cc @SuperQ

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, there's an open effort in Prometheus to switch to the new stdlib atomic types (those only got added in Go 1.19):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I'm not sure. This was introduced in Prometheus in 2020. prometheus/prometheus#7647

I guess we'll need to find out from the rest of the Prometheus devs if we still need this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the check, std lib is what we want as per prometheus/prometheus#14866

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we can migrate to to the stdlib now: prometheus/prometheus#14866

Copy link
Contributor

@Spaceman1701 Spaceman1701 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM now. The state to string map is the only thing I'd really prefer that we changed before merging.

DispatcherStateRunning
)

var state = map[int]string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably more idiomatic to implement Stringer if we need this.

e.g.

type DispatcherState int

const (
    DispatcherStateUnknown DispatcherState = iota
	DispatcherStateWaitingToStart DispatcherState
	DispatcherStateRunning DispatcherState
)

func (d DispatcherState) String() string {
    switch d {
        case DispatcherStateUnknown:
           return "unknown"
     ...
}


// Run starts dispatching alerts incoming via the updates channel.
func (d *Dispatcher) Run() {
func (d *Dispatcher) Run(dispatchDelay time.Duration) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nitpick, take it or leave it: I think it'd be slightly better if this took at startAt time.Time since that's a little closer to the meaning and it's a little more flexible for the caller, in my opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants