pulse-go doesn't recover from dropped amqp

It looks like we drop a message in the logs: https://github.com/taskcluster/pulse-go/blob/515edd00f1ba5b3ccc80cfec4606833e5c8d463c/pulse/pulse.go#L340

This resulted in two instances of cloudops-jenkins not responding to hg.m.o events, which halts rolling out changes to the FirefoxCI tc cluster.

We were wondering if we could either
- add louder notifications: a slack alert, email, ?
- auto-recover, whether that's killing pulse-go for a restart, killing the container for a restart, reconnecting to amqp, ? I'm not sure if this would be on the first failure or after `t` time or `n` failed attempts or what.

or both.

@petemoore any thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pulse-go doesn't recover from dropped amqp #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

pulse-go doesn't recover from dropped amqp #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions