Skip to content

Include max workflow age in stats, for alerting when workflows get "stuck"Β #381

@cspotcode

Description

@cspotcode

We have monitoring configured so that our engineer-on-call is paged if our workflows fail to execute, either due to an acute failure -- a single workflow gets "stuck" -- or due to a performance degradation -- workflows are executing, but slowly, and are starting to back up in the queue.

Currently, this monitoring is based on the total # of pending workflows from GetStats. But this is an imperfect monitor, because it alerts when the system is completely healthy, and it may miss when a single workflow is unhealthy.

A healthy system executing many workflows may always have a high number of in-flight workflows at any given time, even though each individual workflow is executing correctly. Conversely, The total # of pending workflows may be as low as 1, yet if that 1 workflow is stuck and failing to complete, we should be notified.

I think a better statistic might be "max workflow age," computed per workflow type. For example, if we have one type of workflow that should always finish within 30 minutes, and one is 40 minutes old, we want to be notified. For another type of workflow that finishes within 10 seconds, we want to be notified if any one is 1minute or older.

I'm not sure the best way to implement this, but I think adding a new Max Age statistic can help. GetStats can return the max workflow age per queue, and we can intentionally put our 30 minute workflows in a different queue than our 10 second workflows.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions