Skip to content

Commit dd0ddb9

Browse files
[Wf-Diagnostics] add a post on workflow diagnostics (#254)
* [Wf-Diagnostics] add a post on workflow diagnostics * Update 2025-08-06-workflow-diagnostics.md * Update authors.yml * Update blog/2025-08-06-workflow-diagnostics.md Co-authored-by: Adhitya Mamallan <[email protected]> * Update blog/2025-08-06-workflow-diagnostics.md Co-authored-by: Adhitya Mamallan <[email protected]> --------- Co-authored-by: Adhitya Mamallan <[email protected]>
1 parent 5f6a311 commit dd0ddb9

File tree

2 files changed

+72
-0
lines changed

2 files changed

+72
-0
lines changed
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
---
2+
title: "Workflow Diagnostics"
3+
4+
date: 2025-08-06
5+
authors: sankari165
6+
tags:
7+
- announcement
8+
---
9+
10+
Cadence users, especially new users, often struggle with failed/stuck workflows and are unable to understand what is wrong with their workflow. This can now be addressed by a tool that runs on demand to check the workflow and provide diagnostics with actionable information via clear runbooks that users can follow. The overarching goal is to help cadence users understand what is wrong with their workflow
11+
12+
<!-- truncate -->
13+
14+
## Introducing Workflow Diagnostics
15+
16+
Cadence workflow diagnostics fetches the workflow execution history and identifies the issues in the workflow i.e. points out the different items that did not work as expected. For example, workflow timeouts. Next, for the issue identified, it provides the potential root cause by listing the different reasons that must've caused the issue. For example, the tasklist does not have pollers. Lastly, it provides ways to resolve the issue since we want the cadence users to have actionable diagnostics. For example, timeouts could occur when the workflow is running on a tasklist without enough workers to start the activities
17+
18+
## How it works?
19+
20+
Cadence Workflow Diagnostics will be initiated on demand by a user for a given workflow execution in a cadence domain. The call will be made to cadence-frontend service which in turn triggers a diagnostics workflow that runs in the cadence-worker service to perform the diagnostics based on workflow execution history.
21+
22+
Code references:
23+
24+
1. The [invariant interface](https://github.com/cadence-workflow/cadence/tree/master/service/worker/diagnostics/invariant) where each invariant implementation checks and root causes one specific issue like timeouts or failures.
25+
26+
2. The [diagnostics workflow](https://github.com/cadence-workflow/cadence/blob/master/service/worker/diagnostics/workflow.go) that runs as a cadence workflow where it has 2 activities: one to identify the issues using the invariant checks and other to root cause them. Some invariants might not have a rootcause implementation too.
27+
28+
3. [Parent workflow](https://github.com/cadence-workflow/cadence/blob/master/service/worker/diagnostics/parent_workflow.go) to trigger diagnostics as a child workflow followed by emission of some usage logs for observability
29+
30+
## How to use this feature?
31+
32+
1. [Frontend API](https://github.com/cadence-workflow/cadence/blob/master/service/frontend/api/interface.go#L47) or cadence CLI that triggers a call to start the diagnostics workflow - This starts the diagnostics workflow and provides the wf execution details.
33+
34+
```bash
35+
cadence --do cadence-sample-domain workflow diag --wid w123 --rid 123
36+
```
37+
38+
The above command would start performing diagnostics via a cadence workflow and return its details. Sample output:
39+
40+
```bash
41+
Workflow diagnosis started. Query the diagnostic workflow to get diagnostics report.
42+
============Diagnostic Workflow details============
43+
Domain: cadence-system, Workflow Id: diag123wid, Run Id: diag123rid
44+
```
45+
46+
Use workflow query command to fetch the results of the diagnostics
47+
48+
```bash
49+
cadence --do cadence-system workflow query --wid diag123wid --rid diag123rid --qt query-diagnostics-report
50+
```
51+
52+
2. The cadence web UI will have a diagnostics tab on the workflow execution page that displays the results of running diagnostics on the workflow. It lists the various issues identified, the potential rootcause and the link to runbooks.
53+
54+
## How to add a new use-case to workflow diagnostics?
55+
56+
1. Define an implementation of the invariant interface. [link](https://github.com/cadence-workflow/cadence/tree/master/service/worker/diagnostics/invariant/failure)
57+
58+
2. Add it to the list of invariants provided on service start up. [link](https://github.com/cadence-workflow/cadence/blob/master/cmd/server/cadence/server.go#L265)
59+
60+
3. Update the diagnostics workflow to be able to construct the diagnostics result [link](https://github.com/cadence-workflow/cadence/blob/master/service/worker/diagnostics/workflow.go#L201)
61+
62+
4. Provide a runbook for the issues/rootcause and link it up along with the diagnostics result

blog/authors.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,16 @@ jakobht:
2828
linkedin: https://www.linkedin.com/in/jakob-taankvist/
2929
github: jakobht
3030

31+
sankari165:
32+
name: Sankari Gopalakrishnan
33+
title: Senior Software Engineer @ Uber
34+
url: https://www.linkedin.com/in/sankari-gopalakrishnan165/
35+
image_url: https://github.com/sankari165.png
36+
page: true
37+
socials:
38+
linkedin: https://www.linkedin.com/in/sankari-gopalakrishnan165/
39+
github: sankari165
40+
3141
ibarrajo:
3242
name: Josué Alexander Ibarra
3343
title: Developer Advocate @ Uber

0 commit comments

Comments
 (0)