cadence-workflow
diff --git a/‎blog/2025-08-06-workflow-diagnostics.md‎
Lines changed: 62 additions & 0 deletions b/‎blog/2025-08-06-workflow-diagnostics.md‎
Lines changed: 62 additions & 0 deletions
diff --git a/‎blog/authors.yml‎
Lines changed: 10 additions & 0 deletions b/‎blog/authors.yml‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎docs/03-concepts/13-grafana-helm-setup.md‎
Lines changed: 141 additions & 0 deletions b/‎docs/03-concepts/13-grafana-helm-setup.md‎
Lines changed: 141 additions & 0 deletions
diff --git a/‎docs/05-go-client/21-sleep.md‎
Lines changed: 47 additions & 0 deletions b/‎docs/05-go-client/21-sleep.md‎
Lines changed: 47 additions & 0 deletions
@@ -0,0 +1,62 @@
+---
+title: "Workflow Diagnostics"
+
+date: 2025-08-06
+authors: sankari165
+tags:
+  - announcement
+---
+
+Cadence users, especially new users, often struggle with failed/stuck workflows and are unable to understand what is wrong with their workflow. This can now be addressed by a tool that runs on demand to check the workflow and provide diagnostics with actionable information via clear runbooks that users can follow. The overarching goal is to help cadence users understand what is wrong with their workflow
+
+<!-- truncate -->
+
+## Introducing Workflow Diagnostics
+
+Cadence workflow diagnostics fetches the workflow execution history and identifies the issues in the workflow i.e. points out the different items that did not work as expected. For example, workflow timeouts. Next, for the issue identified, it provides the potential root cause by listing the different reasons that must've caused the issue. For example, the tasklist does not have pollers. Lastly, it provides ways to resolve the issue since we want the cadence users to have actionable diagnostics. For example, timeouts could occur when the workflow is running on a tasklist without enough workers to start the activities
+
+## How it works?
+
+Cadence Workflow Diagnostics will be initiated on demand by a user for a given workflow execution in a cadence domain. The call will be made to cadence-frontend service which in turn triggers a diagnostics workflow that runs in the cadence-worker service to perform the diagnostics based on workflow execution history.
+
+Code references:
+
+1. The [invariant interface](https://github.com/cadence-workflow/cadence/tree/master/service/worker/diagnostics/invariant) where each invariant implementation checks and root causes one specific issue like timeouts or failures.
+
+2. The [diagnostics workflow](https://github.com/cadence-workflow/cadence/blob/master/service/worker/diagnostics/workflow.go) that runs as a cadence workflow where it has 2 activities: one to identify the issues using the invariant checks and other to root cause them. Some invariants might not have a rootcause implementation too.
+
+3. [Parent workflow](https://github.com/cadence-workflow/cadence/blob/master/service/worker/diagnostics/parent_workflow.go) to trigger diagnostics as a child workflow followed by emission of some usage logs for observability
+
+## How to use this feature?
+
+1. [Frontend API](https://github.com/cadence-workflow/cadence/blob/master/service/frontend/api/interface.go#L47) or cadence CLI that triggers a call to start the diagnostics workflow - This starts the diagnostics workflow and provides the wf execution details.
+
+```bash
+cadence --do cadence-sample-domain workflow diag --wid w123 --rid 123
+```
+
+The above command would start performing diagnostics via a cadence workflow and return its details. Sample output:
+
+```bash
+Workflow diagnosis started. Query the diagnostic workflow to get diagnostics report.
+============Diagnostic Workflow details============
+Domain: cadence-system, Workflow Id: diag123wid, Run Id: diag123rid
+```
+
+Use workflow query command to fetch the results of the diagnostics
+
+```bash
+cadence --do cadence-system workflow query --wid diag123wid --rid diag123rid --qt query-diagnostics-report
+```
+
+2. The cadence web UI will have a diagnostics tab on the workflow execution page that displays the results of running diagnostics on the workflow. It lists the various issues identified, the potential rootcause and the link to runbooks.
+
+## How to add a new use-case to workflow diagnostics?
+
+1. Define an implementation of the invariant interface. [link](https://github.com/cadence-workflow/cadence/tree/master/service/worker/diagnostics/invariant/failure)
+
+2. Add it to the list of invariants provided on service start up. [link](https://github.com/cadence-workflow/cadence/blob/master/cmd/server/cadence/server.go#L265)
+
+3. Update the diagnostics workflow to be able to construct the diagnostics result [link](https://github.com/cadence-workflow/cadence/blob/master/service/worker/diagnostics/workflow.go#L201)
+
+4. Provide a runbook for the issues/rootcause and link it up along with the diagnostics result
@@ -28,6 +28,16 @@ jakobht:
     linkedin: https://www.linkedin.com/in/jakob-taankvist/
     github: jakobht
 
+sankari165:
+  name: Sankari Gopalakrishnan
+  title: Senior Software Engineer @ Uber
+  url: https://www.linkedin.com/in/sankari-gopalakrishnan165/
+  image_url: https://github.com/sankari165.png
+  page: true
+  socials:
+    linkedin: https://www.linkedin.com/in/sankari-gopalakrishnan165/
+    github: sankari165
+
 ibarrajo:
   name: Josué Alexander Ibarra
   title: Developer Advocate @ Uber
 
@@ -0,0 +1,141 @@
+---
+layout: default
+title: Grafana Helm Setup
+permalink: /docs/concepts/grafana-helm-setup
+---
+
+# Grafana Helm Setup
+
+<details>
+<summary><h2>Introduction</h2></summary>
+
+This guide explains how to set up Grafana for monitoring Cadence workflows and services using Helm charts. Helm simplifies the deployment and management of Grafana in Kubernetes environments. Pre-configured dashboards for Cadence are available to visualize metrics effectively.
+
+</details>
+
+<details>
+<summary><h2>Prerequisites</h2></summary>
+
+Before proceeding, ensure the following:
+
+- Kubernetes cluster is up and running.
+- Helm is installed on your system. Refer to the [Helm installation guide](https://helm.sh/docs/intro/install/).
+- Access to the Cadence Helm charts repository.
+
+</details>
+
+<details>
+<summary><h2>Setup Steps</h2></summary>
+
+### Step 1: Add Cadence Helm Repository
+
+```bash
+helm repo add cadence-workflow https://cadenceworkflow.github.io/cadence-charts
+helm repo update
+```
+
+### Step 2: Deploy Prometheus Operator
+
+```bash
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+helm install prometheus-operator prometheus-community/kube-prometheus-stack \
+  --namespace monitoring --create-namespace
+```
+
+### Step 3: Deploy Cadence with ServiceMonitor
+
+Create a `values.yaml` file to enable ServiceMonitor for automatic metrics scraping:
+
+```yaml
+# Enable metrics collection
+metrics:
+  enabled: true
+  port: 9090
+  portName: metrics
+
+  serviceMonitor:
+    enabled: true
+    # Replace with the namespace where Prometheus is deployed
+    namespace: "monitoring"
+    namespaceSelector:
+      # Ensure this matches Prometheus's namespace
+      matchNames:
+        - monitoring
+    scrapeInterval: 10s
+    additionalLabels:
+      # Ensure this matches Prometheus's Helm release name
+      release: prometheus-operator
+    annotations: {}
+    jobLabel: "app.kubernetes.io/name"
+    targetLabels:
+      - app.kubernetes.io/name
+    relabelings: []
+    metricRelabelings: []
+```
+
+Deploy Cadence:
+```bash
+helm install cadence cadence-workflow/cadence \
+  --namespace cadence --create-namespace \
+  --values values.yaml
+```
+
+**Note:** Update the `namespace`, `matchNames`, and `release` values to match your Prometheus deployment.
+
+### Step 4: Access Grafana
+
+Get Grafana admin password:
+```bash
+kubectl get secret --namespace monitoring prometheus-operator-grafana \
+  -o jsonpath="{.data.admin-password}" | base64 --decode
+```
+
+Access Grafana:
+```bash
+kubectl port-forward --namespace monitoring svc/prometheus-operator-grafana 3000:80
+```
+
+Open http://localhost:3000 (admin/password from above)
+
+### Step 5: Import Cadence Dashboards
+
+1. **Download the Cadence Grafana Dashboard JSON:**
+```bash
+curl https://raw.githubusercontent.com/cadence-workflow/cadence/refs/heads/master/docker/grafana/provisioning/dashboards/cadence-server.json -o cadence-server.json
+```
+
+2. **Import in Grafana:** **Dashboards** → **Import** → Upload JSON files
+3. **Select Prometheus** as data source when prompted
+4. Try the same steps for other dashboards
+
+</details>
+
+<details>
+<summary><h2>Customization</h2></summary>
+
+The Grafana dashboards can be customized by editing the JSON files or modifying panels directly in Grafana. Additionally, Helm values can be overridden during installation to customize Grafana settings.
+
+### Example: Override Helm Values
+Create a `values.yaml` file to customize Grafana settings:
+```yaml
+grafana:
+  adminPassword: "your-password"
+  dashboards:
+    enabled: true
+```
+
+Install Grafana with the custom values:
+```bash
+helm install grafana cadence/grafana -n cadence-monitoring -f values.yaml
+```
+
+</details>
+
+<details>
+<summary><h2>Additional Information</h2></summary>
+
+- [Cadence Helm Charts Repository](https://github.com/cadence-workflow/cadence-charts)
+- [Grafana Documentation](https://grafana.com/docs/)
+- [Helm Documentation](https://helm.sh/docs/)
+
+</details>
@@ -0,0 +1,47 @@
+---
+layout: default
+title: Sleep
+permalink: /docs/go-client/sleep
+---
+
+# Workflow Sleep
+
+The `workflow.Sleep` function allows a Cadence workflow to pause its execution for a specified duration. This is similar to `time.Sleep` in Go, but is safe and deterministic for use within Cadence workflows. The workflow will be paused and resumed by the Cadence service, and the sleep is durable—meaning the workflow can survive worker restarts or failures during the sleep period.
+
+## Example: Sleep for 30 Seconds
+
+Here is a minimal example of using `workflow.Sleep` in a Cadence workflow, as demonstrated in [cadence-samples PR #99](https://github.com/cadence-workflow/cadence-samples/pull/99):
+
+```go
+import (
+    "time"
+    "go.uber.org/cadence/workflow"
+)
+
+func SleepWorkflow(ctx workflow.Context) error {
+    workflow.GetLogger(ctx).Info("Workflow started, going to sleep for 30 seconds...")
+    err := workflow.Sleep(ctx, 30*time.Second)
+    if err != nil {
+        workflow.GetLogger(ctx).Error("Sleep interrupted", "Error", err)
+        return err
+    }
+    workflow.GetLogger(ctx).Info("Woke up after 30 seconds!")
+    return nil
+}
+```
+
+### Key Points
+- Use `workflow.Sleep(ctx, duration)` instead of `time.Sleep` inside workflow code.
+- The sleep is durable: if the worker crashes or restarts, the workflow will resume sleeping where it left off.
+- The workflow is not consuming worker resources while sleeping; the state is persisted by Cadence.
+- You can use any duration supported by Go's `time.Duration`.
+
+### When to Use
+- Delaying workflow progress for a fixed period (e.g., retry with backoff, scheduled reminders, timeouts).
+- Waiting for an external event or timeout before proceeding.
+
+### Limitations
+- Do not use `time.Sleep` in workflow code; always use `workflow.Sleep` for determinism and durability.
+- Very large numbers of simultaneous timers (sleeps) may impact cluster performance; consider jittering or batching if needed.
+
+For more details and advanced usage, see the [Cadence Go client documentation](https://pkg.go.dev/go.uber.org/cadence/workflow#Sleep).