fix: use next_event_id column as source of truth when reading workflow execution from Cassandra by fimanishi · Pull Request #7738 · cadence-workflow/cadence

fimanishi · 2026-02-24T00:03:31Z

What changed?
Read from the denormalized columns next_event_id (that duplicate the data next_event_id from the execution field) and set to InternalWorkflowExecutionInfo when reading concrete execution data from Cassandra.

Why?
Cassandra stores values from the same row but different columns in different places on disk, rather than as a single, contiguous row block. It's possible that the denormalized columns get out of sync with the execution blob in the execution field. This denormalized column is used as conditional write when updating the execution record for concrete workflow executions.

By reading it and setting it on InternalWorkflowExecutionInfo we can leverage the checksum verification to detect differences between the denormalized next_event_id column and the next_event_id in the execution blob field (used to calculate the checksum) and identify corrupt workflows quicker and with more precision.

How did you test it?
go test ./common/persistence/nosql/nosqlplugin/cassandra -run Test_parseWorkflowExecutionInfo

Potential risks
We are changing how we read data from Cassandra and if we are doing incorrectly that could cause workflows to be corrupt/stuck.

We are also changing the argument to the parsing function and we had to modify/use the whole result instead of passing only the "execution". If this is wrong it could cause issues in parsing and affect workflows.

Release notes
Improve workflow corruption detection in Cassandra by reading from next_event_id denormalized column and checking against checksum in checksum verification.

Documentation Changes

Reviewer Validation

PR Description Quality (check these before reviewing code):

Signed-off-by: fimanishi <fimanishi@gmail.com>

common/persistence/nosql/nosqlplugin/cassandra/workflow.go

common/persistence/nosql/nosqlplugin/cassandra/workflow_parsing_utils_test.go

…s are read Signed-off-by: fimanishi <fimanishi@gmail.com>

…w executions Signed-off-by: fimanishi <fimanishi@gmail.com>

common/persistence/nosql/nosqlplugin/cassandra/workflow_parsing_utils_test.go

Signed-off-by: fimanishi <fimanishi@gmail.com>

Shaddoll · 2026-02-24T17:04:28Z

common/persistence/nosql/nosqlplugin/cassandra/workflow_parsing_utils.go

we should probably remove this. In case that the next_event_id column is deleted

c-warren · 2026-02-24T19:07:41Z

common/persistence/nosql/nosqlplugin/cassandra/workflow_parsing_utils.go

 var _emptyUUID = cql.UUID{}

 func parseWorkflowExecutionInfo(result map[string]interface{}) *persistence.InternalWorkflowExecutionInfo {
+	executionBlob := result["execution"].(map[string]interface{})


This can panic if the execution doesn't conform to this structure, e.g:

panic: interface conversion: interface {} is main.BadStruct, not map[string]interface {} goroutine 1 [running]: main.main() /tmp/sandbox1356951504/prog.go:39 +0x3d7

I'd recommend returning an error from the mapper if we're unable to deserialize the data unless there is a good reason not to (e.g backwards compatible behavior).

You're right. We were not checking this before but we should

c-warren · 2026-02-24T19:13:55Z

common/persistence/nosql/nosqlplugin/cassandra/workflow_parsing_utils.go

+	if nextEventID, ok := result["next_event_id"].(int64); ok {
+		info.NextEventID = nextEventID
+	}


This looks like a duplicate way to do what the key/value map iteration above is doing - why do we need this too?

this is from the result not the execution. The data is duplicated in the db, but we use the next_event_id column for conditional writes. The key/value map iteration is only exploring the execution field data, not the whole execution record

currently, this serves as a fallback, we should remove the logic setting this value when iterating the map

Yeah, that makes sense. The only thing here is that we use the same path for current executions (although we don't populate the execution blob in there). Are you ok with removing it or is it ok to leave the fallback?

I think we should remove it

Signed-off-by: fimanishi <fimanishi@gmail.com>

…_id, it is only used in concrete executions that have and read the denormalized next_event_id column Signed-off-by: fimanishi <fimanishi@gmail.com>

gitar-bot · 2026-02-25T22:38:28Z

Code Review 👍 Approved with suggestions 3 resolved / 3 findings

The PR correctly moves next_event_id reading to denormalized columns for the primary SelectWorkflowExecution path. The main remaining concern is that SelectAllWorkflowExecutions and SelectCurrentWorkflow don't select next_event_id, resulting in zero values — though SelectCurrentWorkflow doesn't use NextEventID so it's functionally safe. Previous inline findings cover these issues.

✅ 3 resolved

✅ Bug: State and NextEventID silently zeroed in queries missing denormalized columns

📄 common/persistence/nosql/nosqlplugin/cassandra/workflow_cql.go:324 📄 common/persistence/nosql/nosqlplugin/cassandra/workflow_cql.go:349 📄 common/persistence/nosql/nosqlplugin/cassandra/workflow.go:100 📄 common/persistence/nosql/nosqlplugin/cassandra/workflow.go:361
The "state" and "next_event_id" cases were removed from the execution blob switch statement in parseWorkflowExecutionInfo, so these fields are now only populated from the denormalized columns on the outer result map. However, two query templates do not select next_event_id or workflow_state:

templateGetCurrentExecutionQuery (line 324 of workflow_cql.go): selects only current_run_id, execution, workflow_last_write_version. Used by SelectCurrentWorkflow() which reads executionInfo.State (line 111 of workflow.go) — this will now always be 0 regardless of the actual workflow state. This is a critical data correctness regression since State is returned to callers and used for workflow lifecycle decisions.

templateListWorkflowExecutionQuery (line 349 of workflow_cql.go): selects only run_id, execution, version_histories, version_histories_encoding. Used by SelectAllWorkflowExecutions() — State will also be 0 for all returned executions. This affects the reconciliation fetcher which uses ExecutionInfo.State.

Because the safe type assertion (if value, ok := ...) silently returns zero when the column is absent, there will be no error or panic — just silently wrong data.

Fix options:

Option A (recommended): Add next_event_id, workflow_state to both templateGetCurrentExecutionQuery and templateListWorkflowExecutionQuery.

Option B: Keep parsing "state" and "next_event_id" from the blob as a fallback, then override with denormalized columns when present. This is more defensive and handles any query that doesn't select the denormalized columns.

✅ Bug: Test "no denormalized columns" validates data loss as expected behavior

📄 common/persistence/nosql/nosqlplugin/cassandra/workflow_parsing_utils_test.go:262
The test case "no denormalized columns - nil values" (line 263) puts next_event_id: int64(10) and state: 2 in the execution blob, but asserts that the result has NextEventID: 0 and State: 0. This validates the scenario that causes the critical bug in SelectCurrentWorkflow and SelectAllWorkflowExecutions — when denormalized columns are absent, blob values are silently discarded.

If the fix is to keep blob parsing as a fallback (Option B from the critical finding), this test's expected values should change to NextEventID: 10 and State: 2. If the fix is to add the columns to all queries (Option A), this test case should still be updated since silently returning zero for known-present blob data is unexpected.

✅ Quality: Test includes unused workflow_state key, misleading reader

📄 common/persistence/nosql/nosqlplugin/cassandra/workflow_parsing_utils_test.go:130
The test case "full execution blob" includes "workflow_state": 1 in the top-level args (line 130), but parseWorkflowExecutionInfo does not read workflow_state from the result map — it was intentionally removed in commit f0847c7. The expected State: 1 actually comes from the blob's "state": 1 (line 82).

This dead test data is misleading: a reader would assume workflow_state is being consumed and overriding the blob value, when in fact it's silently ignored. Removing it makes the test's intent clearer and avoids confusion about what the function actually reads.

Rules ❌ No requirements met

Repository Rules

❌ GitHub Issue Linking Requirement: Add a reference to the related GitHub issue using 'Fixes #issue-number' or 'Closes #issue-number' format

❌ PR Description Quality Standards: Add GitHub issue link to [What changed?] section, include [Affected stakeholders/services] section, and add explicit [Documentation Changes] assessment

_{1 rule not applicable. Show all rules by commenting gitar display:verbose.}

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

fix: read from denormalized execution columns

f32fa9f

Signed-off-by: fimanishi <fimanishi@gmail.com>

fimanishi changed the title ~~fix: read from denormalized execution columns~~ ix: use denormalized columns as source of truth when reading workflow execution from Cassandra Feb 24, 2026

fimanishi changed the title ~~ix: use denormalized columns as source of truth when reading workflow execution from Cassandra~~ fix: use denormalized columns as source of truth when reading workflow execution from Cassandra Feb 24, 2026

gitar-bot bot reviewed Feb 24, 2026

View reviewed changes

common/persistence/nosql/nosqlplugin/cassandra/workflow.go Outdated Show resolved Hide resolved

gitar-bot bot reviewed Feb 24, 2026

View reviewed changes

common/persistence/nosql/nosqlplugin/cassandra/workflow_parsing_utils_test.go Show resolved Hide resolved

keep reading from the blob and overwrite when the denormalized column…

b0322b8

…s are read Signed-off-by: fimanishi <fimanishi@gmail.com>

fimanishi changed the title ~~fix: use denormalized columns as source of truth when reading workflow execution from Cassandra~~ fix: use next_event_id column as source of truth when reading workflow execution from Cassandra Feb 24, 2026

do not read workflow_state because it is only used in current workflo…

f0847c7

…w executions Signed-off-by: fimanishi <fimanishi@gmail.com>

gitar-bot bot reviewed Feb 24, 2026

View reviewed changes

common/persistence/nosql/nosqlplugin/cassandra/workflow_parsing_utils_test.go Outdated Show resolved Hide resolved

remove workflow_state from full execution blob test

3a4aed9

Signed-off-by: fimanishi <fimanishi@gmail.com>

fimanishi marked this pull request as ready for review February 24, 2026 02:26

Shaddoll reviewed Feb 24, 2026

View reviewed changes

c-warren reviewed Feb 24, 2026

View reviewed changes

fimanishi added 2 commits February 24, 2026 21:31

use comma-ok to prevent panic when getting execution from result

28c5a38

Signed-off-by: fimanishi <fimanishi@gmail.com>

remove next_event_id fallback. CurrentExecution do not use next_event…

9bb9673

…_id, it is only used in concrete executions that have and read the denormalized next_event_id column Signed-off-by: fimanishi <fimanishi@gmail.com>

Shaddoll approved these changes Feb 25, 2026

View reviewed changes

c-warren approved these changes Feb 25, 2026

View reviewed changes

abhishekj720 approved these changes Feb 26, 2026

View reviewed changes

fimanishi merged commit 18b584f into cadence-workflow:master Feb 26, 2026
42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use next_event_id column as source of truth when reading workflow execution from Cassandra#7738

fix: use next_event_id column as source of truth when reading workflow execution from Cassandra#7738
fimanishi merged 6 commits intocadence-workflow:masterfrom
fimanishi:use-denormalized-columns-as-source-of-truth-when-reading-execution-record

fimanishi commented Feb 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Shaddoll Feb 24, 2026

Uh oh!

c-warren Feb 24, 2026

Uh oh!

fimanishi Feb 24, 2026

Uh oh!

c-warren Feb 24, 2026

Uh oh!

fimanishi Feb 24, 2026

Uh oh!

Shaddoll Feb 25, 2026

Uh oh!

fimanishi Feb 25, 2026

Uh oh!

Shaddoll Feb 25, 2026

Uh oh!

gitar-bot bot commented Feb 25, 2026 •

edited

Loading

Repository Rules

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

fimanishi commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer Validation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gitar-bot bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Repository Rules

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fimanishi commented Feb 24, 2026 •

edited

Loading

gitar-bot bot commented Feb 25, 2026 •

edited

Loading