Skip to content

fix: use next_event_id column as source of truth when reading workflow execution from Cassandra#7738

Merged
fimanishi merged 6 commits intocadence-workflow:masterfrom
fimanishi:use-denormalized-columns-as-source-of-truth-when-reading-execution-record
Feb 26, 2026
Merged

fix: use next_event_id column as source of truth when reading workflow execution from Cassandra#7738
fimanishi merged 6 commits intocadence-workflow:masterfrom
fimanishi:use-denormalized-columns-as-source-of-truth-when-reading-execution-record

Conversation

@fimanishi
Copy link
Member

@fimanishi fimanishi commented Feb 24, 2026

What changed?
Read from the denormalized columns next_event_id (that duplicate the data next_event_id from the execution field) and set to InternalWorkflowExecutionInfo when reading concrete execution data from Cassandra.

Why?
Cassandra stores values from the same row but different columns in different places on disk, rather than as a single, contiguous row block. It's possible that the denormalized columns get out of sync with the execution blob in the execution field. This denormalized column is used as conditional write when updating the execution record for concrete workflow executions.

By reading it and setting it on InternalWorkflowExecutionInfo we can leverage the checksum verification to detect differences between the denormalized next_event_id column and the next_event_id in the execution blob field (used to calculate the checksum) and identify corrupt workflows quicker and with more precision.

How did you test it?
go test ./common/persistence/nosql/nosqlplugin/cassandra -run Test_parseWorkflowExecutionInfo

Potential risks
We are changing how we read data from Cassandra and if we are doing incorrectly that could cause workflows to be corrupt/stuck.

We are also changing the argument to the parsing function and we had to modify/use the whole result instead of passing only the "execution". If this is wrong it could cause issues in parsing and affect workflows.

Release notes
Improve workflow corruption detection in Cassandra by reading from next_event_id denormalized column and checking against checksum in checksum verification.

Documentation Changes


Reviewer Validation

PR Description Quality (check these before reviewing code):

  • "What changed" provides a clear 1-2 line summary
    • Project Issue is linked
  • "Why" explains the full motivation with sufficient context
  • Testing is documented:
    • Unit test commands are included (with exact go test invocation)
    • Integration test setup/commands included (if integration tests were run)
    • Canary testing details included (if canary was mentioned)
  • Potential risks section is thoughtfully filled out (or legitimately N/A)
  • Release notes included if this completes a user-facing feature
  • Documentation needs are addressed (or noted if uncertain)

Signed-off-by: fimanishi <fimanishi@gmail.com>
@fimanishi fimanishi changed the title fix: read from denormalized execution columns ix: use denormalized columns as source of truth when reading workflow execution from Cassandra Feb 24, 2026
@fimanishi fimanishi changed the title ix: use denormalized columns as source of truth when reading workflow execution from Cassandra fix: use denormalized columns as source of truth when reading workflow execution from Cassandra Feb 24, 2026
…s are read

Signed-off-by: fimanishi <fimanishi@gmail.com>
@fimanishi fimanishi changed the title fix: use denormalized columns as source of truth when reading workflow execution from Cassandra fix: use next_event_id column as source of truth when reading workflow execution from Cassandra Feb 24, 2026
…w executions

Signed-off-by: fimanishi <fimanishi@gmail.com>
Signed-off-by: fimanishi <fimanishi@gmail.com>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably remove this. In case that the next_event_id column is deleted

var _emptyUUID = cql.UUID{}

func parseWorkflowExecutionInfo(result map[string]interface{}) *persistence.InternalWorkflowExecutionInfo {
executionBlob := result["execution"].(map[string]interface{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can panic if the execution doesn't conform to this structure, e.g:

panic: interface conversion: interface {} is main.BadStruct, not map[string]interface {}

goroutine 1 [running]:
main.main()
	/tmp/sandbox1356951504/prog.go:39 +0x3d7

I'd recommend returning an error from the mapper if we're unable to deserialize the data unless there is a good reason not to (e.g backwards compatible behavior).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. We were not checking this before but we should

Comment on lines +196 to +198
if nextEventID, ok := result["next_event_id"].(int64); ok {
info.NextEventID = nextEventID
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a duplicate way to do what the key/value map iteration above is doing - why do we need this too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is from the result not the execution. The data is duplicated in the db, but we use the next_event_id column for conditional writes. The key/value map iteration is only exploring the execution field data, not the whole execution record

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, this serves as a fallback, we should remove the logic setting this value when iterating the map

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. The only thing here is that we use the same path for current executions (although we don't populate the execution blob in there). Are you ok with removing it or is it ok to leave the fallback?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove it

Signed-off-by: fimanishi <fimanishi@gmail.com>
…_id, it is only used in concrete executions that have and read the denormalized next_event_id column

Signed-off-by: fimanishi <fimanishi@gmail.com>
@gitar-bot
Copy link

gitar-bot bot commented Feb 25, 2026

Code Review 👍 Approved with suggestions 3 resolved / 3 findings

The PR correctly moves next_event_id reading to denormalized columns for the primary SelectWorkflowExecution path. The main remaining concern is that SelectAllWorkflowExecutions and SelectCurrentWorkflow don't select next_event_id, resulting in zero values — though SelectCurrentWorkflow doesn't use NextEventID so it's functionally safe. Previous inline findings cover these issues.

✅ 3 resolved
Bug: State and NextEventID silently zeroed in queries missing denormalized columns

📄 common/persistence/nosql/nosqlplugin/cassandra/workflow_cql.go:324 📄 common/persistence/nosql/nosqlplugin/cassandra/workflow_cql.go:349 📄 common/persistence/nosql/nosqlplugin/cassandra/workflow.go:100 📄 common/persistence/nosql/nosqlplugin/cassandra/workflow.go:361
The "state" and "next_event_id" cases were removed from the execution blob switch statement in parseWorkflowExecutionInfo, so these fields are now only populated from the denormalized columns on the outer result map. However, two query templates do not select next_event_id or workflow_state:

  1. templateGetCurrentExecutionQuery (line 324 of workflow_cql.go): selects only current_run_id, execution, workflow_last_write_version. Used by SelectCurrentWorkflow() which reads executionInfo.State (line 111 of workflow.go) — this will now always be 0 regardless of the actual workflow state. This is a critical data correctness regression since State is returned to callers and used for workflow lifecycle decisions.

  2. templateListWorkflowExecutionQuery (line 349 of workflow_cql.go): selects only run_id, execution, version_histories, version_histories_encoding. Used by SelectAllWorkflowExecutions()State will also be 0 for all returned executions. This affects the reconciliation fetcher which uses ExecutionInfo.State.

Because the safe type assertion (if value, ok := ...) silently returns zero when the column is absent, there will be no error or panic — just silently wrong data.

Fix options:

  • Option A (recommended): Add next_event_id, workflow_state to both templateGetCurrentExecutionQuery and templateListWorkflowExecutionQuery.
  • Option B: Keep parsing "state" and "next_event_id" from the blob as a fallback, then override with denormalized columns when present. This is more defensive and handles any query that doesn't select the denormalized columns.
Bug: Test "no denormalized columns" validates data loss as expected behavior

📄 common/persistence/nosql/nosqlplugin/cassandra/workflow_parsing_utils_test.go:262
The test case "no denormalized columns - nil values" (line 263) puts next_event_id: int64(10) and state: 2 in the execution blob, but asserts that the result has NextEventID: 0 and State: 0. This validates the scenario that causes the critical bug in SelectCurrentWorkflow and SelectAllWorkflowExecutions — when denormalized columns are absent, blob values are silently discarded.

If the fix is to keep blob parsing as a fallback (Option B from the critical finding), this test's expected values should change to NextEventID: 10 and State: 2. If the fix is to add the columns to all queries (Option A), this test case should still be updated since silently returning zero for known-present blob data is unexpected.

Quality: Test includes unused workflow_state key, misleading reader

📄 common/persistence/nosql/nosqlplugin/cassandra/workflow_parsing_utils_test.go:130
The test case "full execution blob" includes "workflow_state": 1 in the top-level args (line 130), but parseWorkflowExecutionInfo does not read workflow_state from the result map — it was intentionally removed in commit f0847c7. The expected State: 1 actually comes from the blob's "state": 1 (line 82).

This dead test data is misleading: a reader would assume workflow_state is being consumed and overriding the blob value, when in fact it's silently ignored. Removing it makes the test's intent clearer and avoids confusion about what the function actually reads.

Rules ❌ No requirements met

Repository Rules

GitHub Issue Linking Requirement: Add a reference to the related GitHub issue using 'Fixes #issue-number' or 'Closes #issue-number' format
PR Description Quality Standards: Add GitHub issue link to [What changed?] section, include [Affected stakeholders/services] section, and add explicit [Documentation Changes] assessment

1 rule not applicable. Show all rules by commenting gitar display:verbose.

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@fimanishi fimanishi merged commit 18b584f into cadence-workflow:master Feb 26, 2026
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants