fix: use next_event_id column as source of truth when reading workflow execution from Cassandra#7738
Conversation
Signed-off-by: fimanishi <fimanishi@gmail.com>
common/persistence/nosql/nosqlplugin/cassandra/workflow_parsing_utils_test.go
Show resolved
Hide resolved
…s are read Signed-off-by: fimanishi <fimanishi@gmail.com>
…w executions Signed-off-by: fimanishi <fimanishi@gmail.com>
common/persistence/nosql/nosqlplugin/cassandra/workflow_parsing_utils_test.go
Outdated
Show resolved
Hide resolved
Signed-off-by: fimanishi <fimanishi@gmail.com>
There was a problem hiding this comment.
we should probably remove this. In case that the next_event_id column is deleted
| var _emptyUUID = cql.UUID{} | ||
|
|
||
| func parseWorkflowExecutionInfo(result map[string]interface{}) *persistence.InternalWorkflowExecutionInfo { | ||
| executionBlob := result["execution"].(map[string]interface{}) |
There was a problem hiding this comment.
This can panic if the execution doesn't conform to this structure, e.g:
panic: interface conversion: interface {} is main.BadStruct, not map[string]interface {}
goroutine 1 [running]:
main.main()
/tmp/sandbox1356951504/prog.go:39 +0x3d7
I'd recommend returning an error from the mapper if we're unable to deserialize the data unless there is a good reason not to (e.g backwards compatible behavior).
There was a problem hiding this comment.
You're right. We were not checking this before but we should
| if nextEventID, ok := result["next_event_id"].(int64); ok { | ||
| info.NextEventID = nextEventID | ||
| } |
There was a problem hiding this comment.
This looks like a duplicate way to do what the key/value map iteration above is doing - why do we need this too?
There was a problem hiding this comment.
this is from the result not the execution. The data is duplicated in the db, but we use the next_event_id column for conditional writes. The key/value map iteration is only exploring the execution field data, not the whole execution record
There was a problem hiding this comment.
currently, this serves as a fallback, we should remove the logic setting this value when iterating the map
There was a problem hiding this comment.
Yeah, that makes sense. The only thing here is that we use the same path for current executions (although we don't populate the execution blob in there). Are you ok with removing it or is it ok to leave the fallback?
Signed-off-by: fimanishi <fimanishi@gmail.com>
…_id, it is only used in concrete executions that have and read the denormalized next_event_id column Signed-off-by: fimanishi <fimanishi@gmail.com>
Code Review 👍 Approved with suggestions 3 resolved / 3 findingsThe PR correctly moves ✅ 3 resolved✅ Bug:
|
| Auto-apply | Compact |
|
|
Was this helpful? React with 👍 / 👎 | Gitar
What changed?
Read from the denormalized columns next_event_id (that duplicate the data next_event_id from the execution field) and set to InternalWorkflowExecutionInfo when reading concrete execution data from Cassandra.
Why?
Cassandra stores values from the same row but different columns in different places on disk, rather than as a single, contiguous row block. It's possible that the denormalized columns get out of sync with the execution blob in the execution field. This denormalized column is used as conditional write when updating the execution record for concrete workflow executions.
By reading it and setting it on InternalWorkflowExecutionInfo we can leverage the checksum verification to detect differences between the denormalized next_event_id column and the next_event_id in the execution blob field (used to calculate the checksum) and identify corrupt workflows quicker and with more precision.
How did you test it?
go test ./common/persistence/nosql/nosqlplugin/cassandra -run Test_parseWorkflowExecutionInfo
Potential risks
We are changing how we read data from Cassandra and if we are doing incorrectly that could cause workflows to be corrupt/stuck.
We are also changing the argument to the parsing function and we had to modify/use the whole result instead of passing only the "execution". If this is wrong it could cause issues in parsing and affect workflows.
Release notes
Improve workflow corruption detection in Cassandra by reading from next_event_id denormalized column and checking against checksum in checksum verification.
Documentation Changes
Reviewer Validation
PR Description Quality (check these before reviewing code):
go testinvocation)