Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1179,17 +1179,13 @@
.setWorkflowExecution(getExecutionId().getExecution())
.setDomain(getExecutionId().getDomain())
.setWorkflowType(startRequest.getWorkflowType());
ForkJoinPool.commonPool()
.execute(
() -> {
try {
parent.get().childWorkflowStarted(a);
} catch (EntityNotExistsError | WorkflowExecutionAlreadyCompletedError e) {
// Not a problem. Parent might just close by now.
} catch (BadRequestError | InternalServiceError e) {
log.error("Failure reporting child completion", e);
}
});
try {
parent.get().childWorkflowStarted(a);
Copy link
Member

@Groxx Groxx Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh. any idea why it was async in the first place? since it seems to be creating decision tasks above, it seems like the start-event probably already exists, so this doesn't seem to risk deadlocking or something...

is the issue that multiple fork-join-pool executes are racing, and sometimes complete runs before this start runs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two tiers of locks for the test service:

  • There's a lock in TestWorkflowService, which controls access to the Map of MutableState for every workflow.
  • Every MutableState has a lock.

Typically we lock the TestWorkflowService, get the MutableState, unlock the TestWorkflowService, and then perform an action on the MutableState. The MutableState then manages its own lock internally.

Any time a MutableState performs an action on another workflow we use the ForkJoinPool to run the operation asynchronously without holding any locks for the original workflow. These calls use the TestWorkflowService, so they repeat the process above. This is particularly important for any of the methods that process Decisions, which are all called while the lock for the workflow is held. If two workflows attempted to cancel each other they could deadlock.

By running this block asynchronously we end up triggering a race between:

  • This block running and updating the parent workflow that the child workflow started.
  • The next decision task being scheduled , processed, and reported back to the TestService, which could complete the child workflow. This similarly prompts an async operation to report the status update to the parent workflow. We see the test failure any time this process beats the first one.
    • The next decision task gets pushed to a queue before we even release the lock on the MutableState.

This block is unique for two reasons:

  • Starting a workflow is the only time we hold both the TestWorkflowService lock and MutableState lock simultaneously, so we're still holding the TestWorkflowService lock.
  • We've already released the MutableState lock above.

Running this synchronously somewhat incidentally ends up fixing the race condition because we're still holding the lock on the TestWorkflowService. The next decision task can't retrieve the MutableState for the child workflow until we release the lock, so we've created a strict order of events.

Copy link
Member

@Groxx Groxx Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense / fits about what I expected, yea. Thanks!
And it looks like the childWorkflowStarted acquires its own lock in update (mutable state lock?), so that sounds like a probably-correct fix.

I suppose a later step for all this would be to make these related test workflows share a single execution thread, since otherwise this seems prone to this kind of race for no benefit :\ though I'm gonna assume that's extremely hard without a complete rewrite.

} catch (EntityNotExistsError | WorkflowExecutionAlreadyCompletedError e) {

Check warning on line 1184 in src/main/java/com/uber/cadence/internal/testservice/TestWorkflowMutableStateImpl.java

View check run for this annotation

Codecov / codecov/patch

src/main/java/com/uber/cadence/internal/testservice/TestWorkflowMutableStateImpl.java#L1184

Added line #L1184 was not covered by tests
// Not a problem. Parent might just close by now.
} catch (BadRequestError | InternalServiceError e) {
log.error("Failure reporting child completion", e);

Check warning on line 1187 in src/main/java/com/uber/cadence/internal/testservice/TestWorkflowMutableStateImpl.java

View check run for this annotation

Codecov / codecov/patch

src/main/java/com/uber/cadence/internal/testservice/TestWorkflowMutableStateImpl.java#L1186-L1187

Added lines #L1186 - L1187 were not covered by tests
}
}
}

Expand Down