-
Notifications
You must be signed in to change notification settings - Fork 80
[runtime] Support Fine-grain Durable Execution #422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
7224f9e to
ff5984c
Compare
|
@xintongsong Could you review this PR? |
| * <p>During recovery, the success or failure of the original call is determined by checking whether | ||
| * {@code exceptionPayload} is null. | ||
| */ | ||
| public class CallRecord { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to name this CallResult, which makes it explicit that this class represents a result of call execution.
| public byte[] getResultPayload() { | ||
| return resultPayload; | ||
| } | ||
|
|
||
| public void setResultPayload(byte[] resultPayload) { | ||
| this.resultPayload = resultPayload; | ||
| } | ||
|
|
||
| public byte[] getExceptionPayload() { | ||
| return exceptionPayload; | ||
| } | ||
|
|
||
| public void setExceptionPayload(byte[] exceptionPayload) { | ||
| this.exceptionPayload = exceptionPayload; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to set the payloads after construction? Can these fields be final?
| """ | ||
|
|
||
| @abstractmethod | ||
| def execute( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest the name durable_execute()
| The action that calls this API should be deterministic, meaning that it | ||
| will always make the execute_async call with the same arguments and in | ||
| the same order during job recovery. Otherwise, the behavior is undefined. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to name this method durable_execute_async.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the necessity, but it's still possible that we support non-durable async execution in future, which simply execute in a separate thread/coroutine but the result should not be reused for replaying.
E.g., getting a token from remote, which might be already expired during the replay.
| Note: Local runner does not support durable execution, so recovery | ||
| is not available. | ||
| """ | ||
| return func(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are renaming this to durable_execute, we also need to print a warning log here. There's no need to fail the call, as we don't want users to change their codes when switching between local execution & remote execution.
| if (currentActionState != null && actionStatePersister != null) { | ||
| currentActionState.addCallRecord(callRecord); | ||
| actionStatePersister.run(); | ||
| LOG.debug( | ||
| "Recorded and persisted CallRecord at index {}: functionId={}, argsDigest={}", | ||
| currentCallIndex, | ||
| functionId, | ||
| argsDigest); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would currentActionState and actionStatePersister be null when this method is called? Shall we use assertion here?
| * @return array containing [isHit (boolean), resultPayload (byte[]), exceptionPayload | ||
| * (byte[])], or null if miss | ||
| */ | ||
| public Object[] tryGetCachedCallRecord(String functionId, String argsDigest) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest the name matchNextOrClearSubsequentCallRecord. Otherwise, the clearing is super implicit.
| () -> { | ||
| try { | ||
| actionStateStore.put( | ||
| key, sequenceNumberKState.value(), action, event, actionState); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sequenceNumberKState might change when the callback is called.
| private int currentCallIndex = 0; | ||
|
|
||
| /** List of existing CallRecords loaded during recovery. */ | ||
| private List<CallRecord> recoveryCallRecords; | ||
|
|
||
| /** The current ActionState being built during action execution. */ | ||
| @Nullable private ActionState currentActionState; | ||
|
|
||
| /** Callback to persist ActionState after each code block completion. */ | ||
| @Nullable private Runnable actionStatePersister; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of weird that these are maintained in PythonRunnerContextImpl.
- These are not only needed by python. We will also support this for Java.
- These are actually ActionTask-specific. I think we should handle this in
RunnerContextImpl.switchActionContext(). We might want to actually introduce anActionTaskContext.
Linked issue: #423
Purpose of change
CallRecordintoActionStateand persist/restore call records across durable execution recovery.executeAPI onrunner_context.pyand wire it through local/flink runner contexts.Tests
ActionState/CallRecordserde and store, runner context execute path, and durable execution flows.execute(basic, multiple calls, async) with ground-truth fixtures.API
executeto Pythonrunner_context(new public API surface); no breaking changes identified.Documentation
doc-neededdoc-not-neededdoc-included