[runtime] Support Fine-grain Durable Execution #422

Sxnan · 2026-01-09T01:44:55Z

Linked issue: #423

Purpose of change

Introduce CallRecord into ActionState and persist/restore call records across durable execution recovery.
Expose execute API on runner_context.py and wire it through local/flink runner contexts.
Extend Python/Java runtimes to emit and replay call records, aligning durable execution semantics and ensuring operator context restoration.

Tests

Added/updated unit tests for ActionState/CallRecord serde and store, runner context execute path, and durable execution flows.
Added e2e integration tests for runner context execute (basic, multiple calls, async) with ground-truth fixtures.

API

Added execute to Python runner_context (new public API surface); no breaking changes identified.

Documentation

doc-needed
doc-not-needed
doc-included

Sxnan · 2026-01-09T03:00:19Z

@xintongsong Could you review this PR?

xintongsong · 2026-01-09T03:14:20Z

runtime/src/main/java/org/apache/flink/agents/runtime/actionstate/CallRecord.java

+ * <p>During recovery, the success or failure of the original call is determined by checking whether
+ * {@code exceptionPayload} is null.
+ */
+public class CallRecord {


I'd suggest to name this CallResult, which makes it explicit that this class represents a result of call execution.

xintongsong · 2026-01-09T03:58:38Z

runtime/src/main/java/org/apache/flink/agents/runtime/actionstate/CallRecord.java

+    public byte[] getResultPayload() {
+        return resultPayload;
+    }
+
+    public void setResultPayload(byte[] resultPayload) {
+        this.resultPayload = resultPayload;
+    }
+
+    public byte[] getExceptionPayload() {
+        return exceptionPayload;
+    }
+
+    public void setExceptionPayload(byte[] exceptionPayload) {
+        this.exceptionPayload = exceptionPayload;
+    }


Why do we need to set the payloads after construction? Can these fields be final?

xintongsong · 2026-01-09T04:00:05Z

python/flink_agents/api/runner_context.py

        """

+    @abstractmethod
+    def execute(


I'd suggest the name durable_execute()

xintongsong · 2026-01-09T04:02:19Z

python/flink_agents/api/runner_context.py

+        The action that calls this API should be deterministic, meaning that it
+        will always make the execute_async call with the same arguments and in
+        the same order during job recovery. Otherwise, the behavior is undefined.
+


I'd suggest to name this method durable_execute_async.

I don't see the necessity, but it's still possible that we support non-durable async execution in future, which simply execute in a separate thread/coroutine but the result should not be reused for replaying.

E.g., getting a token from remote, which might be already expired during the replay.

xintongsong · 2026-01-09T05:59:06Z

python/flink_agents/runtime/local_runner.py

+        Note: Local runner does not support durable execution, so recovery
+        is not available.
+        """
+        return func(*args, **kwargs)


If we are renaming this to durable_execute, we also need to print a warning log here. There's no need to fail the call, as we don't want users to change their codes when switching between local execution & remote execution.

xintongsong · 2026-01-09T06:18:49Z

...me/src/main/java/org/apache/flink/agents/runtime/python/context/PythonRunnerContextImpl.java

+        if (currentActionState != null && actionStatePersister != null) {
+            currentActionState.addCallRecord(callRecord);
+            actionStatePersister.run();
+            LOG.debug(
+                    "Recorded and persisted CallRecord at index {}: functionId={}, argsDigest={}",
+                    currentCallIndex,
+                    functionId,
+                    argsDigest);
+        }


Why would currentActionState and actionStatePersister be null when this method is called? Shall we use assertion here?

xintongsong · 2026-01-09T06:29:09Z

...me/src/main/java/org/apache/flink/agents/runtime/python/context/PythonRunnerContextImpl.java

+     * @return array containing [isHit (boolean), resultPayload (byte[]), exceptionPayload
+     *     (byte[])], or null if miss
+     */
+    public Object[] tryGetCachedCallRecord(String functionId, String argsDigest) {


I'd suggest the name matchNextOrClearSubsequentCallRecord. Otherwise, the clearing is super implicit.

xintongsong · 2026-01-09T06:55:21Z

runtime/src/main/java/org/apache/flink/agents/runtime/operator/ActionExecutionOperator.java

+                () -> {
+                    try {
+                        actionStateStore.put(
+                                key, sequenceNumberKState.value(), action, event, actionState);


sequenceNumberKState might change when the callback is called.

xintongsong · 2026-01-09T07:50:22Z

...me/src/main/java/org/apache/flink/agents/runtime/python/context/PythonRunnerContextImpl.java

+    private int currentCallIndex = 0;
+
+    /** List of existing CallRecords loaded during recovery. */
+    private List<CallRecord> recoveryCallRecords;
+
+    /** The current ActionState being built during action execution. */
+    @Nullable private ActionState currentActionState;
+
+    /** Callback to persist ActionState after each code block completion. */
+    @Nullable private Runnable actionStatePersister;


It's kind of weird that these are maintained in PythonRunnerContextImpl.

These are not only needed by python. We will also support this for Java.

These are actually ActionTask-specific. I think we should handle this in RunnerContextImpl.switchActionContext(). We might want to actually introduce an ActionTaskContext.

[runtime] Introduce CallRecord to ActionState

6262db0

github-actions bot added priority/major Default priority of the PR or issue. fixVersion/0.2.0 The feature or bug should be implemented/fixed in the 0.2.0 version. doc-needed Your PR changes impact docs. labels Jan 9, 2026

Sxnan added 2 commits January 9, 2026 10:24

[api] Introduce execute to runner_context.py

ac77a70

[runtime] Implement CallRecord persistent and restore

ff5984c

Sxnan force-pushed the finegrain-durable branch from 7224f9e to ff5984c Compare January 9, 2026 02:24

Sxnan changed the title ~~[runtime] Introduce CallRecord to ActionState~~ [runtime] Support Fine-grain Durable Execution Jan 9, 2026

Sxnan marked this pull request as ready for review January 9, 2026 02:52

Sxnan requested a review from xintongsong January 9, 2026 02:59

xintongsong reviewed Jan 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[runtime] Support Fine-grain Durable Execution #422

[runtime] Support Fine-grain Durable Execution #422

Uh oh!

Sxnan commented Jan 9, 2026 •

edited

Loading

Uh oh!

Sxnan commented Jan 9, 2026

Uh oh!

xintongsong Jan 9, 2026

Uh oh!

xintongsong Jan 9, 2026

Uh oh!

xintongsong Jan 9, 2026

Uh oh!

xintongsong Jan 9, 2026

Uh oh!

xintongsong Jan 9, 2026

Uh oh!

xintongsong Jan 9, 2026

Uh oh!

xintongsong Jan 9, 2026

Uh oh!

xintongsong Jan 9, 2026

Uh oh!

xintongsong Jan 9, 2026

Uh oh!

xintongsong Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[runtime] Support Fine-grain Durable Execution #422

Are you sure you want to change the base?

[runtime] Support Fine-grain Durable Execution #422

Uh oh!

Conversation

Sxnan commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of change

Tests

API

Documentation

Uh oh!

Sxnan commented Jan 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sxnan commented Jan 9, 2026 •

edited

Loading