docs: add language sdk specification by embano1 · Pull Request #369 · aws/aws-durable-execution-sdk-js

embano1 · 2025-12-06T15:00:54Z

Issue #, if available: n/a

Description of changes:
Provide a language SDK specification for developers to build their own SDKs and establish conformance testing. This is just a first start to iterate on the SDK and provide builders guidance given the large interest in additional SDKs (Go, Rust, Java, Swift, .NET). The file should then be extracted into its own repository to create conformance tests for officially supported SDKs.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

embano1 · 2025-12-06T16:35:08Z

LANGUAGE_SDK_SPECIFICATION.md

+Invocation 1:
+  - Load state: []
+  - Start STEP(id="step1")
+  - Checkpoint: START step1


I guess we need more guidance here around:

AT-LEAST/MOST-ONCE

batching/optimizations

Yeah - that would be a good idea.

jriecken · 2025-12-08T19:16:44Z

LANGUAGE_SDK_SPECIFICATION.md

+- Be checkpointed and resumed
+- Maintain execution state across interruptions
+
+The two core durable operation primitives are:


There are 5 primitives (if you ignore the EXECUTION operation which is only used to complete the execution):

CALLBACK

CHAINED_INVOKE

CONTEXT

STEP

WAIT

jriecken · 2025-12-08T19:21:30Z

LANGUAGE_SDK_SPECIFICATION.md

+
+For correct replay behavior, **user code MUST be deterministic**:
+
+1. Non-durable code (code outside operations) MUST execute identically on each replay


One thing we should note is that this may require re-implementing/providing alternatives for certain language constructs that are inherently nondeterministic.

For example in Java unless you use a LinkedHashMap instead of a HashMap, the iteration order is not guaranteed to be the same on multiple creations of the same map, or in Go where map iteration order is purposefully randomized, etc.

By re-implementing, do you mean re-implementing in TypeScript/Python? Or you mean we need to revise our decision in Java implementation?

jriecken · 2025-12-08T19:22:52Z

LANGUAGE_SDK_SPECIFICATION.md

+  "CheckpointToken": "string",
+  "InitialExecutionState": {
+    "Operations": [
+      /* Operation objects */


Should we link to the Lambda API docs sections where appropriate in this doc? E.g. https://docs.aws.amazon.com/lambda/latest/api/API_Operation.html

jriecken · 2025-12-08T19:23:34Z

LANGUAGE_SDK_SPECIFICATION.md

+
+The SDK CANNOT:
+
+- Prevent users from writing non-deterministic code


I mean if it is somehow able to, it should 😄 - the spec shouldn't prevent it from doing so

Maybe this should say "The SDK is not responsible for:"

jriecken · 2025-12-08T19:26:50Z

LANGUAGE_SDK_SPECIFICATION.md

+  "Error": {
+    "ErrorType": "string",
+    "ErrorMessage": "string",
+    "StackTrace": ["string"] // OPTIONAL


All the fields are actually optional

(There's also a 4th ErrorData field as well for additional machine-readable error data)

jriecken · 2025-12-09T05:25:52Z

LANGUAGE_SDK_SPECIFICATION.md

+
+- Maximum execution duration: 1 year
+- Maximum response payload: 6MB
+- Maximum history size: Limited by service quotas


Maximum number of durable operations (including retries)? The limit is not directly on history.

but we also have a history limit (100MB), added both

jriecken · 2025-12-09T05:27:28Z

LANGUAGE_SDK_SPECIFICATION.md

+Invocation 1:
+  - Load state: []
+  - Start STEP(id="step1")
+  - Checkpoint: START step1


Yeah - that would be a good idea.

jriecken · 2025-12-09T05:29:55Z

LANGUAGE_SDK_SPECIFICATION.md

+  - Load state: [step1: SUCCEEDED, step2: STARTED]
+  - Replay STEP(id="step1") - return cached "result1"
+  - Resume STEP(id="step2")
+  - Checkpoint: START step2 (same ID, continues)


You wouldn't checkpoint START again - it's already started. Depends on semantics but can either run it again then checkpoint success/failure/retry or decide to immediately checkpoint failure, or retry, etc.

jriecken · 2025-12-09T05:30:36Z

LANGUAGE_SDK_SPECIFICATION.md

+
+```
+[callback_promise, callback_id] = await context.create_callback("approval")
+await send_approval_email(callback_id)


probably want to put this in a context.step

jriecken · 2025-12-09T05:32:04Z

LANGUAGE_SDK_SPECIFICATION.md

+    │ START action
+    ▼
+┌─────────┐
+│ STARTED │◄──────┐


the arrow here should be coming from READY

embano1 · 2025-12-09T12:48:43Z

@jriecken thx for the detailed feedback. Incorporated (diff for commit: ea5d479)

embano1 · 2025-12-09T12:59:05Z

@jriecken shall I add a section on testing (in-memory local executor)?

embano1 · 2025-12-10T07:21:31Z

Just connected with @maschnetwork and I noticed that we currently don't have guidance in this SPEC how to handle concurrent durable operations when waits/suspension are involved (simple waits, durable invokes, callbacks, including timeouts). For example, you want to use context.parallel() with a step taking 5 seconds and a wait (1s) or having two concurrent waits (5s and 5s) where you expect to not wait 10s in total - how should an SDK implement those suspension decisions?

cc/ @ParidelPooya

smking · 2025-12-16T17:35:57Z

LANGUAGE_SDK_SPECIFICATION.md

+
+1. Non-durable code (code outside operations) MUST execute identically on each replay
+2. User code MUST NOT use non-deterministic values (e.g., `Date.now()`, `Math.random()`) outside durable operations
+3. User code MUST NOT perform side effects (e.g., API calls, database writes) outside durable operations


They can perform side-effects if they want as long as they don't use the results to affect operation order.

smking · 2025-12-16T19:57:17Z

LANGUAGE_SDK_SPECIFICATION.md

+```
+[New] → START → STARTED → (time passes) → SUCCEEDED [Done]
+                        ↓
+                     CANCEL → CANCELLED [Done]
+```


flowchart LR New[Customer calls ctx.wait] --> START START --> |Started| Delay{Wait} Delay --> |Succeeded| Success[ctx.wait completes] Delay --> CANCEL CANCEL --> |Cancelled| Cancelled[ctx.wait completes]

smking · 2025-12-16T20:16:17Z

LANGUAGE_SDK_SPECIFICATION.md

+                        └→ (external failure) → FAILED [Done]
+                        └→ (timeout) → TIMED_OUT [Done]
+```
+


flowchart LR New[Customer calls ctx.createCallback] --> START START --> |Started| Delay{Wait} SUCCEED --> |Succeeded| Success[ctx.createCallback completes successully] FAIL --> |Failed| Failure[ctx.createCallback completes with error] TIMEOUT --> |TimedOut| Failure Delay .-> SendDurableExecutionCallbackSuccess Delay .-> SendDurableExecutionCallbackFailure Delay .-> TIMEOUT SendDurableExecutionCallbackSuccess --> SUCCEED SendDurableExecutionCallbackFailure --> FAIL subgraph External System SendDurableExecutionCallbackSuccess SendDurableExecutionCallbackFailure end

smking · 2025-12-16T20:48:25Z

LANGUAGE_SDK_SPECIFICATION.md

+                        └→ (invoke timeout) → TIMED_OUT [Done]
+                        └→ (invoke stopped) → STOPPED [Done]
+```
+


flowchart LR New[Customer calls ctx.invoke] --> START START --> |Started| Delay{Wait} SUCCEED --> |Succeeded| Success[ctx.createCallback completes successully] FAIL --> |Failed| Failure[ctx.createCallback completes with error] TIMEOUT --> |TimedOut| Failure STOP[StopDurableExecution] --> |Stopped| Failure Delay .-> External Delay .-> TIMEOUT subgraph External System External@{ shape: fork } External .-> Invoked[Invoked Function] External .-> STOP end Invoked .-> SUCCEED Invoked .-> FAIL

smking · 2025-12-16T20:52:36Z

LANGUAGE_SDK_SPECIFICATION.md

+                        ↓
+                        └→ FAIL → FAILED [Done]
+```
+


smking · 2025-12-16T21:28:41Z

LANGUAGE_SDK_SPECIFICATION.md

+### 11.2 Async Patterns
+
+The SDK MUST integrate with the language's asynchronous programming model:
+


What do you mean by 'integrate'? The Python SDK doesn't integrate with asyncio.

MUST integrate

this could be needlessly restrictive.

There can be different ways of implementing asynchronous or concurrent work even within a language, and some opinionated views over which is "better".

Other than that, it could be that someone wants to make a deliberately simplified synchronous or light-weight version of the SDK that eschews concurrency?

This looks like a implementation decisions. I can even see more than 1 SDK for a language with different approach. For example I can imagine someone goes and create a Python SDK that work in async way.
So I agree, this should not be part of spec.

embano1 · 2025-12-20T08:21:54Z

Thank you @yaythomas @smking - incorporated changes. Please review the diff.

embano1 · 2025-12-20T08:25:17Z

@ParidelPooya can you please also do a final review so we can close this off?

embano1 · 2025-12-20T08:38:36Z

LANGUAGE_SDK_SPECIFICATION.md

+```mermaid
+flowchart LR
+    New[Customer calls operation] --> START
+    START --> |Started| SUCCEED
+    SUCCEED --> |Succeeded| Success[Completes successfully]
+    START --> |Started| FAIL
+    FAIL --> |Failed| Failure[Completes with error]
+```


@smking would the following be more accurate though?

flowchart LR STARTED[Execution STARTED] --> SUCCEED SUCCEED --> |Succeeded| Success[Completes successfully] STARTED --> FAIL FAIL --> |Failed| Failure[Completes with error]

Loading

Note: The EXECUTION operation is always in STARTED status when the handler begins. It does not have a START action.

Signed-off-by: Michael Gasch <15986659+embano1@users.noreply.github.com>

- Clarify side-effects rule: allowed if they don't affect operation order - Change async integration from MUST to SHOULD - Replace ASCII state diagrams with mermaid flowcharts in Section 4 - Remove duplicate Appendix C, renumber appendices - Bump version to 1.2 Signed-off-by: Michael Gasch <15986659+embano1@users.noreply.github.com>

Signed-off-by: Michael Gasch <15986659+embano1@users.noreply.github.com>

ParidelPooya · 2026-01-06T21:19:28Z

LANGUAGE_SDK_SPECIFICATION.md

@@ -0,0 +1,1401 @@
+# AWS Lambda Durable Functions Language SDK Specification
+
+**Version:** 1.2  


What does this version means? Is it for Spec? do we have spec v1 and v1.1?

ParidelPooya · 2026-01-06T21:22:27Z

LANGUAGE_SDK_SPECIFICATION.md

+
+### 2.1 Durable Function
+
+A **durable function** is a Lambda function that enables developers to build resilient multi-step applications and AI workflows that can execute for extended periods while maintaining reliable progress despite interruptions. Durable functions provide primitives to checkpoint progress and suspend execution at defined points, enabling fault-tolerant and cost-effective long-running processes (up to one year).


Do we want to use Lambda in spec? That is part of our implementation and should not be part of spec.

ParidelPooya · 2026-01-06T21:24:06Z

LANGUAGE_SDK_SPECIFICATION.md

+
+### 2.2 Durable Execution
+
+A **durable execution** is the end-to-end lifecycle of a durable function, using checkpoints to track progress, suspend execution, and recover from failures. When functions resume after suspension or interruptions, the system performs replay, automatically re-executing the event handler from the beginning while skipping completed checkpoints and continuing from the point of interruption. The lifecycle may include multiple sub-invocations (Lambda function invocations that occur when resuming after wait operations, retries, or infrastructure failures) to complete the execution.


Same feedback, would be better to remove Lambda

ParidelPooya · 2026-01-06T21:27:16Z

LANGUAGE_SDK_SPECIFICATION.md

+
+SDKs MUST implement a checkpoint-and-replay execution model:
+
+1. **Checkpoint**: During execution, the SDK periodically persists operation state to the Lambda durable execution service


It's better to use as needed instead of periodically

ParidelPooya · 2026-01-06T21:28:26Z

LANGUAGE_SDK_SPECIFICATION.md

+SDKs MUST implement a checkpoint-and-replay execution model:
+
+1. **Checkpoint**: During execution, the SDK periodically persists operation state to the Lambda durable execution service
+2. **Replay**: When a function resumes after interruption, it re-executes from the beginning but skips operations that have already completed by using their checkpointed results


So far we used replay instead of re-executes. If they look at current implementaton they will see replay concept and not re-executes

ParidelPooya · 2026-01-06T23:20:36Z

LANGUAGE_SDK_SPECIFICATION.md

+
+- `CheckpointDurableExecution`: Persist operation state
+- `GetDurableExecutionState`: Retrieve execution history
+- `SendDurableExecutionCallbackSuccess`: Complete a callback successfully


TS and Python SDK does not use SendDurableExecutionCallbackSuccess, SendDurableExecutionCallbackFailure and SendDurableExecutionCallbackHeartbeat

ParidelPooya · 2026-01-06T23:21:29Z

LANGUAGE_SDK_SPECIFICATION.md

+
+The durable execution system provides **at-least-once** semantics for executions:
+
+- Operations MAY be executed more than once due to retries, timeouts, or infrastructure failures


At-least-once is only for step operation

ParidelPooya · 2026-01-06T23:22:53Z

LANGUAGE_SDK_SPECIFICATION.md

+
+- The entire batch succeeds or fails atomically
+- If a batch fails, none of the operations in the batch are recorded
+- The SDK SHOULD checkpoint critical state transitions promptly rather than accumulating large batches


Not accurate, we dont change how we batch, instead we can wait for checkpoint to return result for critical operations.

ParidelPooya · 2026-01-06T23:23:25Z

LANGUAGE_SDK_SPECIFICATION.md

+- The entire batch succeeds or fails atomically
+- If a batch fails, none of the operations in the batch are recorded
+- The SDK SHOULD checkpoint critical state transitions promptly rather than accumulating large batches
+- For time-sensitive operations, the SDK MAY checkpoint immediately rather than batching


Not accurate, you can't disable batching if there are some stuff in the queue.

ParidelPooya · 2026-01-06T23:24:08Z

LANGUAGE_SDK_SPECIFICATION.md

+- For time-sensitive operations, the SDK MAY checkpoint immediately rather than batching
+
+### 16.4 Performance vs Durability Trade-offs
+


Based on my comments in previous section, I don't agree with this section.

zhongkechen · 2026-01-14T01:57:55Z

Why is this spec in JS SDK? I think it's better in a shared place.

zhongkechen · 2026-01-14T19:26:23Z

LANGUAGE_SDK_SPECIFICATION.md

+- **all**: Wait for all promises to complete successfully
+- **allSettled**: Wait for all promises to settle (success or failure)
+- **race**: Return the first promise to settle
+- **any**: Return the first promise to succeed


These names could be language specific. Some languages use different names with the same semantics.

zhongkechen · 2026-01-14T19:27:58Z

LANGUAGE_SDK_SPECIFICATION.md

+
+- `minSuccessful`: Minimum successful items required
+- `toleratedFailureCount`: Maximum failures allowed
+- `toleratedFailurePercentage`: Maximum failure percentage allowed


These names should be language specific. Some languages don't follow this naming convention

embano1 commented Dec 6, 2025

View reviewed changes

jriecken reviewed Dec 9, 2025

View reviewed changes

embano1 force-pushed the sdk-spec branch from 17bc8fb to ea5d479 Compare December 9, 2025 12:46

embano1 changed the title ~~Add language SDK specification~~ docs: add language sdk specification Dec 9, 2025

embano1 closed this Dec 9, 2025

embano1 reopened this Dec 9, 2025

embano1 marked this pull request as ready for review December 9, 2025 12:49

smking reviewed Dec 16, 2025

View reviewed changes

embano1 force-pushed the sdk-spec branch from a8e9f6e to a59fe6c Compare December 20, 2025 08:22

embano1 commented Dec 20, 2025

View reviewed changes

embano1 added 6 commits December 24, 2025 11:57

docs: add language sdk specification

d3798be

Signed-off-by: Michael Gasch <15986659+embano1@users.noreply.github.com>

docs: improvements

373bab1

Signed-off-by: Michael Gasch <15986659+embano1@users.noreply.github.com>

docs: address jriecken feedback

7f4828a

Signed-off-by: Michael Gasch <15986659+embano1@users.noreply.github.com>

docs: address inconsistencies

1870213

Signed-off-by: Michael Gasch <15986659+embano1@users.noreply.github.com>

docs: clarify execution flow

2a9c09f

Signed-off-by: Michael Gasch <15986659+embano1@users.noreply.github.com>

embano1 force-pushed the sdk-spec branch from e35aea3 to 2a9c09f Compare December 24, 2025 10:58

embano1 mentioned this pull request Jan 6, 2026

Docs: create the LLM.txt file aws/aws-durable-execution-sdk-python#177

Closed

ParidelPooya reviewed Jan 6, 2026

View reviewed changes

zhongkechen reviewed Jan 14, 2026

View reviewed changes


		For correct replay behavior, user code MUST be deterministic:

		1. Non-durable code (code outside operations) MUST execute identically on each replay


		The SDK CANNOT:

		- Prevent users from writing non-deterministic code

		### 11.2 Async Patterns

		The SDK MUST integrate with the language's asynchronous programming model:

		@@ -0,0 +1,1401 @@
		# AWS Lambda Durable Functions Language SDK Specification

		Version: 1.2


		### 2.1 Durable Function

		A durable function is a Lambda function that enables developers to build resilient multi-step applications and AI workflows that can execute for extended periods while maintaining reliable progress despite interruptions. Durable functions provide primitives to checkpoint progress and suspend execution at defined points, enabling fault-tolerant and cost-effective long-running processes (up to one year).


		### 2.2 Durable Execution

		A durable execution is the end-to-end lifecycle of a durable function, using checkpoints to track progress, suspend execution, and recover from failures. When functions resume after suspension or interruptions, the system performs replay, automatically re-executing the event handler from the beginning while skipping completed checkpoints and continuing from the point of interruption. The lifecycle may include multiple sub-invocations (Lambda function invocations that occur when resuming after wait operations, retries, or infrastructure failures) to complete the execution.


		SDKs MUST implement a checkpoint-and-replay execution model:

		1. Checkpoint: During execution, the SDK periodically persists operation state to the Lambda durable execution service


		The durable execution system provides at-least-once semantics for executions:

		- Operations MAY be executed more than once due to retries, timeouts, or infrastructure failures

		- For time-sensitive operations, the SDK MAY checkpoint immediately rather than batching

		### 16.4 Performance vs Durability Trade-offs

Conversation

embano1 commented Dec 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

embano1 commented Dec 9, 2025

Uh oh!

embano1 commented Dec 9, 2025

Uh oh!

embano1 commented Dec 10, 2025

Uh oh!

smking Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smking Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smking Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

embano1 commented Dec 20, 2025

Uh oh!

embano1 commented Dec 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

smking Dec 16, 2025 •

edited

Loading

smking Dec 16, 2025 •

edited

Loading

smking Dec 16, 2025 •

edited

Loading

zhongkechen Jan 14, 2026 •

edited

Loading