Skip to content

docs: add language sdk specification#369

Open
embano1 wants to merge 6 commits intoaws:mainfrom
embano1:sdk-spec
Open

docs: add language sdk specification#369
embano1 wants to merge 6 commits intoaws:mainfrom
embano1:sdk-spec

Conversation

@embano1
Copy link
Member

@embano1 embano1 commented Dec 6, 2025

Issue #, if available: n/a

Description of changes:
Provide a language SDK specification for developers to build their own SDKs and establish conformance testing. This is just a first start to iterate on the SDK and provide builders guidance given the large interest in additional SDKs (Go, Rust, Java, Swift, .NET). The file should then be extracted into its own repository to create conformance tests for officially supported SDKs.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Invocation 1:
- Load state: []
- Start STEP(id="step1")
- Checkpoint: START step1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need more guidance here around:

  • AT-LEAST/MOST-ONCE
  • batching/optimizations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - that would be a good idea.

- Be checkpointed and resumed
- Maintain execution state across interruptions

The two core durable operation primitives are:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 5 primitives (if you ignore the EXECUTION operation which is only used to complete the execution):

  • CALLBACK
  • CHAINED_INVOKE
  • CONTEXT
  • STEP
  • WAIT


For correct replay behavior, **user code MUST be deterministic**:

1. Non-durable code (code outside operations) MUST execute identically on each replay
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we should note is that this may require re-implementing/providing alternatives for certain language constructs that are inherently nondeterministic.

For example in Java unless you use a LinkedHashMap instead of a HashMap, the iteration order is not guaranteed to be the same on multiple creations of the same map, or in Go where map iteration order is purposefully randomized, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By re-implementing, do you mean re-implementing in TypeScript/Python? Or you mean we need to revise our decision in Java implementation?

"CheckpointToken": "string",
"InitialExecutionState": {
"Operations": [
/* Operation objects */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we link to the Lambda API docs sections where appropriate in this doc? E.g. https://docs.aws.amazon.com/lambda/latest/api/API_Operation.html


The SDK CANNOT:

- Prevent users from writing non-deterministic code
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean if it is somehow able to, it should 😄 - the spec shouldn't prevent it from doing so

Maybe this should say "The SDK is not responsible for:"

"Error": {
"ErrorType": "string",
"ErrorMessage": "string",
"StackTrace": ["string"] // OPTIONAL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the fields are actually optional

(There's also a 4th ErrorData field as well for additional machine-readable error data)


- Maximum execution duration: 1 year
- Maximum response payload: 6MB
- Maximum history size: Limited by service quotas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maximum number of durable operations (including retries)? The limit is not directly on history.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we also have a history limit (100MB), added both

Invocation 1:
- Load state: []
- Start STEP(id="step1")
- Checkpoint: START step1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - that would be a good idea.

- Load state: [step1: SUCCEEDED, step2: STARTED]
- Replay STEP(id="step1") - return cached "result1"
- Resume STEP(id="step2")
- Checkpoint: START step2 (same ID, continues)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You wouldn't checkpoint START again - it's already started. Depends on semantics but can either run it again then checkpoint success/failure/retry or decide to immediately checkpoint failure, or retry, etc.


```
[callback_promise, callback_id] = await context.create_callback("approval")
await send_approval_email(callback_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably want to put this in a context.step

│ START action
┌─────────┐
│ STARTED │◄──────┐
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the arrow here should be coming from READY

@embano1
Copy link
Member Author

embano1 commented Dec 9, 2025

@jriecken thx for the detailed feedback. Incorporated (diff for commit: ea5d479)

@embano1 embano1 changed the title Add language SDK specification docs: add language sdk specification Dec 9, 2025
@embano1 embano1 closed this Dec 9, 2025
@embano1 embano1 reopened this Dec 9, 2025
@embano1 embano1 marked this pull request as ready for review December 9, 2025 12:49
@embano1
Copy link
Member Author

embano1 commented Dec 9, 2025

@jriecken shall I add a section on testing (in-memory local executor)?

@embano1
Copy link
Member Author

embano1 commented Dec 10, 2025

Just connected with @maschnetwork and I noticed that we currently don't have guidance in this SPEC how to handle concurrent durable operations when waits/suspension are involved (simple waits, durable invokes, callbacks, including timeouts). For example, you want to use context.parallel() with a step taking 5 seconds and a wait (1s) or having two concurrent waits (5s and 5s) where you expect to not wait 10s in total - how should an SDK implement those suspension decisions?

cc/ @ParidelPooya


1. Non-durable code (code outside operations) MUST execute identically on each replay
2. User code MUST NOT use non-deterministic values (e.g., `Date.now()`, `Math.random()`) outside durable operations
3. User code MUST NOT perform side effects (e.g., API calls, database writes) outside durable operations
Copy link

@smking smking Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can perform side-effects if they want as long as they don't use the results to affect operation order.

Comment on lines +370 to +374
```
[New] → START → STARTED → (time passes) → SUCCEEDED [Done]
CANCEL → CANCELLED [Done]
```
Copy link

@smking smking Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flowchart LR
    New[Customer calls ctx.wait] --> START
    START --> |Started| Delay{Wait}
    Delay --> |Succeeded| Success[ctx.wait completes]
    Delay --> CANCEL
    CANCEL --> |Cancelled| Cancelled[ctx.wait completes]

└→ (external failure) → FAILED [Done]
└→ (timeout) → TIMED_OUT [Done]
```

Copy link

@smking smking Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flowchart LR
    New[Customer calls ctx.createCallback] --> START
    START --> |Started| Delay{Wait}
    SUCCEED --> |Succeeded| Success[ctx.createCallback completes successully]
    FAIL --> |Failed| Failure[ctx.createCallback completes with error]
    TIMEOUT --> |TimedOut| Failure

    Delay .-> SendDurableExecutionCallbackSuccess
    Delay .-> SendDurableExecutionCallbackFailure
    Delay .-> TIMEOUT

    SendDurableExecutionCallbackSuccess --> SUCCEED
    SendDurableExecutionCallbackFailure --> FAIL

    subgraph External System
        SendDurableExecutionCallbackSuccess
        SendDurableExecutionCallbackFailure
    end


└→ (invoke timeout) → TIMED_OUT [Done]
└→ (invoke stopped) → STOPPED [Done]
```

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flowchart LR
    New[Customer calls ctx.invoke] --> START
    START --> |Started| Delay{Wait}
    SUCCEED --> |Succeeded| Success[ctx.createCallback completes successully]
    FAIL --> |Failed| Failure[ctx.createCallback completes with error]
    TIMEOUT --> |TimedOut| Failure
    STOP[StopDurableExecution] --> |Stopped| Failure

    Delay .-> External
    Delay .-> TIMEOUT

    subgraph External System
        External@{ shape: fork }
        External .-> Invoked[Invoked Function]
        External .-> STOP        
    end

    Invoked .-> SUCCEED
    Invoked .-> FAIL

└→ FAIL → FAILED [Done]
```

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flowchart LR
    New[Customer calls operation] --> START
    START --> |Started| SUCCEED
    SUCCEED --> |Succeeded| Success[Completes successully]
    START --> |Started| FAIL
    FAIL --> |Failed| Failure[Completes with error]

### 11.2 Async Patterns

The SDK MUST integrate with the language's asynchronous programming model:

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by 'integrate'? The Python SDK doesn't integrate with asyncio.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MUST integrate

this could be needlessly restrictive.

There can be different ways of implementing asynchronous or concurrent work even within a language, and some opinionated views over which is "better".

Other than that, it could be that someone wants to make a deliberately simplified synchronous or light-weight version of the SDK that eschews concurrency?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a implementation decisions. I can even see more than 1 SDK for a language with different approach. For example I can imagine someone goes and create a Python SDK that work in async way.
So I agree, this should not be part of spec.

@embano1
Copy link
Member Author

embano1 commented Dec 20, 2025

Thank you @yaythomas @smking - incorporated changes. Please review the diff.

@embano1
Copy link
Member Author

embano1 commented Dec 20, 2025

@ParidelPooya can you please also do a final review so we can close this off?

Comment on lines +303 to +310
```mermaid
flowchart LR
New[Customer calls operation] --> START
START --> |Started| SUCCEED
SUCCEED --> |Succeeded| Success[Completes successfully]
START --> |Started| FAIL
FAIL --> |Failed| Failure[Completes with error]
```
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smking would the following be more accurate though?

flowchart LR
    STARTED[Execution STARTED] --> SUCCEED
    SUCCEED --> |Succeeded| Success[Completes successfully]
    STARTED --> FAIL
    FAIL --> |Failed| Failure[Completes with error]
Loading

Note: The EXECUTION operation is always in STARTED status when the handler begins. It does not have a START action.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case yes, you're right. For the EXECUTION operation it's already started.

Slightly better:

flowchart LR
    STARTED[Execution starts] --> |Started| SUCCEED
    SUCCEED --> |Succeeded| Success[Completes successfully]
    STARTED --> |Started| FAIL
    FAIL --> |Failed| Failure[Completes with error]
Loading

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Signed-off-by: Michael Gasch <15986659+embano1@users.noreply.github.com>
Signed-off-by: Michael Gasch <15986659+embano1@users.noreply.github.com>
Signed-off-by: Michael Gasch <15986659+embano1@users.noreply.github.com>
- Clarify side-effects rule: allowed if they don't affect operation order
- Change async integration from MUST to SHOULD
- Replace ASCII state diagrams with mermaid flowcharts in Section 4
- Remove duplicate Appendix C, renumber appendices
- Bump version to 1.2

Signed-off-by: Michael Gasch <15986659+embano1@users.noreply.github.com>
Signed-off-by: Michael Gasch <15986659+embano1@users.noreply.github.com>
Signed-off-by: Michael Gasch <15986659+embano1@users.noreply.github.com>
@@ -0,0 +1,1401 @@
# AWS Lambda Durable Functions Language SDK Specification

**Version:** 1.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this version means? Is it for Spec? do we have spec v1 and v1.1?


### 2.1 Durable Function

A **durable function** is a Lambda function that enables developers to build resilient multi-step applications and AI workflows that can execute for extended periods while maintaining reliable progress despite interruptions. Durable functions provide primitives to checkpoint progress and suspend execution at defined points, enabling fault-tolerant and cost-effective long-running processes (up to one year).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use Lambda in spec? That is part of our implementation and should not be part of spec.


### 2.2 Durable Execution

A **durable execution** is the end-to-end lifecycle of a durable function, using checkpoints to track progress, suspend execution, and recover from failures. When functions resume after suspension or interruptions, the system performs replay, automatically re-executing the event handler from the beginning while skipping completed checkpoints and continuing from the point of interruption. The lifecycle may include multiple sub-invocations (Lambda function invocations that occur when resuming after wait operations, retries, or infrastructure failures) to complete the execution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same feedback, would be better to remove Lambda


SDKs MUST implement a checkpoint-and-replay execution model:

1. **Checkpoint**: During execution, the SDK periodically persists operation state to the Lambda durable execution service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use as needed instead of periodically

SDKs MUST implement a checkpoint-and-replay execution model:

1. **Checkpoint**: During execution, the SDK periodically persists operation state to the Lambda durable execution service
2. **Replay**: When a function resumes after interruption, it re-executes from the beginning but skips operations that have already completed by using their checkpointed results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far we used replay instead of re-executes. If they look at current implementaton they will see replay concept and not re-executes


- `CheckpointDurableExecution`: Persist operation state
- `GetDurableExecutionState`: Retrieve execution history
- `SendDurableExecutionCallbackSuccess`: Complete a callback successfully
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TS and Python SDK does not use SendDurableExecutionCallbackSuccess, SendDurableExecutionCallbackFailure and SendDurableExecutionCallbackHeartbeat


The durable execution system provides **at-least-once** semantics for executions:

- Operations MAY be executed more than once due to retries, timeouts, or infrastructure failures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At-least-once is only for step operation


- The entire batch succeeds or fails atomically
- If a batch fails, none of the operations in the batch are recorded
- The SDK SHOULD checkpoint critical state transitions promptly rather than accumulating large batches
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not accurate, we dont change how we batch, instead we can wait for checkpoint to return result for critical operations.

- The entire batch succeeds or fails atomically
- If a batch fails, none of the operations in the batch are recorded
- The SDK SHOULD checkpoint critical state transitions promptly rather than accumulating large batches
- For time-sensitive operations, the SDK MAY checkpoint immediately rather than batching
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not accurate, you can't disable batching if there are some stuff in the queue.

- For time-sensitive operations, the SDK MAY checkpoint immediately rather than batching

### 16.4 Performance vs Durability Trade-offs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my comments in previous section, I don't agree with this section.

@zhongkechen
Copy link

Why is this spec in JS SDK? I think it's better in a shared place.

- **all**: Wait for all promises to complete successfully
- **allSettled**: Wait for all promises to settle (success or failure)
- **race**: Return the first promise to settle
- **any**: Return the first promise to succeed
Copy link

@zhongkechen zhongkechen Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names could be language specific. Some languages use different names with the same semantics.


- `minSuccessful`: Minimum successful items required
- `toleratedFailureCount`: Maximum failures allowed
- `toleratedFailurePercentage`: Maximum failure percentage allowed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names should be language specific. Some languages don't follow this naming convention

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants