Build resilient, long-running AWS Lambda functions that automatically checkpoint progress and resume after failures. Durable functions can run for up to one year while you pay only for active compute time.
- Automatic Checkpointing – Progress is saved after each step; failures resume from the last checkpoint
- Cost-Effective Waits – Suspend execution for minutes, hours, or days without compute charges
- Configurable Retries – Built-in retry strategies with exponential backoff and jitter
- Replay Safety – Functions deterministically resume from checkpoints after interruptions
- Type Safety – Full generic type support for step results
Your durable function extends DurableHandler<I, O> and implements handleRequest(I input, DurableContext ctx). The DurableContext is your interface to durable operations:
ctx.step()– Execute code and checkpoint the resultctx.stepAsync()– Start a concurrent stepctx.wait()– Suspend execution without compute chargesctx.createCallback()– Wait for external events (approvals, webhooks)ctx.invoke()– Invoke another Lambda function and wait for the resultctx.invokeAsync()– Start a concurrent Lambda function invocationctx.runInChildContext()– Run an isolated child context with its own checkpoint logctx.runInChildContextAsync()– Start a concurrent child context
<dependency>
<groupId>software.amazon.lambda.durable</groupId>
<artifactId>aws-durable-execution-sdk-java</artifactId>
<version>VERSION</version>
</dependency>public class OrderProcessor extends DurableHandler<Order, OrderResult> {
@Override
protected OrderResult handleRequest(Order order, DurableContext ctx) {
// Step 1: Validate and reserve inventory
var reservation = ctx.step("reserve-inventory", Reservation.class,
() -> inventoryService.reserve(order.getItems()));
// Step 2: Process payment
var payment = ctx.step("process-payment", Payment.class,
() -> paymentService.charge(order.getPaymentMethod(), order.getTotal()));
// Wait for warehouse processing (no compute charges)
ctx.wait(Duration.ofHours(2));
// Step 3: Confirm shipment
var shipment = ctx.step("confirm-shipment", Shipment.class,
() -> shippingService.ship(reservation, order.getAddress()));
return new OrderResult(order.getId(), shipment.getTrackingNumber());
}
}Steps execute your code and checkpoint the result. On replay, results from completed checkpoints are returned without re-execution.
// Basic step (blocks until complete)
var result = ctx.step("fetch-user", User.class, () -> userService.getUser(userId));
// Step with custom configuration (retries, semantics, serialization)
var result = ctx.step("call-api", Response.class,
() -> externalApi.call(request),
StepConfig.builder()
.retryStrategy(...)
.semantics(...)
.build());See Step Configuration for retry strategies, delivery semantics, and per-step serialization options.
stepAsync() starts a step in the background and returns a DurableFuture<T>. This enables concurrent execution patterns.
// Start multiple operations concurrently
DurableFuture<User> userFuture = ctx.stepAsync("fetch-user", User.class,
() -> userService.getUser(userId));
DurableFuture<List<Order>> ordersFuture = ctx.stepAsync("fetch-orders",
new TypeToken<List<Order>>() {}, () -> orderService.getOrders(userId));
// Both operations run concurrently
// Block and get results when needed
User user = userFuture.get();
List<Order> orders = ordersFuture.get();Waits suspend the function and resume after the specified duration. You're not charged during suspension.
// Wait 30 minutes
ctx.wait(Duration.ofMinutes(30));
// Named wait (useful for debugging)
ctx.wait("cooling-off-period", Duration.ofDays(7));Callbacks suspend execution until an external system sends a result. Use this for human approvals, webhooks, or any event-driven workflow.
// Create a callback and get the ID to share with external systems
DurableCallbackFuture<String> callback = ctx.createCallback("approval", String.class);
// Send the callback ID to an external system within a step
ctx.step("send-notification", String.class, () -> {
notificationService.sendApprovalRequest(callback.callbackId(), requestDetails);
return "notification-sent";
});
// Suspend until the external system calls back with a result
String approvalResult = callback.get();The external system completes the callback by calling the Lambda Durable Functions API with the callback ID and result payload.
Configure timeouts and serialization to handle cases where callbacks are never completed or need custom deserialization:
var config = CallbackConfig.builder()
.timeout(Duration.ofHours(24)) // Max time to wait for callback
.heartbeatTimeout(Duration.ofHours(1)) // Max time between heartbeats
.serDes(new CustomSerDes()) // Custom serialization/deserialization
.build();
var callback = ctx.createCallback("approval", String.class, config);| Option | Description |
|---|---|
timeout() |
Maximum duration to wait for the callback to complete |
heartbeatTimeout() |
Maximum duration between heartbeat signals from the external system |
serDes() |
Custom SerDes for deserializing callback results (e.g., encryption, custom formats) |
| Exception | When Thrown |
|---|---|
CallbackTimeoutException |
Callback exceeded its timeout duration |
CallbackFailedException |
External system sent an error response |
try {
var result = callback.get();
} catch (CallbackTimeoutException e) {
// Callback timed out - implement fallback logic
} catch (CallbackFailedException e) {
// External system reported an error
}// Basic invoke
var result = ctx.invoke("invoke-function",
"function-name",
"\"payload\"",
Result.class,
InvokeConfig.builder()
.payloadSerDes(...) // payload serializer
.resultSerDes(...) // result deserializer
.timeout(Duration.of(...)) // wait timeout
.tenantId(...) // Lambda tenantId
.build()
);
Child contexts run an isolated stream of work with their own operation counter and checkpoint log. They support the full range of durable operations — step, wait, invoke, createCallback, and nested child contexts.
// Sync: blocks until the child context completes
var result = ctx.runInChildContext("validate-order", String.class, child -> {
var data = child.step("fetch", String.class, () -> fetchData());
child.wait(Duration.ofMinutes(5));
return child.step("validate", String.class, () -> validate(data));
});
// Async: returns a DurableFuture for concurrent execution
var futureA = ctx.runInChildContextAsync("branch-a", String.class, child -> {
return child.step("work-a", String.class, () -> doWorkA());
});
var futureB = ctx.runInChildContextAsync("branch-b", String.class, child -> {
return child.step("work-b", String.class, () -> doWorkB());
});
// Wait for all child contexts to complete
var results = DurableFuture.allOf(futureA, futureB);Configure step behavior with StepConfig:
ctx.step("my-step", Result.class, () -> doWork(),
StepConfig.builder()
.retryStrategy(...) // How to handle failures
.semantics(...) // At-least-once vs at-most-once
.serDes(...) // Custom serialization
.build());Configure how steps handle transient failures:
// No retry - fail immediately (default)
var noRetries = StepConfig.builder().retryStrategy(RetryStrategies.Presets.NO_RETRY).build()
// Exponential backoff with jitter
var customRetries = StepConfig.builder()
.retryStrategy(RetryStrategies.exponentialBackoff(
5, // max attempts
Duration.ofSeconds(2), // initial delay
Duration.ofSeconds(30), // max delay
2.0, // backoff multiplier
JitterStrategy.FULL)) // randomize delays
.build()Control how steps behave when interrupted mid-execution:
| Semantic | Behavior | Use Case |
|---|---|---|
AT_LEAST_ONCE_PER_RETRY (default) |
Re-executes step if interrupted before completion | Idempotent operations (database upserts, API calls with idempotency keys) |
AT_MOST_ONCE_PER_RETRY |
Never re-executes; throws StepInterruptedException if interrupted |
Non-idempotent operations (sending emails, charging payments) |
// Default: at-least-once per retry (step may re-run if interrupted)
var result = ctx.step("idempotent-update", Result.class,
() -> database.upsert(record));
// At-most-once per retry
var result = ctx.step("send-email", Result.class,
() -> emailService.send(notification),
StepConfig.builder()
.semantics(StepSemantics.AT_MOST_ONCE_PER_RETRY)
.build());Important: These semantics apply per retry attempt, not per overall execution:
- AT_LEAST_ONCE_PER_RETRY: The step executes at least once per retry. If the step succeeds but checkpointing fails (e.g., sandbox crash), the step re-executes on replay.
- AT_MOST_ONCE_PER_RETRY: A checkpoint is created before execution. If failure occurs after checkpoint but before completion, the step is skipped on replay and
StepInterruptedExceptionis thrown.
To achieve step-level at-most-once semantics, combine with a no-retry strategy:
// True at-most-once: step executes at most once, ever
var result = ctx.step("charge-payment", Result.class,
() -> paymentService.charge(amount),
StepConfig.builder()
.semantics(StepSemantics.AT_MOST_ONCE_PER_RETRY)
.retryStrategy(RetryStrategies.Presets.NO_RETRY)
.build());Without this, a step using AT_MOST_ONCE_PER_RETRY with retries enabled could still execute multiple times across different retry attempts.
When a step returns a parameterized type like List<User>, use TypeToken to preserve the type information:
var users = ctx.step("fetch-users", new TypeToken<List<User>>() {},
() -> userService.getAllUsers());
var orderMap = ctx.step("fetch-orders", new TypeToken<Map<String, Order>>() {},
() -> orderService.getOrdersByCustomer());This is needed for the SDK to deserialize a checkpointed result and get the exact type to reconstruct. See TypeToken and Type Erasure for technical details.
Customize SDK behavior by overriding createConfiguration() in your handler:
public class OrderProcessor extends DurableHandler<Order, OrderResult> {
@Override
protected DurableConfig createConfiguration() {
// Custom Lambda client with connection pooling
var lambdaClient = LambdaClient.builder()
.httpClient(ApacheHttpClient.builder()
.maxConnections(50)
.connectionTimeout(Duration.ofSeconds(30))
.build())
.build();
return DurableConfig.builder()
.withLambdaClient(lambdaClient)
.withSerDes(new MyCustomSerDes()) // Custom serialization
.withExecutorService(Executors.newFixedThreadPool(10)) // Custom thread pool
.withLoggerConfig(LoggerConfig.withReplayLogging()) // Enable replay logs
.build();
}
@Override
protected OrderResult handleRequest(Order order, DurableContext ctx) {
// Your handler logic
}
}| Option | Description | Default |
|---|---|---|
withLambdaClient() |
Custom AWS Lambda client | Auto-configured Lambda client |
withSerDes() |
Serializer for step results | Jackson with default settings |
withExecutorService() |
Thread pool for user-defined operations | Cached daemon thread pool |
withLoggerConfig() |
Logger behavior configuration | Suppress logs during replay |
The withExecutorService() option configures the thread pool used for running user-defined operations. Internal SDK coordination (checkpoint batching, polling) runs on an SDK-managed thread pool.
The SDK provides a DurableLogger via ctx.getLogger() that automatically includes execution metadata in log entries and suppresses duplicate logs during replay.
@Override
protected OrderResult handleRequest(Order order, DurableContext ctx) {
ctx.getLogger().info("Processing order: {}", order.getId());
var result = ctx.step("validate", String.class, () -> {
ctx.getLogger().debug("Validating order details");
return validate(order);
});
ctx.getLogger().info("Order processed successfully");
return new OrderResult(result);
}Logs include execution context via MDC (works with any SLF4J-compatible logging framework):
{
"timestamp": "2024-01-15T10:30:00.000Z",
"level": "INFO",
"message": "Processing order: ORD-123",
"durableExecutionArn": "arn:aws:lambda:us-east-1:123456789:function:order-processor:exec-abc123",
"requestId": "a1b2c3d4-5678-90ab-cdef-example12345",
"operationId": "1",
"operationName": "validate"
}By default, logs are suppressed during replay to avoid duplicates:
First Invocation:
[INFO] Processing order: ORD-123 ✓ Logged
[DEBUG] Validating order details ✓ Logged
Replay (after wait):
[INFO] Processing order: ORD-123 ✗ Suppressed (already logged)
[DEBUG] Validating order details ✗ Suppressed
[INFO] Continuing after wait ✓ Logged (new code path)
To log during replay (e.g., for debugging):
@Override
protected DurableConfig createConfiguration() {
return DurableConfig.builder()
.withLoggerConfig(LoggerConfig.withReplayLogging())
.build();
}The SDK throws specific exceptions to help you handle different failure scenarios:
DurableExecutionException - General durable exception
├── NonDeterministicExecutionException - Code changed between original execution and replay. Fix code to maintain determinism; don't change step order/names.
├── SerDesException - Serialization and deserialization exception.
└── DurableOperationException - General operation exception
├── StepException - General Step exception
│ ├── StepFailedException - Step exhausted all retry attempts.Catch to implement fallback logic or let execution fail.
│ └── StepInterruptedException - `AT_MOST_ONCE` step was interrupted before completion. Implement manual recovery (check if operation completed externally)
├── InvokeException - General chained invocation exception
│ ├── InvokeFailedException - Chained invocation failed. Handle the error or propagate failure.
│ ├── InvokeTimedoutException - Chained invocation timed out. Handle the error or propagate failure.
│ └── InvokeStoppedException - Chained invocation stopped. Handle the error or propagate failure.
├── CallbackException - General callback exception
│ ├── CallbackFailedException - External system sent an error response to the callback. Handle the error or propagate failure
│ └── CallbackTimeoutException - Callback exceeded its timeout duration. Handle the error or propagate the failure
└── ChildContextFailedException - Child context failed and the original exception could not be reconstructed
try {
var result = ctx.step("charge-payment", Payment.class,
() -> paymentService.charge(amount),
StepConfig.builder()
.semantics(StepSemantics.AT_MOST_ONCE_PER_RETRY)
.build());
} catch (StepInterruptedException e) {
// Step started but we don't know if it completed
// Check payment status externally before retrying
var status = paymentService.checkStatus(transactionId);
if (status.isPending()) {
throw e; // Let it fail - manual intervention needed
}
}The SDK includes testing utilities for both local development and cloud-based integration testing.
<dependency>
<groupId>software.amazon.lambda.durable</groupId>
<artifactId>aws-durable-execution-sdk-java-testing</artifactId>
<version>VERSION</version>
<scope>test</scope>
</dependency>@Test
void testOrderProcessing() {
var handler = new OrderProcessor();
var runner = LocalDurableTestRunner.create(Order.class, handler);
var result = runner.runUntilComplete(new Order("order-123", items));
assertEquals(ExecutionStatus.SUCCEEDED, result.getStatus());
assertNotNull(result.getResult(OrderResult.class).getTrackingNumber());
}You can also pass a lambda directly instead of a handler instance:
var runner = LocalDurableTestRunner.create(Order.class, (order, ctx) -> {
var result = ctx.step("process", String.class, () -> "done");
return new OrderResult(order.getId(), result);
});var result = runner.runUntilComplete(input);
// Verify specific step completed
var paymentOp = result.getOperation("process-payment");
assertNotNull(paymentOp);
assertEquals(OperationStatus.SUCCEEDED, paymentOp.getStatus());
// Get step result
var paymentResult = paymentOp.getStepResult(Payment.class);
assertNotNull(paymentResult.getTransactionId());
// Inspect all operations
List<TestOperation> succeeded = result.getSucceededOperations();
List<TestOperation> failed = result.getFailedOperations();By default, runUntilComplete() skips wait durations. For testing time-dependent logic, disable this:
var runner = LocalDurableTestRunner.create(Order.class, handler)
.withSkipTime(false); // Don't auto-advance time
var result = runner.run(input);
assertEquals(ExecutionStatus.PENDING, result.getStatus()); // Blocked on wait
runner.advanceTime(); // Manually advance past the wait
result = runner.run(input);
assertEquals(ExecutionStatus.SUCCEEDED, result.getStatus());Test against deployed Lambda functions:
var runner = CloudDurableTestRunner.create(
"arn:aws:lambda:us-east-1:123456789012:function:order-processor:$LATEST",
Order.class,
OrderResult.class);
var result = runner.run(new Order("order-123", items));
assertEquals(ExecutionStatus.SUCCEEDED, result.getStatus());See examples/README.md for complete instructions on local testing, deployment, invoking functions, and running cloud integration tests.
- AWS Lambda Durable Functions – Official AWS documentation
- Durable Execution SDK Guide – SDK usage guide
- Best Practices – Patterns and recommendations
See CONTRIBUTING for information about reporting security issues.
This project is licensed under the Apache-2.0 License. See LICENSE for details.