|
| 1 | +# RFC-001: Checkpoint/Resume for Dynamiq Workflows |
| 2 | + |
| 3 | +**Status:** Final Draft v7.0 |
| 4 | +**Created:** January 6, 2026 |
| 5 | +**Author:** AI Architecture Team |
| 6 | +**Target Review:** Architecture Review Board |
| 7 | + |
| 8 | +--- |
| 9 | + |
| 10 | +## Overview |
| 11 | + |
| 12 | +This RFC proposes adding checkpoint/resume capabilities to Dynamiq workflows. The full RFC is broken into 8 sub-documents for detailed coverage. |
| 13 | + |
| 14 | +**📁 Full RFC Location:** [`docs/rfc-001-checkpoint-resume/`](./rfc-001-checkpoint-resume/) |
| 15 | + |
| 16 | +--- |
| 17 | + |
| 18 | +## Problem Statement |
| 19 | + |
| 20 | +Dynamiq workflows are currently **stateless**. When a workflow: |
| 21 | +- Fails mid-execution |
| 22 | +- Requires human input across sessions |
| 23 | +- Is interrupted by infrastructure issues |
| 24 | + |
| 25 | +...there is no way to resume from where it left off. |
| 26 | + |
| 27 | +**Impact:** |
| 28 | +- Wasted computation (re-executing completed LLM calls) |
| 29 | +- Poor UX for human-in-the-loop workflows |
| 30 | +- No fault tolerance for long-running agent tasks |
| 31 | + |
| 32 | +--- |
| 33 | + |
| 34 | +## Proposed Solution |
| 35 | + |
| 36 | +Add an **opt-in checkpoint/resume capability** that: |
| 37 | + |
| 38 | +1. **Persists workflow state** after each node execution |
| 39 | +2. **Supports resumption** from any checkpoint |
| 40 | +3. **Integrates with existing runtime** (WebSocket, SSE, HITL) |
| 41 | +4. **Handles complex nodes** (Agent loops, Map parallelism, orchestrators) |
| 42 | +5. **Maintains 100% backward compatibility** |
| 43 | + |
| 44 | +--- |
| 45 | + |
| 46 | +## Key Design Decisions |
| 47 | + |
| 48 | +| Decision | Source | Rationale | |
| 49 | +|----------|--------|-----------| |
| 50 | +| Checkpoint at node boundaries | LangGraph | Matches DAG execution model | |
| 51 | +| `PENDING_INPUT` status for HITL | CrewAI | Explicit handling of human feedback | |
| 52 | +| Clone-based resume | Metaflow | Skip completed nodes efficiently | |
| 53 | +| Protocol-based node support | All frameworks | Each node defines its checkpoint logic | |
| 54 | +| Pydantic models | Dynamiq pattern | Type safety, consistent serialization | |
| 55 | + |
| 56 | +--- |
| 57 | + |
| 58 | +## Document Structure |
| 59 | + |
| 60 | +| Part | Document | Description | |
| 61 | +|------|----------|-------------| |
| 62 | +| **0** | [**Review Checklist**](./rfc-001-checkpoint-resume/00-REVIEW-CHECKLIST.md) | **Executive briefing for Architecture Board** | |
| 63 | +| 1 | [Executive Summary](./rfc-001-checkpoint-resume/01-EXECUTIVE-SUMMARY.md) | Problem, solution, stakeholders, success criteria | |
| 64 | +| 2 | [Industry Research](./rfc-001-checkpoint-resume/02-INDUSTRY-RESEARCH.md) | **12 frameworks analyzed**: LangGraph, Temporal, AutoGen, Prefect, Haystack, Bedrock, etc. | |
| 65 | +| 3 | [Runtime Integration](./rfc-001-checkpoint-resume/03-RUNTIME-INTEGRATION.md) | How checkpoints work with HITL, streaming, WebSockets | |
| 66 | +| 4 | [Node Analysis](./rfc-001-checkpoint-resume/04-NODE-ANALYSIS.md) | State requirements for every node type | |
| 67 | +| 5 | [Data Models](./rfc-001-checkpoint-resume/05-DATA-MODELS.md) | Pydantic models, protocols, serialization | |
| 68 | +| 6 | [Storage Backends](./rfc-001-checkpoint-resume/06-STORAGE-BACKENDS.md) | File, SQLite, Redis, PostgreSQL implementations | |
| 69 | +| 7 | [Flow Integration](./rfc-001-checkpoint-resume/07-FLOW-INTEGRATION.md) | Modifications to `Flow.run_sync()` | |
| 70 | +| 8 | [Testing & Migration](./rfc-001-checkpoint-resume/08-TESTING-MIGRATION.md) | Test strategy, backward compatibility, timeline | |
| 71 | +| 9 | [UI & Chat Integration](./rfc-001-checkpoint-resume/09-UI-CHAT-INTEGRATION.md) | Frontend, streaming, **alternative approaches** | |
| 72 | + |
| 73 | +--- |
| 74 | + |
| 75 | +## Alternative: Continue via New Message (Cursor Pattern) |
| 76 | + |
| 77 | +**Important:** For **90% of chat use cases**, explicit checkpoints aren't needed: |
| 78 | + |
| 79 | +| Scenario | Solution | New Endpoint? | |
| 80 | +|----------|----------|---------------| |
| 81 | +| Normal follow-up message | Memory (existing) | ❌ No | |
| 82 | +| HITL input (workflow running) | `run_input_events` (existing) | ❌ No | |
| 83 | +| HITL resume (workflow paused) | Extend existing `POST /threads/{id}/runs` | ❌ No | |
| 84 | +| Crash recovery | Checkpoint resume | ✅ Optional | |
| 85 | + |
| 86 | +**The Cursor/ChatGPT Pattern:** When you send a new message in a chat, the agent continues naturally because Memory provides context from previous runs. Checkpoints are only needed for edge cases like crash recovery and long pauses. |
| 87 | + |
| 88 | +See [Section 10 of UI & Chat Integration](./rfc-001-checkpoint-resume/09-UI-CHAT-INTEGRATION.md) for full details on avoiding new endpoints. |
| 89 | + |
| 90 | +--- |
| 91 | + |
| 92 | +## Quick Start Example |
| 93 | + |
| 94 | +```python |
| 95 | +from dynamiq import Workflow |
| 96 | +from dynamiq.flows import Flow |
| 97 | +from dynamiq.flows.flow import CheckpointConfig |
| 98 | +from dynamiq.checkpoint.backends import FileCheckpointBackend |
| 99 | + |
| 100 | +# Enable checkpointing (opt-in) |
| 101 | +flow = Flow( |
| 102 | + nodes=[agent1, agent2], |
| 103 | + checkpoint_config=CheckpointConfig( |
| 104 | + enabled=True, |
| 105 | + backend=FileCheckpointBackend(".checkpoints"), |
| 106 | + ), |
| 107 | +) |
| 108 | + |
| 109 | +workflow = Workflow(flow=flow) |
| 110 | + |
| 111 | +# Run with automatic checkpointing |
| 112 | +result = workflow.run(input_data={"query": "Research AI trends"}) |
| 113 | + |
| 114 | +# Resume from failure if needed |
| 115 | +checkpoint = workflow.get_latest_checkpoint() |
| 116 | +result = workflow.resume(checkpoint_id=checkpoint.id) |
| 117 | +``` |
| 118 | + |
| 119 | +--- |
| 120 | + |
| 121 | +## HITL + Checkpointing: How They Work Together |
| 122 | + |
| 123 | +**Key Insight:** Checkpointing and HITL are **complementary**, not conflicting. |
| 124 | + |
| 125 | +``` |
| 126 | +┌─────────────────────────────────────────────────────────────┐ |
| 127 | +│ Current HITL Flow │ |
| 128 | +├─────────────────────────────────────────────────────────────┤ |
| 129 | +│ 1. Agent tool emits approval event │ |
| 130 | +│ 2. Event persisted to stream_chunks │ |
| 131 | +│ 3. Client receives via WebSocket/SSE │ |
| 132 | +│ 4. User provides input │ |
| 133 | +│ 5. Input saved to run_input_events │ |
| 134 | +│ 6. hitl_input_pump forwards to workflow │ |
| 135 | +│ 7. Workflow continues │ |
| 136 | +└─────────────────────────────────────────────────────────────┘ |
| 137 | +
|
| 138 | +┌─────────────────────────────────────────────────────────────┐ |
| 139 | +│ With Checkpointing (Enhancement) │ |
| 140 | +├─────────────────────────────────────────────────────────────┤ |
| 141 | +│ • Checkpoint captures "waiting for input at tool X" │ |
| 142 | +│ • If runtime crashes, restore from checkpoint │ |
| 143 | +│ • Re-emit approval event via streaming │ |
| 144 | +│ • User provides input through SAME mechanism │ |
| 145 | +│ • Workflow continues │ |
| 146 | +└─────────────────────────────────────────────────────────────┘ |
| 147 | +``` |
| 148 | + |
| 149 | +**Scenario: User closes browser, returns later** |
| 150 | +- Without checkpointing: Workflow times out, fails ❌ |
| 151 | +- With checkpointing: Resume from checkpoint, re-emit prompt ✅ |
| 152 | + |
| 153 | +--- |
| 154 | + |
| 155 | +## Implementation Timeline |
| 156 | + |
| 157 | +| Phase | Week | Deliverables | |
| 158 | +|-------|------|--------------| |
| 159 | +| **Core** | 1 | Models, file backend, basic Flow integration | |
| 160 | +| **Nodes** | 2 | Agent, Orchestrators, Map checkpoint support | |
| 161 | +| **Backends** | 3 | SQLite, Redis, PostgreSQL implementations | |
| 162 | +| **Runtime** | 4 | API endpoints, database migrations | |
| 163 | +| **Release** | 4 | Documentation, tests, benchmarks | |
| 164 | + |
| 165 | +--- |
| 166 | + |
| 167 | +## Backward Compatibility |
| 168 | + |
| 169 | +| Guarantee | Status | |
| 170 | +|-----------|--------| |
| 171 | +| Existing code works unchanged | ✅ | |
| 172 | +| No required parameters added | ✅ | |
| 173 | +| No breaking API changes | ✅ | |
| 174 | +| Checkpointing is opt-in | ✅ | |
| 175 | + |
| 176 | +--- |
| 177 | + |
| 178 | +## Industry Research Summary |
| 179 | + |
| 180 | +We analyzed **12 frameworks** across AI agents, workflow orchestration, and enterprise platforms: |
| 181 | + |
| 182 | +### Primary Analysis (Code Review) |
| 183 | +| Framework | Checkpoint Model | HITL Support | Key Learning | |
| 184 | +|-----------|------------------|--------------|--------------| |
| 185 | +| **LangGraph** | Channel-based, super-steps | `interrupt()` function | Thread isolation via `thread_id` | |
| 186 | +| **CrewAI** | Decorator-based | `HumanFeedbackPending` exception | Separate pending_feedback table | |
| 187 | +| **Google ADK** | Session events | Event append | Multi-tenant isolation | |
| 188 | +| **Metaflow** | Clone-based resume | N/A | Selective re-execution | |
| 189 | +| **Letta** | Agent serialization | N/A | ID remapping for clean export | |
| 190 | +| **Manus AI** | Context engineering | N/A | TiDB for high write throughput | |
| 191 | + |
| 192 | +### Extended Analysis (Documentation Review) |
| 193 | +| Framework | Checkpoint Model | Key Learning | |
| 194 | +|-----------|------------------|--------------| |
| 195 | +| **Temporal** | Event-sourced replay | Industry gold standard for durable execution | |
| 196 | +| **Microsoft AutoGen** | Superstep boundaries | Automatic checkpointing, state isolation | |
| 197 | +| **Prefect** | `persist_result=True` | Per-task opt-in (validates our approach) | |
| 198 | +| **Haystack** | Breakpoints + snapshots | **Very similar to our design!** | |
| 199 | +| **Amazon Bedrock** | Session Management APIs | KMS encryption, enterprise patterns | |
| 200 | +| **Semantic Kernel** | Stateful steps | Process framework checkpointing | |
| 201 | + |
| 202 | +--- |
| 203 | + |
| 204 | +## Success Criteria |
| 205 | + |
| 206 | +| Metric | Target | |
| 207 | +|--------|--------| |
| 208 | +| Backward compatibility | 100% | |
| 209 | +| Resume accuracy | Completed nodes never re-executed | |
| 210 | +| Checkpoint overhead | < 100ms (file backend) | |
| 211 | +| HITL resume reliability | 100% | |
| 212 | +| Test coverage | > 90% | |
| 213 | + |
| 214 | +--- |
| 215 | + |
| 216 | +## Review Checklist |
| 217 | + |
| 218 | +- [ ] Architecture Review Board approval |
| 219 | +- [ ] Security review (checkpoint data handling) |
| 220 | +- [ ] Performance benchmarks acceptable |
| 221 | +- [ ] Runtime team sign-off |
| 222 | +- [ ] Documentation complete |
| 223 | + |
| 224 | +--- |
| 225 | + |
| 226 | +## Next Steps |
| 227 | + |
| 228 | +1. **Review** the detailed sub-documents in [`docs/rfc-001-checkpoint-resume/`](./rfc-001-checkpoint-resume/) |
| 229 | +2. **Provide feedback** on specific sections |
| 230 | +3. **Approve** for implementation |
| 231 | +4. **Begin Phase 1** (Core infrastructure) |
| 232 | + |
| 233 | +--- |
| 234 | + |
| 235 | +*Document version: 5.0 | Last updated: January 6, 2026* |
0 commit comments