Skip to content

Commit d912a41

Browse files
vitalii-dynamiqacoola
authored andcommitted
feat: add rfc for checkpoints
1 parent c10a2bb commit d912a41

12 files changed

+8685
-0
lines changed

docs/RFC-001-CHECKPOINT-RESUME.md

Lines changed: 235 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,235 @@
1+
# RFC-001: Checkpoint/Resume for Dynamiq Workflows
2+
3+
**Status:** Final Draft v7.0
4+
**Created:** January 6, 2026
5+
**Author:** AI Architecture Team
6+
**Target Review:** Architecture Review Board
7+
8+
---
9+
10+
## Overview
11+
12+
This RFC proposes adding checkpoint/resume capabilities to Dynamiq workflows. The full RFC is broken into 8 sub-documents for detailed coverage.
13+
14+
**📁 Full RFC Location:** [`docs/rfc-001-checkpoint-resume/`](./rfc-001-checkpoint-resume/)
15+
16+
---
17+
18+
## Problem Statement
19+
20+
Dynamiq workflows are currently **stateless**. When a workflow:
21+
- Fails mid-execution
22+
- Requires human input across sessions
23+
- Is interrupted by infrastructure issues
24+
25+
...there is no way to resume from where it left off.
26+
27+
**Impact:**
28+
- Wasted computation (re-executing completed LLM calls)
29+
- Poor UX for human-in-the-loop workflows
30+
- No fault tolerance for long-running agent tasks
31+
32+
---
33+
34+
## Proposed Solution
35+
36+
Add an **opt-in checkpoint/resume capability** that:
37+
38+
1. **Persists workflow state** after each node execution
39+
2. **Supports resumption** from any checkpoint
40+
3. **Integrates with existing runtime** (WebSocket, SSE, HITL)
41+
4. **Handles complex nodes** (Agent loops, Map parallelism, orchestrators)
42+
5. **Maintains 100% backward compatibility**
43+
44+
---
45+
46+
## Key Design Decisions
47+
48+
| Decision | Source | Rationale |
49+
|----------|--------|-----------|
50+
| Checkpoint at node boundaries | LangGraph | Matches DAG execution model |
51+
| `PENDING_INPUT` status for HITL | CrewAI | Explicit handling of human feedback |
52+
| Clone-based resume | Metaflow | Skip completed nodes efficiently |
53+
| Protocol-based node support | All frameworks | Each node defines its checkpoint logic |
54+
| Pydantic models | Dynamiq pattern | Type safety, consistent serialization |
55+
56+
---
57+
58+
## Document Structure
59+
60+
| Part | Document | Description |
61+
|------|----------|-------------|
62+
| **0** | [**Review Checklist**](./rfc-001-checkpoint-resume/00-REVIEW-CHECKLIST.md) | **Executive briefing for Architecture Board** |
63+
| 1 | [Executive Summary](./rfc-001-checkpoint-resume/01-EXECUTIVE-SUMMARY.md) | Problem, solution, stakeholders, success criteria |
64+
| 2 | [Industry Research](./rfc-001-checkpoint-resume/02-INDUSTRY-RESEARCH.md) | **12 frameworks analyzed**: LangGraph, Temporal, AutoGen, Prefect, Haystack, Bedrock, etc. |
65+
| 3 | [Runtime Integration](./rfc-001-checkpoint-resume/03-RUNTIME-INTEGRATION.md) | How checkpoints work with HITL, streaming, WebSockets |
66+
| 4 | [Node Analysis](./rfc-001-checkpoint-resume/04-NODE-ANALYSIS.md) | State requirements for every node type |
67+
| 5 | [Data Models](./rfc-001-checkpoint-resume/05-DATA-MODELS.md) | Pydantic models, protocols, serialization |
68+
| 6 | [Storage Backends](./rfc-001-checkpoint-resume/06-STORAGE-BACKENDS.md) | File, SQLite, Redis, PostgreSQL implementations |
69+
| 7 | [Flow Integration](./rfc-001-checkpoint-resume/07-FLOW-INTEGRATION.md) | Modifications to `Flow.run_sync()` |
70+
| 8 | [Testing & Migration](./rfc-001-checkpoint-resume/08-TESTING-MIGRATION.md) | Test strategy, backward compatibility, timeline |
71+
| 9 | [UI & Chat Integration](./rfc-001-checkpoint-resume/09-UI-CHAT-INTEGRATION.md) | Frontend, streaming, **alternative approaches** |
72+
73+
---
74+
75+
## Alternative: Continue via New Message (Cursor Pattern)
76+
77+
**Important:** For **90% of chat use cases**, explicit checkpoints aren't needed:
78+
79+
| Scenario | Solution | New Endpoint? |
80+
|----------|----------|---------------|
81+
| Normal follow-up message | Memory (existing) | ❌ No |
82+
| HITL input (workflow running) | `run_input_events` (existing) | ❌ No |
83+
| HITL resume (workflow paused) | Extend existing `POST /threads/{id}/runs` | ❌ No |
84+
| Crash recovery | Checkpoint resume | ✅ Optional |
85+
86+
**The Cursor/ChatGPT Pattern:** When you send a new message in a chat, the agent continues naturally because Memory provides context from previous runs. Checkpoints are only needed for edge cases like crash recovery and long pauses.
87+
88+
See [Section 10 of UI & Chat Integration](./rfc-001-checkpoint-resume/09-UI-CHAT-INTEGRATION.md) for full details on avoiding new endpoints.
89+
90+
---
91+
92+
## Quick Start Example
93+
94+
```python
95+
from dynamiq import Workflow
96+
from dynamiq.flows import Flow
97+
from dynamiq.flows.flow import CheckpointConfig
98+
from dynamiq.checkpoint.backends import FileCheckpointBackend
99+
100+
# Enable checkpointing (opt-in)
101+
flow = Flow(
102+
nodes=[agent1, agent2],
103+
checkpoint_config=CheckpointConfig(
104+
enabled=True,
105+
backend=FileCheckpointBackend(".checkpoints"),
106+
),
107+
)
108+
109+
workflow = Workflow(flow=flow)
110+
111+
# Run with automatic checkpointing
112+
result = workflow.run(input_data={"query": "Research AI trends"})
113+
114+
# Resume from failure if needed
115+
checkpoint = workflow.get_latest_checkpoint()
116+
result = workflow.resume(checkpoint_id=checkpoint.id)
117+
```
118+
119+
---
120+
121+
## HITL + Checkpointing: How They Work Together
122+
123+
**Key Insight:** Checkpointing and HITL are **complementary**, not conflicting.
124+
125+
```
126+
┌─────────────────────────────────────────────────────────────┐
127+
│ Current HITL Flow │
128+
├─────────────────────────────────────────────────────────────┤
129+
│ 1. Agent tool emits approval event │
130+
│ 2. Event persisted to stream_chunks │
131+
│ 3. Client receives via WebSocket/SSE │
132+
│ 4. User provides input │
133+
│ 5. Input saved to run_input_events │
134+
│ 6. hitl_input_pump forwards to workflow │
135+
│ 7. Workflow continues │
136+
└─────────────────────────────────────────────────────────────┘
137+
138+
┌─────────────────────────────────────────────────────────────┐
139+
│ With Checkpointing (Enhancement) │
140+
├─────────────────────────────────────────────────────────────┤
141+
│ • Checkpoint captures "waiting for input at tool X" │
142+
│ • If runtime crashes, restore from checkpoint │
143+
│ • Re-emit approval event via streaming │
144+
│ • User provides input through SAME mechanism │
145+
│ • Workflow continues │
146+
└─────────────────────────────────────────────────────────────┘
147+
```
148+
149+
**Scenario: User closes browser, returns later**
150+
- Without checkpointing: Workflow times out, fails ❌
151+
- With checkpointing: Resume from checkpoint, re-emit prompt ✅
152+
153+
---
154+
155+
## Implementation Timeline
156+
157+
| Phase | Week | Deliverables |
158+
|-------|------|--------------|
159+
| **Core** | 1 | Models, file backend, basic Flow integration |
160+
| **Nodes** | 2 | Agent, Orchestrators, Map checkpoint support |
161+
| **Backends** | 3 | SQLite, Redis, PostgreSQL implementations |
162+
| **Runtime** | 4 | API endpoints, database migrations |
163+
| **Release** | 4 | Documentation, tests, benchmarks |
164+
165+
---
166+
167+
## Backward Compatibility
168+
169+
| Guarantee | Status |
170+
|-----------|--------|
171+
| Existing code works unchanged ||
172+
| No required parameters added ||
173+
| No breaking API changes ||
174+
| Checkpointing is opt-in ||
175+
176+
---
177+
178+
## Industry Research Summary
179+
180+
We analyzed **12 frameworks** across AI agents, workflow orchestration, and enterprise platforms:
181+
182+
### Primary Analysis (Code Review)
183+
| Framework | Checkpoint Model | HITL Support | Key Learning |
184+
|-----------|------------------|--------------|--------------|
185+
| **LangGraph** | Channel-based, super-steps | `interrupt()` function | Thread isolation via `thread_id` |
186+
| **CrewAI** | Decorator-based | `HumanFeedbackPending` exception | Separate pending_feedback table |
187+
| **Google ADK** | Session events | Event append | Multi-tenant isolation |
188+
| **Metaflow** | Clone-based resume | N/A | Selective re-execution |
189+
| **Letta** | Agent serialization | N/A | ID remapping for clean export |
190+
| **Manus AI** | Context engineering | N/A | TiDB for high write throughput |
191+
192+
### Extended Analysis (Documentation Review)
193+
| Framework | Checkpoint Model | Key Learning |
194+
|-----------|------------------|--------------|
195+
| **Temporal** | Event-sourced replay | Industry gold standard for durable execution |
196+
| **Microsoft AutoGen** | Superstep boundaries | Automatic checkpointing, state isolation |
197+
| **Prefect** | `persist_result=True` | Per-task opt-in (validates our approach) |
198+
| **Haystack** | Breakpoints + snapshots | **Very similar to our design!** |
199+
| **Amazon Bedrock** | Session Management APIs | KMS encryption, enterprise patterns |
200+
| **Semantic Kernel** | Stateful steps | Process framework checkpointing |
201+
202+
---
203+
204+
## Success Criteria
205+
206+
| Metric | Target |
207+
|--------|--------|
208+
| Backward compatibility | 100% |
209+
| Resume accuracy | Completed nodes never re-executed |
210+
| Checkpoint overhead | < 100ms (file backend) |
211+
| HITL resume reliability | 100% |
212+
| Test coverage | > 90% |
213+
214+
---
215+
216+
## Review Checklist
217+
218+
- [ ] Architecture Review Board approval
219+
- [ ] Security review (checkpoint data handling)
220+
- [ ] Performance benchmarks acceptable
221+
- [ ] Runtime team sign-off
222+
- [ ] Documentation complete
223+
224+
---
225+
226+
## Next Steps
227+
228+
1. **Review** the detailed sub-documents in [`docs/rfc-001-checkpoint-resume/`](./rfc-001-checkpoint-resume/)
229+
2. **Provide feedback** on specific sections
230+
3. **Approve** for implementation
231+
4. **Begin Phase 1** (Core infrastructure)
232+
233+
---
234+
235+
*Document version: 5.0 | Last updated: January 6, 2026*

0 commit comments

Comments
 (0)