-
Notifications
You must be signed in to change notification settings - Fork 167
Propose low-level snapshot api #497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,135 @@ | ||
| # Design Doc: Low-Level Snapshot API | ||
|
|
||
| **Status**: Proposed | ||
|
|
||
| **Date**: 2026-01-28 | ||
|
|
||
| **Issue**: https://github.com/strands-agents/sdk-python/issues/1138 | ||
|
|
||
| ## Context | ||
|
|
||
| Developers need a way to preserve and restore the exact state of an agent at a specific point in time. The existing SessionManagement doesn't address this: | ||
|
|
||
| - SessionManager works in the background, incrementally recording messages rather than full state. This means it's not possible to restore to arbitrary points in time. | ||
| - After a message is saved, there is no way to modify it and have it recorded in session-management, preventing more advance context-management strategies while being able to pause & restore agents. | ||
| - There is no way to proactively trigger session-management (e.g., after modifying `agent.messages` or `agent.state` directly) | ||
|
|
||
| ## Decision | ||
|
|
||
| Add a low-level, explicit snapshot API as an alternative to automatic session-management. This enables preserving the exact state of an agent at a specific point and restoring it later — useful for evaluation frameworks, custom session management, and checkpoint/restore workflows. | ||
|
|
||
| ### API Changes | ||
|
|
||
| ```python | ||
| class Snapshot: | ||
|
||
| type: str # the type of data stored (e.g., "agent") | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i'd want to see a timestamp of snapshot so we can go back in time.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1, Probably it can be in the sate / metadata. I notice that otel trace / span properties are good to have in many cases. I wonder if some of them can be added into state or metadata.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Adding SHA is also crucial here ^^ |
||
| state: dict[str, Any] # opaque; do not modify — format subject to change | ||
| metadata: dict # user-provided data to be stored with the snapshot | ||
|
|
||
| class Agent: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want specific agent methods, or Having an interface like below, would make implementation a lot easier. And it would allow us to extend to other types (multi-agent, etc) This is similar to @JackYPCOnline 's idea on |
||
| def save_snapshot(self, metadata: dict | None = None) -> Snapshot: | ||
| """Capture the current agent state as a snapshot.""" | ||
| ... | ||
|
|
||
| def load_snapshot(self, snapshot: Snapshot) -> None: | ||
| """Restore agent state from a snapshot.""" | ||
| ... | ||
|
Comment on lines
+43
to
+49
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great idea! I would consider moving these methods to their own class and inject it in the agent class
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am picky on naming, what it actually does is |
||
| ``` | ||
|
|
||
| ### Behavior | ||
|
|
||
| Snapshots capture **agent state** (data), not **runtime behavior** (code): | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with this, it is not easy to capture and persist code, and I dont think strands should try to do this. However, we should explore how one would restore an agent from a snapshot, and load lets say tools back into the agent after persisting it. I would like to see an example devex of what this looks like.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I view the tool state as a feature that we'd be adding to the agent to make "enabled" tools into a state on the agent. So, if we had that I imagine it would be something like: agent = Agent(tools=[tool1, tool2, tool3, tool4], enabled_tools=["tool1"])Where only agent.enabled_tools = ["tool1", "tool3"]and for restoring an agent with specific tools, it would be the same as agent2 = Agent(tools=[tool3, tool4])
agent2.load_snapshot(snapshot)and the snapshot would be restoring the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 the term "snapshot" makes me think of a disk snapshot - literally everything. I would like to see this incorporate tools etc in the future. |
||
|
|
||
| - **Agent State** — Data persisted as part of session-management: conversation messages, context, and other JSON-serializable data. This is what snapshots save and restore. | ||
| - **Runtime Behavior** — Configuration that defines how the agent operates: model, tools, ConversationManager, etc. These are *not* included in snapshots and must be set separately when creating or restoring an agent. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we allow these components to expose "snapshot-able data"? e.g. I am a conv manager developer, I want my data to be restored with snapshots What's the recommendation? Keeping that data in agent state?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah; the recommendation is agent state
It should be Agent State (AgentState directly; or if we're missing something, an equivalent thereof). The idea that I'm trying to get across in this section is "Snapshots do not represent anything other than what already exists in agent state/session-management, it just provides a more direct api to control it".
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we saying that we don't believe these things should be part of a snapshot or are we just saying that we are not trying to expand the scoep by limiting to the current capabilities of Session Management. For example I could see the following being important
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So does that mean a configuration like |
||
|
|
||
| The intent is that anything stored or restored by session-management would be stored in a snapshot — so this proposal is *not* documenting or changing what is persisted, but rather providing an explicit way to do what session-management does automatically. | ||
|
|
||
| ### Contract | ||
|
|
||
| - **`metadata`** — Caller-owned. Strands does not read, modify, or manage this field. Use it to store checkpoint labels, timestamps, or any application-specific data. | ||
|
||
| - **`type` and `state`** — Strands-owned. These fields are managed internally and should be treated as opaque. The format of `state` is subject to change; do not modify or depend on its structure. | ||
| - **Serialization** — Strands guarantees that `type` and `state` will only contain JSON-serializable values. | ||
|
|
||
| ### Future Concerns | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good to have: We can enable a traces for snapshot actions |
||
|
|
||
| - Snapshotting for MultiAgent constructs: This proposal would | ||
|
||
| - Providing a storage API for snapshot CRUD operations (save to disk, database, etc.) | ||
| - Providing APIs to customize serialization formats | ||
|
|
||
| ## Developer Experience | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to allow users to snapshot in hooks on life cycle events? |
||
|
|
||
| ### Evaluations via Rewind and Replay | ||
|
|
||
| ```python | ||
| agent = Agent(tools=[tool1, tool2]) | ||
| snapshot = agent.save_snapshot() | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. QQ: what happens if an user calls |
||
|
|
||
| result1 = agent("What is the weather?") | ||
|
|
||
| agent2 = Agent(tools=[tool3, tool4]) | ||
| agent2.load_snapshot(snapshot) | ||
|
Comment on lines
+85
to
+86
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should consider allowing for passing in the snapshot in the Agent init (as well)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if this is the intended flow where we are creating a new instance, did you consider pros/cons of acting on the constructor?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would we want to give customers the option to override specific data in a snapshot? So keep most things the same but try tweaking one value to see how the agent behaves. |
||
| result2 = agent2("What is the weather?") | ||
| # Compare result1 and result2 | ||
|
||
| ``` | ||
|
|
||
| ### Advanced Context Management | ||
|
|
||
| ```python | ||
| agent = Agent(conversation_manager=CompactingConversationManager()) | ||
| snapshot = agent.save_snapshot(metadata={"checkpoint": "before_long_task"}) | ||
|
|
||
| # ... later ... | ||
| later_agent = Agent(conversation_manager=CompactingConversationManager()) | ||
| later_agent.load_snapshot(snapshot) | ||
| ``` | ||
|
|
||
| ### Persisting Snapshots | ||
|
|
||
| ```python | ||
| import json | ||
| from dataclasses import asdict | ||
|
|
||
| agent = Agent(tools=[tool1, tool2]) | ||
| agent("Remember that my favorite color is orange.") | ||
|
|
||
| # Save to file | ||
| snapshot = agent.save_snapshot(metadata={"user_id": "123"}) | ||
| with open("checkpoint.json", "w") as f: | ||
|
||
| json.dump(asdict(snapshot), f) | ||
|
|
||
| # Later, restore from file | ||
| with open("checkpoint.json", "r") as f: | ||
| data = json.load(f) | ||
| snapshot = Snapshot(**data) | ||
|
|
||
| agent = Agent(tools=[tool1, tool2]) | ||
| agent.load_snapshot(snapshot) | ||
| agent("What is my favorite color?") # "Your favorite color is orange." | ||
| ``` | ||
|
|
||
| ### Edge cases | ||
|
|
||
| Restoring runtime behavior (e.g., tools) is explicitly not supported: | ||
|
|
||
| ```python | ||
| agent1 = Agent(tools=[tool1, tool2]) | ||
| snapshot = agent1.save_snapshot() | ||
| agent_no = Agent(snapshot) # tools are NOT restored | ||
| ``` | ||
|
|
||
| ## Consequences | ||
|
|
||
| **Easier:** | ||
| - Building evaluation frameworks with rewind/replay capabilities | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. seems like timestamps are missing for rewind ^^ |
||
| - Implementing custom session management strategies | ||
| - Creating checkpoints during long-running agent tasks | ||
| - Cloning agents (load the same snapshot into multiple agent instances) | ||
| - Resetting agents to a known state (we do this manually for Graphs) | ||
|
|
||
| **More difficult:** | ||
| - N/A — this is an additive API | ||
|
|
||
| ## Willingness to Implement | ||
|
|
||
| Yes | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a use case for wanting to modify a past message?