feat(kernel_checkpoint): POC for kernel-state snapshot/restore#105
feat(kernel_checkpoint): POC for kernel-state snapshot/restore#105jamesbhobbs wants to merge 1 commit into
Conversation
Today the dominant cold-start cost for projects with a heavy init notebook
is re-running init on every fresh container boot. This POC ships the data
plane for snapshotting the kernel's post-init globals to a pluggable store
and restoring them on the next cold boot, skipping the init re-run.
# What's in the box
deepnote_toolkit/kernel_checkpoint/
__init__.py — public surface + invariant docstring
store.py — SnapshotStore protocol + LocalDiskSnapshotStore impl.
S3-backed prod impl will be a sibling later.
key.py — compute_checkpoint_key(project_id, init_source_hash,
environment_id). Human-readable so debugging stale-
restore incidents is one `cat` away.
checkpoint.py — save_checkpoint / try_restore_checkpoint. Best-effort:
unpicklable values are logged-and-skipped; modules and
dunders / ipython artifacts are skipped by rule.
__main__.py — CLI for end-to-end testing without touching prod code.
# Correctness invariant
A restore is correct iff the restored namespace produces the same downstream
execution behaviour as freshly re-running init in the current environment.
The checkpoint key includes init source hash + environment id so source/env
changes always invalidate the cache.
# Verified end-to-end
$ python -m deepnote_toolkit.kernel_checkpoint --root /tmp/poc-snap \
save "proj:alpha:init:abc:env:py311" /tmp/poc-init.py
[checkpoint] saved 4 globals (505 bytes)
$ python -m deepnote_toolkit.kernel_checkpoint --root /tmp/poc-snap \
restore "proj:alpha:init:abc:env:py311"
[checkpoint] restored 4 globals (505 bytes)
$ python -m deepnote_toolkit.kernel_checkpoint --root /tmp/poc-snap \
restore "proj:alpha:init:DIFFERENT:env:py311"
[checkpoint] no snapshot found
# Out of scope for this PR (productionisation TODO)
- S3SnapshotStore — swap LocalDiskSnapshotStore in prod
- Wiring into runtime/executor.py init lifecycle
- Webapp signaling of (init_source_hash, environment_id)
- Filesystem state in the cache key (init reading a file that has
since changed would currently produce a stale restore)
- Size cap and S3 multipart upload story for large dataframes
21/21 unit tests pass including pandas DataFrame round-trip.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR adds a kernel checkpoint system to snapshot and restore Python globals post-kernel initialization. The feature reduces cold-start latency by persisting the initialized namespace. Storage uses a protocol abstraction with local-disk implementation via atomic temp-file writes. Checkpoints are keyed by project ID, initialization source hash, and environment ID. Save logic applies skip rules (modules, dunders, IPython artifacts) and records unpicklable values without aborting. Restore loads snapshots via dill. A CLI provides save/restore commands. Comprehensive tests validate roundtrip behavior, skip rules, key composition, and storage semantics. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
|
📦 Python package built successfully!
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #105 +/- ##
==========================================
- Coverage 74.32% 73.75% -0.57%
==========================================
Files 94 100 +6
Lines 5534 5826 +292
Branches 824 860 +36
==========================================
+ Hits 4113 4297 +184
- Misses 1155 1253 +98
- Partials 266 276 +10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deepnote_toolkit/kernel_checkpoint/__main__.py`:
- Line 35: The CLI currently calls the private method store._path_for(...) in
deepnote_toolkit/kernel_checkpoint/__main__.py to build the message; instead
expose a public API (either make store.write return the final path or add a
public method like store.path_for/get_path) and update the CLI to use that
public result instead of _path_for; ensure the change returns the same path used
when writing so the print line (which references report.saved_names and
args.key) can display the path without touching private store internals.
- Line 65: The function signature for main uses the union syntax "list[str] |
None" which is inconsistent with the module's preferred use of typing.Optional;
update the main function signature to use Optional[list[str]] instead, add "from
typing import Optional" to the imports if missing, and ensure any other type
hints in this module follow the same Optional[...] convention for consistency
with store.py and checkpoint.py.
In `@deepnote_toolkit/kernel_checkpoint/checkpoint.py`:
- Around line 123-127: The code currently calls dill.loads(payload) on untrusted
data read via store.read(key); instead add integrity/authentication verification
before deserialization: expect an authenticated envelope (e.g., AEAD ciphertext
or a signed payload with signature field) and verify it using your project
crypto utilities (or add HMAC/verify using a configured secret/public key)
before calling dill.loads; if verification fails, return None or raise and do
not call dill.loads. Update checkpoint.restore logic around payload, dill.loads,
and RestoreReport to parse/verify the envelope, decrypt/unwrap or reject invalid
payloads, and only then set restored and construct RestoreReport.
In `@deepnote_toolkit/kernel_checkpoint/store.py`:
- Around line 43-46: The _path_for method currently maps keys to filenames with
simple replace, which is non-injective and allows path-traversal; replace it by
producing a deterministic, filesystem-safe filename derived from a strong hash
of the key (e.g., sha256 hex) and optionally a short sanitized/truncated
human-readable suffix, ensure the filename contains no path separators or "..",
join with self._root, then resolve and assert the resolved path is inside
self._root (use Path.resolve() and Path.is_relative_to or compare parents)
before returning to prevent escapes; update references to _path_for accordingly.
In `@tests/unit/test_kernel_checkpoint.py`:
- Around line 184-189: The tests call compute_checkpoint_key with empty inputs
but currently use broad pytest.raises(ValueError) which can catch unrelated
errors; update each assertion to use pytest.raises(ValueError, match="...") with
a specific expected message fragment (e.g., "project_id", "init_source_hash",
and "environment_id") so each raise asserts the correct validation path in
compute_checkpoint_key.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0ce517a1-1400-413b-a2f7-499e959fbbdf
📒 Files selected for processing (6)
deepnote_toolkit/kernel_checkpoint/__init__.pydeepnote_toolkit/kernel_checkpoint/__main__.pydeepnote_toolkit/kernel_checkpoint/checkpoint.pydeepnote_toolkit/kernel_checkpoint/key.pydeepnote_toolkit/kernel_checkpoint/store.pytests/unit/test_kernel_checkpoint.py
| exec(compile(source, args.python_file, "exec"), namespace) | ||
| store = LocalDiskSnapshotStore(args.root) | ||
| report = save_checkpoint(namespace, store, args.key) | ||
| print(f"[checkpoint] saved {len(report.saved_names)} globals to {store._path_for(args.key)}") |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Avoid calling a private store method from CLI.
Line 35 reaches into store._path_for(...). Expose a public helper (or return path info from write) so CLI does not depend on private internals.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deepnote_toolkit/kernel_checkpoint/__main__.py` at line 35, The CLI currently
calls the private method store._path_for(...) in
deepnote_toolkit/kernel_checkpoint/__main__.py to build the message; instead
expose a public API (either make store.write return the final path or add a
public method like store.path_for/get_path) and update the CLI to use that
public result instead of _path_for; ensure the change returns the same path used
when writing so the print line (which references report.saved_names and
args.key) can display the path without touching private store internals.
| return 0 | ||
|
|
||
|
|
||
| def main(argv: list[str] | None = None) -> int: |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
repo_root="."
file="deepnote_toolkit/kernel_checkpoint/__main__.py"
echo "=== File exists? ==="
ls -l "$file" || true
echo "=== Showing around line 55-75 with numbers ==="
nl -ba "$file" | sed -n '50,85p'Repository: deepnote/deepnote-toolkit
Length of output: 295
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="deepnote_toolkit/kernel_checkpoint/__main__.py"
echo "=== File exists? ==="
ls -l "$file" || true
echo "=== Showing around line 55-75 with numbers ==="
nl -ba "$file" | sed -n '50,85p'
echo "=== Search for 'Optional' usage in that file ==="
rg -n "Optional\[" "$file" || true
echo "=== Search for other PEP604 unions ' | None' in that file ==="
rg -n "\|\s*None" "$file" || trueRepository: deepnote/deepnote-toolkit
Length of output: 294
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="deepnote_toolkit/kernel_checkpoint/__main__.py"
echo "=== File exists? ==="
ls -l "$file" || true
echo "=== Showing around line 55-75 with numbers ==="
nl -ba "$file" | sed -n '50,85p'
echo "=== Imports in that file (top 40 lines) ==="
nl -ba "$file" | sed -n '1,45p'
echo "=== Search for 'Optional[' usage in that file ==="
rg -n "Optional\[" "$file" || true
echo "=== Search for PEP604 unions '| None' in that file ==="
rg -n "\|\s*None" "$file" || trueRepository: deepnote/deepnote-toolkit
Length of output: 294
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="deepnote_toolkit/kernel_checkpoint/__main__.py"
echo "=== File exists? ==="
ls -l "$file" || true
echo "=== Showing around line 55-75 with numbers ==="
nl -ba "$file" | sed -n '55,75p'
echo "=== Imports in that file (top 60 lines) ==="
nl -ba "$file" | sed -n '1,60p'
echo "=== Search for typing.Optional usage in repo (small scope) ==="
rg -n "from typing import Optional|Optional\[" deepnote_toolkit | head -n 20
echo "=== Search for PEP604 '| None' usage in same module area (deepnote_toolkit/kernel_checkpoint) ==="
rg -n "\|\s*None" deepnote_toolkit/kernel_checkpoint | head -n 20Repository: deepnote/deepnote-toolkit
Length of output: 294
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="deepnote_toolkit/kernel_checkpoint/__main__.py"
echo "=== Line 1-120 with numbers (awk) ==="
awk 'NR>=1 && NR<=120 {printf "%d\t%s\n", NR, $0}' "$file"
echo "=== Exact main() signature line containing 'def main' ==="
rg -n "def main\(" "$file" || true
echo "=== Search for 'Optional[' usage in that file ==="
rg -n "Optional\[" "$file" || true
echo "=== Search for PEP604 '| None' usage in that file ==="
rg -n "\|\s*None" "$file" || true
echo "=== Imports in that file ==="
rg -n "^(from typing import|import typing|from __future__ import|from typing_extensions import)" "$file" || trueRepository: deepnote/deepnote-toolkit
Length of output: 3797
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "Optional\[" deepnote_toolkit | head -n 20
echo "----"
rg -n "\|\s*None" deepnote_toolkit | head -n 20
echo "----"
echo "Counts:"
python3 - <<'PY'
import re, subprocess, sys
def count(pattern):
p=subprocess.run(["bash","-lc",f"rg -n {pattern!r} deepnote_toolkit | wc -l"],capture_output=True,text=True)
return int(p.stdout.strip() or "0")
# basic counts
opt=count("Optional\\[")
union=count("\\|\\s*None")
print("Optional[ count:",opt)
print("| None count:",union)
PYRepository: deepnote/deepnote-toolkit
Length of output: 2441
Use Optional[list[str]] for the nullable argv param (or align the whole module)
deepnote_toolkit/kernel_checkpoint/__main__.py:65 uses list[str] | None; this kernel_checkpoint package already uses PEP 604 (... | None) in store.py/checkpoint.py, so either switch to Optional[...] consistently there (incl. this function) or allow | None as the repo convention.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deepnote_toolkit/kernel_checkpoint/__main__.py` at line 65, The function
signature for main uses the union syntax "list[str] | None" which is
inconsistent with the module's preferred use of typing.Optional; update the main
function signature to use Optional[list[str]] instead, add "from typing import
Optional" to the imports if missing, and ensure any other type hints in this
module follow the same Optional[...] convention for consistency with store.py
and checkpoint.py.
| payload = store.read(key) | ||
| if payload is None: | ||
| return None | ||
| restored: dict[str, Any] = dill.loads(payload) | ||
| return restored, RestoreReport(restored_names=list(restored.keys()), bytes_read=len(payload)) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify whether checkpoint payload integrity/authenticity is implemented anywhere.
rg -n -C3 'dill\.loads|dill\.dumps|hmac|signature|sign|verify|sha256|blake2|cryptography|fernet' deepnote_toolkitRepository: deepnote/deepnote-toolkit
Length of output: 15277
Authenticate checkpoint payloads before dill.loads — deepnote_toolkit/kernel_checkpoint/checkpoint.py (lines 123-127) restores state via dill.loads(payload) directly from store.read(key). If the stored snapshot can be modified, this enables arbitrary code execution. Add integrity/authentication (e.g., signed payloads or an AEAD envelope) and fail closed before deserialization.
🧰 Tools
🪛 Ruff (0.15.14)
[error] 126-126: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue
(S301)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deepnote_toolkit/kernel_checkpoint/checkpoint.py` around lines 123 - 127, The
code currently calls dill.loads(payload) on untrusted data read via
store.read(key); instead add integrity/authentication verification before
deserialization: expect an authenticated envelope (e.g., AEAD ciphertext or a
signed payload with signature field) and verify it using your project crypto
utilities (or add HMAC/verify using a configured secret/public key) before
calling dill.loads; if verification fails, return None or raise and do not call
dill.loads. Update checkpoint.restore logic around payload, dill.loads, and
RestoreReport to parse/verify the envelope, decrypt/unwrap or reject invalid
payloads, and only then set restored and construct RestoreReport.
| def _path_for(self, key: str) -> Path: | ||
| # Forward slashes and colons would create unwanted directories; flatten. | ||
| safe = key.replace("/", "__").replace(":", "_") | ||
| return self._root / safe |
There was a problem hiding this comment.
Unsafe key-to-path mapping can collide and traverse.
Line 45’s replacement scheme is non-injective ("a/b" and "a__b" collide) and still allows ".." as a path component. That can overwrite the wrong snapshot or write outside root.
Suggested fix
+import hashlib
+
def _path_for(self, key: str) -> Path:
- # Forward slashes and colons would create unwanted directories; flatten.
- safe = key.replace("/", "__").replace(":", "_")
- return self._root / safe
+ digest = hashlib.sha256(key.encode("utf-8")).hexdigest()
+ return self._root / f"{digest}.snapshot"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _path_for(self, key: str) -> Path: | |
| # Forward slashes and colons would create unwanted directories; flatten. | |
| safe = key.replace("/", "__").replace(":", "_") | |
| return self._root / safe | |
| def _path_for(self, key: str) -> Path: | |
| digest = hashlib.sha256(key.encode("utf-8")).hexdigest() | |
| return self._root / f"{digest}.snapshot" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deepnote_toolkit/kernel_checkpoint/store.py` around lines 43 - 46, The
_path_for method currently maps keys to filenames with simple replace, which is
non-injective and allows path-traversal; replace it by producing a
deterministic, filesystem-safe filename derived from a strong hash of the key
(e.g., sha256 hex) and optionally a short sanitized/truncated human-readable
suffix, ensure the filename contains no path separators or "..", join with
self._root, then resolve and assert the resolved path is inside self._root (use
Path.resolve() and Path.is_relative_to or compare parents) before returning to
prevent escapes; update references to _path_for accordingly.
| with pytest.raises(ValueError): | ||
| compute_checkpoint_key(project_id="", init_source_hash="h", environment_id="e") | ||
| with pytest.raises(ValueError): | ||
| compute_checkpoint_key(project_id="p", init_source_hash="", environment_id="e") | ||
| with pytest.raises(ValueError): | ||
| compute_checkpoint_key(project_id="p", init_source_hash="h", environment_id="") |
There was a problem hiding this comment.
Tighten ValueError assertions with match=.
These checks are broad; they can pass for unrelated ValueErrors. Assert the expected message fragment.
Suggested fix
- with pytest.raises(ValueError):
+ with pytest.raises(ValueError, match="project_id must be a non-empty string"):
compute_checkpoint_key(project_id="", init_source_hash="h", environment_id="e")
- with pytest.raises(ValueError):
+ with pytest.raises(ValueError, match="init_source_hash must be a non-empty string"):
compute_checkpoint_key(project_id="p", init_source_hash="", environment_id="e")
- with pytest.raises(ValueError):
+ with pytest.raises(ValueError, match="environment_id must be a non-empty string"):
compute_checkpoint_key(project_id="p", init_source_hash="h", environment_id="")🧰 Tools
🪛 Ruff (0.15.14)
[warning] 184-184: pytest.raises(ValueError) is too broad, set the match parameter or use a more specific exception
(PT011)
[warning] 186-186: pytest.raises(ValueError) is too broad, set the match parameter or use a more specific exception
(PT011)
[warning] 188-188: pytest.raises(ValueError) is too broad, set the match parameter or use a more specific exception
(PT011)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/test_kernel_checkpoint.py` around lines 184 - 189, The tests call
compute_checkpoint_key with empty inputs but currently use broad
pytest.raises(ValueError) which can catch unrelated errors; update each
assertion to use pytest.raises(ValueError, match="...") with a specific expected
message fragment (e.g., "project_id", "init_source_hash", and "environment_id")
so each raise asserts the correct validation path in compute_checkpoint_key.
|
🚀 Review App Deployment Started
|
Why
The dominant cold-start cost for projects with a heavy init notebook is re-running init on every fresh container boot. For data-loading / client-setup init, this is tens of seconds to minutes per cold start.
This POC ships the data plane for snapshotting the kernel's post-init globals to a pluggable store and restoring them on the next cold boot, skipping the init re-run entirely. Architecture is production-shaped so the same module ships unchanged into prod once we add an S3 backend.
What's in the box
Correctness invariant
A restore is correct iff the restored namespace produces the same downstream execution behaviour as freshly re-running init in the current environment.
The checkpoint key includes init source hash + environment id so source/env changes always invalidate the cache.
Verified end-to-end
Test plan
runtime/executor.pyinit lifecycle + webapp signaling of init_source_hash + env_id + per-workspace feature flagOut of scope (productionisation TODO)
/workdirectory hash in the checkpoint key🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Tests