Skip to content

Conversation

@edonadei
Copy link

@edonadei edonadei commented Nov 2, 2025

Motivation

This PR implements incremental model re-hashing to solve a critical performance problem when signing large ML models. Currently, when a user makes a small change to a large model (e.g., updating README.md in a 500GB model), the entire model must be re-hashed before re-signing, which can take hours. This makes it impractical to update documentation or configuration files in large models.

This PR adds a Python API that reuses digests from previous signatures for unchanged files, only re-hashing files that were added or modified. For a 500GB model with a 1KB documentation update, this reduces re-hashing time from hours to seconds (~500,000x speedup).

Changes

  1. Manifest.from_signature() - Extracts manifest from existing signature files without cryptographic verification, enabling digest reuse
  2. IncrementalSerializer - Core implementation that compares current model state against existing manifest and only re-hashes changed files
  3. Config.use_incremental_serialization() - Integrates incremental serializer into hashing API
  4. sign_incremental() - High-level convenience API that combines extraction + incremental hashing + signing

Design Decision

Following the discussion in #160, this implementation uses a user-driven approach where changed files are specified via the files_to_hash parameter (e.g., from git diff). This was chosen over automatic change detection because:

  • File metadata (mtime) is unreliable across systems and git operations
  • Users know what changed via their workflow (git, manual tracking, etc.)
  • Keeps implementation simple and reliable
  • Works with any file tracking system

Test Coverage:

  • tests/manifest_test.py: 5 tests for manifest extraction (valid/invalid signatures, error cases)
  • tests/_serialization/incremental_test.py: 7 tests for incremental serializer (new files, modified files, deleted files, mixed scenarios, empty manifests)

Future Work: Based on maintainer feedback, these could be added in follow-up PRs:

  • CLI support (--incremental flag)
  • Documentation in README
  • Git helper utility (get_changed_files_from_git())
  • Integration tests with large models

Questions for Maintainers:

  • Is the files_to_hash parameter approach acceptable, or would you prefer a different change detection mechanism?
  • Should CLI support and documentation be in this PR or separate follow-ups?
  • Would you like me to add a CHANGELOG entry now or wait until the approach is approved?

Testing this PR

# Create and sign an initial model
from model_signing import signing
from pathlib import Path

model_dir = Path("test_model")
model_dir.mkdir()
(model_dir / "weights.bin").write_bytes(b"x" * 1000000)  # 1MB file
(model_dir / "README.md").write_text("Version 1")

# Initial sign
signing.Config().use_elliptic_key_signer(
    private_key=Path("test.key")
).sign(model_dir, "model.sig.v1")

# Modify only README
(model_dir / "README.md").write_text("Version 2 - updated docs")

# Sign incrementally - only re-hashes README.md, reuses weights.bin digest
signing.Config().use_elliptic_key_signer(
    private_key=Path("test.key")
).sign_incremental(
    model_dir,
    old_signature_path="model.sig.v1",
    new_signature_path="model.sig.v2",
    files_to_hash=[model_dir / "README.md"]
)

# Verify the new signature works
from model_signing import verifying
verifying.Config().use_elliptic_key_verifier(
    public_key=Path("test.pub")
).verify(model_dir, "model.sig.v2")
Usage Example:
import subprocess
from model_signing import signing

# Get changed files from git
changed = subprocess.check_output(
    ['git', 'diff', '--name-only', 'HEAD']
).decode().strip().split('\n')
files_to_hash = [f"model/{f}" for f in changed if f]

# Sign incrementally
signing.sign_incremental(
    model_path="huge-model/",
    old_signature_path="model.sig.old",
    new_signature_path="model.sig.new",
    files_to_hash=files_to_hash
)

Emrick Donadei added 4 commits November 1, 2025 23:58
…tures

This method enables reading a manifest from a signature file without
performing cryptographic verification. This is the foundation for
incremental re-hashing, where we need to know what files were
previously signed to determine which files need re-hashing.

The method:
- Reads and parses Sigstore bundle JSON format
- Extracts the DSSE envelope payload
- Decodes base64-encoded payload
- Validates manifest integrity (root digest matches resources)
- Returns a Manifest object

Includes comprehensive tests covering:
- Valid manifest extraction
- Rejection of inconsistent manifests
- Error handling for missing files, invalid JSON, and missing envelopes

Related to issue sigstore#160 - API for incremental model re-hashing

Signed-off-by: Emrick Donadei <[email protected]>
Implements the core incremental hashing logic that compares the current
model state against an existing manifest and only re-hashes changed files.

Key features:
- Reuses digests for unchanged files from previous manifest
- Hashes new files not in the previous signature
- Handles modified files via files_to_hash parameter
- Handles file deletions automatically (omits them from new manifest)
- Uses same parallel hashing as standard file serializer

The algorithm:
1. Scan current model directory for all files
2. Build set of files to rehash from files_to_hash parameter
3. For each current file:
   - If not in old manifest: hash it (new file)
   - If in files_to_hash list: hash it (modified file)
   - Otherwise: reuse digest from old manifest (unchanged)
4. Deleted files are automatically excluded (not on disk)
5. Return manifest with mix of reused and new digests

Usage for incremental signing (e.g., 500GB model, 1KB README changed):
  # Get changed files from git
  changed = subprocess.check_output(['git', 'diff', '--name-only', 'HEAD'])
  files_to_hash = [model_path / f for f in changed.decode().split()]

  # Only re-hash the changed file(s)
  serializer.serialize(model_path, files_to_hash=files_to_hash)

This provides significant performance improvements - only re-hashing
the changed 1KB instead of all 500GB.

Includes comprehensive tests covering:
- No changes: all digests reused
- New file added: only new file hashed
- Modified file: only modified file re-hashed
- File deleted (auto): removed from manifest
- File deleted (in files_to_hash): safely ignored
- Mixed changes: all scenarios working together

Related to issue sigstore#160 - API for incremental model re-hashing

Signed-off-by: Emrick Donadei <[email protected]>
Integrates the IncrementalSerializer into the high-level hashing API,
making it accessible through the Config class.

Usage:
  # Extract manifest from previous signature
  old_manifest = Manifest.from_signature(Path("model.sig.old"))

  # Configure incremental hashing
  config = hashing.Config().use_incremental_serialization(
      old_manifest,
      hashing_algorithm="sha256"
  )

  # Get changed files and hash them
  changed_files = [model_path / "README.md"]
  new_manifest = config.hash(model_path, files_to_hash=changed_files)

This method follows the same pattern as use_file_serialization() and
use_shard_serialization(), providing a consistent API for users.

The configuration:
- Accepts an existing manifest to compare against
- Supports all the same hashing algorithms (SHA256, BLAKE2, BLAKE3)
- Supports the same parameters (chunk_size, max_workers, etc.)
- Returns Self for method chaining

Related to issue sigstore#160 - API for incremental model re-hashing

Signed-off-by: Emrick Donadei <[email protected]>
Provides high-level convenience functions for incremental model signing
that combine all the pieces: manifest extraction, incremental hashing,
and signing.

Two levels of API:

1. Simple function API:
   sign_incremental(
       model_path="huge-model/",
       old_signature_path="model.sig.old",
       new_signature_path="model.sig.new",
       files_to_hash=["huge-model/README.md"]
   )

2. Configurable class API:
   Config().use_elliptic_key_signer(private_key="key").sign_incremental(
       model_path="huge-model/",
       old_signature_path="model.sig.old",
       new_signature_path="model.sig.new",
       files_to_hash=["huge-model/README.md"]
   )

Both APIs:
- Extract manifest from old signature automatically
- Configure incremental hashing
- Hash only changed/new files
- Sign the new manifest
- Write the new signature

Also added set_allow_symlinks() method to IncrementalSerializer to
maintain compatibility with the hashing Config class, which calls this
method before serialization.

This makes it trivial for users to incrementally sign large models
where only a few files changed, avoiding hours of re-hashing.

Related to issue sigstore#160 - API for incremental model re-hashing

Signed-off-by: Emrick Donadei <[email protected]>
@edonadei edonadei marked this pull request as ready for review November 3, 2025 00:27
@edonadei edonadei requested review from a team as code owners November 3, 2025 00:27
@edonadei
Copy link
Author

edonadei commented Nov 3, 2025

@mihaimaruseac if you can take a look at this, I tried to follow up with the last discussions from #160 (from 2024 that's old) in that thread and tried to implement a solution. It's a bit long and probably imperfect, but I'm open to feedback, there's a questions for maintainer sections to start the discussion.

@mihaimaruseac
Copy link
Collaborator

Amazing! Will take a look this week (JupyterCon)

- Fix SIM118: Use 'key in dict' instead of 'key in dict.keys()'
- Fix E501: Break long lines to stay under 80 characters
- Fix F401: Remove unused pytest import from incremental_test.py
- Fix F401: Remove unused json import from manifest_test.py

All critical lint errors resolved.

Signed-off-by: Emrick Donadei <[email protected]>
Auto-format code with ruff to match the project's formatting standards:
- Adjust line breaking for long expressions
- Format function call arguments consistently
- Apply consistent parentheses placement

No functional changes, only formatting.

Signed-off-by: Emrick Donadei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants