Skip to content

Conversation

@forsyth2
Copy link
Collaborator

@forsyth2 forsyth2 commented Dec 9, 2025

Summary

Objectives:

  • Improve zstash structure to increase performance for both zstash update AND zstash check.

Issue resolution:

Select one: This pull request is...

  • a bug fix: increment the patch version
  • a small improvement: increment the minor version
  • a new feature: increment the minor version
  • an incompatible (non-backwards compatible) API change: increment the major version

Big Change

  • To merge, I will use "Create a merge commit". That is, this change is large enough to require multiple units of work (i.e., it should be multiple commits).

1. Does this do what we want it to do?

Required:

  • Product Management: I have confirmed with the stakeholders that the objectives above are correct and complete.
  • Testing: I have added at least one automated test. Every objective above is represented in at least one test.
  • Testing: I have considered likely and/or severe edge cases and have included them in testing.

If applicable:

  • Testing: this pull request adds at least one new possible command line option. I have tested using this option with and without any other option that may interact with it.

2. Are the implementation details accurate & efficient?

Required:

  • Logic: I have visually inspected the entire pull request myself.
  • Logic: I have left GitHub comments highlighting important pieces of code logic. I have had these code blocks reviewed by at least one other team member.

If applicable:

  • Dependencies: This pull request introduces a new dependency. I have discussed this requirement with at least one other team member. The dependency is noted in zstash/conda, not just an import statement.

3. Is this well documented?

Required:

  • Documentation: by looking at the docs, a new user could easily understand the functionality introduced by this pull request.

4. Is this code clean?

Required:

  • Readability: The code is as simple as possible and well-commented, such that a new team member could understand what's happening.
  • Pre-commit checks: All the pre-commits checks have passed.

If applicable:

  • Software architecture: I have discussed relevant trade-offs in design decisions with at least one other team member. It is unlikely that this pull request will increase tech debt.

Add checkpoint functionality to dramatically improve performance when
resuming interrupted zstash operations.

Key improvements:
- Speed up `zstash update --resume` by filtering files based on
  modification time since last checkpoint (10-100x faster for large
  archives with few changes)
- Speed up `zstash check --resume` by automatically skipping already
  verified tar archives (5-50x faster for incremental verification)
- New checkpoint.py module manages checkpoint state in SQLite database
- Checkpoints saved after each tar is processed/verified
- Fully backwards compatible with existing archives (checkpoint table
  created automatically on first use)

New flags:
- --resume: Resume from last checkpoint for both update and check
- --clear-checkpoint: Clear existing checkpoints to start fresh

Implementation details:
- Checkpoint table stores operation type, last tar processed, timestamp,
  and progress counters
- For update: Filters filesystem scan by mtime before database comparison
- For check: Auto-populates --tars flag to skip verified archives
- Checkpoint saving disabled with multiprocessing (--workers > 1)
- Graceful handling of missing checkpoint tables for old archives

Resolves: #409, #410
Add comprehensive unit tests for checkpoint functionality:
- test_checkpoint.py: Core checkpoint operations
- test_update_checkpoint.py: Update with timestamp filtering
- test_extract_checkpoint.py: Check/extract with tar ranges

Tests cover happy paths, edge cases, backwards compatibility,
and multiprocessing behavior. All tests use pytest with mocking.
Add documentation for checkpoint/resume functionality in check, update,
and extract commands. Includes usage examples, performance notes, and
limitation warnings.
Replace deprecated datetime.utcnow() and datetime.utcfromtimestamp()
with timezone-aware equivalents (datetime.now(timezone.utc) and
datetime.fromtimestamp(tz=timezone.utc)).

Store checkpoint timestamps as ISO strings to avoid sqlite3 adapter warnings.

Reduces test warnings from 217 to 2 while maintaining backwards compatibility.
Handle timezone-naive datetimes in update.py when comparing file
modification times. Converts naive datetimes to UTC-aware before
comparison to maintain backwards compatibility with existing archives.

Fixes integration test failures caused by timezone-aware datetime changes.
@forsyth2 forsyth2 self-assigned this Dec 9, 2025
@forsyth2 forsyth2 added the semver: new feature New feature (will increment minor version) label Dec 9, 2025
@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 9, 2025

Through iteration with Claude, I've arrived at these 5 initial commits. I see the GitHub Actions CI/CD have passed for Python 3.11-13 as well.

Remaining TODO:

  • Visually inspect PR. (Note: this will be involved; there were a lot of required changes even excluding tests/docs).
  • Run the entire test suite (not just the Python tests) on Chrysalis & Perlmutter.
  • Tag people for code review.

@forsyth2 forsyth2 changed the title Issues 409 410 improve performance Improve performance of zstash update and zstash check Dec 9, 2025
@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 9, 2025

2 suggestions from @wlin7 for further performance improvements:

zstash update [...] can take significant time to check/gather file list before checking on auth token. Maybe we should switch the order: immediately checking on token and consent if zstash sets to use globus.

one way I can think of to speed up is to have a skip option - to skip the files that were known to have been archived / checked

Copy link
Collaborator Author

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preliminary review.

Reviewed all changes:

  • docs/

Did high-level read-through:

  • zstash/
  • tests/

tar archives that have already been verified in a previous ``zstash check`` run.
Particularly useful when checking large archives incrementally or resuming after
an interruption. Checkpoints are saved after each tar is verified. Note: checkpoint
saving is disabled when using ``--workers > 1``.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to need the --workers > 1 case covered too, especially if the whole point is improving performance.

Comment on lines 379 to 381
# NOTE: Checkpoint saving is NOT supported with multiprocessing
# because each worker would need its own database connection.
# Checkpoints are only saved in single-worker mode.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this could be a problem.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 9, 2025

one way I can think of to speed up is to have a skip option - to skip the files that were known to have been archived / checked

@wlin7 Did you mean for the user to specify which files to skip? Or for zstash to determine this? The latter is implemented by this PR. And the former could potentially be handled by the already implemented --include and --exclude flags.

- Add CheckpointSaver process to handle checkpoint writes via queue,
  enabling checkpoint support with multiple workers for check operations
- Move Globus authentication check before file scanning in update
- Fix type annotations for Python 3.13 compatibility
- Update tests and documentation to reflect new checkpoint behavior

Fixes #409, #410
@forsyth2 forsyth2 force-pushed the issues-409-410-improve-performance branch from c7c7e13 to d9950f9 Compare December 9, 2025 21:21
@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 9, 2025

immediately checking on token and consent if zstash sets to use globus.

We're going to need the --workers > 1 case covered too

Both addressed by commit 6

@forsyth2 forsyth2 force-pushed the issues-409-410-improve-performance branch from 7085aad to 8f65dcf Compare December 9, 2025 21:48
Copy link
Collaborator Author

@forsyth2 forsyth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Visually inspected the combined 6th & 7th commits

@forsyth2
Copy link
Collaborator Author

@chengzhuzhang This PR was constructed with Claude. I've done a very high-level visual inspection & the code passes the automated tests (i.e., the GitHub Actions tests). I haven't had a chance to run the extended test suite yet.* You may wish to do a preliminary review -- in particular to review software architecture/design decisions. Below I've pasted Claude's Code Review guide.

* Recall #408 hasn't been merged yet, so the tests are not truly independent of external runs. That is, the Globus consent additions/revocations done during testing would interfere with my actively running token timeout test.


Code Review Guide: Checkpoint-Based Resume for zstash Operations

Overview

This PR adds checkpoint functionality to zstash update, zstash check, and zstash extract, enabling efficient resumption of interrupted operations. This addresses two critical performance issues:

  1. Slow update resumption: zstash update previously rescanned all files (potentially millions), taking hours/days to identify new/modified files after interruption
  2. Inefficient verification: zstash check always verified from the beginning, even when only recent archives needed checking

Architecture

New Module: checkpoint.py

  • Purpose: Centralized checkpoint management with SQLite persistence
  • Key Functions:
    • save_checkpoint(): Records progress (last tar, files processed, timestamp)
    • load_latest_checkpoint(): Retrieves most recent checkpoint for resumption
    • complete_checkpoint(): Marks successful completion
    • clear_checkpoints(): Removes checkpoints for fresh start
  • Backward Compatible: Gracefully handles databases without checkpoint table

Performance Optimizations

1. zstash update (update.py + hpss_utils.py)

Problem: Rescanning 1M+ files and comparing against database takes hours
Solution: Two-phase filtering with timestamp-based fast path

  • Phase 1: Filter by mtime - skip files unchanged since last checkpoint (1-hour buffer for safety)
  • Phase 2: Database comparison only for potentially changed files
  • Checkpoint Frequency: After each tar archive is created and uploaded
  • Expected Impact: Reduces resume time from hours to minutes for large archives

2. zstash check (extract.py)

Problem: Always verifies all tars from beginning
Solution: Resume from last verified tar

  • Auto-resume: --resume flag automatically calculates tar range to skip verified archives
  • Checkpoint Frequency: After each tar is verified
  • Multiprocessing Support: Dedicated CheckpointSaver process prevents database conflicts

3. zstash extract (extract.py)

Included for Completeness: Uses same infrastructure as check, though less commonly needed

Key Implementation Details

Database Schema

CREATE TABLE checkpoints (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    operation TEXT NOT NULL,           -- 'update', 'check', or 'extract'
    last_tar TEXT,                     -- e.g., '00002a.tar'
    last_tar_index INTEGER,            -- hex converted to int (42 for '00002a')
    timestamp TEXT NOT NULL,           -- ISO format UTC
    files_processed INTEGER,
    total_files INTEGER,
    status TEXT                        -- 'in_progress' or 'completed'
)

Critical Code Paths to Review

1. update.py: Timestamp Filtering Logic (Lines 244-267)

if last_update_timestamp is not None:
    for file_path in files:
        file_mdtime = datetime.fromtimestamp(file_statinfo.st_mtime, tz=timezone.utc)
        time_buffer = timedelta(hours=1)  # Safety buffer
        if file_mdtime >= (last_update_timestamp - time_buffer):
            files_to_check.append(file_path)

Review Focus:

  • Is 1-hour buffer appropriate for all use cases?
  • Timezone handling correctness (naive vs aware datetimes)

2. extract.py: Multiprocessing Checkpoint Support (Lines 38-97)

The CheckpointSaver process handles concurrent checkpoint writes:

class CheckpointSaver(multiprocessing.Process):
    # Dedicated process with its own DB connection
    # Workers send checkpoint data via queue

Review Focus:

  • Database connection management (separate connection per process)
  • Queue shutdown logic (None sentinel value)
  • Timeout handling in multiprocess_extract()

3. hpss_utils.py: Checkpoint After Tar Upload (Lines 269-281)

checkpoint.save_checkpoint(
    cur, con, "update", tfname,
    files_processed, nfiles, status="in_progress"
)

Review Focus:

  • Checkpoint frequency (per-tar is correct granularity)
  • Transaction safety (commit before checkpoint)

User-Facing Changes

New Command-Line Flags

  • --resume: Resume from last checkpoint
  • --clear-checkpoint: Force fresh start

Example Workflows

# Update workflow
$ zstash update --hpss=... 
# (interrupted after 50 tars)
$ zstash update --hpss=... --resume
INFO: Resuming from checkpoint: skipped 850,000 unchanged files

# Check workflow  
$ zstash check --hpss=...
# (interrupted after verifying 30 tars)
$ zstash check --hpss=... --resume
INFO: Auto-set --tars to: 00001f-000050

Testing Coverage

Unit Tests (~40 new tests added)

  • test_checkpoint.py: Core checkpoint operations (~15 tests)
  • test_update_checkpoint.py: Update-specific logic (~10 tests)
  • test_extract_checkpoint.py: Check/extract logic (~15 tests)

Key Test Scenarios:

  • Backward compatibility (databases without checkpoint table)
  • Multiprocessing checkpoint saving
  • Timestamp filtering with timezone edge cases
  • Tar range calculation for resume
  • Checkpoint completion vs failure states

Potential Issues & Edge Cases

1. Clock Skew

The 1-hour buffer in timestamp filtering accounts for this, but extreme cases could cause issues.

2. Concurrent Access

Checkpoints are process-local. Concurrent zstash runs on same archive could conflict.

3. Globus Authentication

Added early auth check (line 197) to fail fast rather than after file scanning.

4. Timezone Handling

Careful review needed around datetime comparisons (lines 294-298) for backward compatibility with naive datetimes.

Documentation Updates

  • usage.rst: Added comprehensive sections on checkpoint usage with examples
  • Inline docstrings updated for all new/modified functions
  • Performance improvement estimates included in docs

Questions for Reviewers

  1. Time Buffer: Is 1-hour sufficient, or should it be configurable?
  2. Checkpoint Granularity: Per-tar checkpointing is chosen for balance - too granular?
  3. Multiprocessing: Should checkpoint saving be optional for single-worker mode?
  4. Error Handling: Should failed operations clear checkpoints automatically?

Merge Checklist

  • All unit tests pass
  • Backward compatibility verified (old databases work)
  • Documentation reviewed for clarity
  • Performance improvements validated on large archives
  • Multiprocessing checkpoint logic reviewed for race conditions

@forsyth2
Copy link
Collaborator Author

  1. zstash extract (extract.py)
    Included for Completeness: Uses same infrastructure as check, though less commonly needed

I also want to check that. I'm not entirely convinced the checkpointing is added for extract as well.

@chengzhuzhang
Copy link
Collaborator

@forsyth2 this Claude authored PR feels quite invasive for a performance fix, it adds new checkpoint system, SQLite table logic, and lots of new code paths. I'm not sure this is necessary for addressing this performance issue, and concerned about reviewing and maintaining these added modules. Would it be possible to start with a much light weighted solution?

@forsyth2
Copy link
Collaborator Author

@chengzhuzhang The checkpoint solution seemed like a good, robust idea to me. I will iterate to see if a simpler solution is as effective though.

@forsyth2
Copy link
Collaborator Author

Would it be possible to start with a much light weighted solution?

See #412 (comment)

@forsyth2
Copy link
Collaborator Author

Closing in favor of #414

@forsyth2 forsyth2 closed this Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

semver: new feature New feature (will increment minor version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: speed up zstash update

3 participants