-
Notifications
You must be signed in to change notification settings - Fork 10
Improve performance of zstash update and zstash check
#411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add checkpoint functionality to dramatically improve performance when resuming interrupted zstash operations. Key improvements: - Speed up `zstash update --resume` by filtering files based on modification time since last checkpoint (10-100x faster for large archives with few changes) - Speed up `zstash check --resume` by automatically skipping already verified tar archives (5-50x faster for incremental verification) - New checkpoint.py module manages checkpoint state in SQLite database - Checkpoints saved after each tar is processed/verified - Fully backwards compatible with existing archives (checkpoint table created automatically on first use) New flags: - --resume: Resume from last checkpoint for both update and check - --clear-checkpoint: Clear existing checkpoints to start fresh Implementation details: - Checkpoint table stores operation type, last tar processed, timestamp, and progress counters - For update: Filters filesystem scan by mtime before database comparison - For check: Auto-populates --tars flag to skip verified archives - Checkpoint saving disabled with multiprocessing (--workers > 1) - Graceful handling of missing checkpoint tables for old archives Resolves: #409, #410
Add comprehensive unit tests for checkpoint functionality: - test_checkpoint.py: Core checkpoint operations - test_update_checkpoint.py: Update with timestamp filtering - test_extract_checkpoint.py: Check/extract with tar ranges Tests cover happy paths, edge cases, backwards compatibility, and multiprocessing behavior. All tests use pytest with mocking.
Add documentation for checkpoint/resume functionality in check, update, and extract commands. Includes usage examples, performance notes, and limitation warnings.
Replace deprecated datetime.utcnow() and datetime.utcfromtimestamp() with timezone-aware equivalents (datetime.now(timezone.utc) and datetime.fromtimestamp(tz=timezone.utc)). Store checkpoint timestamps as ISO strings to avoid sqlite3 adapter warnings. Reduces test warnings from 217 to 2 while maintaining backwards compatibility.
Handle timezone-naive datetimes in update.py when comparing file modification times. Converts naive datetimes to UTC-aware before comparison to maintain backwards compatibility with existing archives. Fixes integration test failures caused by timezone-aware datetime changes.
|
Through iteration with Claude, I've arrived at these 5 initial commits. I see the GitHub Actions CI/CD have passed for Python 3.11-13 as well. Remaining TODO:
|
zstash update and zstash check
|
2 suggestions from @wlin7 for further performance improvements:
|
forsyth2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preliminary review.
Reviewed all changes:
docs/
Did high-level read-through:
zstash/tests/
docs/source/usage.rst
Outdated
| tar archives that have already been verified in a previous ``zstash check`` run. | ||
| Particularly useful when checking large archives incrementally or resuming after | ||
| an interruption. Checkpoints are saved after each tar is verified. Note: checkpoint | ||
| saving is disabled when using ``--workers > 1``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're going to need the --workers > 1 case covered too, especially if the whole point is improving performance.
zstash/extract.py
Outdated
| # NOTE: Checkpoint saving is NOT supported with multiprocessing | ||
| # because each worker would need its own database connection. | ||
| # Checkpoints are only saved in single-worker mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this could be a problem.
@wlin7 Did you mean for the user to specify which files to skip? Or for |
- Add CheckpointSaver process to handle checkpoint writes via queue, enabling checkpoint support with multiple workers for check operations - Move Globus authentication check before file scanning in update - Fix type annotations for Python 3.13 compatibility - Update tests and documentation to reflect new checkpoint behavior Fixes #409, #410
c7c7e13 to
d9950f9
Compare
Both addressed by commit 6 |
7085aad to
8f65dcf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Visually inspected the combined 6th & 7th commits
|
@chengzhuzhang This PR was constructed with Claude. I've done a very high-level visual inspection & the code passes the automated tests (i.e., the GitHub Actions tests). I haven't had a chance to run the extended test suite yet.* You may wish to do a preliminary review -- in particular to review software architecture/design decisions. Below I've pasted Claude's Code Review guide. * Recall #408 hasn't been merged yet, so the tests are not truly independent of external runs. That is, the Globus consent additions/revocations done during testing would interfere with my actively running token timeout test. Code Review Guide: Checkpoint-Based Resume for zstash OperationsOverviewThis PR adds checkpoint functionality to
ArchitectureNew Module:
|
I also want to check that. I'm not entirely convinced the checkpointing is added for |
|
@forsyth2 this Claude authored PR feels quite invasive for a performance fix, it adds new checkpoint system, SQLite table logic, and lots of new code paths. I'm not sure this is necessary for addressing this performance issue, and concerned about reviewing and maintaining these added modules. Would it be possible to start with a much light weighted solution? |
|
@chengzhuzhang The checkpoint solution seemed like a good, robust idea to me. I will iterate to see if a simpler solution is as effective though. |
See #412 (comment) |
|
Closing in favor of #414 |
Summary
Objectives:
zstashstructure to increase performance for bothzstash updateANDzstash check.Issue resolution:
zstashworked, albeit slowly, this fix is more correctly characterized as a new feature -- i.e., significantly improved performance)Select one: This pull request is...
Big Change
1. Does this do what we want it to do?
Required:
If applicable:
2. Are the implementation details accurate & efficient?
Required:
If applicable:
zstash/conda, not just animportstatement.3. Is this well documented?
Required:
4. Is this code clean?
Required:
If applicable: