-
Notifications
You must be signed in to change notification settings - Fork 10
Delete transferred files #405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2252767 to
552daae
Compare
|
Visually compared diffs of #383 and this PR; they appear to match up correctly. |
96dda0f to
7ead900
Compare
|
Merged #404 (the test that fails on |
|
Indeed, we now get: Remaining TODO:
|
Unfortunately, the |
Fix & Chrysalis testscd ~/ez/zstash
git status
# On branch improve-globus-refresh
# Handle uncommitted changes
git add -A
git commit -m "Testing" --no-verify
git checkout issue-374-tar-deletion-rebased20251124
lcrc_conda
conda activate pr405-tar-deletion-20251203
git log
# Good, matches the 2 commits of https://github.com/E3SM-Project/zstash/pull/405/commits
git diff 01ffe34230c357145258544ed26d0b64327323ed 06255950a408a2fabb1606ebbbc718ed06c49697 | cat
# Review diff
# Make fixes and then:
pre-commit run --all-files
python -m pip install .
# Main test
cd tests/integration/bash_tests/run_from_any
./test_globus_tar_deletion.bash 20251208_pr405 /home/ac.forsyth2/ez/zstash /home/ac.forsyth2/zstash_tests LCRC_IMPROV_DTN_ENDPOINT yesThat gives: Run full test suite: cd ~/ez/zstash
pytest tests/unit/test_*.py
# 1 passed in 1.61s
python -m unittest tests/integration/python_tests/group_by_command/test_*.py
# Ran 69 tests in 61.546s
# OK (skipped=32)
python -m unittest tests/integration/python_tests/group_by_workflow/test_*.py
# Ran 4 tests in 3.378s
# OK
cd tests/integration/bash_tests/run_from_any/
./globus_auth.bash pr405_full_test_try2 chrysalis /home/ac.forsyth2/ez/zstash /home/ac.forsyth2/zstash_tests /global/homes/f/forsyth/zstash_tests /home/f/forsyth/zstash_tests /compyfs/fors729/zstash_tests
# All globus_auth tests completed successfully.
# Good, the fix worked!
cd ~/ez/zstash
cd tests/integration/bash_tests/run_from_chrysalis/
# Revoke consents: https://auth.globus.org/v2/web/consents > Globus Endpoint Performance Monitoring > rescind all
rm ~/.zstash_globus_tokens.json
mkdir zstash_demo; echo 'file0 stuff' > zstash_demo/file0.txt
# NERSC_PERLMUTTER_ENDPOINT=6bdc7956-fc0f-4ad2-989c-7aa5ee643a79
zstash create --hpss=globus://6bdc7956-fc0f-4ad2-989c-7aa5ee643a79//global/homes/f/forsyth/zstash/tests/test_database_corruption_setup23 zstash_demo
# Paste an auth code here, but NOT during the database_corruption test.
rm -rf zstash_demo/
time ./database_corruption.bash pr405_test_db
# Success count: 25
# Fail count: 0
# Review:
# real 8m0.382s
time ./symlinks.sh
# ./symlinks.sh: line 18: cd: /home/ac.forsyth2/ez/zstash/tests/utils/test_symlinks: No such file or directory
mkdir -p /home/ac.forsyth2/ez/zstash/tests/utils/test_symlinks
time ./symlinks.sh
# real 0m8.316s
# Good, no errors
cd ~/ez/zstash
git status
rm -rf tests/integration/bash_tests/run_from_chrysalis/workdir/
pre-commit run --all-files
git add -A
git commit -m "Fix gtc handling for get"
git push upstream issue-374-tar-deletion-rebased20251124Perlmutter testscd ~/ez/zstash
git status
# On branch test_unified_1.12.0rc4_perlmutter
# nothing to commit, working tree clean
git fetch upstream issue-374-tar-deletion-rebased20251124
git checkout -b issue-374-tar-deletion-rebased20251124 upstream/issue-374-tar-deletion-rebased20251124
nersc_conda # Activate conda.
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n pr405-tar-deletion-20251208
conda activate pr405-tar-deletion-20251208
pre-commit run --all-files
python -m pip install .
cd ~/ez/zstash
pytest tests/unit/test_*.py
# 1 passed in 0.21s
python -m unittest tests/integration/python_tests/group_by_command/test_*.pyThat gives 3 errors. (Note that errors newly occur because the HPSS tests only work on Perlmutter). python -m unittest tests/integration/python_tests/group_by_workflow/test_*.py
# Ran 4 tests in 2.650s
# OK
# Skipping tests/integration/bash_tests/run_from_any/ tests
# `./globus_auth.bash` is cumbersome to run, but passed on Chrysalis
# `./test_globus_tar_deletion.bash` is set up to run a transfer from an endpoint to itself,
# which has sometimes caused run into problems on Perlmutter. This test passed on Chrysalis.
cd tests/integration/bash_tests/run_from_perlmutter/
time ./follow_symlinks.sh
# real 0m32.711s
# Good, no errors
time ./test_update_non_empty_hpss.bash
# real 0m9.237s
# Good, no errors
time ./test_ls_globus.bash # Had to paste an auth-code
# real 0m43.061s
# Good, no errorsThere a 3 HPSS test failures to account for. All other tests appear to be working ok now. Once all tests are passing:
|
|
Claude's review guide: ExpandCode Review Guide: Fix Tar File Deletion After Globus TransferOverviewThis diff addresses a bug where tar files were not being deleted after successful Globus transfers when Key Changes1. New Transfer Tracking Module (
|
forsyth2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-review using Claude's code-review guide. Currently (as of commit 4), the entire test suite is passing.
Self-review
Verifications
Verify: Should is_index check be redundant with proper keep flag usage?
Yes, index.db is treated differently than the tar files. keep applies to the tar files. The index.db is always kept.
Potential issue: In non-blocking mode, status might still be "ACTIVE" or "UNKNOWN" here
Check: Is deletion properly deferred to globus_finalize()?
Yes, the test added in #404 appears to confirm this.
Question: Why delete both curr_transfers and prev_transfers? Is this double-deletion safe?
Verify the state machine: curr_transfers → prev_transfers → deletion
I believe this is fine. By this point in the code, we've passed the # Wait for any submitted transfers to complete block and are just trying to clear the transfer queues.
Verify: Does this maintain the same behavior?
I believe so; no symlink test was broken.
Tar files deleted after successful Globus transfer (without --keep)
Tar files preserved with --keep flag
index.db never deleted
Non-blocking mode: files deleted after finalization
Yes, in test added in #404.
Direct HPSS transfers still work correctly
Yes, no HPSS test was broken.
Very long transfers with EXHAUSTED_TIMEOUT_RETRIES
This is more the domain of #407 (which was recently determined to already be resolved on main).
--follow-symlinks with broken symlink behavior unchanged
Yes, no symlink test broke.
--dry-run doesn't trigger deletions
Verified by visual inspection of --dry-run code path.
Database corruption handling unchanged
Yes, that test didn't break.
In hpss_transfer(), status might be "UNKNOWN" or "ACTIVE" when deletion logic runs. > Files might not get queued for deletion properly.
Check: Does globus_finalize() compensate for this?
Yes. By this point in the code, we've passed the # Wait for any submitted transfers to complete block and are just trying to clear the transfer queues.
In globus_finalize(), if there's pending transfer_data, it submits a new transfer. But globus_transfer() already submitted data.
Verify: Is this intentional for batching? Or could it duplicate transfers?
This is explicitly in a # Check if there's any pending transfer data that hasn't been submitted yet block.
When does curr become prev?
Why delete both in finalize?
What happens on consecutive calls?
curr is the list of tars being transferred via Globus right now. prev was the previous list. prev is the list of tars that are now ok to delete because they've been successfully transferred. They are both deleted in finalize because we have completed globus_wait by then and all queues should be cleared.
If globus_finalize() calls sys.exit(1) (line 236), are files left behind?
Check: Should there be cleanup in error paths?
We probably should not be deleting in the case of error.
Lines commented with # Don't track index.db for deletion - verify this is consistently applied everywhere index.db is transferred.
Yes, otherwise the tests would not be passing.
Why is index.db excluded from htc tracking in some calls but the parameter is commented out?
This is referring to:
hpss_put(
hpss,
get_db_filename(cache),
cache,
keep=args.keep,
is_index=True,
gtc=gtc,
# htc=htc, # Don't track index.db for deletion
)It's commented out there to make specific note that we don't need to track index.db, only the tars.
In non-blocking mode, how do you ensure files are deleted if status is still "ACTIVE" when hpss_transfer() returns?
That's the point of the wait logic.
Should globus_wait() be called for ALL submitted transfers, not just the most recent?
No, that would render --non-blocking useless. The point is to wait on blocking transfers and at the very end.
What's the expected behavior if a user Ctrl-C's during finalization?
I imagine we'd want to keep the files since we're not sure they're transferred yet then.
Not explicitly tested/verified
Blocking mode: files deleted immediately after each transfer
The test from #404 does test that the files are deleted, but I suppose it's not explicitly testing it's done after each transfer.
Multiple tar files in single run handled correctly
The test from #404 is only testing one tar file.
Transfer fails: files should NOT be deleted
Mixed successful/failed transfers
Globus endpoint activation failures
How do we force a transfer failure?
File already deleted (verify graceful handling)
We'd have to concurrently delete files while the test was transferring data.
Possible improvements
Magic values: "SUCCEEDED", "ACTIVE" should be constants
I can make an enum for this.
Consider: Replace List with deque for transfer collections (performance)
That seems reasonable. (There probably aren't that many tars batched together for each transfer, but there's not really a downside to switching to deque).
Possible concerns:
Thread safety: Is this code thread-safe now? (Likely wasn't before either)
forsyth2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding explanatory notes
| logger.debug(f"{ts_utc()}:Calling globus_activate(hpss)") | ||
| globus_activate(hpss) | ||
| logger.debug(f"{ts_utc()}:Calling globus_activate()") | ||
| gtc = globus_activate(hpss) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a new object to hold state data, rather than using global variables.
| # Create and set up the database | ||
| logger.debug(f"{ts_utc()}: Calling create_database()") | ||
| failures: List[str] = create_database(cache, args) | ||
| htc: HPSSTransferCollection = HPSSTransferCollection() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same idea of using objects to store data. Don't confuse with GlobusTransferCollection().
| keep=args.keep, | ||
| is_index=True, | ||
| gtc=gtc, | ||
| # htc=htc, # Don't track index.db for deletion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to track index.db because we never delete it.
| if args.follow_symlinks: | ||
| raise Exception("Archive creation failed due to broken symlink.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just to make the code easier to read and avoid an identical function call being written out twice (once in each part of the if/else block).
| sys.exit(1) | ||
|
|
||
|
|
||
| def file_exists(name: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to zstash/globus_utils.py
| # ACTIVE, SUCCEEDED, FAILED, INACTIVE | ||
| self.task_status: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible improvement: could replace with an enum
| self.prev_transfers: List[str] = [] # Can remove | ||
| self.curr_transfers: List[str] = [] # Still using! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible improvement: use dequeue rather than List
| logger.debug(f"{ts_utc()}: HPSSTransferCollection initialized") | ||
|
|
||
|
|
||
| def delete_transferred_files(htc: HPSSTransferCollection): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used in Globus case: we can delete the previous files and rotate the current transfer list to be the "previous" list.
| logger.debug(f"{ts_utc()}: prev_transfers has been set to {htc.prev_transfers}") | ||
|
|
||
|
|
||
| def delete_current_files(htc: HPSSTransferCollection): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used in HPSS case: we can delete the current files right away because we don't have to wait long for a transfer.
| if args.follow_symlinks: | ||
| raise Exception("Archive update failed due to broken symlink.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same update to the follow_symlinks exception wrapping as in create.py
|
Converted back to draft. Upon further self-review, need to resolve a few things:
|
|
Closing in favor of #416 |
Summary
Objectives:
--non-blockingis setIssue resolution:
--non-blockingcompletes #383 (That one had many conflicts when I rebased onupstream/main, so I decided to open a distinct PR to compare the two).Select one: This pull request is...
Big Change
1. Does this do what we want it to do?
Required:
If applicable:
2. Are the implementation details accurate & efficient?
Required:
If applicable:
zstash/conda, not just animportstatement.3. Is this well documented?
Required:
zstashbehaves as should be expected based on the docs.4. Is this code clean?
Required:
If applicable: