Refactor to resolve tars not deleting when `--non-blocking` is set #416

forsyth2 · 2026-01-03T00:49:36Z

Summary

Objectives:

Delete tars eventually, even if --non-blocking is set

Issue resolution:

Closes [Bug]: tar files are not deleted after successful globus transfer #374
Replaces Delete tars after --non-blocking completes #383 (closed because of significant rebase conflicts), Delete transferred files #405 (closed because of design not working properly)

Select one: This pull request is...

a bug fix: increment the patch version
a small improvement: increment the minor version
a new feature: increment the minor version
an incompatible (non-backwards compatible) API change: increment the major version

Big Change

To merge, I will use "Create a merge commit". That is, this change is large enough to require multiple units of work (i.e., it should be multiple commits).

1. Does this do what we want it to do?

Required:

Product Management: I have confirmed with the stakeholders that the objectives above are correct and complete.
Testing: I have added at least one automated test. Every objective above is represented in at least one test.
Testing: I have considered likely and/or severe edge cases and have included them in testing.

If applicable:

Testing: this pull request adds at least one new possible command line option. I have tested using this option with and without any other option that may interact with it.

2. Are the implementation details accurate & efficient?

Required:

Logic: I have visually inspected the entire pull request myself.
Logic: I have left GitHub comments highlighting important pieces of code logic. I have had these code blocks reviewed by at least one other team member.

If applicable:

Dependencies: This pull request introduces a new dependency. I have discussed this requirement with at least one other team member. The dependency is noted in zstash/conda, not just an import statement.

3. Is this well documented?

Required:

Documentation: by looking at the docs, a new user could easily understand the functionality introduced by this pull request.

4. Is this code clean?

Required:

Readability: The code is as simple as possible and well-commented, such that a new team member could understand what's happening.
Pre-commit checks: All the pre-commits checks have passed.

If applicable:

Software architecture: I have discussed relevant trade-offs in design decisions with at least one other team member. It is unlikely that this pull request will increase tech debt.

forsyth2 · 2026-01-03T00:52:18Z

Action items:

Run test from Add test for tar deletion #404 to see if these changes do in fact resolve [Bug]: tar files are not deleted after successful globus transfer #374
Visually review code changes myself
Run entire test suite
Find some way to test that tars are actually deleted after successful transfer, rather than at the very end.
Code review

forsyth2 · 2026-01-06T21:37:32Z

All tests are passing now. Self-review guide from Claude:

Self-Review Guide for Progressive Tar File Deletion Fix

Overview

This diff fixes the issue where tar files weren't being deleted after successful Globus transfers when --keep is False. It introduces a TransferManager class to track transfers and delete files progressively.

Key Changes to Review

1. New Transfer Tracking System (`transfer_tracking.py`)

TransferManager class: Does it correctly maintain state across multiple transfers?
TransferBatch class: Are file paths being tracked correctly for each batch?
delete_successfully_transferred_files():
- Does it properly check Globus task status before deletion?
- Does it handle both Globus and HPSS transfers correctly?
- Are files only deleted once (empty file_paths list after deletion)?

2. Global Variable Elimination

Check that these global variables are properly replaced:

remote_endpoint → globus_config.remote_endpoint
local_endpoint → globus_config.local_endpoint
transfer_client → globus_config.transfer_client
transfer_data → batch.transfer_data
task_id → batch.task_id
archive_directory_listing → globus_config.archive_directory_listing
global_variable_tarfiles_pushed → transfer_manager.cumulative_tarfiles_pushed
prev_transfers, curr_transfers → removed (logic now in TransferManager)

3. TransferManager Threading

Is a single TransferManager instance created in create() and passed through all functions?
Is the same instance used in update()?
Does hpss_get() create its own instance (acceptable since it's a separate operation)?

4. Batch Creation and Management (`hpss.py::hpss_transfer()`)

Are batches created at the right time (before adding files)?
Are files added to the current batch correctly?
Is index.db excluded from deletion tracking (is_index check)?
Does the code handle the --keep flag correctly (never track files when keep=True)?

5. Transfer Submission (`globus.py::globus_transfer()`)

After submitting a transfer, is the batch updated with task_id and task_status?
Is transfer_data set to None after submission?
Does the function handle the case where a previous transfer is still ACTIVE?

6. Deletion Trigger Points

Review where delete_successfully_transferred_files() is called:

After each transfer (hpss.py::hpss_transfer()): Only when keep=False
At finalization (globus.py::globus_finalize()): After all transfers complete
Are these the right trigger points for both blocking and non-blocking modes?

7. Blocking vs Non-Blocking Behavior

Blocking mode:

Does globus_transfer() wait for completion via globus_block_wait()?
Are files deleted immediately after each successful transfer?
Does globus_finalize() still work correctly (redundant waits are harmless)?

Non-blocking mode:

Does globus_transfer() skip the blocking wait?
Does delete_successfully_transferred_files() check task status non-blockingly?
Are files deleted when status is checked and found to be SUCCEEDED?
Does globus_finalize() wait for all transfers before final cleanup?

8. Finalization Logic (`globus.py::globus_finalize()`)

Does it handle pending transfer_data that hasn't been submitted?
Does it wait for the most recent transfer to complete?
Does it wait for the last task (if different from most recent)?
Does it avoid waiting twice on the same task_id?
Does it call delete_successfully_transferred_files() at the end?

9. Error Handling

Are Optional types used correctly for TransferManager, GlobusConfig, etc.?
Are None checks in place before accessing attributes?
Does hpss_get() create necessary objects when they don't exist?

10. Test Coverage (`test_globus_tar_deletion.bash`)

New test function: test_globus_progressive_deletion()
- Creates ~2GB of test files to trigger multiple tars
- Verifies multiple tar archives are created
- Checks for deletion events during the run
- Verifies no tar files remain in source
- Verifies all tar files exist in destination
Are both blocking and non-blocking progressive deletion tests run?
Does the test correctly differentiate expected deletion behavior between modes?

11. Logging and Debugging

Are there sufficient debug logs to track batch creation and file tracking?
Are deletion events logged clearly?
Is the -v flag added to test commands for verbose output?

12. Edge Cases

What happens if a transfer fails? (Files won't be deleted - correct)
What happens if --keep is True? (Files never tracked for deletion - correct)
What happens with index.db? (Never deleted - correct via is_index flag)
What happens with multiple tar files in one run? (Progressive deletion should work)
What happens if the last transfer is still pending at finalization? (Should wait)

13. Code Cleanup

Are all commented-out debug statements removed or uncommented appropriately?
Is the old global variable cleanup code removed from hpss_transfer()?
Are all function signatures updated with transfer_manager parameter?

14. Backward Compatibility

Does the change affect HPSS (non-Globus) transfers? (Should still work)
Does it work with --keep flag? (Yes, files never tracked)
Does it work in both blocking and non-blocking modes? (Yes, different deletion timing)

Specific Potential Issues to Check

Critical Path Review

File Addition Flow:

hpss_transfer() → creates batch → adds file to batch.file_paths → 
globus_transfer() → submits → sets batch.task_id

Deletion Flow (Blocking):

globus_transfer() → globus_block_wait() → returns SUCCEEDED →
hpss_transfer() → delete_successfully_transferred_files() → checks status → deletes

Deletion Flow (Non-blocking):

globus_transfer() → returns early (not SUCCEEDED) →
later: delete_successfully_transferred_files() → checks status → if SUCCEEDED, deletes

Potential Race Conditions

Can a batch be submitted while another is being checked? (Should be sequential)
Can files be deleted while still being transferred? (No, status check prevents this)

Memory Leaks

Are transfer_data objects properly cleaned up after submission?
Are batch file_paths lists cleared after deletion?

Testing Checklist

Run basic blocking keep test
Run basic blocking non-keep test
Run basic non-blocking keep test
Run basic non-blocking non-keep test
Run progressive deletion blocking test
Run progressive deletion non-blocking test
Verify no tar files remain in source (non-keep cases)
Verify tar files exist in destination
Check logs for deletion events

forsyth2 · 2026-01-06T22:02:50Z

Follow-up:

Performance Review for Progressive Tar File Deletion

Performance Concerns to Review

1. Status Check Overhead

Current Implementation:

def delete_successfully_transferred_files(self):
    for batch in self.batches:
        if batch.is_globus and batch.task_id and (batch.task_status != "SUCCEEDED"):
            if self.globus_config and self.globus_config.transfer_client:
                task = self.globus_config.transfer_client.get_task(batch.task_id)
                batch.task_status = task["status"]

Issues:

O(n) status checks: Iterates through ALL batches every time delete_successfully_transferred_files() is called
Redundant API calls: Checks status of already-succeeded batches (though the != "SUCCEEDED" check prevents the API call)
Multiple calls per file: Called after EVERY hpss_put() in non-blocking mode

Impact:

For a run with 100 tar files, this could mean 100+ iterations through the batch list
Each call to Globus API adds latency (~100-500ms per call)

Potential Optimizations:

Only check batches that haven't been processed yet (already done via if not batch.file_paths: continue)
Add index tracking: self.last_checked_batch_index to avoid re-checking old batches
Batch status checks: collect multiple task_ids and check them together (if Globus SDK supports it)
Rate limit checks: only check every N seconds or every N file additions

2. Batch List Growth

Current Implementation:

self.batches: List[TransferBatch] = []
# Grows unbounded throughout the run

Issues:

Memory growth: For runs with 1000+ tar files, this list grows to 1000+ items
Iteration overhead: Each delete_successfully_transferred_files() iterates the entire list

Potential Optimizations:

Clear processed batches: Remove batches from list once files are deleted
Use a deque with max length
Separate "pending" and "completed" lists

3. File Deletion Performance

Current Implementation:

def delete_files(self):
    for src_path in self.file_paths:
        if os.path.exists(src_path):
            os.remove(src_path)

Issues:

Sequential deletion: Deletes files one at a time
Redundant existence checks: os.path.exists() adds a syscall per file

Impact:

For tar files (large files), deletion is I/O bound, so sequential is probably fine
But the os.path.exists() check is wasteful if we already know the file exists

Potential Optimizations:

Remove the os.path.exists() check and handle exceptions instead
Consider bulk deletion if filesystem supports it
Log warning instead of failing if file doesn't exist

4. Non-Blocking Mode Efficiency

Current Behavior:

# After EVERY hpss_put() call:
if not keep:
    transfer_manager.delete_successfully_transferred_files()

Issues:

Excessive checking: In non-blocking mode, checks status after every single tar file is queued
Wasted API calls: Most checks will return "ACTIVE" or "PENDING"
No batching benefit: Defeats the purpose of batching transfers

Example:

Add tar 1 → check status (PENDING)
Add tar 2 → check status (ACTIVE)
Add tar 3 → check status (ACTIVE)
Add tar 4 → check status (ACTIVE)
Add tar 5 → check status (SUCCEEDED for tar 1, ACTIVE for 2-4)

This means 5 status checks when only 1-2 would be needed.

Potential Optimizations:

Throttle checks: Only check every N tar files or every M seconds
Check only recent batches: Don't iterate through all old batches every time
Progressive threshold: Only check if X tar files have accumulated

5. Globus Transfer Batching

Current Implementation:

# Creates new batch if last one was submitted
if not transfer_manager.batches or transfer_manager.batches[-1].task_id:
    new_batch = TransferBatch()

Questions:

Batch size: How many files are in each TransferData before submission?
Submission trigger: When is a batch actually submitted to Globus?
Optimal batch size: Is there a maximum batch size for Globus transfers?

Looking at the code:
The batch submission happens in globus_transfer(), but it's called after EVERY hpss_put(). This means:

One file per transfer?: Each tar file might be its own transfer task
No actual batching?: The batch tracking is for deletion, not for combining transfers

Potential Optimizations:

Accumulate multiple tar files before submitting to Globus
Submit every N files or when total size reaches threshold
Use Globus's native batching capabilities more effectively

6. Test Performance Impact

New Progressive Deletion Tests:

dd if=/dev/zero of=zstash_demo/file1.dat bs=1M count=700  # 700 MB
dd if=/dev/zero of=zstash_demo/file2.dat bs=1M count=700  # 700 MB  
dd if=/dev/zero of=zstash_demo/file3.dat bs=1M count=700  # 700 MB

Issues:

2.1 GB of test data: Takes significant time to create and transfer
CI/CD impact: Will these tests timeout in automated testing?
Disk space: Requires sufficient space in test environment

Recommendations:

Document expected test duration
Consider making these optional or only running on specific platforms
Add timeout configuration
Clean up test files afterwards

7. Finalization Performance

Current Implementation:

def globus_finalize(transfer_manager: TransferManager, non_blocking: bool = False):
    # Submit any pending transfer_data
    # Wait for most recent transfer
    # Wait for last task (if different)
    # Delete successfully transferred files

Issues:

Double wait: Potentially waits for same task_id twice (though skip_last_wait mitigates)
Blocking at end: Even in non-blocking mode, finalization blocks on all transfers

Questions:

Is the finalization wait necessary? (Probably yes, to ensure index.db transfer completes)
Can we return earlier in non-blocking mode? (No, because index.db must complete)

Recommended Performance Improvements

Priority 1: High Impact, Low Effort

Optimize batch iteration:

def delete_successfully_transferred_files(self):
    # Only check batches that haven't been processed yet
    batches_to_check = [b for b in self.batches if b.file_paths]  # Has files to delete
    
    for batch in batches_to_check:
        # ... existing logic

Remove redundant os.path.exists():

def delete_files(self):
    for src_path in self.file_paths:
        try:
            os.remove(src_path)
        except FileNotFoundError:
            logger.warning(f"File already deleted: {src_path}")

Throttle status checks in non-blocking mode:

# In hpss.py::hpss_transfer()
if not keep:
    # Only check every 5 files or if this is the last file
    if (transfer_manager.cumulative_tarfiles_pushed % 5 == 0) or is_last_file:
        transfer_manager.delete_successfully_transferred_files()

Priority 2: Medium Impact, Medium Effort

Track last checked batch:

class TransferManager:
    def __init__(self):
        self.batches: List[TransferBatch] = []
        self.last_deletion_check_index: int = 0  # New field
    
    def delete_successfully_transferred_files(self):
        # Only check batches from last_deletion_check_index forward
        for i in range(self.last_deletion_check_index, len(self.batches)):
            batch = self.batches[i]
            # ... check and delete logic
            if batch.file_paths == []:  # Processed
                self.last_deletion_check_index = i + 1

Clear old batches to prevent memory growth:

def delete_successfully_transferred_files(self):
    # ... existing logic
    
    # Remove fully processed batches
    self.batches = [b for b in self.batches if b.file_paths or not b.task_id]

Priority 3: Lower Priority / More Investigation Needed

Consider time-based throttling:

class TransferManager:
    def __init__(self):
        self.last_status_check_time: float = 0
    
    def delete_successfully_transferred_files(self):
        now = time.time()
        if now - self.last_status_check_time < 30:  # Don't check more than every 30s
            return
        self.last_status_check_time = now
        # ... existing logic

Investigate actual Globus batching:
- Review how TransferData accumulates files
- Ensure multiple tar files are combined into single transfer tasks when possible
- This might already be working correctly; needs verification

Performance Testing Checklist

Profile a run with 100+ tar files in non-blocking mode
Count number of Globus API calls vs number of tar files
Measure memory usage growth over long runs
Time the delete_successfully_transferred_files() function
Check if status checks are the bottleneck or file I/O is
Test with different batch sizes and throttling parameters
Verify the progressive deletion tests don't exceed reasonable timeouts

Questions to Answer

What's the typical number of tar files in a real zstash run?
- 10s? 100s? 1000s?
- Determines urgency of optimizations
What's the acceptable overhead?
- If transfers take hours, a few extra seconds of status checks is negligible
- If transfers take minutes, overhead becomes significant
Is the current implementation already good enough?
- The batch.file_paths check prevents processing old batches
- The != "SUCCEEDED" check prevents redundant API calls
- Maybe performance is already acceptable?
Do we need metrics?
- Add counters for number of API calls
- Track time spent in deletion checks
- Log performance statistics at end of run

forsyth2 added 2 commits January 2, 2026 18:32

Initial refactor to delete tars properly

4f17478

Fixes from Claude

300aa15

forsyth2 self-assigned this Jan 3, 2026

forsyth2 added semver: bug Bug fix (will increment patch version) Globus Globus labels Jan 3, 2026

forsyth2 mentioned this pull request Jan 3, 2026

Delete transferred files #405

Closed

16 tasks

forsyth2 added 2 commits January 6, 2026 12:36

Further fixes

43b0460

Add progressive deletion tests

064029c

Refactor to resolve tars not deleting when --non-blocking is set #416

Are you sure you want to change the base?

Refactor to resolve tars not deleting when --non-blocking is set #416

Uh oh!

Conversation

forsyth2 commented Jan 3, 2026

Summary

Big Change

1. Does this do what we want it to do?

2. Are the implementation details accurate & efficient?

3. Is this well documented?

4. Is this code clean?

Uh oh!

forsyth2 commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

forsyth2 commented Jan 6, 2026

Overview

Key Changes to Review

1. New Transfer Tracking System (transfer_tracking.py)

2. Global Variable Elimination

3. TransferManager Threading

4. Batch Creation and Management (hpss.py::hpss_transfer())

5. Transfer Submission (globus.py::globus_transfer())

6. Deletion Trigger Points

7. Blocking vs Non-Blocking Behavior

8. Finalization Logic (globus.py::globus_finalize())

9. Error Handling

10. Test Coverage (test_globus_tar_deletion.bash)

11. Logging and Debugging

12. Edge Cases

13. Code Cleanup

14. Backward Compatibility

Specific Potential Issues to Check

Critical Path Review

Potential Race Conditions

Memory Leaks

Testing Checklist

Uh oh!

forsyth2 commented Jan 6, 2026

Performance Concerns to Review

1. Status Check Overhead

2. Batch List Growth

3. File Deletion Performance

4. Non-Blocking Mode Efficiency

5. Globus Transfer Batching

6. Test Performance Impact

7. Finalization Performance

Recommended Performance Improvements

Priority 1: High Impact, Low Effort

Priority 2: Medium Impact, Medium Effort

Priority 3: Lower Priority / More Investigation Needed

Performance Testing Checklist

Questions to Answer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactor to resolve tars not deleting when `--non-blocking` is set #416

Refactor to resolve tars not deleting when `--non-blocking` is set #416

forsyth2 commented Jan 3, 2026 •

edited

Loading

1. New Transfer Tracking System (`transfer_tracking.py`)

4. Batch Creation and Management (`hpss.py::hpss_transfer()`)

5. Transfer Submission (`globus.py::globus_transfer()`)

8. Finalization Logic (`globus.py::globus_finalize()`)

10. Test Coverage (`test_globus_tar_deletion.bash`)