Skip to content

Conversation

@forsyth2
Copy link
Collaborator

@forsyth2 forsyth2 commented Jan 3, 2026

Summary

Objectives:

  • Delete tars eventually, even if --non-blocking is set

Issue resolution:

Select one: This pull request is...

  • a bug fix: increment the patch version
  • a small improvement: increment the minor version
  • a new feature: increment the minor version
  • an incompatible (non-backwards compatible) API change: increment the major version

Big Change

  • To merge, I will use "Create a merge commit". That is, this change is large enough to require multiple units of work (i.e., it should be multiple commits).

1. Does this do what we want it to do?

Required:

  • Product Management: I have confirmed with the stakeholders that the objectives above are correct and complete.
  • Testing: I have added at least one automated test. Every objective above is represented in at least one test.
  • Testing: I have considered likely and/or severe edge cases and have included them in testing.

If applicable:

  • Testing: this pull request adds at least one new possible command line option. I have tested using this option with and without any other option that may interact with it.

2. Are the implementation details accurate & efficient?

Required:

  • Logic: I have visually inspected the entire pull request myself.
  • Logic: I have left GitHub comments highlighting important pieces of code logic. I have had these code blocks reviewed by at least one other team member.

If applicable:

  • Dependencies: This pull request introduces a new dependency. I have discussed this requirement with at least one other team member. The dependency is noted in zstash/conda, not just an import statement.

3. Is this well documented?

Required:

  • Documentation: by looking at the docs, a new user could easily understand the functionality introduced by this pull request.

4. Is this code clean?

Required:

  • Readability: The code is as simple as possible and well-commented, such that a new team member could understand what's happening.
  • Pre-commit checks: All the pre-commits checks have passed.

If applicable:

  • Software architecture: I have discussed relevant trade-offs in design decisions with at least one other team member. It is unlikely that this pull request will increase tech debt.

@forsyth2 forsyth2 self-assigned this Jan 3, 2026
@forsyth2 forsyth2 added semver: bug Bug fix (will increment patch version) Globus Globus labels Jan 3, 2026
@forsyth2 forsyth2 mentioned this pull request Jan 3, 2026
16 tasks
@forsyth2
Copy link
Collaborator Author

forsyth2 commented Jan 3, 2026

Action items:

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Jan 6, 2026

All tests are passing now. Self-review guide from Claude:

Self-Review Guide for Progressive Tar File Deletion Fix

Overview

This diff fixes the issue where tar files weren't being deleted after successful Globus transfers when --keep is False. It introduces a TransferManager class to track transfers and delete files progressively.

Key Changes to Review

1. New Transfer Tracking System (transfer_tracking.py)

  • TransferManager class: Does it correctly maintain state across multiple transfers?
  • TransferBatch class: Are file paths being tracked correctly for each batch?
  • delete_successfully_transferred_files():
    • Does it properly check Globus task status before deletion?
    • Does it handle both Globus and HPSS transfers correctly?
    • Are files only deleted once (empty file_paths list after deletion)?

2. Global Variable Elimination

Check that these global variables are properly replaced:

  • remote_endpointglobus_config.remote_endpoint
  • local_endpointglobus_config.local_endpoint
  • transfer_clientglobus_config.transfer_client
  • transfer_databatch.transfer_data
  • task_idbatch.task_id
  • archive_directory_listingglobus_config.archive_directory_listing
  • global_variable_tarfiles_pushedtransfer_manager.cumulative_tarfiles_pushed
  • prev_transfers, curr_transfers → removed (logic now in TransferManager)

3. TransferManager Threading

  • Is a single TransferManager instance created in create() and passed through all functions?
  • Is the same instance used in update()?
  • Does hpss_get() create its own instance (acceptable since it's a separate operation)?

4. Batch Creation and Management (hpss.py::hpss_transfer())

  • Are batches created at the right time (before adding files)?
  • Are files added to the current batch correctly?
  • Is index.db excluded from deletion tracking (is_index check)?
  • Does the code handle the --keep flag correctly (never track files when keep=True)?

5. Transfer Submission (globus.py::globus_transfer())

  • After submitting a transfer, is the batch updated with task_id and task_status?
  • Is transfer_data set to None after submission?
  • Does the function handle the case where a previous transfer is still ACTIVE?

6. Deletion Trigger Points

Review where delete_successfully_transferred_files() is called:

  • After each transfer (hpss.py::hpss_transfer()): Only when keep=False
  • At finalization (globus.py::globus_finalize()): After all transfers complete
  • Are these the right trigger points for both blocking and non-blocking modes?

7. Blocking vs Non-Blocking Behavior

Blocking mode:

  • Does globus_transfer() wait for completion via globus_block_wait()?
  • Are files deleted immediately after each successful transfer?
  • Does globus_finalize() still work correctly (redundant waits are harmless)?

Non-blocking mode:

  • Does globus_transfer() skip the blocking wait?
  • Does delete_successfully_transferred_files() check task status non-blockingly?
  • Are files deleted when status is checked and found to be SUCCEEDED?
  • Does globus_finalize() wait for all transfers before final cleanup?

8. Finalization Logic (globus.py::globus_finalize())

  • Does it handle pending transfer_data that hasn't been submitted?
  • Does it wait for the most recent transfer to complete?
  • Does it wait for the last task (if different from most recent)?
  • Does it avoid waiting twice on the same task_id?
  • Does it call delete_successfully_transferred_files() at the end?

9. Error Handling

  • Are Optional types used correctly for TransferManager, GlobusConfig, etc.?
  • Are None checks in place before accessing attributes?
  • Does hpss_get() create necessary objects when they don't exist?

10. Test Coverage (test_globus_tar_deletion.bash)

  • New test function: test_globus_progressive_deletion()
    • Creates ~2GB of test files to trigger multiple tars
    • Verifies multiple tar archives are created
    • Checks for deletion events during the run
    • Verifies no tar files remain in source
    • Verifies all tar files exist in destination
  • Are both blocking and non-blocking progressive deletion tests run?
  • Does the test correctly differentiate expected deletion behavior between modes?

11. Logging and Debugging

  • Are there sufficient debug logs to track batch creation and file tracking?
  • Are deletion events logged clearly?
  • Is the -v flag added to test commands for verbose output?

12. Edge Cases

  • What happens if a transfer fails? (Files won't be deleted - correct)
  • What happens if --keep is True? (Files never tracked for deletion - correct)
  • What happens with index.db? (Never deleted - correct via is_index flag)
  • What happens with multiple tar files in one run? (Progressive deletion should work)
  • What happens if the last transfer is still pending at finalization? (Should wait)

13. Code Cleanup

  • Are all commented-out debug statements removed or uncommented appropriately?
  • Is the old global variable cleanup code removed from hpss_transfer()?
  • Are all function signatures updated with transfer_manager parameter?

14. Backward Compatibility

  • Does the change affect HPSS (non-Globus) transfers? (Should still work)
  • Does it work with --keep flag? (Yes, files never tracked)
  • Does it work in both blocking and non-blocking modes? (Yes, different deletion timing)

Specific Potential Issues to Check

Critical Path Review

  1. File Addition Flow:

    hpss_transfer() → creates batch → adds file to batch.file_paths → 
    globus_transfer() → submits → sets batch.task_id
    
  2. Deletion Flow (Blocking):

    globus_transfer() → globus_block_wait() → returns SUCCEEDED →
    hpss_transfer() → delete_successfully_transferred_files() → checks status → deletes
    
  3. Deletion Flow (Non-blocking):

    globus_transfer() → returns early (not SUCCEEDED) →
    later: delete_successfully_transferred_files() → checks status → if SUCCEEDED, deletes
    

Potential Race Conditions

  • Can a batch be submitted while another is being checked? (Should be sequential)
  • Can files be deleted while still being transferred? (No, status check prevents this)

Memory Leaks

  • Are transfer_data objects properly cleaned up after submission?
  • Are batch file_paths lists cleared after deletion?

Testing Checklist

  • Run basic blocking keep test
  • Run basic blocking non-keep test
  • Run basic non-blocking keep test
  • Run basic non-blocking non-keep test
  • Run progressive deletion blocking test
  • Run progressive deletion non-blocking test
  • Verify no tar files remain in source (non-keep cases)
  • Verify tar files exist in destination
  • Check logs for deletion events

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Jan 6, 2026

Follow-up:

Performance Review for Progressive Tar File Deletion

Performance Concerns to Review

1. Status Check Overhead

Current Implementation:

def delete_successfully_transferred_files(self):
    for batch in self.batches:
        if batch.is_globus and batch.task_id and (batch.task_status != "SUCCEEDED"):
            if self.globus_config and self.globus_config.transfer_client:
                task = self.globus_config.transfer_client.get_task(batch.task_id)
                batch.task_status = task["status"]

Issues:

  • O(n) status checks: Iterates through ALL batches every time delete_successfully_transferred_files() is called
  • Redundant API calls: Checks status of already-succeeded batches (though the != "SUCCEEDED" check prevents the API call)
  • Multiple calls per file: Called after EVERY hpss_put() in non-blocking mode

Impact:

  • For a run with 100 tar files, this could mean 100+ iterations through the batch list
  • Each call to Globus API adds latency (~100-500ms per call)

Potential Optimizations:

  • Only check batches that haven't been processed yet (already done via if not batch.file_paths: continue)
  • Add index tracking: self.last_checked_batch_index to avoid re-checking old batches
  • Batch status checks: collect multiple task_ids and check them together (if Globus SDK supports it)
  • Rate limit checks: only check every N seconds or every N file additions

2. Batch List Growth

Current Implementation:

self.batches: List[TransferBatch] = []
# Grows unbounded throughout the run

Issues:

  • Memory growth: For runs with 1000+ tar files, this list grows to 1000+ items
  • Iteration overhead: Each delete_successfully_transferred_files() iterates the entire list

Potential Optimizations:

  • Clear processed batches: Remove batches from list once files are deleted
  • Use a deque with max length
  • Separate "pending" and "completed" lists

3. File Deletion Performance

Current Implementation:

def delete_files(self):
    for src_path in self.file_paths:
        if os.path.exists(src_path):
            os.remove(src_path)

Issues:

  • Sequential deletion: Deletes files one at a time
  • Redundant existence checks: os.path.exists() adds a syscall per file

Impact:

  • For tar files (large files), deletion is I/O bound, so sequential is probably fine
  • But the os.path.exists() check is wasteful if we already know the file exists

Potential Optimizations:

  • Remove the os.path.exists() check and handle exceptions instead
  • Consider bulk deletion if filesystem supports it
  • Log warning instead of failing if file doesn't exist

4. Non-Blocking Mode Efficiency

Current Behavior:

# After EVERY hpss_put() call:
if not keep:
    transfer_manager.delete_successfully_transferred_files()

Issues:

  • Excessive checking: In non-blocking mode, checks status after every single tar file is queued
  • Wasted API calls: Most checks will return "ACTIVE" or "PENDING"
  • No batching benefit: Defeats the purpose of batching transfers

Example:

  • Add tar 1 → check status (PENDING)
  • Add tar 2 → check status (ACTIVE)
  • Add tar 3 → check status (ACTIVE)
  • Add tar 4 → check status (ACTIVE)
  • Add tar 5 → check status (SUCCEEDED for tar 1, ACTIVE for 2-4)

This means 5 status checks when only 1-2 would be needed.

Potential Optimizations:

  • Throttle checks: Only check every N tar files or every M seconds
  • Check only recent batches: Don't iterate through all old batches every time
  • Progressive threshold: Only check if X tar files have accumulated

5. Globus Transfer Batching

Current Implementation:

# Creates new batch if last one was submitted
if not transfer_manager.batches or transfer_manager.batches[-1].task_id:
    new_batch = TransferBatch()

Questions:

  • Batch size: How many files are in each TransferData before submission?
  • Submission trigger: When is a batch actually submitted to Globus?
  • Optimal batch size: Is there a maximum batch size for Globus transfers?

Looking at the code:
The batch submission happens in globus_transfer(), but it's called after EVERY hpss_put(). This means:

  • One file per transfer?: Each tar file might be its own transfer task
  • No actual batching?: The batch tracking is for deletion, not for combining transfers

Potential Optimizations:

  • Accumulate multiple tar files before submitting to Globus
  • Submit every N files or when total size reaches threshold
  • Use Globus's native batching capabilities more effectively

6. Test Performance Impact

New Progressive Deletion Tests:

dd if=/dev/zero of=zstash_demo/file1.dat bs=1M count=700  # 700 MB
dd if=/dev/zero of=zstash_demo/file2.dat bs=1M count=700  # 700 MB  
dd if=/dev/zero of=zstash_demo/file3.dat bs=1M count=700  # 700 MB

Issues:

  • 2.1 GB of test data: Takes significant time to create and transfer
  • CI/CD impact: Will these tests timeout in automated testing?
  • Disk space: Requires sufficient space in test environment

Recommendations:

  • Document expected test duration
  • Consider making these optional or only running on specific platforms
  • Add timeout configuration
  • Clean up test files afterwards

7. Finalization Performance

Current Implementation:

def globus_finalize(transfer_manager: TransferManager, non_blocking: bool = False):
    # Submit any pending transfer_data
    # Wait for most recent transfer
    # Wait for last task (if different)
    # Delete successfully transferred files

Issues:

  • Double wait: Potentially waits for same task_id twice (though skip_last_wait mitigates)
  • Blocking at end: Even in non-blocking mode, finalization blocks on all transfers

Questions:

  • Is the finalization wait necessary? (Probably yes, to ensure index.db transfer completes)
  • Can we return earlier in non-blocking mode? (No, because index.db must complete)

Recommended Performance Improvements

Priority 1: High Impact, Low Effort

  1. Optimize batch iteration:
def delete_successfully_transferred_files(self):
    # Only check batches that haven't been processed yet
    batches_to_check = [b for b in self.batches if b.file_paths]  # Has files to delete
    
    for batch in batches_to_check:
        # ... existing logic
  1. Remove redundant os.path.exists():
def delete_files(self):
    for src_path in self.file_paths:
        try:
            os.remove(src_path)
        except FileNotFoundError:
            logger.warning(f"File already deleted: {src_path}")
  1. Throttle status checks in non-blocking mode:
# In hpss.py::hpss_transfer()
if not keep:
    # Only check every 5 files or if this is the last file
    if (transfer_manager.cumulative_tarfiles_pushed % 5 == 0) or is_last_file:
        transfer_manager.delete_successfully_transferred_files()

Priority 2: Medium Impact, Medium Effort

  1. Track last checked batch:
class TransferManager:
    def __init__(self):
        self.batches: List[TransferBatch] = []
        self.last_deletion_check_index: int = 0  # New field
    
    def delete_successfully_transferred_files(self):
        # Only check batches from last_deletion_check_index forward
        for i in range(self.last_deletion_check_index, len(self.batches)):
            batch = self.batches[i]
            # ... check and delete logic
            if batch.file_paths == []:  # Processed
                self.last_deletion_check_index = i + 1
  1. Clear old batches to prevent memory growth:
def delete_successfully_transferred_files(self):
    # ... existing logic
    
    # Remove fully processed batches
    self.batches = [b for b in self.batches if b.file_paths or not b.task_id]

Priority 3: Lower Priority / More Investigation Needed

  1. Consider time-based throttling:
class TransferManager:
    def __init__(self):
        self.last_status_check_time: float = 0
    
    def delete_successfully_transferred_files(self):
        now = time.time()
        if now - self.last_status_check_time < 30:  # Don't check more than every 30s
            return
        self.last_status_check_time = now
        # ... existing logic
  1. Investigate actual Globus batching:
    • Review how TransferData accumulates files
    • Ensure multiple tar files are combined into single transfer tasks when possible
    • This might already be working correctly; needs verification

Performance Testing Checklist

  • Profile a run with 100+ tar files in non-blocking mode
  • Count number of Globus API calls vs number of tar files
  • Measure memory usage growth over long runs
  • Time the delete_successfully_transferred_files() function
  • Check if status checks are the bottleneck or file I/O is
  • Test with different batch sizes and throttling parameters
  • Verify the progressive deletion tests don't exceed reasonable timeouts

Questions to Answer

  1. What's the typical number of tar files in a real zstash run?

    • 10s? 100s? 1000s?
    • Determines urgency of optimizations
  2. What's the acceptable overhead?

    • If transfers take hours, a few extra seconds of status checks is negligible
    • If transfers take minutes, overhead becomes significant
  3. Is the current implementation already good enough?

    • The batch.file_paths check prevents processing old batches
    • The != "SUCCEEDED" check prevents redundant API calls
    • Maybe performance is already acceptable?
  4. Do we need metrics?

    • Add counters for number of API calls
    • Track time spent in deletion checks
    • Log performance statistics at end of run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Globus Globus semver: bug Bug fix (will increment patch version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: tar files are not deleted after successful globus transfer

2 participants