Skip to content

Refactor to resolve tars not deleting when --non-blocking is set#416

Merged
forsyth2 merged 33 commits intomainfrom
issue-374-refactor-tar-deletion
Apr 6, 2026
Merged

Refactor to resolve tars not deleting when --non-blocking is set#416
forsyth2 merged 33 commits intomainfrom
issue-374-refactor-tar-deletion

Conversation

@forsyth2
Copy link
Copy Markdown
Collaborator

@forsyth2 forsyth2 commented Jan 3, 2026

Summary

Objectives:

  • Delete tars eventually, even if --non-blocking is set.
  • Refactor as necessary to 1) complete the above objective and 2) make logical flow easy to understand.

Issue resolution:

Select one: This pull request is...

  • a bug fix: increment the patch version
  • a small improvement: increment the minor version
  • a new feature: increment the minor version
  • an incompatible (non-backwards compatible) API change: increment the major version

Big Change

  • To merge, I will use "Create a merge commit". That is, this change is large enough to require multiple units of work (i.e., it should be multiple commits).

1. Does this do what we want it to do?

Required:

  • Product Management: I have confirmed with the stakeholders that the objectives above are correct and complete.
  • Testing: I have added at least one automated test. Every objective above is represented in at least one test.
    • Tests were added to check for progressive deletion
  • Testing: I have considered likely and/or severe edge cases and have included them in testing.

2. Are the implementation details accurate & efficient?

Required:

  • Logic: I have visually inspected the entire pull request myself.
  • Logic: I have left GitHub comments highlighting important pieces of code logic. I have had these code blocks reviewed by at least one other team member.

3. Is this well documented?

Required:

  • Documentation: by looking at the docs, a new user could easily understand the functionality introduced by this pull request.
    • As a bug fix, this is only correcting expected behavior.

4. Is this code clean?

Required:

  • Readability: The code is as simple as possible and well-commented, such that a new team member could understand what's happening.
    • The refactoring significantly increases readability.
  • Pre-commit checks: All the pre-commits checks have passed.

If applicable:

  • Software architecture: I have discussed relevant trade-offs in design decisions with at least one other team member. It is unlikely that this pull request will increase tech debt.

@forsyth2 forsyth2 self-assigned this Jan 3, 2026
@forsyth2 forsyth2 added semver: bug Bug fix (will increment patch version) Globus Globus labels Jan 3, 2026
@forsyth2 forsyth2 mentioned this pull request Jan 3, 2026
16 tasks
@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Jan 3, 2026

Action items:

@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Jan 6, 2026

All tests are passing now. Self-review guide from Claude:

Self-Review Guide for Progressive Tar File Deletion Fix

Overview

This diff fixes the issue where tar files weren't being deleted after successful Globus transfers when --keep is False. It introduces a TransferManager class to track transfers and delete files progressively.

Key Changes to Review

1. New Transfer Tracking System (transfer_tracking.py)

  • TransferManager class: Does it correctly maintain state across multiple transfers?
  • TransferBatch class: Are file paths being tracked correctly for each batch?
  • delete_successfully_transferred_files():
    • Does it properly check Globus task status before deletion?
    • Does it handle both Globus and HPSS transfers correctly?
    • Are files only deleted once (empty file_paths list after deletion)?

2. Global Variable Elimination

Check that these global variables are properly replaced:

  • remote_endpointglobus_config.remote_endpoint
  • local_endpointglobus_config.local_endpoint
  • transfer_clientglobus_config.transfer_client
  • transfer_databatch.transfer_data
  • task_idbatch.task_id
  • archive_directory_listingglobus_config.archive_directory_listing
  • global_variable_tarfiles_pushedtransfer_manager.cumulative_tarfiles_pushed
  • prev_transfers, curr_transfers → removed (logic now in TransferManager)

3. TransferManager Threading

  • Is a single TransferManager instance created in create() and passed through all functions?
  • Is the same instance used in update()?
  • Does hpss_get() create its own instance (acceptable since it's a separate operation)?

4. Batch Creation and Management (hpss.py::hpss_transfer())

  • Are batches created at the right time (before adding files)?
  • Are files added to the current batch correctly?
  • Is index.db excluded from deletion tracking (is_index check)?
  • Does the code handle the --keep flag correctly (never track files when keep=True)?

5. Transfer Submission (globus.py::globus_transfer())

  • After submitting a transfer, is the batch updated with task_id and task_status?
  • Is transfer_data set to None after submission?
  • Does the function handle the case where a previous transfer is still ACTIVE?

6. Deletion Trigger Points

Review where delete_successfully_transferred_files() is called:

  • After each transfer (hpss.py::hpss_transfer()): Only when keep=False
  • At finalization (globus.py::globus_finalize()): After all transfers complete
  • Are these the right trigger points for both blocking and non-blocking modes?

7. Blocking vs Non-Blocking Behavior

Blocking mode:

  • Does globus_transfer() wait for completion via globus_block_wait()?
  • Are files deleted immediately after each successful transfer?
  • Does globus_finalize() still work correctly (redundant waits are harmless)?

Non-blocking mode:

  • Does globus_transfer() skip the blocking wait?
  • Does delete_successfully_transferred_files() check task status non-blockingly?
  • Are files deleted when status is checked and found to be SUCCEEDED?
  • Does globus_finalize() wait for all transfers before final cleanup?

8. Finalization Logic (globus.py::globus_finalize())

  • Does it handle pending transfer_data that hasn't been submitted?
  • Does it wait for the most recent transfer to complete?
  • Does it wait for the last task (if different from most recent)?
  • Does it avoid waiting twice on the same task_id?
  • Does it call delete_successfully_transferred_files() at the end?

9. Error Handling

  • Are Optional types used correctly for TransferManager, GlobusConfig, etc.?
  • Are None checks in place before accessing attributes?
  • Does hpss_get() create necessary objects when they don't exist?

10. Test Coverage (test_globus_tar_deletion.bash)

  • New test function: test_globus_progressive_deletion()
    • Creates ~2GB of test files to trigger multiple tars
    • Verifies multiple tar archives are created
    • Checks for deletion events during the run
    • Verifies no tar files remain in source
    • Verifies all tar files exist in destination
  • Are both blocking and non-blocking progressive deletion tests run?
  • Does the test correctly differentiate expected deletion behavior between modes?

11. Logging and Debugging

  • Are there sufficient debug logs to track batch creation and file tracking?
  • Are deletion events logged clearly?
  • Is the -v flag added to test commands for verbose output?

12. Edge Cases

  • What happens if a transfer fails? (Files won't be deleted - correct)
  • What happens if --keep is True? (Files never tracked for deletion - correct)
  • What happens with index.db? (Never deleted - correct via is_index flag)
  • What happens with multiple tar files in one run? (Progressive deletion should work)
  • What happens if the last transfer is still pending at finalization? (Should wait)

13. Code Cleanup

  • Are all commented-out debug statements removed or uncommented appropriately?
  • Is the old global variable cleanup code removed from hpss_transfer()?
  • Are all function signatures updated with transfer_manager parameter?

14. Backward Compatibility

  • Does the change affect HPSS (non-Globus) transfers? (Should still work)
  • Does it work with --keep flag? (Yes, files never tracked)
  • Does it work in both blocking and non-blocking modes? (Yes, different deletion timing)

Specific Potential Issues to Check

Critical Path Review

  1. File Addition Flow:

    hpss_transfer() → creates batch → adds file to batch.file_paths → 
    globus_transfer() → submits → sets batch.task_id
    
  2. Deletion Flow (Blocking):

    globus_transfer() → globus_block_wait() → returns SUCCEEDED →
    hpss_transfer() → delete_successfully_transferred_files() → checks status → deletes
    
  3. Deletion Flow (Non-blocking):

    globus_transfer() → returns early (not SUCCEEDED) →
    later: delete_successfully_transferred_files() → checks status → if SUCCEEDED, deletes
    

Potential Race Conditions

  • Can a batch be submitted while another is being checked? (Should be sequential)
  • Can files be deleted while still being transferred? (No, status check prevents this)

Memory Leaks

  • Are transfer_data objects properly cleaned up after submission?
  • Are batch file_paths lists cleared after deletion?

Testing Checklist

  • Run basic blocking keep test
  • Run basic blocking non-keep test
  • Run basic non-blocking keep test
  • Run basic non-blocking non-keep test
  • Run progressive deletion blocking test
  • Run progressive deletion non-blocking test
  • Verify no tar files remain in source (non-keep cases)
  • Verify tar files exist in destination
  • Check logs for deletion events

@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Jan 6, 2026

Follow-up:

Performance Review for Progressive Tar File Deletion

Performance Concerns to Review

1. Status Check Overhead

Current Implementation:

def delete_successfully_transferred_files(self):
    for batch in self.batches:
        if batch.is_globus and batch.task_id and (batch.task_status != "SUCCEEDED"):
            if self.globus_config and self.globus_config.transfer_client:
                task = self.globus_config.transfer_client.get_task(batch.task_id)
                batch.task_status = task["status"]

Issues:

  • O(n) status checks: Iterates through ALL batches every time delete_successfully_transferred_files() is called
  • Redundant API calls: Checks status of already-succeeded batches (though the != "SUCCEEDED" check prevents the API call)
  • Multiple calls per file: Called after EVERY hpss_put() in non-blocking mode

Impact:

  • For a run with 100 tar files, this could mean 100+ iterations through the batch list
  • Each call to Globus API adds latency (~100-500ms per call)

Potential Optimizations:

  • Only check batches that haven't been processed yet (already done via if not batch.file_paths: continue)
  • Add index tracking: self.last_checked_batch_index to avoid re-checking old batches
  • Batch status checks: collect multiple task_ids and check them together (if Globus SDK supports it)
  • Rate limit checks: only check every N seconds or every N file additions

2. Batch List Growth

Current Implementation:

self.batches: List[TransferBatch] = []
# Grows unbounded throughout the run

Issues:

  • Memory growth: For runs with 1000+ tar files, this list grows to 1000+ items
  • Iteration overhead: Each delete_successfully_transferred_files() iterates the entire list

Potential Optimizations:

  • Clear processed batches: Remove batches from list once files are deleted
  • Use a deque with max length
  • Separate "pending" and "completed" lists

3. File Deletion Performance

Current Implementation:

def delete_files(self):
    for src_path in self.file_paths:
        if os.path.exists(src_path):
            os.remove(src_path)

Issues:

  • Sequential deletion: Deletes files one at a time
  • Redundant existence checks: os.path.exists() adds a syscall per file

Impact:

  • For tar files (large files), deletion is I/O bound, so sequential is probably fine
  • But the os.path.exists() check is wasteful if we already know the file exists

Potential Optimizations:

  • Remove the os.path.exists() check and handle exceptions instead
  • Consider bulk deletion if filesystem supports it
  • Log warning instead of failing if file doesn't exist

4. Non-Blocking Mode Efficiency

Current Behavior:

# After EVERY hpss_put() call:
if not keep:
    transfer_manager.delete_successfully_transferred_files()

Issues:

  • Excessive checking: In non-blocking mode, checks status after every single tar file is queued
  • Wasted API calls: Most checks will return "ACTIVE" or "PENDING"
  • No batching benefit: Defeats the purpose of batching transfers

Example:

  • Add tar 1 → check status (PENDING)
  • Add tar 2 → check status (ACTIVE)
  • Add tar 3 → check status (ACTIVE)
  • Add tar 4 → check status (ACTIVE)
  • Add tar 5 → check status (SUCCEEDED for tar 1, ACTIVE for 2-4)

This means 5 status checks when only 1-2 would be needed.

Potential Optimizations:

  • Throttle checks: Only check every N tar files or every M seconds
  • Check only recent batches: Don't iterate through all old batches every time
  • Progressive threshold: Only check if X tar files have accumulated

5. Globus Transfer Batching

Current Implementation:

# Creates new batch if last one was submitted
if not transfer_manager.batches or transfer_manager.batches[-1].task_id:
    new_batch = TransferBatch()

Questions:

  • Batch size: How many files are in each TransferData before submission?
  • Submission trigger: When is a batch actually submitted to Globus?
  • Optimal batch size: Is there a maximum batch size for Globus transfers?

Looking at the code:
The batch submission happens in globus_transfer(), but it's called after EVERY hpss_put(). This means:

  • One file per transfer?: Each tar file might be its own transfer task
  • No actual batching?: The batch tracking is for deletion, not for combining transfers

Potential Optimizations:

  • Accumulate multiple tar files before submitting to Globus
  • Submit every N files or when total size reaches threshold
  • Use Globus's native batching capabilities more effectively

6. Test Performance Impact

New Progressive Deletion Tests:

dd if=/dev/zero of=zstash_demo/file1.dat bs=1M count=700  # 700 MB
dd if=/dev/zero of=zstash_demo/file2.dat bs=1M count=700  # 700 MB  
dd if=/dev/zero of=zstash_demo/file3.dat bs=1M count=700  # 700 MB

Issues:

  • 2.1 GB of test data: Takes significant time to create and transfer
  • CI/CD impact: Will these tests timeout in automated testing?
  • Disk space: Requires sufficient space in test environment

Recommendations:

  • Document expected test duration
  • Consider making these optional or only running on specific platforms
  • Add timeout configuration
  • Clean up test files afterwards

7. Finalization Performance

Current Implementation:

def globus_finalize(transfer_manager: TransferManager, non_blocking: bool = False):
    # Submit any pending transfer_data
    # Wait for most recent transfer
    # Wait for last task (if different)
    # Delete successfully transferred files

Issues:

  • Double wait: Potentially waits for same task_id twice (though skip_last_wait mitigates)
  • Blocking at end: Even in non-blocking mode, finalization blocks on all transfers

Questions:

  • Is the finalization wait necessary? (Probably yes, to ensure index.db transfer completes)
  • Can we return earlier in non-blocking mode? (No, because index.db must complete)

Recommended Performance Improvements

Priority 1: High Impact, Low Effort

  1. Optimize batch iteration:
def delete_successfully_transferred_files(self):
    # Only check batches that haven't been processed yet
    batches_to_check = [b for b in self.batches if b.file_paths]  # Has files to delete
    
    for batch in batches_to_check:
        # ... existing logic
  1. Remove redundant os.path.exists():
def delete_files(self):
    for src_path in self.file_paths:
        try:
            os.remove(src_path)
        except FileNotFoundError:
            logger.warning(f"File already deleted: {src_path}")
  1. Throttle status checks in non-blocking mode:
# In hpss.py::hpss_transfer()
if not keep:
    # Only check every 5 files or if this is the last file
    if (transfer_manager.cumulative_tarfiles_pushed % 5 == 0) or is_last_file:
        transfer_manager.delete_successfully_transferred_files()

Priority 2: Medium Impact, Medium Effort

  1. Track last checked batch:
class TransferManager:
    def __init__(self):
        self.batches: List[TransferBatch] = []
        self.last_deletion_check_index: int = 0  # New field
    
    def delete_successfully_transferred_files(self):
        # Only check batches from last_deletion_check_index forward
        for i in range(self.last_deletion_check_index, len(self.batches)):
            batch = self.batches[i]
            # ... check and delete logic
            if batch.file_paths == []:  # Processed
                self.last_deletion_check_index = i + 1
  1. Clear old batches to prevent memory growth:
def delete_successfully_transferred_files(self):
    # ... existing logic
    
    # Remove fully processed batches
    self.batches = [b for b in self.batches if b.file_paths or not b.task_id]

Priority 3: Lower Priority / More Investigation Needed

  1. Consider time-based throttling:
class TransferManager:
    def __init__(self):
        self.last_status_check_time: float = 0
    
    def delete_successfully_transferred_files(self):
        now = time.time()
        if now - self.last_status_check_time < 30:  # Don't check more than every 30s
            return
        self.last_status_check_time = now
        # ... existing logic
  1. Investigate actual Globus batching:
    • Review how TransferData accumulates files
    • Ensure multiple tar files are combined into single transfer tasks when possible
    • This might already be working correctly; needs verification

Performance Testing Checklist

  • Profile a run with 100+ tar files in non-blocking mode
  • Count number of Globus API calls vs number of tar files
  • Measure memory usage growth over long runs
  • Time the delete_successfully_transferred_files() function
  • Check if status checks are the bottleneck or file I/O is
  • Test with different batch sizes and throttling parameters
  • Verify the progressive deletion tests don't exceed reasonable timeouts

Questions to Answer

  1. What's the typical number of tar files in a real zstash run?

    • 10s? 100s? 1000s?
    • Determines urgency of optimizations
  2. What's the acceptable overhead?

    • If transfers take hours, a few extra seconds of status checks is negligible
    • If transfers take minutes, overhead becomes significant
  3. Is the current implementation already good enough?

    • The batch.file_paths check prevents processing old batches
    • The != "SUCCEEDED" check prevents redundant API calls
    • Maybe performance is already acceptable?
  4. Do we need metrics?

    • Add counters for number of API calls
    • Track time spent in deletion checks
    • Log performance statistics at end of run

@forsyth2 forsyth2 force-pushed the issue-374-refactor-tar-deletion branch from 064029c to ed3248d Compare March 4, 2026 21:37
@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Mar 4, 2026

I've rebased off the latest main, and added a pre-commit fix that was introduced while resolving conflicts with main.

I ran the tar deletion test and got:

==========================================
TEST RESULTS
==========================================
✓ blocking_non-keep PASSED
✓ non-blocking_non-keep PASSED
✓ blocking_keep PASSED
✓ non-blocking_keep PASSED
✓ blocking_progressive_deletion PASSED
✓ non-blocking_progressive_deletion PASSED
==========================================
TEST SUMMARY
==========================================
Total tests: 6
Passed: 6
Failed: 0
==========================================
All globus tar deletion tests completed successfully.

The AI-generated reviews above are excessively comprehensive, so I will tag Copilot to review this PR and see what it deems relevant.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors Globus transfer handling to ensure local tar files are deleted after successful transfers even when --non-blocking is used, addressing issue #374.

Changes:

  • Introduces a TransferManager / TransferBatch model to track submitted transfers and associated local files for later deletion.
  • Threads the transfer manager through create/updateadd_fileshpss_putglobus_transfer, and performs deletion based on task status checks.
  • Expands the Globus tar deletion integration test coverage (including a new “progressive deletion” scenario).

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
zstash/update.py Wires TransferManager through update flow and into finalization.
zstash/transfer_tracking.py New module defining transfer tracking data structures and deletion logic.
zstash/hpss_utils.py Passes TransferManager into hpss_put from tar creation loop.
zstash/hpss.py Refactors deletion logic to rely on TransferManager rather than prior global lists.
zstash/globus.py Refactors Globus state handling to use TransferManager/GlobusConfig instead of module globals.
zstash/create.py Wires TransferManager through create flow and into finalization.
tests/integration/bash_tests/run_from_any/test_globus_tar_deletion.bash Adds verbose logging and a new progressive-deletion integration test scenario.
Comments suppressed due to low confidence (1)

zstash/hpss.py:15

  • The refactor leaves prev_transfers/curr_transfers as unused globals. Since deletion is now handled via TransferManager, keeping these around is confusing and may trigger linting warnings. Consider removing them if they are no longer part of the design.
prev_transfers: List[str] = list()
curr_transfers: List[str] = list()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

zstash/globus.py Outdated
Comment on lines +124 to +128
if mrt.task_status == "ACTIVE":
logger.info(
f"{ts_utc()}: Previous task_id {task_id} Still Active. Returning ACTIVE."
f"{ts_utc()}: Previous task_id {mrt.task_id} Still Active. Returning ACTIVE."
)
return "ACTIVE"
elif prev_task_status == "SUCCEEDED":
# Don't return early - continue to submit the new transfer
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ACTIVE-task handling no longer returns early; it now proceeds to submit a new transfer while the previous one is still active. This changes behavior from “single in-flight task” to potentially many concurrent tasks, which can hit Globus service limits and complicate operational debugging. If concurrency is not intended, consider restoring the early return/queueing behavior or adding an explicit cap.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This the expected behavior of --non-blocking, not an issue.

Comment on lines 209 to +213
if transfer_type == "put":
return task_status

if transfer_type == "get" and task_id:
globus_wait(task_id)
globus_wait(transfer_manager.globus_config.transfer_client, task_id)
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

globus_transfer() returns task_status, but that variable is initialized to "UNKNOWN" and never updated from the actual task result (blocking mode only updates new_mrt.task_status). As written, the function will always return "UNKNOWN" for both put and get. Consider returning new_mrt.task_status (or setting task_status from the wait result) so callers can rely on the return value.

Copilot uses AI. Check for mistakes.
zstash/create.py Outdated
if args.follow_symlinks:
raise Exception("Archive creation failed due to broken symlink.")
else:
raise e
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching FileNotFoundError as e and then using raise e drops the original traceback context. Use a bare raise to re-raise the original exception (or wrap it with raise ... from e if you need a different exception type).

Suggested change
raise e
raise

Copilot uses AI. Check for mistakes.
zstash/update.py Outdated
if args.follow_symlinks:
raise Exception("Archive update failed due to broken symlink.")
else:
raise e
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching FileNotFoundError as e and then using raise e drops the original traceback context. Use a bare raise to re-raise the original exception (or raise ... from e if you need to add context).

Suggested change
raise e
raise

Copilot uses AI. Check for mistakes.
zstash/globus.py Outdated
Comment on lines +357 to +375
# In non-blocking mode, this ensures index.db and any accumulated tar files complete
# In blocking mode, this is redundant but harmless
skip_last_wait: bool = False
if transfer and transfer.task_id:
if transfer.task_id == last_task_id:
skip_last_wait = (
True # No reason to call globus_wait twice on the same task_id
)
logger.info(
f"{ts_utc()}: Waiting for transfer task_id={transfer.task_id} to complete"
)
globus_wait(
transfer_manager.globus_config.transfer_client, transfer.task_id
)
if last_task_id and (not skip_last_wait):
logger.info(
f"{ts_utc()}: Waiting for last transfer task_id={last_task_id} to complete"
)
globus_wait(transfer_manager.globus_config.transfer_client, last_task_id)
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non_blocking is passed into globus_finalize(), but the finalization path still waits for task completion (globus_wait) regardless. This makes --non-blocking effectively block at the end and can dramatically increase runtime for large transfers. Consider skipping the wait logic when non_blocking=True (or rename/remove the flag if always-wait is intended).

Suggested change
# In non-blocking mode, this ensures index.db and any accumulated tar files complete
# In blocking mode, this is redundant but harmless
skip_last_wait: bool = False
if transfer and transfer.task_id:
if transfer.task_id == last_task_id:
skip_last_wait = (
True # No reason to call globus_wait twice on the same task_id
)
logger.info(
f"{ts_utc()}: Waiting for transfer task_id={transfer.task_id} to complete"
)
globus_wait(
transfer_manager.globus_config.transfer_client, transfer.task_id
)
if last_task_id and (not skip_last_wait):
logger.info(
f"{ts_utc()}: Waiting for last transfer task_id={last_task_id} to complete"
)
globus_wait(transfer_manager.globus_config.transfer_client, last_task_id)
# In non-blocking mode, skip waiting here so the caller can return immediately.
# In blocking mode, this ensures index.db and any accumulated tar files complete.
if non_blocking:
logger.info(
f"{ts_utc()}: Non-blocking mode: skipping wait for Globus "
f"transfer completion during finalize"
)
else:
skip_last_wait: bool = False
if transfer and transfer.task_id:
if transfer.task_id == last_task_id:
# No reason to call globus_wait twice on the same task_id
skip_last_wait = True
logger.info(
f"{ts_utc()}: Waiting for transfer task_id={transfer.task_id} "
f"to complete"
)
globus_wait(
transfer_manager.globus_config.transfer_client, transfer.task_id
)
if last_task_id and (not skip_last_wait):
logger.info(
f"{ts_utc()}: Waiting for last transfer "
f"task_id={last_task_id} to complete"
)
globus_wait(
transfer_manager.globus_config.transfer_client, last_task_id
)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should be waiting at the end. globus_finalize is called last by create and update -- Globus transfers should complete before zstash returns.

@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Mar 5, 2026

Addressed relevant code review comments. Confirmed all 6 tar deletion tests still pass. Remaining action items:

  • High-level review of the self-review guide and performance review guide above for any issues that are particularly important to address.
  • Visually review code changes myself
  • Run entire test suite
  • Code review

@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Mar 5, 2026

This is a very-high level pseudo-code algorithm of what is happening with the tars:

create.py > create()
  create.py > create_database()
    hpss_utils.py > add_files()
      For each tar:
        hpss.py > hpss_put()
          hpss.py > hpss_transfer()
            If no batches, create new batch
            Add file to batch
            globus.py > globus_transfer()
              globus_utils.py > submit_transfer_with_checks()
              If blocking (not non_blocking):
                globus.py > globus_block_wait()
            transfer_tracking.py > TransferManager.delete_successfully_transferred_files()
              Reset batches to be only batches with associated files
              For any remaining batches that now have the SUCCESS status:
                Delete the associated files
        Add tar into the database

@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Mar 6, 2026

Confirmed all tests pass on Chrysalis and Perlmutter

@forsyth2 forsyth2 marked this pull request as ready for review March 6, 2026 00:28
@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Mar 6, 2026

@TonyB9000 If you have the time, this could use at least a high-level code review. I also plan to go over it briefly at Monday's technical discussion.

The issue being resolved is #374, which notified us that --non-blocking without --keep wasn't deleting successfully transferred tars. This of course can cause excessive disk space usage.

Here, I've refactored the code to remove global variables and clarify the logic flow (i.e., using TransferManager). I've outlined the very basic pseudocode above.

The test was already added in #404 (which confirms on the main branch that of the 4 possible combinations of blocking & keep, only the non-blocking & non-keep combination behaves incorrectly). This PR adds testing to confirm tars are deleted along the way rather than at the very end.

@chengzhuzhang @golaz ^Just for your awareness. I will discuss more at our next meeting.

@TonyB9000
Copy link
Copy Markdown
Collaborator

TonyB9000 commented Mar 6, 2026

@forsyth2 First a general comment: We never interrupt a tar-file in the middle of tar-file formation. So for clarity, I would employ a fully separate "make_tarfile()" that returns when a size-threshold is reached, and THEN submit that tar-file to a process whose return behavior depends upon "BLOCKING", "transfer success" etc. It bothers me that "add_files()" does FAR MORE than just "add_files". Properly, it could be called "conduct_all_processing()".

The "pseudocode" you supply (really, a call-sequence outline) is thus a bit obscure as well.

My Translation (up to the point of my understanding:

We are creating a remote archive:

	Call create()
		Calls create_database()
			Calls add_files()

Now, the innocent looking “add_files()” is actually a very complex and extensive routine.
It performs as follows:

	Given a body of files that can easily span many tar-files and transfers, it adds files to a
        “new or current” tar file until a size-threshold would be breached (or no files remain to be added).
	The completed tar-file is then submitted to

		hpss_put()
			calls hpss_transfer()
				adds tar-file to “batch”
				(conditionally) submits “batch” to globus_transfer()

	Now, here is the conditional part:

	If NON-BLOCKING:  THEN
		If globus transfer is BUSY with a pre-existing batch transfer, we can “store up” tarfiles for transfer
                in a new batch and return immediately to produce more tar-files.
		If globus transfer is NOT BUSY, this tar file is added to the existing batch and that batch
                is submitted for transfer.  (Again, we return immediately to produce more tar-files.)

	If BLOCKING: THEN
		We KNOW (supposedly) that globus is not busy, or else we’d be unable to create and submit a tar-file. 
                So, we “batch” exactly ONE tarfile, and submit it to transfer.
                THEN we call "globus_block_wait()"

Here is what LivChat says about "globus_block_wait()":

Specifically, TransferClient.task_wait(task_id, timeout=..., polling_interval=...):

    Returns True if the task reached a terminal state within the timeout (SUCCEEDED, FAILED, CANCELED).
    Returns False if the timeout expires and the task is still active.

So it is “transfer finished (success or failure)”

Also:

Also, you only break on SUCCEEDED. If the task fails quickly, you will keep looping until you exhaust retries, and you will end up returning EXHAUSTED_TIMEOUT_RETRIES even though the task actually finished as FAILED or CANCELED.

The AI suggests the following improvement:

done = transfer_client.task_wait(task_id, timeout=wait_timeout, polling_interval=polling_interval)
if done:
    curr_task = transfer_client.get_task(task_id)
    return curr_task["status"]
# else, timed out and still running, continue loop until you exhaust retries.

I have yet to understand, thoroughly, the functionality of TransferManager.delete_successfully_transferred_files().

It appears to be called only upon RETURN from globus_transfer. The code manipulates "batches" - but do we ever have more than 3 batches (one subject to transfer, and one storing up new tar-files, and one whose transfer is completed - successful or otherwise)? If BLOCKING, this would only be 2, as we never "store up" tar-files.

@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Mar 6, 2026

Thanks for the thorough review @TonyB9000. I'm working my way through your suggestions. You're right that zstash could use a lot of refactoring to for code readability. I'm trying to pick low-hanging fruit there.

In the process of reviewing logic flows, I noticed what appears to be an error in the code that has persisted for several years: see https://github.com/E3SM-Project/zstash/pull/10/changes#r2898357541

@forsyth2
Copy link
Copy Markdown
Collaborator Author

I think maybe the best thing for clarity would be to pull out the create-if-doesn't-exist part to before calling the function.

Oh wait, I see now the creation very much relies on variables defined in that function. I guess we're just going to have to go with a complicated function name

@forsyth2 forsyth2 force-pushed the issue-374-refactor-tar-deletion branch 2 times, most recently from 17286e1 to 67c880a Compare March 30, 2026 20:02
@forsyth2 forsyth2 force-pushed the issue-374-refactor-tar-deletion branch from 67c880a to f757094 Compare March 30, 2026 20:04
@forsyth2
Copy link
Copy Markdown
Collaborator Author

@TonyB9000 Ok how's this new commit look?

@TonyB9000
Copy link
Copy Markdown
Collaborator

I'll do another pull so I can see everything in context.

Long ago, all of the remote-transfer operations occurred under a function called "add_files". Now, it is all embedded under a function called "construct_tars()". It works, but just feels weird. It is like embedding a process that creates humans, all under a function called "adjust_atoms()".

@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Apr 3, 2026

@TonyB9000 Responding to your email from yesterday, 4/2:

At the top of hpss_transfer(), we ensure that the transfer_manager has at least one (deletion) batch, (and we add the tarfile to it if it is ok to eventually delete it.

hpss.hpss_transfer():

    if not transfer_manager.batches or transfer_manager.batches[-1].task_id:
        # Either no batches exist, or the last batch was already submitted
        new_batch = TransferBatch()
        new_batch.is_globus = scheme == "globus"
        transfer_manager.batches.append(new_batch)
        logger.debug(
            f"{ts_utc()}: Created new TransferBatch, total batches: {len(transfer_manager.batches)}"
        )

mrb = latest_deletion_batch   (from the transfer_manager, since it MUST exist)

globus.globus_transfer():

mrb: Optional[TransferBatch] = transfer_manager.get_most_recent_batch()

Correct, it must exist.

update_cumulative_tarfiles_pushed(transfer_manager, transfer_data)

globus.globus_transfer():

update_cumulative_tarfiles_pushed(transfer_manager, transfer_data)

task = submit_transfer_with_checks()

globus.globus_transfer():

        task = submit_transfer_with_checks(
            transfer_manager.globus_config.transfer_client, transfer_data
        )

Did we obtain a new “transfer_manager”?  At the top, we KNEW that our transfer_manager had a “mrb”. but here we test if it has “batches”.

        # Update the current batch with the task info
        # The batch was already created in hpss_transfer with files added to it
        # We just need to mark it as submitted
        if transfer_manager.batches:
            # Update these two fields of the most recent batch
            # (which is still available in this function as `mrb`).
            transfer_manager.batches[-1].task_id = task_id
            transfer_manager.batches[-1].task_status = TaskStatus.SUBMITTED

No, there is no new transfer manager. We do in fact know we have batches -- hence the else block:

            # This block should be impossible to reach.
            # By now, we've ensured that `get_most_recent_batch()` returns a batch,
            # and we haven't removed any batches since then,
            # so there should always be at least one batch in `batches`.
            error_str = "transfer_manager has no batches"
            logger.error(error_str)
            raise RuntimeError(error_str)

Now, you wonder, why bother with this check then? We do so because this is the first time we reference transfer_manager.batches in globus_transfer() and that parameter has type List[TransferBatch] which could be the empty list, so it's just a confirmation that batches[-1] does in fact exist. (I can't remember if the type checker mpypy cares about this though). It's really just an acknowledgement that lists in general can be empty (but this one shouldn't be, since we did check the batches earlier!)

I am having trouble telling when transfer_managers come and go.

TransferManager is constructed exactly once per zstash call. Unlike the batch list, which is used only to track files to delete, the TransferManager is always needed.

git grep -n "TransferManager("
# zstash/create.py:57:    transfer_manager: TransferManager = TransferManager()
# zstash/hpss.py:26:        transfer_manager = TransferManager()
# zstash/hpss.py:184:        transfer_manager = TransferManager()
# zstash/update.py:25:    transfer_manager = TransferManager()

create and update construct their TransferManager object themselves because they use it in several places besides the hpss.hpss_get() and hpss.hpss_put() calls.

hpss.hpss_get():

    if not transfer_manager:
        transfer_manager = TransferManager()
git grep -n "hpss_get("
# zstash/extract.py:178:            hpss_get(hpss, get_db_filename(cache), cache)
# zstash/extract.py:554:                        hpss_get(hpss, tfname, cache)
# zstash/hpss.py:173:def hpss_get(
# zstash/ls.py:105:                hpss_get(hpss, get_db_filename(cache), cache)
# zstash/update.py:164:            hpss_get(hpss, get_db_filename(cache), cache, transfer_manager)

So, hpss.hpss_get() is used by extract, ls, and update. Of those, only update passes in a transfer_manager. So, the other cases require construction of the TransferManager.

hpss.hpss_transfer():

    if not transfer_manager:
        transfer_manager = TransferManager()
git grep -n "hpss_transfer("
zstash/hpss.py:15:def hpss_transfer(
# zstash/hpss.py:161:    hpss_transfer(
# zstash/hpss.py:187:    hpss_transfer(

hpss.hpss_transfer() is called in 2 spots. One is hpss.hpss_get() which is discussed above -- it will always pass in a non-None value for transfer_manager. The other is hpss.hpss_put():

git grep -A 8 -n "hpss_put("
# zstash/create.py:97:    logger.debug(f"{ts_utc()}: calling hpss_put() for {get_db_filename(cache)}")
# zstash/create.py:98:    hpss_put(
# zstash/create.py-99-        hpss,
# zstash/create.py-100-        get_db_filename(cache),
# zstash/create.py-101-        cache,
# zstash/create.py-102-        keep=args.keep,
# zstash/create.py-103-        is_index=True,
# zstash/create.py-104-        transfer_manager=transfer_manager,
# zstash/create.py-105-    )
# zstash/create.py-106-
# --
# zstash/hpss.py:149:def hpss_put(
# zstash/hpss.py-150-    hpss: str,
# zstash/hpss.py-151-    file_path: str,
# zstash/hpss.py-152-    cache: str,
# zstash/hpss.py-153-    keep: bool = True,
# zstash/hpss.py-154-    non_blocking: bool = False,
# zstash/hpss.py-155-    is_index=False,
# zstash/hpss.py-156-    transfer_manager: Optional[TransferManager] = None,
# zstash/hpss.py-157-):
# --
# zstash/hpss_utils.py:139:        hpss_put(
# zstash/hpss_utils.py-140-            hpss,
# zstash/hpss_utils.py-141-            os.path.join(cache, self.tfname),
# zstash/hpss_utils.py-142-            cache,
# zstash/hpss_utils.py-143-            keep,
# zstash/hpss_utils.py-144-            non_blocking,
# zstash/hpss_utils.py-145-            is_index=False,
# zstash/hpss_utils.py-146-            transfer_manager=transfer_manager,
# zstash/hpss_utils.py-147-        )
# --
# zstash/update.py:39:    hpss_put(
# zstash/update.py-40-        hpss,
# zstash/update.py-41-        get_db_filename(cache),
# zstash/update.py-42-        cache,
# zstash/update.py-43-        keep=args.keep,
# zstash/update.py-44-        is_index=True,
# zstash/update.py-45-        transfer_manager=transfer_manager,
# zstash/update.py-46-    )
# zstash/update.py-47-

So, hpss.hpss_put() is used by create, and update, both of which pass in a transfer_manager. The remaining occurrence is the call by hpss_utils.TarWrapper.process_tars(), which is ultimately used, again, by both create and update.

You can see that unlike hpss.hpss_get(), hpss.hpss_put(), is always passes in a non-None transfer_manager, so really we could probably just change the function call signature from using transfer_manager: Optional[TransferManager] = None to using transfer_manager: TransferManager.

some flowcharts to consider

Flowchart 1

Call hierarchy & some pseudo-code:

create.create()
  globus.globus_activate()
  create.create_database() # Put the files
    utils.create_tars_table()
    utils.get_files_to_archive_with_stats()
    hpss_utils.construct_tars()
      while we have files to archive:
        hpss_utils.TarWrapper() # Opens a new tar
        while we have files to archive AND adding a new file won't send us over the size limit:
          hpss_utils.TarWrapper.process_file() # "Stuff New Tarfile to Limit"
            hpss_utils.add_file_to_tar_archive()
        hpss_utils.TarWrapper.process_tar()
          Close the tar
          hpss_utils.hpss_put() # "Submit Tarfile for Transfer", see Flowchart 2
          Add the tar itself to the tars table
          Add the files included in this tar to the files table
    "commit/close database"
  hpss.hpss_put() # Put the index.db
  globus.globus_finalize()

Comments on Flowchart 1:

  • create_tars is really the function construct_tars.
  • I would replace "more files?" with "any files not yet added to a tar?"

Overall, Flowchart 1 looks good.

Flowchart 2a (left side)

Call hierarchy & some pseudo-code:

hpss_utils.hpss_put() # "Submit Tarfile for Transfer" from Flowchart 1
  hpss_utils.hpss_transfer()
    "ensure_transfer_manager" (if not transfer_manager)
    "ensure_transfer_manager_deletion_batch" (if not transfer_manager.batches or transfer_manager.batches[-1].task_id)
    if (not keep) and (not is_index): # We don't need to track files for deletion if we're keeping them; we never delete the index
      transfer_manager.batches[-1].file_paths.append(file_path) # "Add tarpath to latest deletion_batch"
    if scheme == "globus":
      globus.globus_transfer() # See Flowchart 2b
    if not keep:
      transfer_manager.delete_successfully_transferred_files()

Overall, Flowchart 2a looks good.

Flowchart 2b (right side)

Call hierarchy & some pseudo-code:

globus.globus_transfer() # Picking up from Flowchart 2a
  "ensure_globus_activate" (if (not transfer_manager.globus_config) or (not transfer_manager.globus_config.transfer_client):)
  mrb: Optional[TransferBatch] = transfer_manager.get_most_recent_batch()
  "ensure endpoint IDs set" (if transfer_manager.globus_config.local_endpoint:, if transfer_manager.globus_config.remote_endpoint:)
  "ensure transfer_data" (transfer_data = mrb.transfer_data or transfer_data = create_TransferData())
  add_file_to_TransferData # Adding TAR file, that is
  if mrb.task_id:
    if mrb.task_status == TaskStatus.ACTIVE:
      if non_blocking:
        return TaskStatus.ACTIVE # Go back to Flowchart 2a now
      else ERROR, actually raise Exception
    elif mrb.task_status == TaskStatus.SUCCEEDED:
      Proceed
    else ERROR, do not actually raise Exception (change to logger.warning?)
  globus.update_cumulative_tarfiles_pushed()
  globus_utils.submit_transfer_with_checks() # As far as Flowchart 2b currently goes
  if mrb.task_id:
    if not non_blocking: # I.e., IS blocking
      globus.globus_block_wait()
  if transfer_type == "get" and task_id:
    globus.globus_wait()

Comments on Flowchart 2b:

  • I would have the "globus?" arrow point to the start box. Right now, it looks like globus_transfer() is called completely independently, which is not the case.

Otherwise, Flowchart 2b looks good.

Action items for me

  • really we could probably just change the function call signature from using transfer_manager: Optional[TransferManager] = None to using transfer_manager: TransferManager.
  • (change to logger.warning?)

@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Apr 3, 2026

The new commit addresses those 2 action items. I haven't rerun the tests, but none of these changes should affect functionality. In particular, mypy would catch the type errors.

@TonyB9000
Copy link
Copy Markdown
Collaborator

@forsyth2

"I would replace "more files?" with "any files not yet added to a tar?"

Try writing that in that tiny little diamond :)

"I would have the "globus?" arrow point to the start box"

I was being creative there. By pointing to the entire "box" for "globus_transfer", it should imply invoking at "start", as there is no where else to start.

I am trying to figure out how to fit the rest of the logic on that diagram - I'm only barely at the point where "block-wait" would apply.

@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Apr 3, 2026

I am trying to figure out how to fit the rest of the logic on that diagram

Thanks @TonyB9000. For our purposes, I'd say the diagrams are most useful if they facilitate the code review. If we think we understand the logic well enough as-is and the tests are passing (which they are, last I checked), then we can probably go ahead and merge the PR. (And do future refactor work in #435).

@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Apr 3, 2026

I also have performance testing set up in #427, so we can compare performance. I would hope nothing we've changed here would impact performance, but it would be good to check. I'll try to work on performance profiling today, but I'm not sure we have enough data points from main branch for it to be totally useful.

I'm not sure when the release candidate deadline has been moved to, but I imagine it will be early next week.

@forsyth2 forsyth2 mentioned this pull request Apr 3, 2026
16 tasks
@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Apr 4, 2026

I would hope nothing we've changed here would impact performance, but it would be good to check. I'll try to work on performance profiling today, but I'm not sure we have enough data points from main branch for it to be totally useful.

It appears this PR has actually largely improved performance for Globus, but slightly degraded it for no-hpss/local and hpss. Additionally, with a limited number of runs, it's unclear to me how much of the Globus variation is inherent to Globus and not because of zstash code.

Performance charts made using code from #427 comparing this branch to main:
image

Performance charts made using code from #427 for this branch:
image

Performance profiling setup steps
cd ~/ez/zstash
git status
# On branch add-performance-profiling
# nothing to commit, working tree clean

# Let's make a new branch with these commits AND the tar deletion commits
git checkout -b profile-refactored-tar-deletion
git log --oneline
# Good, has the correct commits
git fetch upstream issue-374-refactor-tar-deletion
git rebase upstream/issue-374-refactor-tar-deletion
git log --oneline
# Good, has commits from both
nersc_conda
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n zstash-profile-tar-deletion-20260403
conda activate zstash-profile-tar-deletion-20260403
pre-commit run --all-files
python -m pip install .
cd tests/performance
emacs generate_performance_data.bash # Edit parameters
git diff # Check diff
./generate_performance_data.bash
# ~2 hours to run, note there is the manual step to paste an auth code

# [SUCCESS] All tests completed. Results saved to: /pscratch/sd/f/forsyth/zstash_performance/performance_pr416_20260403/results.csv
# [INFO] Now edit IO paths and run: python visualize_performance.py

emacs visualize_performance.py # Edit parameters
git diff # Check diff
pre-commit run --all-files
git add -A
python visualize_performance.py
# Figure 1 (overview) saved to: /global/cfs/cdirs/e3sm/www/forsyth/zstash_performance/performance_20260403_pr416_vs_pr427.png
#   Accessible at: https://portal.nersc.gov/cfs/e3sm/forsyth/zstash_performance/performance_20260403_pr416_vs_pr427.png
# Figure 2 (baseline comparison) saved to: /global/cfs/cdirs/e3sm/www/forsyth/zstash_performance/performance_20260403_pr416_vs_pr427_vs_baseline.png
#   Accessible at: https://portal.nersc.gov/cfs/e3sm/forsyth/zstash_performance/performance_20260403_pr416_vs_pr427_vs_baseline.png

@TonyB9000
Copy link
Copy Markdown
Collaborator

@forsyth2 Nice work!

Fascinating result. I could be wrong (not sure how the OS, or Python, optimizes the stack) but reducing the call-depth may have a real impact as well.

@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Apr 6, 2026

with a limited number of runs, it's unclear to me how much of the Globus variation is inherent to Globus

I might try to run this again today or tomorrow to get another data point at least.

I could be wrong (not sure how the OS, or Python, optimizes the stack) but reducing the call-depth may have a real impact as well.

ChatGPT says

Yes, in many cases that is true. Reducing call depth can improve performance, but the size of the effect depends a lot on what “call depth” means in the specific code. In Python, each function call is fairly expensive compared with many lower-level languages. A deeper call chain usually means:

  • more Python frame objects created and managed
  • more argument passing
  • more local-scope setup and teardown
  • less opportunity for the interpreter to stay in a tight simple loop

So if a hot path goes through many small helper functions, flattening that path can produce a measurable speedup. A few important distinctions:

In Python

  • The overhead is mostly from the Python interpreter and frame handling, not from the OS “optimizing the stack.”
  • Python does not do tail-call optimization.
  • Recursive or deeply layered code is often noticeably slower than equivalent iterative or less-indirect code.

What the OS does

  • The OS is usually not the main player here.
  • It provides the thread stack and memory protection, but it is not dynamically “optimizing” your function-call stack in a way that usually explains application-level performance differences.
  • For Python especially, interpreter overhead dominates.

Well, that's a bit unfortunate. I find having a bunch of helper functions to be much cleaner/easier to maintain.

@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Apr 6, 2026

Giving it the pseudo-code above, it says:

Concrete rule of thumb
For this codebase, “reduce call depth” is worth considering only in the inner file-processing loop. Everywhere else, I would treat it as low priority unless profiling says otherwise.

What to benchmark
Profile or microbenchmark just the hot per-file path:

  • current process_file -> add_file_to_tar_archive
  • flattened single function version
  • same number of files, same file sizes
  • especially with many small files

If the flattened version barely moves runtime, then the real cost is elsewhere. If it helps, the gain will probably come from Python call overhead in that inner loop, not OS stack behavior.

@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Apr 6, 2026

My plan is to 1) run the performance check a second time on the same exact code to get more performance data, 2) time-permitting, try flattening the call structure in inner file-processing loop and re-running performance checks. Alternatively, the second item could be done as part of #435 instead.

@TonyB9000
Copy link
Copy Markdown
Collaborator

@forsyth2

I might try to run this again today or tomorrow to get another data point at least.

I think that would be valuable in general, get a sense of how much "natural variation" exists when running the same operations.

Well, that's a bit unfortunate. I find having a bunch of helper functions to be much cleaner/easier to maintain.

I agree. Pulling "common repeat" stuff into natural subroutines is easier to maintain, but incurs performance burden.

Eliminating "hpss_get()" and "hpss_put()", in favor of "hpss_transfer( ..., 'get', ...)" and "hpss_transfer( ...,'put', ...) wouild eliminate a stack call, but not really change manitainability. How it might improve performance is a question.

@TonyB9000
Copy link
Copy Markdown
Collaborator

TonyB9000 commented Apr 6, 2026

@forsyth2 Curious thought. One could have a "code flattener" that takes "well-subroutined" code, and rewrites it with subroutines eliminated and code repeated everywhere - "bloating" the code in volume, but just for the purpose of compiling. You would always maintain the code with subroutines intact. Why is this not the practice, if python calls are so expensive? (Not possible with recursion of course).

@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Apr 6, 2026

Hmm ChatGPT has a lot to say on that topic:


Because in Python, that kind of source-level “flattening” usually does not buy enough to justify the cost, and often does not attack the real source of overhead.

Automatic flattening of arbitrary Python is too semantically messy and usually too low-yield, but selective inlining in profiled hotspots or using a JIT/native-extension path is absolutely a real optimization strategy.

Full AI response

Because in Python, that kind of source-level “flattening” usually does not buy enough to justify the cost, and often does not attack the real source of overhead.

A few reasons.

First, manual or automatic inlining is much more valuable in compiled languages than in Python. In C/C++, when a compiler inlines a function, it can do much more than just remove the call:

  • constant propagation
  • dead-code elimination
  • register allocation across the old call boundary
  • loop optimizations
  • better memory access scheduling

In Python, eliminating a function call removes some interpreter/frame overhead, but the body is still executed by the Python interpreter one bytecode at a time. You do not suddenly unlock the same class of deep optimizations a native compiler would.

So the payoff is smaller.

Second, Python code is highly dynamic, which makes general-purpose source flattening awkward and fragile. A function call in Python is not always just “jump to fixed code”:

  • methods can be overridden
  • functions can be monkey-patched
  • decorators can change behavior
  • globals and closures can be rebound
  • descriptors and metaclasses can affect lookup
  • introspection, tracing, debugging, and profiling may depend on call boundaries

A flattener that preserves exact Python semantics in all those cases is hard to make. A flattener that ignores them would silently change behavior.

Third, flattening duplicates bytecode and usually hurts maintainability and sometimes runtime characteristics:

  • code gets larger
  • instruction locality can worsen
  • debugging gets worse
  • tracebacks become less meaningful
  • profiling becomes harder to interpret
  • generated code can become enormous if helpers are used in many places

So you trade a modest reduction in call overhead for bigger, uglier, harder-to-debug code.

Fourth, the better optimization boundary in Python is often not “inline more Python,” but “do less Python.” The common winning moves are:

  • move hot loops into C/C++/Rust extensions
  • use vectorized NumPy/Pandas operations
  • batch operations
  • reduce object churn
  • reduce attribute lookups and repeated dynamic dispatch
  • reduce I/O round-trips
  • use JIT-based runtimes where applicable

Those usually dominate any gain from flattening pure Python call structure.

Fifth, Python already has better ways to preserve abstraction without paying as much at runtime, depending on the case:

  • keep helpers outside the hottest loop
  • combine tiny wrapper functions in hot paths
  • cache repeated lookups into locals
  • use iterative tight loops instead of chains of tiny methods
  • switch the hotspot to Cython, Numba, mypyc, pybind11, etc.

That is a more targeted version of what your colleague is imagining.

There is also a deeper point: a “code flattener” is basically trying to be a compiler optimization pass. But CPython is not a traditional optimizing compiler pipeline. It does not take your whole program, prove properties about it, inline aggressively, and emit optimized native code. Since Python remains dynamic at runtime, source rewriting has less leverage.

So the answer to “why is this not the practice?” is roughly:

  • it is hard to do safely for Python semantics
  • it often gives only modest gains
  • it makes code size/debugging worse
  • bigger wins usually come from reducing interpreter work entirely, not rearranging Python source

That said, the idea is not crazy. It does exist in neighboring forms:

  • Cython/mypyc: compile restricted or typed Python-like code to C/native extensions
  • Numba: JIT-compile certain numerical functions
  • PyPy: JIT can effectively inline and optimize hot paths dynamically
  • hand-tuned hot-loop refactors: people absolutely do flatten tiny helpers in hotspots after profiling

So the real-world practice is not “flatten all Python before compiling,” but rather:

profile, identify hotspots, and either simplify the hot Python path or move it to a runtime/compiler that can optimize it properly.

For your colleague’s idea, the closest practical statement is:

Automatic flattening of arbitrary Python is too semantically messy and usually too low-yield, but selective inlining in profiled hotspots or using a JIT/native-extension path is absolutely a real optimization strategy.

In your tar/transfer workflow, I would only consider this for the tiny per-file inner-loop helpers, not for the orchestration or I/O-heavy layers.

@TonyB9000
Copy link
Copy Markdown
Collaborator

@forsyth2 Very nice summary. I'd like to know how much overhead (if any) is incurred by the "hpss_put()" type call-stack lengthening conducted only to "rename" a function (almost purely a "pass-through").

Experiment: Call simple function "X" 100 times, Then rewrite the code to call function "A" 100 times, where A() calls B(), calls C(), calls D() ... Calls X(). Any parameters are simply passed through. Test performance burden as call length increases.

@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Apr 6, 2026

I might try to run this again today or tomorrow to get another data point at least.

Ok, here's another data point.

Figure 1

Performance profile running on the exact same code as the first run:

image

Figure 2

Comparing those results to main (+ the performance profiling code itself, from #427):

image

It seems like again we come to the same conclusion: Globus runs are much faster and the no-HPSS and HPSS runs are a little slower. Considering, (1) I believe most users use zstash with Globus, and (2) The Globus runs are consistently the longest-running runs, I think it makes sense to prioritize the speedup of the Globus runs (i.e., go ahead and merge this PR).

Figure 3

Comparing those results instead to the results from the first run:

image

It appears the Globus runs are not too too different in runtime. Only two runs really stand out in terms of consistency between performance profiles:

  • zstash create using Globus on a directory with many small files: major decrease in runtime on second profiling.
  • zstash update using HPSS on a directory with a few large files: major increase in runtime on second profiling.

Setup

Setup steps
# 2026-04-06

```bash
# Running on Bebop since Chrysalis is down.

cd ~/ez/zstash
git status
# On branch profile-refactored-tar-deletion
git diff --staged | cat
# diff --git a/tests/performance/generate_performance_data.bash b/tests/performance/generate_performance_data.bash
# index 83bc913..c313bf1 100755
# --- a/tests/performance/generate_performance_data.bash
# +++ b/tests/performance/generate_performance_data.bash
# @@ -15,7 +15,7 @@ set -e
#  # Run from Perlmutter, so that we can do both
#  # a direct transfer to HPSS & a Globus transfer to Chrysalis
#  work_dir=/pscratch/sd/f/forsyth/zstash_performance/
# -unique_id=performance_20260402
# +unique_id=performance_pr416_20260403
 
#  dir_to_copy_from=/global/cfs/cdirs/e3sm/forsyth/E3SMv2/v2.LR.historical_0201/
#  subdir0=build/
# diff --git a/tests/performance/visualize_performance.py b/tests/performance/visualize_performance.py
# index 79e5f7f..32266e6 100644
# --- a/tests/performance/visualize_performance.py
# +++ b/tests/performance/visualize_performance.py
# @@ -56,17 +56,19 @@ import pandas as pd
 
#  # The results to show in Fig. 1
#  RESULTS_CSV: str = (
# -    "/pscratch/sd/f/forsyth/zstash_performance/performance_20260402/results.csv"
# +    "/pscratch/sd/f/forsyth/zstash_performance/performance_pr416_20260403/results.csv"
#  )
 
#  # The results to compare against in Fig. 2.
#  # Set to None to skip Fig. 2.
# -BASELINE_RESULTS_CSV: Optional[str] = None
# +BASELINE_RESULTS_CSV: Optional[str] = (
# +    "/pscratch/sd/f/forsyth/zstash_performance/performance_20260402/results.csv"
# +)
 
#  # Output path for the saved figures.
#  # Set to None to display interactively instead of saving.
#  OUTPUT_PATH: Optional[str] = (
# -    "/global/cfs/cdirs/e3sm/www/forsyth/zstash_performance/performance__20260402_pr427.png"
# +    "/global/cfs/cdirs/e3sm/www/forsyth/zstash_performance/performance_20260403_pr416_vs_pr427.png"
#  )
 
#  # ---------------------------------------------------------------------------
nersc_conda
conda activate zstash-pr427-performance-profile-20260402
git add -A
pre-commit run --all-files
git commit -m "Profiling 20260403"

# Edit tests/performance/generate_performance_data.bash
# unique_id=performance_pr416_20260406
python -m pip install .
cd tests/performance
git diff # Check diff
./generate_performance_data.bash
# ~2-2.5 hours to run, note there is the manual step to paste an auth code (about 5-10 minutes into run time)

# [SUCCESS] All tests completed. Results saved to: /pscratch/sd/f/forsyth/zstash_performance/performance_pr416_20260406/results.csv
# [INFO] Now edit IO paths and run: python visualize_performance.py

# First, we'll compare against `main`.
# Edit tests/performance/visualize_performance.py
# RESULTS_CSV => /pscratch/sd/f/forsyth/zstash_performance/performance_pr416_20260406/results.csv
# OUTPUT_PATH => /global/cfs/cdirs/e3sm/www/forsyth/zstash_performance/performance_20260406_pr416_vs_pr427.png
git diff # Check diff
pre-commit run --all-files
git add -A
python visualize_performance.py
# Figure 1 (overview) saved to: /global/cfs/cdirs/e3sm/www/forsyth/zstash_performance/performance_20260406_pr416_vs_pr427.png
#   Accessible at: https://portal.nersc.gov/cfs/e3sm/forsyth/zstash_performance/performance_20260406_pr416_vs_pr427.png
# Figure 2 (baseline comparison) saved to: /global/cfs/cdirs/e3sm/www/forsyth/zstash_performance/performance_20260406_pr416_vs_pr427_vs_baseline.png
#   Accessible at: https://portal.nersc.gov/cfs/e3sm/forsyth/zstash_performance/performance_20260406_pr416_vs_pr427_vs_baseline.png

# Second, we'll compare against the performance profile of the same exact code.
# Edit tests/performance/visualize_performance.py
# BASELINE_RESULTS_CSV => /pscratch/sd/f/forsyth/zstash_performance/performance_pr416_20260403/results.csv
# OUTPUT_PATH => /global/cfs/cdirs/e3sm/www/forsyth/zstash_performance/performance_20260406_pr416_vs_20260403_pr416.png
git diff # Check diff
pre-commit run --all-files
git add -A
python visualize_performance.py
# Figure 1 (overview) saved to: /global/cfs/cdirs/e3sm/www/forsyth/zstash_performance/performance_20260406_pr416_vs_20260403_pr416.png
#   Accessible at: https://portal.nersc.gov/cfs/e3sm/forsyth/zstash_performance/performance_20260406_pr416_vs_20260403_pr416.png
# Figure 2 (baseline comparison) saved to: /global/cfs/cdirs/e3sm/www/forsyth/zstash_performance/performance_20260406_pr416_vs_20260403_pr416_vs_baseline.png
#   Accessible at: https://portal.nersc.gov/cfs/e3sm/forsyth/zstash_performance/performance_20260406_pr416_vs_20260403_pr416_vs_baseline.png

</details>

@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Apr 6, 2026

It seems like again we come to the same conclusion: Globus runs are much faster and the no-HPSS and HPSS runs are a little slower. Considering, (1) I believe most users use zstash with Globus, and (2) The Globus runs are consistently the longest-running runs, I think it makes sense to prioritize the speedup of the Globus runs (i.e., go ahead and merge this PR).

@TonyB9000 Do you have any comments on the plots above? I think this PR should be good to merge. (And we can do more code cleanup in #435 of course).

@forsyth2
Copy link
Copy Markdown
Collaborator Author

forsyth2 commented Apr 6, 2026

  • I believe I ran the most important tests on the second-to-last commit.
  • The last commit, as noted here shouldn't impact functionality.
  • I believe changes since the last complete run of the test suite are unlikely to affect not-run tests.
  • I've now done two performance profilings: profile 1 & profile 2
  • After merging this, and before making rc1, I will run the full test suite on the main branch.

So, I will merge this now.

@forsyth2 forsyth2 merged commit 33f379d into main Apr 6, 2026
5 checks passed
@forsyth2 forsyth2 deleted the issue-374-refactor-tar-deletion branch April 6, 2026 22:03
@forsyth2 forsyth2 mentioned this pull request Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Globus Globus semver: bug Bug fix (will increment patch version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: tar files are not deleted after successful globus transfer

4 participants