Skip to content

Conversation

@forsyth2
Copy link
Collaborator

Summary

Objectives:

  • Resolve token timeout problem.

Issue resolution:

Note that this transfer is limited to 48 hours due to Globus token expiration. Given the relatively slow transfer speeds between chrysalis and NERSC HPSS (~100 MB/s), 48 hours is often insufficient for a large simulation.
(6) Using Globus web interface, manually transfer zstash files that were not transferred due to token expiration.

Select one: This pull request is...

  • a bug fix: increment the patch version
  • a small improvement: increment the minor version
  • a new feature: increment the minor version
  • an incompatible (non-backwards compatible) API change: increment the major version

Small Change

  • To merge, I will use "Squash and merge". That is, this change should be a single commit.
  • Logic: I have visually inspected the entire pull request myself.
  • Pre-commit checks: All the pre-commits checks have passed.

@forsyth2 forsyth2 self-assigned this Nov 25, 2025
@forsyth2 forsyth2 added semver: bug Bug fix (will increment patch version) Globus Globus labels Nov 25, 2025
@forsyth2
Copy link
Collaborator Author

forsyth2 commented Nov 25, 2025

Remaining TODO

  • Review tests that Claude wrote.
  • Better organize tests into existing test subdirectory structure.
  • Do a real test of a transfer taking 48+ hours.

@forsyth2 forsyth2 force-pushed the improve-globus-refresh branch 3 times, most recently from 0e31048 to 735fc3d Compare November 26, 2025 00:16
@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 2, 2025

@chengzhuzhang Just a status update: I used Claude to get some prototypes set up for the 4 components of the Globus integration improvements. I've ranked them by decreasing order of importance, as I understand it, here:

  1. Fix token timeout (i.e., remove step 6 from the list of cumbersome steps in zstash Globus functionality has become overly cumbersome #339) -- this PR
  2. Delete tar files when --non-blocking is set -- Delete transferred files #405 is the actual fix, Add test for tar deletion #404 is a test that should fail on current main but pass with the fix.
  3. Better handle the token file (i.e., store multiple tokens, allow user to specify a token file) -- [Feature]: Better handle the Globus token file #408
  4. Support Globus 4.0 -- Support Globus 4.0 #406

I anticipate many merge/rebase conflicts as each of these PRs go in. Therefore, my plan is to merge them in the above order, ensuring we get the most important pieces merged first. With that in mind, I've begun testing this PR as the highest priority one, using a large transfer that should hopefully take longer than 48 hours.

Initial testing setup
cd ~/ez/zstash
git status
# branch issue-398-token-file
# nothing to commit, working tree clean
git checkout improve-globus-refresh
conda env list # Get name of environment to reuse
conda activate zstash_globus_refresh
pre-commit run --all-files # Optional; just makes sure the files are looking clean.
python -m pip install .

According to #339:

Note that this transfer is limited to 48 hours due to Globus token expiration. Given the relatively slow transfer speeds between chrysalis and NERSC HPSS (~100 MB/s), 48 hours is often insufficient for a large simulation.

That means we need to transfer at least 100 MB/sec * 60 sec/min * 60 min/hour * 48 hours = 17,280,000 MB = 17,280 GB = 17.28 TB

In #391 (reply in thread),
Tony ran the following zstash check command interfacing with the NERSC HPSS endpoint:

zstash check command: zstash check -v --keep --cache archives --hpss=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/g/golaz/E3SMv3.LR/v3.LR.piControl

which ran 6 TB (out of 70) over 14 hours, for a transfer rate of 6/14=0.43 TB/hour. 0.43 TB/hour * 48 hours = 20.64 TB.

So, in order to trigger the test condition, we need to transfer upwards of 17.28-20.64TB.

Let's check if I even have space to store/transfer that much data:

Chrysalis:

lcrc-quota
# 38 GB available on /home/ac.forsyth2/
/usr/lpp/mmfs/bin/mmlsquota -u ac.forsyth2 --block-size T fs2
# 300-70 = 210 TB available

Perlmutter:

showquota --hpss
# 40-15.17 GiB = GiB available on home
# 20 TiB - 125.92 GiB = 19.87408 TiB available on pscratch
# 2 PiB - 1019.11 TiB = 0.98089 PiB available on HPSS = 980.89 TiB

Now, let's see if I have any datasets of the required size. The most likely match would be the dataset used for zppy's integration tests:

Chrysalis:

cat ~/ez/zppy/tests/integration/utils.py
        # "user_input_v2": "/lcrc/group/e3sm/ac.forsyth2/",
        # "user_input_v3": "/lcrc/group/e3sm2/ac.wlin/",
ls /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
du -sh /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
# 24T	/lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
ls /lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.historical_0051
du -sh /lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.historical_0051
# 21T	/lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.historical_0051

Either of these are in theory large enough to surpass 48 hours of transfer time. Let's try the larger one. I only have 19 TiB = 20.89 TB available on pscratch, so we'll have to transfer any data to NERSC HPSS, where I have enough space.

Perlmutter:

hsi
ls
pwd
# pwd0: /home/f/forsyth
mkdir zstash_48_hour_run_test20251201
exit

Let's try the following tests:

  • zstash create --non-blocking. We'll start with this one.
  • zstash create (i.e., blocking)
  • zstash check

Chrysalis:

cd /lcrc/group/e3sm/ac.forsyth2/
mkdir zstash_48_hour_run_test20251201
cd zstash_48_hour_run_test20251201
mkdir cache

# Start fresh
# Go to https://app.globus.org/file-manager?two_pane=true > For "Collection", choose: LCRC Improv DTN, NERSC HPSS
rm -rf ~/.zstash.ini
rm -rf ~/.zstash_globus_tokens.json
# https://auth.globus.org/v2/web/consents > Globus Endpoint Performance Monitoring > rescind all

# Start a screen session, 
# so the transfer will continue even if the connection is interrupted:
screen 
screen -ls
# There is a screen on:
#         2719818.pts-7.chrlogin2 (Attached)
# 1 Socket in /run/screen/S-ac.forsyth2
pwd
# Good, /lcrc/group/e3sm/ac.forsyth2/zstash_48_hour_run_test20251201

# Re-activate the conda environment:
lcrc_conda
conda activate zstash_globus_refresh
# NERSC  HPSS endpoint: 9cd89cfd-6d04-11e5-ba46-22000b92c6ec
time zstash create --non-blocking --hpss=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/f/forsyth/zstash_48_hour_run_test20251201_try3 --cache=/lcrc/group/e3sm/ac.forsyth2/zstash_48_hour_run_test20251201/cache /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201 2>&1 | tee pr407_create_non_blocking.log
# CTRL A D

screen -ls
# There is a screen on:
# 	2719818.pts-7.chrlogin2	(Detached)
# 1 Socket in /run/screen/S-ac.forsyth2.

# Mon 2025-12-01 17:34
# Check back Wed 2025-12-03 17:34

# Initial checkin
screen -R # 18:03
# Good, still running after ~30 minutes
# CTRL A D

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 2, 2025

# Tue 2025-12-02 14:11 checkin
hostname
# chrlogin2.lcrc.anl.gov
screen -ls
# There is a screen on:
# 	2719818.pts-7.chrlogin2	(Detached)
# 1 Socket in /run/screen/S-ac.forsyth2.
screen -R
# Good, still running
# CTRL A D

Ran into OSError: [Errno 28] No space left on device while testing E3SM-Project/zppy#757

lcrc-quota
# ----------------------------------------------------------------------------------------
# Home                          Current Usage   Space Avail    Quota Limit    Grace Time
# ----------------------------------------------------------------------------------------
# ac.forsyth2                        61 GB          38 GB         100 GB               
# ----------------------------------------------------------------------------------------
# Project                       Current Usage   Space Avail    Quota Limit    Grace Time
# ----------------------------------------------------------------------------------------
/usr/lpp/mmfs/bin/mmlsquota -u ac.forsyth2 --block-size T fs2
# /usr/lpp/mmfs/bin/mmlsquota -u ac.forsyth2 --block-size T fs2
#                          Block Limits                                               |     File Limits
# Filesystem Fileset    type             TB      quota      limit   in_doubt    grace |    files   quota    limit in_doubt    grace  Remarks
# fs2        root       USR              84        300        300          1     none |  5986737       0        0      148     none lcrcstg.lcrc.anl.gov

Neither of these indicate any problems with disk space though...
We started with 70 TB used on /lcrc/group/e3sm/ac.forsyth2/, so we've added 14 TB so far.

Turns out the whole project is out of disk space on /lcrc/group/e3sm.

screen -R gives:

  File "/gpfs/fs1/home/ac.forsyth2/miniforge3/envs/zstash_globus_refresh/lib/python3.13/tarfile.py", line 2287, in addfile
    self.fileobj.write(buf)
    ~~~~~~~~~~~~~~~~~~^^^^^
  File "/gpfs/fs1/home/ac.forsyth2/miniforge3/envs/zstash_globus_refresh/lib/python3.13/site-packages/zstash/hpss_utils.py", line 35, in write
    self.f.write(s)
    ~~~~~~~~~~~~^^^
OSError: [Errno 28] No space left on device
ERROR: Archiving zstash/00002d.tar
Traceback (most recent call last):
  File "/gpfs/fs1/home/ac.forsyth2/miniforge3/envs/zstash_globus_refresh/bin/zstash", line 7, in <module>
    sys.exit(main())
             ~~~~^^
  File "/gpfs/fs1/home/ac.forsyth2/miniforge3/envs/zstash_globus_refresh/lib/python3.13/site-packages/zstash/main.py", line 65, in main
    create()
    ~~~~~~^^
  File "/gpfs/fs1/home/ac.forsyth2/miniforge3/envs/zstash_globus_refresh/lib/python3.13/site-packages/zstash/create.py", line 91, in create
    failures: List[str] = create_database(cache, args)
                          ~~~~~~~~~~~~~~~^^^^^^^^^^^^^
  File "/gpfs/fs1/home/ac.forsyth2/miniforge3/envs/zstash_globus_refresh/lib/python3.13/site-packages/zstash/create.py", line 287, in create_database
    failures = add_files(
        cur,
    ...<10 lines>...
        force_database_corruption=args.for_developers_force_database_corruption,
    )
  File "/gpfs/fs1/home/ac.forsyth2/miniforge3/envs/zstash_globus_refresh/lib/python3.13/site-packages/zstash/hpss_utils.py", line 143, in add_files  
    tar.close()
    ~~~~~~~~~^^
  File "/gpfs/fs1/home/ac.forsyth2/miniforge3/envs/zstash_globus_refresh/lib/python3.13/tarfile.py", line 2042, in close
    self.fileobj.write(NUL * (BLOCKSIZE * 2))
    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/fs1/home/ac.forsyth2/miniforge3/envs/zstash_globus_refresh/lib/python3.13/site-packages/zstash/hpss_utils.py", line 35, in write
    self.f.write(s)
    ~~~~~~~~~~~~^^^
OSError: [Errno 28] No space left on device
Exception ignored in: <_io.BufferedWriter name='/lcrc/group/e3sm/ac.forsyth2/zstash_48_hour_run_test20251201/cache/000035.tar'>
OSError: [Errno 28] No space left on device

real    1285m24.695s
user    774m24.926s
sys     90m6.002s

So, that means after 1285/60=21.42 hours and transferring ~14 out of the planned 24 TB, we've hit a disk space limit.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 2, 2025

@chengzhuzhang @TonyB9000 I'm trying to test this PR by running a 48+ hour transfer. I made careful note of my own space allowances, but it appears testing still hit a project-wide cap. Is it possible to test a 48+ hour transfer on less data? (I made calculations above re: how much data would need to be transferred). I guess this is where having this feature would be useful:

Delete tar files when --non-blocking is set

Perhaps it would be better to test/merge that PR first?

@chengzhuzhang
Copy link
Collaborator

@forsyth2 when testing this, we should try not to use lcrc to receive data since the disk storage is running low.
Another possibility for testing this, v3.HR group needs to archiving large amount of data to NERSC HPSS. If you can set up a zstash env that can be easily activated, we can ask @jonbob or @wlin7 or @xuezhengllnl to help this zstash update.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 2, 2025

Ok, sure I can write instructions on how to test this, and we can see if any of them still run into the token problem.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 3, 2025

@chengzhuzhang If the project is close to full utilization of /lcrc/group/e3sm/, then how will it help if anyone else runs the test? Won't they still hit the project-wide cap? I suppose they might have higher space allowances on Perlmutter to do a data transfer from there...

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 3, 2025

@chengzhuzhang I'm thinking it would make more sense to resolve "Delete tar files when --non-blocking is set"(#405) first. Then, tars won't stack up in the cache, taking up space.

@chengzhuzhang
Copy link
Collaborator

@chengzhuzhang If the project is close to full utilization of /lcrc/group/e3sm/, then how will it help if anyone else runs the test? Won't they still hit the project-wide cap? I suppose they might have higher space allowances on Perlmutter to do a data transfer from there...

There is also another space, e3sm2, as well as a scratch space that folks can leverage. My point is that it would be ideal to test this in a real use case while avoiding the need to occupy additional disk space with duplicate testing data.

@TonyB9000
Copy link
Collaborator

@chengzhuzhang @forsyth2 Right now, e3sm is choking:

DIR = /lcrc/group/e3sm2:
              Space           Inodes
    Total:     3056 TB      50003968
     Used:     2913 TB      20520074
     Free:      142 TB      29483894

DIR = /lcrc/group/e3sm:
              Space           Inodes
    Total:     3056 TB     157224960
     Used:     3040 TB     151716938
     Free:       15 TB       5508022

Soon, e3sm2 will be tight as well, as I (a) generate more v3 LE CMIP6, (b) when I must fetch NERSC v3 LE ocean data. Good thing I can work right now using Wuyin's local atmos data.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 3, 2025

Hi @jonbob, @wlin7, @xuezhengllnl would any of you be able to test this pull request (PR)?

What does this PR do?

In #339, Chris Golaz noted that the current zstash-Globus integration involves the cumbersome step of:

this transfer is limited to 48 hours due to Globus token expiration. Given the relatively slow transfer speeds between chrysalis and NERSC HPSS (~100 MB/s), 48 hours is often insufficient for a large simulation.
(6) Using Globus web interface, manually transfer zstash files that were not transferred due to token expiration.

This PR is meant to resolve that issue. That is, a zstash call should be able to surpass 48 hours in runtime.

What needs to be tested?

Despite working with Claude to write some mock tests, we still need to test the real thing. That is, we need to do a 48+ hour transfer.

Using "transfer speeds between chrysalis and NERSC HPSS (~100 MB/s)" as a baseline, a proper test would have to transfer upwards of 100 MB/s * 60 s/m * 60 m/h * 48 h = 17,280,000 MB = 17,280 GB = 17.28 TB.

We can also refer to #391 (reply in thread), where Tony ran the following zstash check command interfacing with the NERSC HPSS endpoint:

zstash check command: zstash check -v --keep --cache archives --hpss=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/g/golaz/E3SMv3.LR/v3.LR.piControl

That transferred 6 TB (out of 70) over 14 hours, giving a transfer rate of 6/14=0.43 TB/hour. Now, 0.43 TB/hour * 48 hours = 20.64 TB.

So, in order to trigger the test condition, we need to transfer upwards of 17.28-20.64TB.

Why can't I test it?

Space limitations

As of 12/02 afternoon:

Chrysalis:

lcrc-quota
# 38 GB available on /home/ac.forsyth2/
/usr/lpp/mmfs/bin/mmlsquota -u ac.forsyth2 --block-size T fs2
# 300-71 = 209 TB available on /lcrc/group/e3sm/ac.forsyth2/

Perlmutter:

showquota --hpss
# 40-15.17 GiB = GiB available on home (/global/homes/f/forsyth)
# 20 TiB - 125.92 GiB = 19.87408 TiB available on pscratch (/pscratch/sd/f/forsyth)
# 2 PiB - 1022.60 TiB = 0.9774 PiB available on HPSS = 980.89 TiB

I don't have enough space for 20+ TB on LCRC's /home/ac.forsyth2/ or on NERSC's /global/homes/f/forsyth or /pscratch/sd/f/forsyth. That only leaves the option of transferring from LCRC's /lcrc/group/e3sm/ac.forsyth2/ to NERSC HPSS's /home/f/forsyth.

Unfortunately, the LCRC space is already close to project-wide full utilization, so this last option still won't work.

How you can test this PR

Set up git

# If you don't already have a zstash repo, get it with:
git clone git@github.com:E3SM-Project/zstash.git

cd zstash # Enter zstash repo
git status
# Make sure the output includes: nothing to commit, working tree clean
# If not, you'll need to clean up the git workspace first.
git remote -v
# You should see:
# origin	git@github.com:E3SM-Project/zstash.git (fetch)
# origin	git@github.com:E3SM-Project/zstash.git (push)
git fetch origin improve-globus-refresh
git checkout -b improve-globus-refresh origin/improve-globus-refresh # Checkout this PR's branch

Set up conda

# Activate conda
# I have a function in ~/.bashrc to do this. You probably have a different method.

# Set up conda environment
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n test-zstash-improve-globus-refresh
conda activate test-zstash-improve-globus-refresh
pre-commit run --all-files # Optional, but ensures the files are formatted & linted.
python -m pip install . # Install the code of this branch into the newly created environment

Commands to test

I'm interested in testing the following cases. Perhaps not all of them, but at least a few. Are any of these commands ones you'd need to run soon anyway?

Data locations

You can transfer data from anywhere to anywhere, as long you're transferring the necessary minimum size of data (~21 TB), using Globus.

The cache, not the directory being archived, is what will accumulate space. Since /lcrc/group/e3sm/ has space limitations, consider using /lcrc/group/e3sm2/ or /gpfs/fs0/globalscratch.

Endpoints

Here are the relevant Globus endpoints:

LCRC_IMPROV_DTN_ENDPOINT=15288284-7006-4041-ba1a-6b52501e49f1
NERSC_PERLMUTTER_ENDPOINT=6bdc7956-fc0f-4ad2-989c-7aa5ee643a79
NERSC_HPSS_ENDPOINT=9cd89cfd-6d04-11e5-ba46-22000b92c6ec
PIC_COMPY_DTN_ENDPOINT=68fbd2fa-83d7-11e9-8e63-029d279f7e24
GLOBUS_TUTORIAL_COLLECTION_1_ENDPOINT=6c54cade-bde5-45c1-bdea-f4bd71dba2cc

Running a test

It may be best to start fresh with everything for Globus, but in theory this is not fully necessary:

# Activate Globus endpoints
# Go to https://app.globus.org/file-manager?two_pane=true > For "Collection", 
# choose the appropriate src endpoint and dst endpoint. 
# Authenticate if needed.

# Start with fresh token files
rm -rf ~/.zstash.ini
rm -rf ~/.zstash_globus_tokens.json

# Start with no Globus consents
# https://auth.globus.org/v2/web/consents > Globus Endpoint Performance Monitoring > rescind all

Now, you can use the following code block as an example to run:

# Set up a screen because this command will likely not finish before you lose SSH connection.
screen 
screen -ls # See what login node this screen is attached to

# Screen doesn't seem to keep conda environment information, so we need to redo that part:

# Activate conda
# I have a function in ~/.bashrc to do this. You probably have a different method.

conda activate test-zstash-improve-globus-refresh

# Example command: create, non-blocking, non-keep (but with bug of issue 374)
# Replace: dst_endpoint, path_to_dst_dir, path_to_cache, dir_you_want_to_archive
time zstash create --non-blocking --hpss=globus://dst_endpoint//path_to_dst_dir --cache=path_to_cache/cache dir_you_want_to_archive 2>&1 | tee pr407_create_non_blocking.log

# Wait until you get the auth code prompt.
# Once you've entered that, you should be able to exit the screen.

# Exit the screen with:
# CTRL A D

screen -R # Resume the screen

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 3, 2025

@chengzhuzhang Given @TonyB9000's comment:

/gpfs/fs0/globalscratch is definitely the place to conduct volume testing

DIR = /gpfs/fs0/globalscratch:
Space Inodes
Total: 9077 TB 8897695744
Used: 5215 TB 970452531
Free: 3862 TB 7927243213
I think it makes the most sense for me to test this myself using /gpfs/fs0/globalscratch for the cache. I think that's a better idea than trying to have others do production runs as the initial testing of this PR (as in the above comment).

It does appear I have write-access there:

cd /gpfs/fs0/globalscratch
mkdir ac.forsyth2

@chengzhuzhang
Copy link
Collaborator

/gpfs/fs0/globalscratch is definitely the place to conduct volume testing

DIR = /gpfs/fs0/globalscratch: Space Inodes Total: 9077 TB 8897695744 Used: 5215 TB 970452531 Free: 3862 TB 7927243213 I think it makes the most sense for me to test this myself using /gpfs/fs0/globalscratch for the cache. I think that's a better idea than trying to have others do production runs as the initial testing of this PR (as in the above comment).

I agree!

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 3, 2025

Sounds good! @jonbob, @wlin7, @xuezhengllnl -- in this case, sorry for the early tagging; please ignore for now. We may ask for production testing in the future once we've done initial testing.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 4, 2025

Test 2 setup
cd ~/ez/zstash
git status
# On branch issue-374-tar-deletion-rebased20251124
# nothing to commit, working tree clean
git checkout improve-globus-refresh
git fetch upstream main
git rebase upstream/main
git log
# Commit history looks right
lcrc_conda
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n test-pr407-token-timeout-20251203
conda activate test-pr407-token-timeout-20251203
pre-commit run --all-files # Optional, but ensures the files are formatted & linted.
python -m pip install . # Install the code of this branch into the newly created environment

# Activate Globus endpoints
# Go to https://app.globus.org/file-manager?two_pane=true > For "Collection", choose LCRC Improv DTN
# Didn't need to re-authenticate.

# Start with fresh token files
rm -rf ~/.zstash.ini
rm -rf ~/.zstash_globus_tokens.json

# Start with no Globus consents
# https://auth.globus.org/v2/web/consents > Globus Endpoint Performance Monitoring > rescind all

# Set up a screen because this command will likely not finish before you lose SSH connection.
screen 
screen -ls # See what login node this screen is attached to
# There is a screen on:
#         4129470.pts-16.chrlogin1        (Attached)
# 1 Socket in /run/screen/S-ac.forsyth2.

# Screen doesn't seem to keep conda environment information, so we need to redo that part:
lcrc_conda
conda activate test-pr407-token-timeout-20251203

mkdir -p /gpfs/fs0/globalscratch/ac.forsyth2/pr407_token_timeout_20251203_cache
mkdir -p /gpfs/fs0/globalscratch/ac.forsyth2/pr407_token_timeout_20251203_dst_dir
mkdir -p /gpfs/fs0/globalscratch/ac.forsyth2/pr407_token_timeout_20251203_run_dir
cd /gpfs/fs0/globalscratch/ac.forsyth2/pr407_token_timeout_20251203_run_dir

# Test 1: create, non-blocking, non-keep (but with the bug of issue 374)
# LCRC_IMPROV_DTN_ENDPOINT=15288284-7006-4041-ba1a-6b52501e49f1
time zstash create --non-blocking --hpss=globus://15288284-7006-4041-ba1a-6b52501e49f1///gpfs/fs0/globalscratch/ac.forsyth2/pr407_token_timeout_20251203_dst_dir --cache=/gpfs/fs0/globalscratch/ac.forsyth2/pr407_token_timeout_20251203_cache /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201 2>&1 | tee pr407_20251203_create_non_blocking.log

# Enter auth code

# Exit the screen with:
# CTRL A D

screen -ls
# There is a screen on:
# 	4129470.pts-16.chrlogin1	(Detached)
# 1 Socket in /run/screen/S-ac.forsyth2.

# RUNNING as of Wednesday 2025-12-03 17:57 (48 hours => Fri 17:57)

# Initial checkin
screen -R # Good, still going
# CTRL A D

# 18:21
ls /gpfs/fs0/globalscratch/ac.forsyth2/pr407_token_timeout_20251203_cache
# 000000.tar  000001.tar  index.db

@chengzhuzhang I had mentioned previously we could try running Chrysalis-to-Chrysalis, but it appears the transfer speed is higher, which means we'd need a larger dataset. I think a transfer that would have taken 48+ hours Chrysalis-NERSC would only take 15 hours for Chrysalis-Chrysalis. By my rough calculations, instead of 24 TB I'd need to transfer as much as 76.8 TB of data. And I don't have anything that large to transfer.

Calculation details

To avoid a long du run, we can do some back-of-the-envelope math:

When I ran ls on the previous cache, it had 5 x 10 tars in the output = 50 tars.
When I deleted that cache, it freed up 13 TB of data.

13 TB / 50 tars = 0.26 TB per tar

We had run this previously:

> du -sh /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
24T	/lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201

To transfer the entire thing would take:

24 TB/0.26 TB/tar = 92 tars

We transferred 2 tars in 20 minutes as part of this latest Chrysalis-to-Chrysalis transfer.
That's roughly 10 minutes per tar. (Although it's unclear if 000001.tar is finished).

10 min/tar * 92 tar = 920 min * 1 hr/60 min = 15 hours

So, how much data would need to be transferred then to hit the 48-hour limit on a Chrysalis-Chrysalis transfer?
48 hours/15 hours=3.2, so we'd need 3.2x data => 24 TB x 3.2=76.8 TB of data.

Notes:

  1. The third tar, 000002.tar, has appeared under ls /gpfs/fs0/globalscratch/ac.forsyth2/pr407_token_timeout_20251203_cache as of 12/03 18:36, 40 minutes after the initial launch. That implies the rate might be closer to 20 min/tar, but even then that only increases the transfer time to 30 hours, which still isn't enough to cross the 48-hour threshold.
  2. I cancelled the run at 47 min of runtime.
  3. ls /gpfs/fs0/globalscratch/ac.forsyth2/pr407_token_timeout_20251203_dst_dir returns nothing, which implies data isn't even being transferred to the dst_dir

Now, as of yesterday (12/02) afternoon:

> showquota --hpss
2 PiB - 1022.60 TiB = 0.9774 PiB available on HPSS = 980.89 TiB

So, I certainly have room on NERSC HPSS to do the Chrysalis-NERSC Transfer of 24 TB to hit the 48-hour timeout.

But my question is does NERSC HPSS have another project-wide cap we're close to hitting?

If not, I'm going to test by transferring 24 TB from Chryaslis to NERSC HPSS, using /gpfs/fs0/globalscratch/ as my cache on Chrysalis.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 4, 2025

And I don't have anything that large to transfer.

I could however transfer data from another user's directory as long as I make my own cache on /gpfs/fs0/globalscratch/

@chengzhuzhang
Copy link
Collaborator

@forsyth2 it would be nice to add summary of code change in PR description: example as follows.

what the PR does to prevent token timeouts:

Proactively refreshes Globus tokens before long operations and periodically during long waits by calling endpoint_autoactivate on both endpoints.
Detects when stored access tokens are near expiration when loading tokens and logs a warning.
Adds tests that assert the refresh calls happen (and that expiration detection logs a warning).

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 4, 2025

it would be nice to add summary of code change in PR description

@chengzhuzhang Ok I can do that going forward on this PR and others, but it makes more sense to add that in once the PR is actually in code review. Until that point, the implementation could change quite a bit.

I think a transfer that would have taken 48+ hours Chrysalis-NERSC would only take 15 hours for Chrysalis-Chrysalis.

I had been wanting to test the code as-is, but it occurs to me it might be possible to put a 2-day sleep/pause in the code for the sole purpose of testing. Then we don't actually have to transfer much data at all. I'm looking into the best way to do that.

@forsyth2 forsyth2 force-pushed the improve-globus-refresh branch from 735fc3d to b0b9aae Compare December 4, 2025 20:02
@forsyth2 forsyth2 force-pushed the improve-globus-refresh branch from b0b9aae to 4d1dae4 Compare December 4, 2025 20:24
@forsyth2 forsyth2 force-pushed the improve-globus-refresh branch from 4d1dae4 to fc0d2be Compare December 4, 2025 20:26
@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 4, 2025

Ok, I think I've found a way to mock a long transfer without interfering with any actual code. Meaning we can run a 48+ hour test without needing terabytes of data. We will see in a few days.

Test 2025-12-04 Try 2 setup
cd ~/ez/zstash
git status
# On branch improve-globus-refresh
# nothing to commit, working tree clean
git log
# Commit history looks right
lcrc_conda
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n test-pr407-token-timeout-20251204-try2
conda activate test-pr407-token-timeout-20251204-try2

# Now, edit the debug flag in globus.py:
# DEBUG_LONG_TRANSFER: bool = True

pre-commit run --all-files # Optional, but ensures the files are formatted & linted.
python -m pip install . # Install the code of this branch into the newly created environment

# Activate Globus endpoints
# Go to https://app.globus.org/file-manager?two_pane=true > For "Collection", 
# choose LCRC Improv DTN, NERSC HPSS
# Didn't need to re-authenticate.

# Start with fresh token files
rm -rf ~/.zstash.ini
rm -rf ~/.zstash_globus_tokens.json

# Start with no Globus consents
# https://auth.globus.org/v2/web/consents > Globus Endpoint Performance Monitoring > rescind all

mkdir -p /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/cache
mkdir -p /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/src_dir
mkdir -p /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/run_dir
echo "File contents" > /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/src_dir/file0.txt

# On NERSC Perlmutter:
# hsi 
# cd zstash_tests
# mkdir pr407_token_timeout_2025104_try2
# cd pr407_token_timeout_2025104_try2
# mkdir dst_dir
# exit

# Set up a screen because this command will not finish before you lose SSH connection.
screen 
screen -ls # See what login node this screen is attached to
# There is a screen on:
#         558378.pts-37.chrlogin1 (Attached)
# 1 Socket in /run/screen/S-ac.forsyth2.

# Screen doesn't seem to keep conda environment information, so we need to redo that part:
lcrc_conda
conda activate test-pr407-token-timeout-20251204-try2

cd /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/run_dir

# Test 1: create, non-blocking, non-keep (but with the bug of issue 374)
# LCRC_IMPROV_DTN_ENDPOINT=15288284-7006-4041-ba1a-6b52501e49f1
# NERSC_HPSS_ENDPOINT=9cd89cfd-6d04-11e5-ba46-22000b92c6ec
time zstash create --non-blocking --hpss=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec/home/f/forsyth/zstash_tests/pr407_token_timeout_2025104_try2/dst_dir --cache=/home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/cache /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/src_dir 2>&1 | tee pr407_20251204_try2_create_non_blocking.log

# Authenticate to Argonne, NERSC
# Enter auth code on the command line

# INFO: 20251204_210052_849631: TESTING (non-blocking): Sleeping for 49 hours to let access token expire

# CTRL A D

screen -ls
# There is a screen on:
# 	558378.pts-37.chrlogin1	(Detached)
# 1 Socket in /run/screen/S-ac.forsyth2.

From Claude:

What to look for in the logs:
Success indicators:

  • "Woke up after 49 hours" message
  • Transfer submitted successfully
  • No authentication/token errors
  • Command completes normally

Failure indicators:

  • Authentication errors after waking up
  • "NoCredException" or similar token errors
  • Command crashes after the 49-hour sleep

@TonyB9000
Copy link
Collaborator

@forsyth2 Nice work Ryan! I look forward to the results.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 8, 2025

Reviewing results
cd ~/ez/zstash
git status
# On branch improve-globus-refresh
git diff
# -DEBUG_LONG_TRANSFER: bool = False  # Set to true if testing token expiration handling
# +DEBUG_LONG_TRANSFER: bool = True  # Set to true if testing token expiration handling
git log
# Good, matches the 5 commit of https://github.com/E3SM-Project/zstash/pull/407/commits
lcrc_conda
conda activate test-pr407-token-timeout-20251204-try2

# https://app.globus.org/activity
# No failed tasks listed

# Good!

cd /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/run_dir
tail -n 4 pr407_20251204_try2_create_non_blocking.log
# INFO: Transferring file to HPSS: /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/cache/index.db
# INFO: 20251206_220055_820659: DIVING: hpss calls globus_transfer(name=index.db)
# INFO: 20251206_220055_820686: Entered globus_transfer() for name = index.db
# INFO: 20251206_220055_873391: TESTING (non-blocking): Sleeping for 49 hours to let access token expire

# Not promising...

screen -ls
# No Sockets found in /run/screen/S-ac.forsyth2.
hostname
# chrlogin1.lcrc.anl.gov

# Previously, we had:
# There is a screen on:
# 	558378.pts-37.chrlogin1	(Detached)
# 1 Socket in /run/screen/S-ac.forsyth2.

# So, it was in fact on chrlogin1. 
# That implies the LCRC maintenance today terminatd my screen session...

# But does that actually matter?

emacs pr407_20251204_try2_create_non_blocking.log
# INFO: 20251204_210052_849631: TESTING (non-blocking): Sleeping for 49 hours to let access token expire
# INFO: 20251206_220052_869770: TESTING (non-blocking): Woke up after 49 hours. Access token expired, RefreshTokenAuthorizer should automatically refresh on next API call.

# Good!!

# We got past the 48-hour window, which is what we really cared about in testing.
# The sleep we saw initially was the *second* time sleeping.

@chengzhuzhang (also @TonyB9000 and @golaz may be interested) Ok, some good news. It looks like the latest commit (fc0d2be) did actually allow us to mock a large transfer (by using sleep).

cd /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/run_dir
emacs pr407_20251204_try2_create_non_blocking.log
# INFO: 20251204_210052_849631: TESTING (non-blocking): Sleeping for 49 hours to let access token expire
# INFO: 20251206_220052_869770: TESTING (non-blocking): Woke up after 49 hours. Access token expired, RefreshTokenAuthorizer should automatically refresh on next API call.

That is, the command was able to continue on even after a 48+ hour wait.

@chengzhuzhang I have 3 potential next steps in mind:

  • Test on a blocking case.
  • Have users test in real use cases, as laid out in the earlier comment.
  • You mentioned a Globus point-of-contact could review this PR. (I would write up some explanatory notes to facilitate this).

Is there a particular step we should do next? Because this test takes so long to run (by definition) and #408 hasn't been merged yet, it is difficult to do other zstash work while a Globus consent is granted for this long run (that is, we don't want to revoke that to run on a different endpoint, etc.). It would be wise to be strategic about the order of these steps -- i.e., do we want the Globus point-of-contact to look at this before we spend any more time testing or do we want to test thoroughly before checking in with Globus support?

Lastly, my latest view of the zstash task priortization is the following. Please let me know if this is in line with your view.

  1. Fix token timeout (i.e., remove step 6 from the list of cumbersome steps in zstash Globus functionality has become overly cumbersome #339) -- this PR
  2. Delete tar files when --non-blocking is set -- Delete transferred files #405 is the actual fix; the relevant test Add test for tar deletion #404 has already been merged.
  3. Speed up zstash update -- [Bug]: speed up zstash update #409
  4. Speed up zstash check for updated runs -- [Bug]: speed up zstash check for updated runs #410
  5. Better handle the token file (i.e., store multiple tokens, allow user to specify a token file) -- [Feature]: Better handle the Globus token file #408
  6. Support Python 3.14 -- Add Python 3.14 support #402. This PR has turned out to be non-trivial.
  7. Support Globus 4.0 -- Support Globus 4.0 #406. I think it's possible this could be put off for a while. I don't think we're going to be forced to use it in the Spring release.

@chengzhuzhang
Copy link
Collaborator

@forsyth2 this is promissing regarding token fresh. I may missed this detail, but is there a way to verify if the token is actually refreshed?
I'm not familiar with this issue: Delete tar files when --non-blocking is set : #405
What are the expected behavior for --non-blocking vs --blocking? I think @golaz can advice on this better.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 9, 2025

@chengzhuzhang

is there a way to verify if the token is actually refreshed?

I'm not aware of / haven't looked into a direct verification. https://auth.globus.org/v2/web/consents > Globus Endpoint Performance Monitoring will show you consents, but those are different than auth tokens, in my understanding.

Experimentally, this simulated-large-transfer slept for 49 hours, and then started another 49-hour sleep to do a second transfer. That implies to me that it was able to continue transferring. But really, the most thorough test would be truly transferring a large enough amount of data (to confirm there isn't a bug in the mocking/debugging code introduced in fc0d2be)

Full output log after initial auth prompt
cd /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/run_dir
cat pr407_20251204_try2_create_non_blocking.log

gives:

INFO: Gathering list of files to archive
INFO: 20251204_210052_771018: Creating new tar archive 000000.tar
INFO: Archiving file0.txt
INFO: 20251204_210052_796080: (add_files): Completed archive file 000000.tar
INFO: Contents of the cache prior to `hpss_put`: ['index.db', '000000.tar']
INFO: 20251204_210052_796309: DIVING: (add_files): Calling hpss_put to dispatch archive file 000000.tar [keep, non_blocking] = [False, True]
INFO: 20251204_210052_796354: in hpss_transfer, prev_transfers is starting as []
INFO: Transferring file to HPSS: /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/cache/000000.tar
INFO: 20251204_210052_796433: DIVING: hpss calls globus_transfer(name=000000.tar)
INFO: 20251204_210052_796461: Entered globus_transfer() for name = 000000.tar
INFO: 20251204_210052_849631: TESTING (non-blocking): Sleeping for 49 hours to let access token expire
INFO: 20251206_220052_869770: TESTING (non-blocking): Woke up after 49 hours. Access token expired, RefreshTokenAuthorizer should automatically refresh on next API call.
INFO: 20251206_220053_468767: TransferData: accumulated items:
   (routine)  PUSHING (#1) STORED source item: /gpfs/fs1/home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/cache/000000.tar
INFO: 20251206_220053_468900: DIVING: Submit Transfer for dst_dir 000000
INFO: 20251206_220055_666665: SURFACE Submit Transfer returned new task_id = 08c42b94-d2ef-11f0-874f-0221bf474485 for label dst_dir 000000
INFO: 20251206_220055_666762: NO BLOCKING (task_wait) for task_id 08c42b94-d2ef-11f0-874f-0221bf474485
INFO: 20251206_220055_666823: SURFACE hpss globus_transfer(name=000000.tar) returns UNKNOWN
INFO: 20251206_220055_669160: SURFACE (add_files): Called hpss_put to dispatch archive file 000000.tar
INFO: tar name=000000.tar, tar size=10240, tar md5=fd6125c9d4b99e08b66fc3e6f5e3ac1e
INFO: Adding 000000.tar to the database.
INFO: 20251206_220055_820561: in hpss_transfer, prev_transfers is starting as []
INFO: Transferring file to HPSS: /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/cache/index.db
INFO: 20251206_220055_820659: DIVING: hpss calls globus_transfer(name=index.db)
INFO: 20251206_220055_820686: Entered globus_transfer() for name = index.db
INFO: 20251206_220055_873391: TESTING (non-blocking): Sleeping for 49 hours to let access token expire

Notably:

INFO: 20251204_210052_849631: TESTING (non-blocking): Sleeping for 49 hours to let access token expire
INFO: 20251206_220052_869770: TESTING (non-blocking): Woke up after 49 hours. Access token expired, RefreshTokenAuthorizer should automatically refresh on next API call.

and

INFO: Transferring file to HPSS: /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/cache/index.db
INFO: 20251206_220055_820659: DIVING: hpss calls globus_transfer(name=index.db)
INFO: 20251206_220055_820686: Entered globus_transfer() for name = index.db
INFO: 20251206_220055_873391: TESTING (non-blocking): Sleeping for 49 hours to let access token expire

I'm not familiar with this issue: Delete tar files when --non-blocking is set
What are the expected behavior for --non-blocking vs --blocking?

The problem is that tars aren't being deleted when --non-blocking is on, but --keep is off. That means tars will pile up on disk, even when not required, which is frustrating when disk space is extremely limited. (If --keep is off, tars should get deleted in either --non-blocking or --blocking, as tested in #404). @tangq reported this issue in #374. I believe @TonyB9000 has encountered this as well.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 9, 2025

@chengzhuzhang -- @rljacob relayed an observation from @NickolausDS that the issue may actually be resolved on main already.

It appears that #380 & #397 may have actually resolved this issue via the get_transfer_client_with_auth function.

Both of those PRs were included in the zstash v1.5.0 release, which in turn was included in the E3SM Unified v1.12.0 environment. That means, in theory, the latest Unified environment doesn't have the token timeout issue.

We can therefore test with that environment. The only file systems I have the required space (24 TB) on are LCRC's /gpfs/fs0/globalscratch/ and NERSC HPSS, so I will transfer from the former to the latter. If this transfers the data without timing out, then we can confirm that main doesn't have the issue anymore.

I could try to add the mock transfer code of fc0d2be rather than transferring 24 TB, but then 1) we wouldn't be able to use the Unified environment itself and 2) there'd be a chance the mock code isn't actually implemented correctly. I think just transferring 24 TB is the most straightforward path.

I'm starting this test Monday 12/08 evening, so 48 hours will take us to Wednesday 12/10 evening.

Testing steps

Since we're just using the Unified environment, we don't need even need to set up branch or dev environment first.

Previously, we had calculated:

# On Chrysalis
ls /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
du -sh /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
# 24T	/lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201

Now, let's begin the test setup:

# On Perlmutter
hsi
ls
pwd
# pwd0: /home/f/forsyth
mkdir zstash_test_token_timeout_main20251208
exit
# On Chrysalis
# We need to use the extra scratch space to have sufficient disk space.
cd /gpfs/fs0/globalscratch/ac.forsyth2/
mkdir zstash_test_token_timeout_main20251208
cd zstash_test_token_timeout_main20251208
mkdir cache

# Start fresh
# Go to https://app.globus.org/file-manager?two_pane=true > For "Collection", choose: LCRC Improv DTN, NERSC HPSS
rm -rf ~/.zstash.ini
rm -rf ~/.zstash_globus_tokens.json
# https://auth.globus.org/v2/web/consents > Globus Endpoint Performance Monitoring > rescind all

# Start a screen session, 
# so the transfer will continue even if the connection is interrupted:
screen 
screen -ls
# There is a screen on:
#         129907.pts-19.chrlogin1 (Attached)
# 1 Socket in /run/screen/S-ac.forsyth2.
pwd
# Good, /gpfs/fs0/globalscratch/ac.forsyth2/zstash_test_token_timeout_main20251208

# Activate the conda environment
source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh
# Good, e3sm_unified_1.12.0_login

# NERSC HPSS endpoint: 9cd89cfd-6d04-11e5-ba46-22000b92c6ec
time zstash create --non-blocking --hpss=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/f/forsyth/zstash_test_token_timeout_main20251208 --cache=/gpfs/fs0/globalscratch/ac.forsyth2/zstash_test_token_timeout_main20251208/cache /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201 2>&1 | tee zstash_test_token_timeout_main20251208.log

# Auth code prompt
# Use LCRC/NERSC identities
# Label the consent: zstash_test_token_timeout_main20251208
# Paste auth code to command line.

# Get out of screen without terminating it:
# CTRL A D

screen -ls
# There is a screen on:
# 	129907.pts-19.chrlogin1	(Detached)
# 1 Socket in /run/screen/S-ac.forsyth2.

# To check back in:
screen -R 
# Check if still running
# CTRL A D

@forsyth2
Copy link
Collaborator Author

Results

Checking in 2025-12-09 end-of-day (~24 hours in)

Chrysalis:

# Can't seem to get on to chrlogin1
cd /gpfs/fs0/globalscratch/ac.forsyth2/zstash_test_token_timeout_main20251208
tail zstash_test_token_timeout_main20251208.log
# No error in tail

# https://app.globus.org/activity
# There is an active transfer.

Checking in 2025-12-10 14:45

Chrysalis:

screen -ls
# There is a screen on:
# 	129907.pts-19.chrlogin1	(Detached)
# 1 Socket in /run/screen/S-ac.forsyth2.

# To check back in:
screen -R 
# real    1627m3.482s
# user    1330m21.451s
# sys     147m26.516s
exit

1627 min * 1 hr/60 min = 27.12 hours

Recall we ran:

time zstash create --non-blocking --hpss=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/f/forsyth/zstash_test_token_timeout_main20251208 --cache=/gpfs/fs0/globalscratch/ac.forsyth2/zstash_test_token_timeout_main20251208/cache /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201 2>&1 | tee zstash_test_token_timeout_main20251208.log

So we had:

src dir: Chrysalis /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
cache: Chrysalis /gpfs/fs0/globalscratch/ac.forsyth2/zstash_test_token_timeout_main20251208/cache

dst dir: NERSC HPSS home/f/forsyth/zstash_test_token_timeout_main20251208

Chrysalis:

# src dir
# Recall previously we had done:
du -sh /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
# 24T	/lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
# At a transfer rate of 0.43 TB/hour, as we had assumed,
# that would have taken 55.18 hours to transfer.
# But the command finished after 27 hours.

# cache
cd /gpfs/fs0/globalscratch/ac.forsyth2/zstash_test_token_timeout_main20251208/cache
ls # 000000.tar - 00005e.tar = tars 0-94 (95 tars)

Perlmutter:

# dst dir
hsi
pwd
# pwd0: /home/f/forsyth
cd zstash_test_token_timeout_main20251208
ls
# Empty!
# Nothing was transferred!
exit

It unfortunately appears LCRC Improv DTN doesn't have access to /gpfs/fs0/globalscratch/ac.forsyth2/. Indeed:

  1. https://app.globus.org/file-manager
  2. Select the "LCRC Improv DTN" endpoint
  3. Try to navigate to /gpfs/fs0/globalscratch/ac.forsyth2/
  4. Get "You do not have permission to list the contents of /gpfs/fs0/globalscratch/ac.forsyth2/."

@chengzhuzhang @rljacob Options to proceed with testing if zstash in the latest E3SM Unified is free of the token timeout problem:

  1. Is there some other endpoint with access to /gpfs/fs0/globalscratch/ac.forsyth2/ that I could use?
  2. First, merge the tar file deletion fix Delete transferred files #405. Then, it may be possible to use /lcrc/group/e3sm/ac.forsyth2/, which LCRC Improv DTN can access, without exhuasting disk space.
  3. Wait for someone who was going to run a long zstash transfer anyway try it out.
  4. Is Wuyin's experience sufficient proof that the 48 hour token timeout is gone? @wlin7 How long was your run? More than 48 hours? "My most recent zstash run was just a few hours after E3SM Unified 1.12.0 was deployed. Connecting to Globus from chrysalis no more requiring authentication. The connection from Globus to HPSS still needs to be reestablished from Globus web interface. That probably has a 10-day expiration window."

@TonyB9000
Copy link
Collaborator

@forsyth2 It may be possible for the "LCRC Improve DTN" management to provide access to /gpfs/fs0/globalscratch/ac.forsyth2/. That would be the most useful path forward.

It will be weeks (likely, January) before I conduct significant transfers (namely the v3 LE ssp245 native data), and as it stands, /lcrc/group/e3sm2 is already tight, if pulling from NERSC would be a test:

DIR = /lcrc/group/e3sm2:
              Space           Inodes
    Total:     3056 TB      50003968
     Used:     2920 TB      20660042
     Free:      135 TB      29343926

DIR = /lcrc/group/e3sm:
              Space           Inodes
    Total:     3056 TB     157224960
     Used:     2996 TB     151289444
     Free:       60 TB       5935516

DIR = /gpfs/fs0/globalscratch:
              Space           Inodes
    Total:     9077 TB    8906792960
     Used:     5106 TB     970143584
     Free:     3971 TB    7936649376

@rljacob
Copy link
Member

rljacob commented Dec 11, 2025

@forsyth2 use /lcrc/group/e3sm2

@forsyth2
Copy link
Collaborator Author

Use /lcrc/group/e3sm2

Thanks @rljacob!

@chengzhuzhang You mentioned here that:

[@wlin7] tested and confirmed recently that token refreshing works with the zstash release that includes the fix above.

where "the fix above" appears to imply the changes of #380 & #397 (i.e., the zstash found in the latest Unified release).

Does this mean there is no longer a need for me to 1) test this myself or 2) merge this PR?

@chengzhuzhang
Copy link
Collaborator

We will have a EZ meeting this afternoon on zstash. @wlin7 will be join us. Let's clarify with him at the meeting.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 15, 2025

@wlin7 has tested multiple authentications using one token (i.e., confirming the "toy problem" is no longer needed, which was the purpose of #380). He will be starting a long-running test that takes 3+ days to confirm a single long run can exceed the original 48-hour token timeout. Thanks @wlin7!

@forsyth2
Copy link
Collaborator Author

@wlin7 has noted:

The test of a long zstash/globus transfer is completed. Other than a hiccup due to networking issue, all went well without additional authentication. (loading existing auth token and end point activations). The full process took nearly 120 hours.

@chengzhuzhang @golaz This is good evidence that E3SM Unified v1.12.0 (which includes zstash v1.5.0) is no longer affected by the 48-hour token timeout problem (step 6 in #339). That means this PR can probably be closed, unless we want to try to retain/update the Claude-written tests. In any case, I think we can close #339 as completed, since #380 & #397 apparently resolved not only steps 2-4 but also step 6 in that issue.

@forsyth2
Copy link
Collaborator Author

@chengzhuzhang & @wlin7 confirmed we can close this PR because it is fixed on Unified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Globus Globus semver: bug Bug fix (will increment patch version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

zstash Globus functionality has become overly cumbersome

5 participants