-
Notifications
You must be signed in to change notification settings - Fork 10
Potential fix to token timeout #407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Remaining TODO
|
0e31048 to
735fc3d
Compare
|
@chengzhuzhang Just a status update: I used Claude to get some prototypes set up for the 4 components of the Globus integration improvements. I've ranked them by decreasing order of importance, as I understand it, here:
I anticipate many merge/rebase conflicts as each of these PRs go in. Therefore, my plan is to merge them in the above order, ensuring we get the most important pieces merged first. With that in mind, I've begun testing this PR as the highest priority one, using a large transfer that should hopefully take longer than 48 hours. Initial testing setupcd ~/ez/zstash
git status
# branch issue-398-token-file
# nothing to commit, working tree clean
git checkout improve-globus-refresh
conda env list # Get name of environment to reuse
conda activate zstash_globus_refresh
pre-commit run --all-files # Optional; just makes sure the files are looking clean.
python -m pip install .According to #339:
That means we need to transfer at least 100 MB/sec * 60 sec/min * 60 min/hour * 48 hours = 17,280,000 MB = 17,280 GB = 17.28 TB In #391 (reply in thread), zstash check command: zstash check -v --keep --cache archives --hpss=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/g/golaz/E3SMv3.LR/v3.LR.piControlwhich ran 6 TB (out of 70) over 14 hours, for a transfer rate of 6/14=0.43 TB/hour. 0.43 TB/hour * 48 hours = 20.64 TB. So, in order to trigger the test condition, we need to transfer upwards of 17.28-20.64TB. Let's check if I even have space to store/transfer that much data: Chrysalis: lcrc-quota
# 38 GB available on /home/ac.forsyth2/
/usr/lpp/mmfs/bin/mmlsquota -u ac.forsyth2 --block-size T fs2
# 300-70 = 210 TB availablePerlmutter: showquota --hpss
# 40-15.17 GiB = GiB available on home
# 20 TiB - 125.92 GiB = 19.87408 TiB available on pscratch
# 2 PiB - 1019.11 TiB = 0.98089 PiB available on HPSS = 980.89 TiBNow, let's see if I have any datasets of the required size. The most likely match would be the dataset used for zppy's integration tests: Chrysalis: cat ~/ez/zppy/tests/integration/utils.py
# "user_input_v2": "/lcrc/group/e3sm/ac.forsyth2/",
# "user_input_v3": "/lcrc/group/e3sm2/ac.wlin/",
ls /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
du -sh /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
# 24T /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
ls /lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.historical_0051
du -sh /lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.historical_0051
# 21T /lcrc/group/e3sm2/ac.wlin/E3SMv3/v3.LR.historical_0051Either of these are in theory large enough to surpass 48 hours of transfer time. Let's try the larger one. I only have 19 TiB = 20.89 TB available on Perlmutter: hsi
ls
pwd
# pwd0: /home/f/forsyth
mkdir zstash_48_hour_run_test20251201
exitLet's try the following tests:
Chrysalis: cd /lcrc/group/e3sm/ac.forsyth2/
mkdir zstash_48_hour_run_test20251201
cd zstash_48_hour_run_test20251201
mkdir cache
# Start fresh
# Go to https://app.globus.org/file-manager?two_pane=true > For "Collection", choose: LCRC Improv DTN, NERSC HPSS
rm -rf ~/.zstash.ini
rm -rf ~/.zstash_globus_tokens.json
# https://auth.globus.org/v2/web/consents > Globus Endpoint Performance Monitoring > rescind all
# Start a screen session,
# so the transfer will continue even if the connection is interrupted:
screen
screen -ls
# There is a screen on:
# 2719818.pts-7.chrlogin2 (Attached)
# 1 Socket in /run/screen/S-ac.forsyth2
pwd
# Good, /lcrc/group/e3sm/ac.forsyth2/zstash_48_hour_run_test20251201
# Re-activate the conda environment:
lcrc_conda
conda activate zstash_globus_refresh
# NERSC HPSS endpoint: 9cd89cfd-6d04-11e5-ba46-22000b92c6ec
time zstash create --non-blocking --hpss=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/f/forsyth/zstash_48_hour_run_test20251201_try3 --cache=/lcrc/group/e3sm/ac.forsyth2/zstash_48_hour_run_test20251201/cache /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201 2>&1 | tee pr407_create_non_blocking.log
# CTRL A D
screen -ls
# There is a screen on:
# 2719818.pts-7.chrlogin2 (Detached)
# 1 Socket in /run/screen/S-ac.forsyth2.
# Mon 2025-12-01 17:34
# Check back Wed 2025-12-03 17:34
# Initial checkin
screen -R # 18:03
# Good, still running after ~30 minutes
# CTRL A D |
# Tue 2025-12-02 14:11 checkin
hostname
# chrlogin2.lcrc.anl.gov
screen -ls
# There is a screen on:
# 2719818.pts-7.chrlogin2 (Detached)
# 1 Socket in /run/screen/S-ac.forsyth2.
screen -R
# Good, still running
# CTRL A DRan into lcrc-quota
# ----------------------------------------------------------------------------------------
# Home Current Usage Space Avail Quota Limit Grace Time
# ----------------------------------------------------------------------------------------
# ac.forsyth2 61 GB 38 GB 100 GB
# ----------------------------------------------------------------------------------------
# Project Current Usage Space Avail Quota Limit Grace Time
# ----------------------------------------------------------------------------------------
/usr/lpp/mmfs/bin/mmlsquota -u ac.forsyth2 --block-size T fs2
# /usr/lpp/mmfs/bin/mmlsquota -u ac.forsyth2 --block-size T fs2
# Block Limits | File Limits
# Filesystem Fileset type TB quota limit in_doubt grace | files quota limit in_doubt grace Remarks
# fs2 root USR 84 300 300 1 none | 5986737 0 0 148 none lcrcstg.lcrc.anl.govNeither of these indicate any problems with disk space though... Turns out the whole project is out of disk space on
So, that means after 1285/60=21.42 hours and transferring ~14 out of the planned 24 TB, we've hit a disk space limit. |
|
@chengzhuzhang @TonyB9000 I'm trying to test this PR by running a 48+ hour transfer. I made careful note of my own space allowances, but it appears testing still hit a project-wide cap. Is it possible to test a 48+ hour transfer on less data? (I made calculations above re: how much data would need to be transferred). I guess this is where having this feature would be useful:
Perhaps it would be better to test/merge that PR first? |
|
@forsyth2 when testing this, we should try not to use lcrc to receive data since the disk storage is running low. |
|
Ok, sure I can write instructions on how to test this, and we can see if any of them still run into the token problem. |
|
@chengzhuzhang If the project is close to full utilization of |
|
@chengzhuzhang I'm thinking it would make more sense to resolve "Delete tar files when |
There is also another space, e3sm2, as well as a scratch space that folks can leverage. My point is that it would be ideal to test this in a real use case while avoiding the need to occupy additional disk space with duplicate testing data. |
|
@chengzhuzhang @forsyth2 Right now, e3sm is choking: Soon, e3sm2 will be tight as well, as I (a) generate more v3 LE CMIP6, (b) when I must fetch NERSC v3 LE ocean data. Good thing I can work right now using Wuyin's local atmos data. |
|
Hi @jonbob, @wlin7, @xuezhengllnl would any of you be able to test this pull request (PR)? What does this PR do?In #339, Chris Golaz noted that the current zstash-Globus integration involves the cumbersome step of:
This PR is meant to resolve that issue. That is, a What needs to be tested?Despite working with Claude to write some mock tests, we still need to test the real thing. That is, we need to do a 48+ hour transfer. Using "transfer speeds between chrysalis and NERSC HPSS (~100 MB/s)" as a baseline, a proper test would have to transfer upwards of 100 MB/s * 60 s/m * 60 m/h * 48 h = 17,280,000 MB = 17,280 GB = 17.28 TB. We can also refer to #391 (reply in thread), where Tony ran the following zstash check command: zstash check -v --keep --cache archives --hpss=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/g/golaz/E3SMv3.LR/v3.LR.piControlThat transferred 6 TB (out of 70) over 14 hours, giving a transfer rate of 6/14=0.43 TB/hour. Now, 0.43 TB/hour * 48 hours = 20.64 TB. So, in order to trigger the test condition, we need to transfer upwards of 17.28-20.64TB. Why can't I test it?Space limitationsAs of 12/02 afternoon: Chrysalis: lcrc-quota
# 38 GB available on /home/ac.forsyth2/
/usr/lpp/mmfs/bin/mmlsquota -u ac.forsyth2 --block-size T fs2
# 300-71 = 209 TB available on /lcrc/group/e3sm/ac.forsyth2/Perlmutter: showquota --hpss
# 40-15.17 GiB = GiB available on home (/global/homes/f/forsyth)
# 20 TiB - 125.92 GiB = 19.87408 TiB available on pscratch (/pscratch/sd/f/forsyth)
# 2 PiB - 1022.60 TiB = 0.9774 PiB available on HPSS = 980.89 TiBI don't have enough space for 20+ TB on LCRC's Unfortunately, the LCRC space is already close to project-wide full utilization, so this last option still won't work. How you can test this PRSet up
|
|
@chengzhuzhang Given @TonyB9000's comment: /gpfs/fs0/globalscratch is definitely the place to conduct volume testing DIR = /gpfs/fs0/globalscratch: It does appear I have write-access there: cd /gpfs/fs0/globalscratch |
I agree! |
|
Sounds good! @jonbob, @wlin7, @xuezhengllnl -- in this case, sorry for the early tagging; please ignore for now. We may ask for production testing in the future once we've done initial testing. |
Test 2 setupcd ~/ez/zstash
git status
# On branch issue-374-tar-deletion-rebased20251124
# nothing to commit, working tree clean
git checkout improve-globus-refresh
git fetch upstream main
git rebase upstream/main
git log
# Commit history looks right
lcrc_conda
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n test-pr407-token-timeout-20251203
conda activate test-pr407-token-timeout-20251203
pre-commit run --all-files # Optional, but ensures the files are formatted & linted.
python -m pip install . # Install the code of this branch into the newly created environment
# Activate Globus endpoints
# Go to https://app.globus.org/file-manager?two_pane=true > For "Collection", choose LCRC Improv DTN
# Didn't need to re-authenticate.
# Start with fresh token files
rm -rf ~/.zstash.ini
rm -rf ~/.zstash_globus_tokens.json
# Start with no Globus consents
# https://auth.globus.org/v2/web/consents > Globus Endpoint Performance Monitoring > rescind all
# Set up a screen because this command will likely not finish before you lose SSH connection.
screen
screen -ls # See what login node this screen is attached to
# There is a screen on:
# 4129470.pts-16.chrlogin1 (Attached)
# 1 Socket in /run/screen/S-ac.forsyth2.
# Screen doesn't seem to keep conda environment information, so we need to redo that part:
lcrc_conda
conda activate test-pr407-token-timeout-20251203
mkdir -p /gpfs/fs0/globalscratch/ac.forsyth2/pr407_token_timeout_20251203_cache
mkdir -p /gpfs/fs0/globalscratch/ac.forsyth2/pr407_token_timeout_20251203_dst_dir
mkdir -p /gpfs/fs0/globalscratch/ac.forsyth2/pr407_token_timeout_20251203_run_dir
cd /gpfs/fs0/globalscratch/ac.forsyth2/pr407_token_timeout_20251203_run_dir
# Test 1: create, non-blocking, non-keep (but with the bug of issue 374)
# LCRC_IMPROV_DTN_ENDPOINT=15288284-7006-4041-ba1a-6b52501e49f1
time zstash create --non-blocking --hpss=globus://15288284-7006-4041-ba1a-6b52501e49f1///gpfs/fs0/globalscratch/ac.forsyth2/pr407_token_timeout_20251203_dst_dir --cache=/gpfs/fs0/globalscratch/ac.forsyth2/pr407_token_timeout_20251203_cache /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201 2>&1 | tee pr407_20251203_create_non_blocking.log
# Enter auth code
# Exit the screen with:
# CTRL A D
screen -ls
# There is a screen on:
# 4129470.pts-16.chrlogin1 (Detached)
# 1 Socket in /run/screen/S-ac.forsyth2.
# RUNNING as of Wednesday 2025-12-03 17:57 (48 hours => Fri 17:57)
# Initial checkin
screen -R # Good, still going
# CTRL A D
# 18:21
ls /gpfs/fs0/globalscratch/ac.forsyth2/pr407_token_timeout_20251203_cache
# 000000.tar 000001.tar index.db@chengzhuzhang I had mentioned previously we could try running Chrysalis-to-Chrysalis, but it appears the transfer speed is higher, which means we'd need a larger dataset. I think a transfer that would have taken 48+ hours Chrysalis-NERSC would only take 15 hours for Chrysalis-Chrysalis. By my rough calculations, instead of 24 TB I'd need to transfer as much as 76.8 TB of data. And I don't have anything that large to transfer. Calculation detailsTo avoid a long When I ran We had run this previously: To transfer the entire thing would take: We transferred 2 tars in 20 minutes as part of this latest Chrysalis-to-Chrysalis transfer. So, how much data would need to be transferred then to hit the 48-hour limit on a Chrysalis-Chrysalis transfer? Notes:
Now, as of yesterday (12/02) afternoon: So, I certainly have room on NERSC HPSS to do the Chrysalis-NERSC Transfer of 24 TB to hit the 48-hour timeout. But my question is does NERSC HPSS have another project-wide cap we're close to hitting? If not, I'm going to test by transferring 24 TB from Chryaslis to NERSC HPSS, using |
I could however transfer data from another user's directory as long as I make my own cache on |
|
@forsyth2 it would be nice to add summary of code change in PR description: example as follows. what the PR does to prevent token timeouts: |
@chengzhuzhang Ok I can do that going forward on this PR and others, but it makes more sense to add that in once the PR is actually in code review. Until that point, the implementation could change quite a bit.
I had been wanting to test the code as-is, but it occurs to me it might be possible to put a 2-day sleep/pause in the code for the sole purpose of testing. Then we don't actually have to transfer much data at all. I'm looking into the best way to do that. |
735fc3d to
b0b9aae
Compare
b0b9aae to
4d1dae4
Compare
4d1dae4 to
fc0d2be
Compare
|
Ok, I think I've found a way to mock a long transfer without interfering with any actual code. Meaning we can run a 48+ hour test without needing terabytes of data. We will see in a few days. Test 2025-12-04 Try 2 setupcd ~/ez/zstash
git status
# On branch improve-globus-refresh
# nothing to commit, working tree clean
git log
# Commit history looks right
lcrc_conda
rm -rf build
conda clean --all --y
conda env create -f conda/dev.yml -n test-pr407-token-timeout-20251204-try2
conda activate test-pr407-token-timeout-20251204-try2
# Now, edit the debug flag in globus.py:
# DEBUG_LONG_TRANSFER: bool = True
pre-commit run --all-files # Optional, but ensures the files are formatted & linted.
python -m pip install . # Install the code of this branch into the newly created environment
# Activate Globus endpoints
# Go to https://app.globus.org/file-manager?two_pane=true > For "Collection",
# choose LCRC Improv DTN, NERSC HPSS
# Didn't need to re-authenticate.
# Start with fresh token files
rm -rf ~/.zstash.ini
rm -rf ~/.zstash_globus_tokens.json
# Start with no Globus consents
# https://auth.globus.org/v2/web/consents > Globus Endpoint Performance Monitoring > rescind all
mkdir -p /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/cache
mkdir -p /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/src_dir
mkdir -p /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/run_dir
echo "File contents" > /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/src_dir/file0.txt
# On NERSC Perlmutter:
# hsi
# cd zstash_tests
# mkdir pr407_token_timeout_2025104_try2
# cd pr407_token_timeout_2025104_try2
# mkdir dst_dir
# exit
# Set up a screen because this command will not finish before you lose SSH connection.
screen
screen -ls # See what login node this screen is attached to
# There is a screen on:
# 558378.pts-37.chrlogin1 (Attached)
# 1 Socket in /run/screen/S-ac.forsyth2.
# Screen doesn't seem to keep conda environment information, so we need to redo that part:
lcrc_conda
conda activate test-pr407-token-timeout-20251204-try2
cd /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/run_dir
# Test 1: create, non-blocking, non-keep (but with the bug of issue 374)
# LCRC_IMPROV_DTN_ENDPOINT=15288284-7006-4041-ba1a-6b52501e49f1
# NERSC_HPSS_ENDPOINT=9cd89cfd-6d04-11e5-ba46-22000b92c6ec
time zstash create --non-blocking --hpss=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec/home/f/forsyth/zstash_tests/pr407_token_timeout_2025104_try2/dst_dir --cache=/home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/cache /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/src_dir 2>&1 | tee pr407_20251204_try2_create_non_blocking.log
# Authenticate to Argonne, NERSC
# Enter auth code on the command line
# INFO: 20251204_210052_849631: TESTING (non-blocking): Sleeping for 49 hours to let access token expire
# CTRL A D
screen -ls
# There is a screen on:
# 558378.pts-37.chrlogin1 (Detached)
# 1 Socket in /run/screen/S-ac.forsyth2.From Claude: What to look for in the logs:
Failure indicators:
|
|
@forsyth2 Nice work Ryan! I look forward to the results. |
Reviewing resultscd ~/ez/zstash
git status
# On branch improve-globus-refresh
git diff
# -DEBUG_LONG_TRANSFER: bool = False # Set to true if testing token expiration handling
# +DEBUG_LONG_TRANSFER: bool = True # Set to true if testing token expiration handling
git log
# Good, matches the 5 commit of https://github.com/E3SM-Project/zstash/pull/407/commits
lcrc_conda
conda activate test-pr407-token-timeout-20251204-try2
# https://app.globus.org/activity
# No failed tasks listed
# Good!
cd /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/run_dir
tail -n 4 pr407_20251204_try2_create_non_blocking.log
# INFO: Transferring file to HPSS: /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/cache/index.db
# INFO: 20251206_220055_820659: DIVING: hpss calls globus_transfer(name=index.db)
# INFO: 20251206_220055_820686: Entered globus_transfer() for name = index.db
# INFO: 20251206_220055_873391: TESTING (non-blocking): Sleeping for 49 hours to let access token expire
# Not promising...
screen -ls
# No Sockets found in /run/screen/S-ac.forsyth2.
hostname
# chrlogin1.lcrc.anl.gov
# Previously, we had:
# There is a screen on:
# 558378.pts-37.chrlogin1 (Detached)
# 1 Socket in /run/screen/S-ac.forsyth2.
# So, it was in fact on chrlogin1.
# That implies the LCRC maintenance today terminatd my screen session...
# But does that actually matter?
emacs pr407_20251204_try2_create_non_blocking.log
# INFO: 20251204_210052_849631: TESTING (non-blocking): Sleeping for 49 hours to let access token expire
# INFO: 20251206_220052_869770: TESTING (non-blocking): Woke up after 49 hours. Access token expired, RefreshTokenAuthorizer should automatically refresh on next API call.
# Good!!
# We got past the 48-hour window, which is what we really cared about in testing.
# The sleep we saw initially was the *second* time sleeping.@chengzhuzhang (also @TonyB9000 and @golaz may be interested) Ok, some good news. It looks like the latest commit (fc0d2be) did actually allow us to mock a large transfer (by using
cd /home/ac.forsyth2/zstash_tests/pr407_token_timeout_2025104_try2/run_dir
emacs pr407_20251204_try2_create_non_blocking.log
# INFO: 20251204_210052_849631: TESTING (non-blocking): Sleeping for 49 hours to let access token expire
# INFO: 20251206_220052_869770: TESTING (non-blocking): Woke up after 49 hours. Access token expired, RefreshTokenAuthorizer should automatically refresh on next API call.That is, the command was able to continue on even after a 48+ hour wait. @chengzhuzhang I have 3 potential next steps in mind:
Is there a particular step we should do next? Because this test takes so long to run (by definition) and #408 hasn't been merged yet, it is difficult to do other Lastly, my latest view of the
|
|
@forsyth2 this is promissing regarding token fresh. I may missed this detail, but is there a way to verify if the token is actually refreshed? |
I'm not aware of / haven't looked into a direct verification. Experimentally, this simulated-large-transfer slept for 49 hours, and then started another 49-hour sleep to do a second transfer. That implies to me that it was able to continue transferring. But really, the most thorough test would be truly transferring a large enough amount of data (to confirm there isn't a bug in the mocking/debugging code introduced in fc0d2be) Full output log after initial auth promptgives: Notably: and
The problem is that tars aren't being deleted when |
|
@chengzhuzhang -- @rljacob relayed an observation from @NickolausDS that the issue may actually be resolved on It appears that #380 & #397 may have actually resolved this issue via the Both of those PRs were included in the We can therefore test with that environment. The only file systems I have the required space (24 TB) on are LCRC's I could try to add the mock transfer code of fc0d2be rather than transferring 24 TB, but then 1) we wouldn't be able to use the Unified environment itself and 2) there'd be a chance the mock code isn't actually implemented correctly. I think just transferring 24 TB is the most straightforward path. I'm starting this test Monday 12/08 evening, so 48 hours will take us to Wednesday 12/10 evening. Testing stepsSince we're just using the Unified environment, we don't need even need to set up branch or dev environment first. Previously, we had calculated: # On Chrysalis
ls /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
du -sh /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
# 24T /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201Now, let's begin the test setup: # On Perlmutter
hsi
ls
pwd
# pwd0: /home/f/forsyth
mkdir zstash_test_token_timeout_main20251208
exit# On Chrysalis
# We need to use the extra scratch space to have sufficient disk space.
cd /gpfs/fs0/globalscratch/ac.forsyth2/
mkdir zstash_test_token_timeout_main20251208
cd zstash_test_token_timeout_main20251208
mkdir cache
# Start fresh
# Go to https://app.globus.org/file-manager?two_pane=true > For "Collection", choose: LCRC Improv DTN, NERSC HPSS
rm -rf ~/.zstash.ini
rm -rf ~/.zstash_globus_tokens.json
# https://auth.globus.org/v2/web/consents > Globus Endpoint Performance Monitoring > rescind all
# Start a screen session,
# so the transfer will continue even if the connection is interrupted:
screen
screen -ls
# There is a screen on:
# 129907.pts-19.chrlogin1 (Attached)
# 1 Socket in /run/screen/S-ac.forsyth2.
pwd
# Good, /gpfs/fs0/globalscratch/ac.forsyth2/zstash_test_token_timeout_main20251208
# Activate the conda environment
source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh
# Good, e3sm_unified_1.12.0_login
# NERSC HPSS endpoint: 9cd89cfd-6d04-11e5-ba46-22000b92c6ec
time zstash create --non-blocking --hpss=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/f/forsyth/zstash_test_token_timeout_main20251208 --cache=/gpfs/fs0/globalscratch/ac.forsyth2/zstash_test_token_timeout_main20251208/cache /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201 2>&1 | tee zstash_test_token_timeout_main20251208.log
# Auth code prompt
# Use LCRC/NERSC identities
# Label the consent: zstash_test_token_timeout_main20251208
# Paste auth code to command line.
# Get out of screen without terminating it:
# CTRL A D
screen -ls
# There is a screen on:
# 129907.pts-19.chrlogin1 (Detached)
# 1 Socket in /run/screen/S-ac.forsyth2.
# To check back in:
screen -R
# Check if still running
# CTRL A D |
ResultsChecking in 2025-12-09 end-of-day (~24 hours in) Chrysalis: # Can't seem to get on to chrlogin1
cd /gpfs/fs0/globalscratch/ac.forsyth2/zstash_test_token_timeout_main20251208
tail zstash_test_token_timeout_main20251208.log
# No error in tail
# https://app.globus.org/activity
# There is an active transfer.Checking in 2025-12-10 14:45 Chrysalis: screen -ls
# There is a screen on:
# 129907.pts-19.chrlogin1 (Detached)
# 1 Socket in /run/screen/S-ac.forsyth2.
# To check back in:
screen -R
# real 1627m3.482s
# user 1330m21.451s
# sys 147m26.516s
exit
Recall we ran: time zstash create --non-blocking --hpss=globus://9cd89cfd-6d04-11e5-ba46-22000b92c6ec//home/f/forsyth/zstash_test_token_timeout_main20251208 --cache=/gpfs/fs0/globalscratch/ac.forsyth2/zstash_test_token_timeout_main20251208/cache /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201 2>&1 | tee zstash_test_token_timeout_main20251208.logSo we had: Chrysalis: # src dir
# Recall previously we had done:
du -sh /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
# 24T /lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201
# At a transfer rate of 0.43 TB/hour, as we had assumed,
# that would have taken 55.18 hours to transfer.
# But the command finished after 27 hours.
# cache
cd /gpfs/fs0/globalscratch/ac.forsyth2/zstash_test_token_timeout_main20251208/cache
ls # 000000.tar - 00005e.tar = tars 0-94 (95 tars)Perlmutter: # dst dir
hsi
pwd
# pwd0: /home/f/forsyth
cd zstash_test_token_timeout_main20251208
ls
# Empty!
# Nothing was transferred!
exitIt unfortunately appears LCRC Improv DTN doesn't have access to
@chengzhuzhang @rljacob Options to proceed with testing if
|
|
@forsyth2 It may be possible for the "LCRC Improve DTN" management to provide access to /gpfs/fs0/globalscratch/ac.forsyth2/. That would be the most useful path forward. It will be weeks (likely, January) before I conduct significant transfers (namely the v3 LE ssp245 native data), and as it stands, /lcrc/group/e3sm2 is already tight, if pulling from NERSC would be a test: |
|
@forsyth2 use /lcrc/group/e3sm2 |
Thanks @rljacob! @chengzhuzhang You mentioned here that:
where "the fix above" appears to imply the changes of #380 & #397 (i.e., the Does this mean there is no longer a need for me to 1) test this myself or 2) merge this PR? |
|
We will have a EZ meeting this afternoon on zstash. @wlin7 will be join us. Let's clarify with him at the meeting. |
|
@wlin7 has noted:
@chengzhuzhang @golaz This is good evidence that E3SM Unified |
|
@chengzhuzhang & @wlin7 confirmed we can close this PR because it is fixed on Unified. |
Summary
Objectives:
Issue resolution:
Select one: This pull request is...
Small Change