Skip to content

nisarqa does not exit cleanly on certain filesystems (NFS, NISAR On-Demand) #152

@nemo794

Description

@nemo794

When nisarqa uses local filesystems for the scratch directory (such as on a local computer, or in mission operations), it exits successfully.

However, when nisarqa uses Network File Systems (NFS) or the NISAR On-Demand system, it successfully generates all outputs but then fails to exit cleanly, raising the Exception "QA SAS program failed with exit code 1" when calling shutil.rmtree() here to cleanup the scratch directory.

In these cases, the error appears similar to this:

Starting Quality Assurance for input file: output/rslc.h5
Input file validation complete.
`qa_reports` processing complete.
Absolute Radiometric Calibration CalTool complete.
Noise Equivalent Backscatter CalTool complete.
Point Target Analyzer CalTool complete.
QA SAS complete. For details, warnings, and errors see output log file.
Traceback (most recent call last):
  File "/opt/conda/bin/nisarqa", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/nisarqa/__main__.py", line 219, in main
    run()
  File "/opt/conda/lib/python3.12/site-packages/nisarqa/__main__.py", line 190, in run
    with nisarqa.create_unique_subdirectory(
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/contextlib.py", line 144, in __exit__
    next(self.gen)
  File "/opt/conda/lib/python3.12/site-packages/nisarqa/utils/utils.py", line 761, in create_unique_subdirectory
    shutil.rmtree(path)
  File "/opt/conda/lib/python3.12/shutil.py", line 759, in rmtree
    _rmtree_safe_fd(stack, onexc)
  File "/opt/conda/lib/python3.12/shutil.py", line 703, in _rmtree_safe_fd
    onexc(func, path, err)
  File "/opt/conda/lib/python3.12/shutil.py", line 662, in _rmtree_safe_fd
    os.rmdir(name, dir_fd=dirfd)
OSError: [Errno 39] Directory not empty: PosixPath('/nobackupnfs1/nisar/mcayanan/track_frame_002_117_056_partial_individual_L_05_SV_00_NA_23_state-config/scratch/qa-rslc-20260115152651-tg_byu2_')
2026-01-15T15:27:21.093509Z, Error, PGE::rslc_l_pge.py, run_sas_program, 303004, /opt/pge/l1_l2/sas_runner.py:79, "QA SAS program failed with exit code 1"

As described by M Cayaman, here is what's happening:

The reason [for this] specific error is due to a well-known behavior in Network File Systems (NFS) called "NFS Silly Rename."

It is essentially a "race condition" between your software trying to delete a folder and the network storage system confirming that the files inside are actually gone.

Here is the step-by-step breakdown of why it failed:

  1. The Trigger
    The nisarqa Python script uses a context manager (the with statement) to handle its scratch space.

    • Start: It creates a unique folder: .../scratch/qa-rslc-....
    • Work: It writes temporary files into that folder and reads them back.
    • Finish: As soon as the job finishes, the script instantly runs shutil.rmtree() to delete the folder and everything inside it.
  2. The Conflict (The "Race")
    On a local disk (like the one on your laptop or the NVMe drive on AWS), file deletion is instant.
    On NFS (/nobackupnfs1), it works differently:

    • When shutil.rmtree() deletes a file, if any process (even the kernel cache) still has a "handle" open on that file for even a microsecond, NFS cannot delete it immediately.
    • Instead, NFS renames the file to a hidden temporary name like .nfs0000001234. This preserves the data until the handle is fully closed. This is the "Silly Rename."
  3. The Crash
    The cleanup script runs effectively instantly:

    • Step A: Delete file temp_data.dat.
      • NFS Response: "Wait, a handle is lingering. I'll rename it to .nfs123 for a millisecond while I clean up."
    • Step B: Delete directory qa-rslc-....
      • System Check: "Is the directory empty?"
      • Reality: No. It contains the hidden file .nfs123.
    • Result: The command fails with OSError: [Errno 39] Directory not empty.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions