Merge branch 'main' into pandas3.0-fixes

hsteptoe · web-flow · commit e41a0e540f87 · 2026-02-26T11:10:01.000Z
diff --git a/.github/workflows/ci-linkchecks.yml b/.github/workflows/ci-linkchecks.yml
@@ -23,7 +23,7 @@ jobs:
 
       - name: Link Checker
         id: lychee
-        uses: lycheeverse/lychee-action@a8c4c7cb88f0c7386610c35eb25108e448569cb0
+        uses: lycheeverse/lychee-action@8646ba30535128ac92d33dfc9133794bfdd9b411
         with:
           token: ${{secrets.GITHUB_TOKEN}}
           fail: false
diff --git a/docs/src/further_topics/index.rst b/docs/src/further_topics/index.rst
@@ -17,6 +17,7 @@ Extra information on specific technical issues.
    missing_data_handling
    dataless_cubes
    netcdf_io
+   s3_io
    dask_best_practices/index
    ugrid/index
    which_regridder_to_use
diff --git a/docs/src/further_topics/s3_io.rst b/docs/src/further_topics/s3_io.rst
@@ -0,0 +1,260 @@
+.. _s3_io:
+
+Loading From and Saving To S3 Buckets
+=====================================
+
+For cloud computing, it is natural to want to access data storage based on URIs.
+At the present time, by far the most widely used platform for this is
+`Amazon S3 "buckets" <https://aws.amazon.com/s3/>`_.
+
+It is common to treat an S3 bucket like a "disk", storing files as individual S3
+objects.  S3 access URLs can also contain a nested
+`'prefix string' <https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html>`_
+structure, which naturally mirrors sub-directories in a file-system.
+
+While it would be possible for Iris to support S3 access directly, as it does the
+"OpenDAP" protocol for netCDF data, this approach has some serious limitations : most
+notably, each supported file format would have to be separately extended to support S3
+URLs in the place of file paths for loading and saving.
+
+Instead, we have found that it is most practical to perform this access using a virtual
+file system approach.  However, one drawback is that this is best controlled *outside*
+the Python code -- see details below.
+
+
+TL;DR
+-----
+Install s3-fuse and use its ``s3fs`` command, to create a file-system mount which maps
+to an S3 bucket.  S3 objects can then be accessed as a regular files (read and write).
+
+
+Fsspec, S3-fs, fuse and s3-fuse
+--------------------------------
+This approach depends on a set of related code solutions, as follows:
+
+`fsspec <https://github.com/fsspec/filesystem_spec/blob/master/README.md>`_
+is a  general framework for implementing Python-file-like access to alternative storage
+resources.
+
+`s3fs <https://github.com/fsspec/s3fs>`_
+is a package based on fsspec, which enables Python to "open" S3 data objects as Python
+file-like objects for reading and writing.
+
+`fuse <https://github.com/libfuse/libfuse>`_
+is an interface library that enables a data resource to be "mounted" as a Linux
+filesystem, with user (not root) privilege.
+
+`s3-fuse <https://github.com/s3fs-fuse/s3fs-fuse/blob/master/README.md>`_
+is a utility based on s3fs and fuse, which provides a POSIX-compatible  "mount" so that
+an S3 bucket can be accessed as a regular Unix file system.
+
+
+Practical usage
+---------------
+Of the above, the only thing you actually need to know about is **s3-fuse**.
+
+There is an initial one-time setup, and also actions to take in advance of launching
+Python, and after exit, each time you want to access S3 from Python.
+
+Prior requirements
+^^^^^^^^^^^^^^^^^^
+
+Install "s3-fuse"
+~~~~~~~~~~~~~~~~~
+The most reliable method is to install into your Linux O.S.  See
+`installation instructions <https://github.com/s3fs-fuse/s3fs-fuse/blob/master/README.md#installation>`_ .
+This presumes that you perform a system installation with ``apt``, ``yum`` or similar.
+
+If you do not have necessary 'sudo' or root access permissions, we have found that it
+is sufficient to install only **into your Python environment**, using conda.
+Though not suggested, this appears to work on Unix systems where we have tried it.
+
+For this, you can use conda -- e.g.
+
+.. code-block:: bash
+
+    $ conda install s3-fuse
+
+( Or better, put it into a reusable 'spec file', with all other requirements, and then
+use ``$ conda create --file ...``
+).
+
+.. note::
+
+    It is **not** possible to install s3fs-fuse into a Python environment with ``pip``,
+    as it is not a Python package.
+
+
+Create an empty mount directory
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+You need an empty directory in your existing filesystem tree, that you will map your
+S3 bucket **onto** -- e.g.
+
+.. code-block:: bash
+
+    $ mkdir /home/self.me/s3_root/testbucket_mountpoint
+
+
+Setup AWS credentials
+~~~~~~~~~~~~~~~~~~~~~
+Provide S3 access credentials in an AWS credentials file, as described
+`here in the the s3-fuse documentation <https://github.com/s3fs-fuse/s3fs-fuse/blob/master/README.md#examples>`_.
+
+There is a general introduction to AWS credentials
+`here in the AWS documentation <https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html>`_
+which should explain what you need here.
+
+
+Before use (before each Python invocation)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Activate your Python environment, which then gives access to the **s3-fuse** Linux
+command "s3fs".
+
+Map your S3 bucket "into" the chosen empty directory -- e.g.
+
+.. code-block:: bash
+
+    $ s3fs my-test-bucket /home/self.me/s3_root/testbucket_mountpoint
+
+.. note::
+
+    You can now freely list/access contents of your bucket at this path
+    -- including updating or writing files.
+
+.. note::
+
+    This performs a Unix file-system "mount" operation, which temporarily
+    modifies your system.  This change is not part of the current environment, and is not
+    limited to the scope of the current process.
+
+    If you reboot, the mount will disappear.  If you logout and login again, there can
+    be problems : ideally you should avoid this by always "unmounting" (see below).
+
+.. note::
+
+    The command for mounting an s3-fuse filesystem is ``s3fs`` - this should not be
+    confused with the similarly named s3fs python package.
+
+
+Within Python code
+^^^^^^^^^^^^^^^^^^
+You can now access objects at the remote S3 URL via the mount point on your local file
+system you just created with `s3fs`, e.g.
+
+.. code-block:: python
+
+    >>> path = "/home/self.me/s3_root/testbucket_mountpoint/sub_dir/a_file.nc"
+    >>> cubes = iris.load(path)
+
+
+After use (after Python exit)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+When you have finished accessing the S3 objects in the mounted virtual filesystem, it
+is a good idea to **unmount** it.  Before doing this, make sure that all file handles to
+the objects have been closed and there are no terminals open in that directory.
+
+.. code-block:: bash
+
+    $ umount /home/self.me/s3_root/testbucket_mountpoint
+
+.. note::
+
+    The ``umount`` command is a standard Unix command.  It may not always succeed, in which case
+    some kind of retry may be needed -- see detail notes below.
+
+    The mount created will not survive a system reboot, nor does it function correctly
+    if the user logs out + logs in again.
+
+    Presumably, problems could occur if repeated operation were to create a very large
+    number of mounts, so unmounting after use does seem advisable.
+
+
+Some Pros and Cons of this approach
+-----------------------------------
+
+PROs
+^^^^
+
+*   **s3fs** supports random access to "parts" of a file, allowing efficient handling of
+    datasets larger than memory without requiring the data to be explicitly sharded
+    in storage.
+
+*   **s3-fuse** is transparent to file access within Python, including Iris load+save or
+    other files accessed via a Python 'open' : the S3 data appears to be files in a
+    regular file-system.
+
+*   the file-system virtualisation approach works for all file formats, since the
+    mapping occurs in the O.S. rather than in Iris, or Python.
+
+*   "mounting" avoids the need for the Python code to dynamically connect to /
+    disconnect from an S3 bucket.
+
+*   the "unmount problem" (see below) is managed at the level of the operating system,
+    where it occurs, instead of trying to allow for it in Python code.  This means it
+    could be managed differently in different operating systems, if needed.
+
+*   it does also work with many other cloud object-storage platforms, though with extra
+    required dependencies in some cases.
+    See the s3fs-fuse `Non-Amazon S3`_ docs page for details.
+
+CONs
+^^^^
+
+*   only works on Unix-like O.S.
+
+*   requires the "fuse" kernel module to be supported in your O.S.
+    This is usually installed by default, but may not always be.
+    See `'fuse' kernel module <https://www.kernel.org/doc/html/next/filesystems/fuse.html>`_
+    for more detail.
+
+*   the file-system virtualisation may not be perfect :  some file-system operations
+    might not behave as expected, e.g. with regard to file permissions or system
+    information.
+
+*   it requires user actions *outside* the Python code.
+
+*   the user must manage the mount/umount context.
+
+*   some similar cloud object-storage platforms are *not* supported.
+    See the s3fs-fuse `Non-Amazon S3`_ docs page for details of those which are.
+
+
+Background Notes and Details
+----------------------------
+
+*   The file-like objects provided by **fsspec** replicate nearly *all* the behaviours
+    of a regular Python file.
+
+    However, this is still hard to integrate with regular file access, since you
+    cannot create one from a regular Python "open" call -- still less
+    when opening a file with an underlying file-format such as netCDF4 or HDF5
+    (since these are usually implemented in other languages such as C).
+    Nor can you interrogate file paths or system metadata, e.g. permissions.
+
+    So, the key benefit offered by **s3-fuse** is that all functions are mapped
+    onto regular O.S. file-system calls -- so the file-format never needs to
+    know that the data is not a "real" file.
+
+*   It would be possible, instead, to copy data into an *actual* file on disk, but the
+    s3-fuse approach avoids the need for copying, and thus in a cloud environment also
+    the cost and maintenance of a "local disk".
+
+    s3fs also allows the software to access only *required* parts of a file, without
+    copying the whole content.  This is obviously essential for efficient use of large
+    datasets, e.g. when larger than available memory.
+
+*   It is also possible to use **s3-fuse** to establish the mounts *from within Python*.
+    However, we have considered integrating this into Iris and rejected it because of
+    unavoidable problems : namely, the "umount problem" (see below).
+    For details, see : https://github.com/SciTools/iris/pull/6731
+
+*   "Unmounting" must be done via a shell ``umount`` command, and there is no easy way to
+    guarantee that this succeeds, since it can often get a "target is busy" error.
+
+    This "umount problem" is a known problem in Unix generally : see
+    `here <https://stackoverflow.com/questions/tagged/linux%20umount>`_ .
+
+    It can only be resolved by a delay + retry.
+
+
+.. _Non-Amazon S3: https://github.com/s3fs-fuse/s3fs-fuse/wiki/Non-Amazon-S3
diff --git a/docs/src/whatsnew/latest.rst b/docs/src/whatsnew/latest.rst
@@ -96,6 +96,9 @@ This document explains the changes made to Iris for this release
 #. :user:`bjlittle` added the ``:user:`` `extlinks`_ ``github`` user convenience.
    (:pull:`6931`)
 
+#. `@pp-mo`_ added a page on how to access datafiles in S3 buckets.
+   (:issue:`6374`, :pull:`6951`)
+
 
 💼 Internal
 ===========