Skip to content

Commit e41a0e5

Browse files
authored
Merge branch 'main' into pandas3.0-fixes
2 parents adb36af + 207d28d commit e41a0e5

File tree

4 files changed

+265
-1
lines changed

4 files changed

+265
-1
lines changed

.github/workflows/ci-linkchecks.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ jobs:
2323

2424
- name: Link Checker
2525
id: lychee
26-
uses: lycheeverse/lychee-action@a8c4c7cb88f0c7386610c35eb25108e448569cb0
26+
uses: lycheeverse/lychee-action@8646ba30535128ac92d33dfc9133794bfdd9b411
2727
with:
2828
token: ${{secrets.GITHUB_TOKEN}}
2929
fail: false

docs/src/further_topics/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ Extra information on specific technical issues.
1717
missing_data_handling
1818
dataless_cubes
1919
netcdf_io
20+
s3_io
2021
dask_best_practices/index
2122
ugrid/index
2223
which_regridder_to_use

docs/src/further_topics/s3_io.rst

Lines changed: 260 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
.. _s3_io:
2+
3+
Loading From and Saving To S3 Buckets
4+
=====================================
5+
6+
For cloud computing, it is natural to want to access data storage based on URIs.
7+
At the present time, by far the most widely used platform for this is
8+
`Amazon S3 "buckets" <https://aws.amazon.com/s3/>`_.
9+
10+
It is common to treat an S3 bucket like a "disk", storing files as individual S3
11+
objects. S3 access URLs can also contain a nested
12+
`'prefix string' <https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html>`_
13+
structure, which naturally mirrors sub-directories in a file-system.
14+
15+
While it would be possible for Iris to support S3 access directly, as it does the
16+
"OpenDAP" protocol for netCDF data, this approach has some serious limitations : most
17+
notably, each supported file format would have to be separately extended to support S3
18+
URLs in the place of file paths for loading and saving.
19+
20+
Instead, we have found that it is most practical to perform this access using a virtual
21+
file system approach. However, one drawback is that this is best controlled *outside*
22+
the Python code -- see details below.
23+
24+
25+
TL;DR
26+
-----
27+
Install s3-fuse and use its ``s3fs`` command, to create a file-system mount which maps
28+
to an S3 bucket. S3 objects can then be accessed as a regular files (read and write).
29+
30+
31+
Fsspec, S3-fs, fuse and s3-fuse
32+
--------------------------------
33+
This approach depends on a set of related code solutions, as follows:
34+
35+
`fsspec <https://github.com/fsspec/filesystem_spec/blob/master/README.md>`_
36+
is a general framework for implementing Python-file-like access to alternative storage
37+
resources.
38+
39+
`s3fs <https://github.com/fsspec/s3fs>`_
40+
is a package based on fsspec, which enables Python to "open" S3 data objects as Python
41+
file-like objects for reading and writing.
42+
43+
`fuse <https://github.com/libfuse/libfuse>`_
44+
is an interface library that enables a data resource to be "mounted" as a Linux
45+
filesystem, with user (not root) privilege.
46+
47+
`s3-fuse <https://github.com/s3fs-fuse/s3fs-fuse/blob/master/README.md>`_
48+
is a utility based on s3fs and fuse, which provides a POSIX-compatible "mount" so that
49+
an S3 bucket can be accessed as a regular Unix file system.
50+
51+
52+
Practical usage
53+
---------------
54+
Of the above, the only thing you actually need to know about is **s3-fuse**.
55+
56+
There is an initial one-time setup, and also actions to take in advance of launching
57+
Python, and after exit, each time you want to access S3 from Python.
58+
59+
Prior requirements
60+
^^^^^^^^^^^^^^^^^^
61+
62+
Install "s3-fuse"
63+
~~~~~~~~~~~~~~~~~
64+
The most reliable method is to install into your Linux O.S. See
65+
`installation instructions <https://github.com/s3fs-fuse/s3fs-fuse/blob/master/README.md#installation>`_ .
66+
This presumes that you perform a system installation with ``apt``, ``yum`` or similar.
67+
68+
If you do not have necessary 'sudo' or root access permissions, we have found that it
69+
is sufficient to install only **into your Python environment**, using conda.
70+
Though not suggested, this appears to work on Unix systems where we have tried it.
71+
72+
For this, you can use conda -- e.g.
73+
74+
.. code-block:: bash
75+
76+
$ conda install s3-fuse
77+
78+
( Or better, put it into a reusable 'spec file', with all other requirements, and then
79+
use ``$ conda create --file ...``
80+
).
81+
82+
.. note::
83+
84+
It is **not** possible to install s3fs-fuse into a Python environment with ``pip``,
85+
as it is not a Python package.
86+
87+
88+
Create an empty mount directory
89+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
90+
You need an empty directory in your existing filesystem tree, that you will map your
91+
S3 bucket **onto** -- e.g.
92+
93+
.. code-block:: bash
94+
95+
$ mkdir /home/self.me/s3_root/testbucket_mountpoint
96+
97+
98+
Setup AWS credentials
99+
~~~~~~~~~~~~~~~~~~~~~
100+
Provide S3 access credentials in an AWS credentials file, as described
101+
`here in the the s3-fuse documentation <https://github.com/s3fs-fuse/s3fs-fuse/blob/master/README.md#examples>`_.
102+
103+
There is a general introduction to AWS credentials
104+
`here in the AWS documentation <https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html>`_
105+
which should explain what you need here.
106+
107+
108+
Before use (before each Python invocation)
109+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
110+
Activate your Python environment, which then gives access to the **s3-fuse** Linux
111+
command "s3fs".
112+
113+
Map your S3 bucket "into" the chosen empty directory -- e.g.
114+
115+
.. code-block:: bash
116+
117+
$ s3fs my-test-bucket /home/self.me/s3_root/testbucket_mountpoint
118+
119+
.. note::
120+
121+
You can now freely list/access contents of your bucket at this path
122+
-- including updating or writing files.
123+
124+
.. note::
125+
126+
This performs a Unix file-system "mount" operation, which temporarily
127+
modifies your system. This change is not part of the current environment, and is not
128+
limited to the scope of the current process.
129+
130+
If you reboot, the mount will disappear. If you logout and login again, there can
131+
be problems : ideally you should avoid this by always "unmounting" (see below).
132+
133+
.. note::
134+
135+
The command for mounting an s3-fuse filesystem is ``s3fs`` - this should not be
136+
confused with the similarly named s3fs python package.
137+
138+
139+
Within Python code
140+
^^^^^^^^^^^^^^^^^^
141+
You can now access objects at the remote S3 URL via the mount point on your local file
142+
system you just created with `s3fs`, e.g.
143+
144+
.. code-block:: python
145+
146+
>>> path = "/home/self.me/s3_root/testbucket_mountpoint/sub_dir/a_file.nc"
147+
>>> cubes = iris.load(path)
148+
149+
150+
After use (after Python exit)
151+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
152+
When you have finished accessing the S3 objects in the mounted virtual filesystem, it
153+
is a good idea to **unmount** it. Before doing this, make sure that all file handles to
154+
the objects have been closed and there are no terminals open in that directory.
155+
156+
.. code-block:: bash
157+
158+
$ umount /home/self.me/s3_root/testbucket_mountpoint
159+
160+
.. note::
161+
162+
The ``umount`` command is a standard Unix command. It may not always succeed, in which case
163+
some kind of retry may be needed -- see detail notes below.
164+
165+
The mount created will not survive a system reboot, nor does it function correctly
166+
if the user logs out + logs in again.
167+
168+
Presumably, problems could occur if repeated operation were to create a very large
169+
number of mounts, so unmounting after use does seem advisable.
170+
171+
172+
Some Pros and Cons of this approach
173+
-----------------------------------
174+
175+
PROs
176+
^^^^
177+
178+
* **s3fs** supports random access to "parts" of a file, allowing efficient handling of
179+
datasets larger than memory without requiring the data to be explicitly sharded
180+
in storage.
181+
182+
* **s3-fuse** is transparent to file access within Python, including Iris load+save or
183+
other files accessed via a Python 'open' : the S3 data appears to be files in a
184+
regular file-system.
185+
186+
* the file-system virtualisation approach works for all file formats, since the
187+
mapping occurs in the O.S. rather than in Iris, or Python.
188+
189+
* "mounting" avoids the need for the Python code to dynamically connect to /
190+
disconnect from an S3 bucket.
191+
192+
* the "unmount problem" (see below) is managed at the level of the operating system,
193+
where it occurs, instead of trying to allow for it in Python code. This means it
194+
could be managed differently in different operating systems, if needed.
195+
196+
* it does also work with many other cloud object-storage platforms, though with extra
197+
required dependencies in some cases.
198+
See the s3fs-fuse `Non-Amazon S3`_ docs page for details.
199+
200+
CONs
201+
^^^^
202+
203+
* only works on Unix-like O.S.
204+
205+
* requires the "fuse" kernel module to be supported in your O.S.
206+
This is usually installed by default, but may not always be.
207+
See `'fuse' kernel module <https://www.kernel.org/doc/html/next/filesystems/fuse.html>`_
208+
for more detail.
209+
210+
* the file-system virtualisation may not be perfect : some file-system operations
211+
might not behave as expected, e.g. with regard to file permissions or system
212+
information.
213+
214+
* it requires user actions *outside* the Python code.
215+
216+
* the user must manage the mount/umount context.
217+
218+
* some similar cloud object-storage platforms are *not* supported.
219+
See the s3fs-fuse `Non-Amazon S3`_ docs page for details of those which are.
220+
221+
222+
Background Notes and Details
223+
----------------------------
224+
225+
* The file-like objects provided by **fsspec** replicate nearly *all* the behaviours
226+
of a regular Python file.
227+
228+
However, this is still hard to integrate with regular file access, since you
229+
cannot create one from a regular Python "open" call -- still less
230+
when opening a file with an underlying file-format such as netCDF4 or HDF5
231+
(since these are usually implemented in other languages such as C).
232+
Nor can you interrogate file paths or system metadata, e.g. permissions.
233+
234+
So, the key benefit offered by **s3-fuse** is that all functions are mapped
235+
onto regular O.S. file-system calls -- so the file-format never needs to
236+
know that the data is not a "real" file.
237+
238+
* It would be possible, instead, to copy data into an *actual* file on disk, but the
239+
s3-fuse approach avoids the need for copying, and thus in a cloud environment also
240+
the cost and maintenance of a "local disk".
241+
242+
s3fs also allows the software to access only *required* parts of a file, without
243+
copying the whole content. This is obviously essential for efficient use of large
244+
datasets, e.g. when larger than available memory.
245+
246+
* It is also possible to use **s3-fuse** to establish the mounts *from within Python*.
247+
However, we have considered integrating this into Iris and rejected it because of
248+
unavoidable problems : namely, the "umount problem" (see below).
249+
For details, see : https://github.com/SciTools/iris/pull/6731
250+
251+
* "Unmounting" must be done via a shell ``umount`` command, and there is no easy way to
252+
guarantee that this succeeds, since it can often get a "target is busy" error.
253+
254+
This "umount problem" is a known problem in Unix generally : see
255+
`here <https://stackoverflow.com/questions/tagged/linux%20umount>`_ .
256+
257+
It can only be resolved by a delay + retry.
258+
259+
260+
.. _Non-Amazon S3: https://github.com/s3fs-fuse/s3fs-fuse/wiki/Non-Amazon-S3

docs/src/whatsnew/latest.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,9 @@ This document explains the changes made to Iris for this release
9696
#. :user:`bjlittle` added the ``:user:`` `extlinks`_ ``github`` user convenience.
9797
(:pull:`6931`)
9898

99+
#. `@pp-mo`_ added a page on how to access datafiles in S3 buckets.
100+
(:issue:`6374`, :pull:`6951`)
101+
99102

100103
💼 Internal
101104
===========

0 commit comments

Comments
 (0)