Skip to content

Commit 460f14c

Browse files
committed
Add docs page on how to use S3 data.
1 parent 70ad614 commit 460f14c

File tree

3 files changed

+235
-0
lines changed

3 files changed

+235
-0
lines changed

docs/src/further_topics/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ Extra information on specific technical issues.
1717
missing_data_handling
1818
dataless_cubes
1919
netcdf_io
20+
s3_io
2021
dask_best_practices/index
2122
ugrid/index
2223
which_regridder_to_use

docs/src/further_topics/s3_io.rst

Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
.. _s3_io:
2+
3+
Loading From and Saving To S3 Buckets
4+
=====================================
5+
6+
For cloud computing, it is natural to want to access data storage based on URIs.
7+
At the present time, by far the most widely used platform for this is
8+
`Amazon S3 "buckets" <https://aws.amazon.com/s3/>`_.
9+
10+
It is common to treat an S3 bucket like a "disk", storing files as individual S3
11+
objects. S3 access urls can also contain a nested
12+
`'prefix string' <https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html>`_
13+
structure, which naturally mirrors sub-directories in a file-system.
14+
15+
While it would be possible for Iris to support S3 access directly, as it does the
16+
"OpenDAP" protocol for netCDF data, this approach has some serious limitations : most
17+
notably, each supported file format would have to be separately extended to support S3
18+
urls in the place of file paths for loading and saving.
19+
20+
Instead, we have found that it is most practical to perform this access using a virtual
21+
file system approach. However, one drawback is that this is best controlled *outside*
22+
the Python code -- see details below.
23+
24+
25+
TL;DR
26+
-----
27+
Install s3-fuse and use its ``s3fs`` command, to create a file-system mount which maps
28+
to an S3 bucket. S3 objects can then be accessed as a regular files (read and write).
29+
30+
31+
Fsspec, S3-fs, fuse and s3-fuse
32+
--------------------------------
33+
This approach depends on a set of related code solutions, as follows:
34+
35+
`fsspec <https://github.com/fsspec/filesystem_spec/blob/master/README.md>`_
36+
is a general framework for implementing Python-file-like access to alternative storage
37+
resources.
38+
39+
`s3fs <https://github.com/fsspec/s3fs>`_
40+
is a package based on fsspec, which enables Python to "open" S3 data objects as Python
41+
file-like objects for reading and writing.
42+
43+
`fuse <https://github.com/libfuse/libfuse>`_
44+
is an interface library that enables a data resource to be "mounted" as a Linux
45+
filesystem, with user (not root) privilege.
46+
47+
`s3-fuse <https://github.com/s3fs-fuse/s3fs-fuse/blob/master/README.md>`_
48+
is a utility based on s3fs and fuse, which provides a POSIX-compatible "mount" so that
49+
an S3 bucket can be accessed as a regular Unix file system.
50+
51+
52+
Practical usage
53+
---------------
54+
Of the above, the only thing you actually need to know about is **s3-fuse**.
55+
56+
There is an initial one-time setup, and also actions to take in advance of launching
57+
Python, and after exit, each time you want to access S3 from Python.
58+
59+
Prior requirements
60+
^^^^^^^^^^^^^^^^^^
61+
62+
Install "s3-fuse"
63+
~~~~~~~~~~~~~~~~~
64+
The official
65+
`installation instructions <https://github.com/s3fs-fuse/s3fs-fuse/blob/master/README.md#installation>`_
66+
assume that you will perform a system installation with `apt`, `yum` or similar.
67+
68+
However, since you may well not have adequate 'sudo' or root access permissions
69+
for this, it is simpler to instead install it only into your Python environment.
70+
Though not suggested, this appears to work on Unix systems where we have tried it.
71+
72+
So, you can use conda or pip -- e.g.
73+
74+
.. code-block:: bash
75+
76+
$ pip install s3-fuse
77+
78+
or
79+
80+
.. code-block:: bash
81+
82+
$ conda install s3-fuse
83+
84+
( Or better, put it into a reusable 'spec file', with all other requirements, and then
85+
use ``$ conda create --file ...``
86+
).
87+
88+
Create an empty mount directory
89+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
90+
You need an empty directory in your existing filesystem tree, that you will map your
91+
S3 bucket **onto** -- e.g.
92+
93+
.. code-block:: bash
94+
95+
$ mkdir /home/self.me/s3_root/testbucket_mountpoint
96+
97+
The file system which this belongs to is presumably irrelevant, and will not affect
98+
performance.
99+
100+
Setup AWS credentials
101+
~~~~~~~~~~~~~~~~~~~~~
102+
Provide S3 access credentials in an AWS credentials file, as described in
103+
`this account <https://github.com/s3fs-fuse/s3fs-fuse/blob/master/README.md#examples>`_.
104+
105+
106+
Before use (before each Python invocation)
107+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
108+
Activate your Python environment, which should then give you access to the s3-fuse Linux
109+
command (note: somewhat confusingly, this is called "s3fs").
110+
111+
Map your S3 bucket "into" the chosen empty directory -- e.g.
112+
113+
.. code-block:: bash
114+
115+
$ s3fs my-test-bucket /home/self.me/s3_root/testbucket_mountpoint``
116+
117+
.. note::
118+
119+
You can now freely list/access contents of your bucket at this path
120+
-- including updating or writing files.
121+
122+
.. note::
123+
124+
This performs a Unix file-system "mount" operation, which temporarily
125+
modifies your system. This change is not part of the current environment, and is not
126+
limited to the scope of the current process.
127+
128+
If you reboot, the mount will disappear. If you logout and login again, there can
129+
be problems : ideally you should avoid this by always "unmounting" (see below).
130+
131+
132+
Within Python code
133+
^^^^^^^^^^^^^^^^^^
134+
Access files stored as S3 objects "under" the S3 url, appearing as files under the
135+
mapped file-system path -- e.g.
136+
137+
.. code-block:: python
138+
139+
>>> path = "/home/self.me/s3_root/testbucket_mountpoint/sub_dir/a_file.nc"
140+
>>> cubes = iris.load(path)
141+
142+
143+
After use (after Python exit)
144+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
145+
At some point, you should 'forget' the mounted S3 filesystem by "unmounting" it -- e.g.
146+
147+
.. code-block:: bash
148+
149+
$ umount /home/self.me/s3_root/testbucket_mountpoint
150+
151+
.. note::
152+
153+
The "umount" is a standard Unix command. It may not always succeed, in which case
154+
some kind of retry may be needed -- see detail notes below.
155+
156+
The mount created will not survive a system reboot, nor does it function correctly
157+
if the user logs out + logs in again.
158+
159+
Presumably problems can occur if repeated actions can create a very large number of
160+
mounts, so unmounting after use does seem advisable.
161+
162+
163+
Some Pros and Cons of this approach
164+
-----------------------------------
165+
166+
PROs
167+
^^^^
168+
169+
* s3fs supports random access to "parts" of a file, allowing efficient handling of
170+
datasets larger than memory without requiring the data to be explicitly sharded
171+
in storage.
172+
173+
* s3-fuse is transparent to file access within Python, including Iris load+save or
174+
other Python 'open' files : the S3 data appears to be files in a
175+
regular file-system.
176+
177+
* the file-system virtualisation approach works for all file formats, since the
178+
mapping occurs in the O.S. rather than in Iris, or Python.
179+
180+
* "mounting" avoids the need for the Python instance to dynamically connect to /
181+
disconnect from the S3 bucket
182+
183+
* the "unmount problem" (see below) is managed at the level of the O.S., where it
184+
occurs, instead of trying to allow for it in Python code. This means it can be
185+
managed differently in different operating systems, if needed.
186+
187+
CONs
188+
^^^^
189+
190+
* this solution is specific to S3 storage
191+
192+
* possibly the virtualisation is not perfect, if some file-system operations do not
193+
behave as expected, e.g. with regard to file permissions or system information
194+
195+
* it requires user actions *outside* the Python code
196+
197+
* the user must manage the mount/umount context
198+
199+
200+
Background Notes and Details
201+
----------------------------
202+
203+
* The file-like objects provided by **fsspec** replicate nearly *all* the behaviours
204+
of a regular Python file.
205+
206+
However, this is still hard to integrate with regular file access, since you
207+
cannot create one from a regular Python "open" call -- still less
208+
when opening a file with an underlying file-format such as netCDF4 or HDF5
209+
(since these are usually implemented in other languages such as C).
210+
211+
So, the key benefit offered by **s3-fuse** is that all the functions are mapped
212+
onto regular O.S. file-system calls -- so the file-format never needs to
213+
know that the data is not a "real" file.
214+
215+
* It would be possible, instead, to copy data into an *actual* file on disk, but the
216+
s3-fuse approach avoids the need for copying, and thus in a cloud environment also
217+
the cost and maintenance of a "local disk".
218+
219+
s3fs also allows the software to access only *required* parts of a file, without
220+
copying the whole content. This is obviously essential for efficient use of large
221+
datasets, e.g. when larger than available memory.
222+
223+
* It is also possible to use "s3-fuse" to establish the mounts *from within Python*.
224+
However, we have considered integrating this into Iris and rejected it because of
225+
unavoidable problems : namely, the "umount problem" (see below).
226+
For details, see : https://github.com/SciTools/iris/pull/6731
227+
228+
* "Unmounting" must be done via a shell ``umount`` command, and there is no easy way to
229+
guarantee that this succeeds, since it can often get a "target is busy" error.
230+
This "umount problem" is a known problem in Unix generally : see
231+
`here <https://stackoverflow.com/questions/tagged/linux%20umount>`_

docs/src/whatsnew/latest.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,9 @@ This document explains the changes made to Iris for this release
9191
#. :user:`bjlittle` added the ``:user:`` `extlinks`_ ``github`` user convenience.
9292
(:pull:`6931`)
9393

94+
#. `@ppmo`_ added a page on how to access datafiles in S3 buckets.
95+
(:issue:`6374`, :pull:`6951`)
96+
9497

9598
💼 Internal
9699
===========

0 commit comments

Comments
 (0)