|
| 1 | +.. _s3_io: |
| 2 | + |
| 3 | +Loading From and Saving To S3 Buckets |
| 4 | +===================================== |
| 5 | + |
| 6 | +For cloud computing, it is natural to want to access data storage based on URIs. |
| 7 | +At the present time, by far the most widely used platform for this is |
| 8 | +`Amazon S3 "buckets" <https://aws.amazon.com/s3/>`_. |
| 9 | + |
| 10 | +It is common to treat an S3 bucket like a "disk", storing files as individual S3 |
| 11 | +objects. S3 access URLs can also contain a nested |
| 12 | +`'prefix string' <https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html>`_ |
| 13 | +structure, which naturally mirrors sub-directories in a file-system. |
| 14 | + |
| 15 | +While it would be possible for Iris to support S3 access directly, as it does the |
| 16 | +"OpenDAP" protocol for netCDF data, this approach has some serious limitations : most |
| 17 | +notably, each supported file format would have to be separately extended to support S3 |
| 18 | +URLs in the place of file paths for loading and saving. |
| 19 | + |
| 20 | +Instead, we have found that it is most practical to perform this access using a virtual |
| 21 | +file system approach. However, one drawback is that this is best controlled *outside* |
| 22 | +the Python code -- see details below. |
| 23 | + |
| 24 | + |
| 25 | +TL;DR |
| 26 | +----- |
| 27 | +Install s3-fuse and use its ``s3fs`` command, to create a file-system mount which maps |
| 28 | +to an S3 bucket. S3 objects can then be accessed as a regular files (read and write). |
| 29 | + |
| 30 | + |
| 31 | +Fsspec, S3-fs, fuse and s3-fuse |
| 32 | +-------------------------------- |
| 33 | +This approach depends on a set of related code solutions, as follows: |
| 34 | + |
| 35 | +`fsspec <https://github.com/fsspec/filesystem_spec/blob/master/README.md>`_ |
| 36 | +is a general framework for implementing Python-file-like access to alternative storage |
| 37 | +resources. |
| 38 | + |
| 39 | +`s3fs <https://github.com/fsspec/s3fs>`_ |
| 40 | +is a package based on fsspec, which enables Python to "open" S3 data objects as Python |
| 41 | +file-like objects for reading and writing. |
| 42 | + |
| 43 | +`fuse <https://github.com/libfuse/libfuse>`_ |
| 44 | +is an interface library that enables a data resource to be "mounted" as a Linux |
| 45 | +filesystem, with user (not root) privilege. |
| 46 | + |
| 47 | +`s3-fuse <https://github.com/s3fs-fuse/s3fs-fuse/blob/master/README.md>`_ |
| 48 | +is a utility based on s3fs and fuse, which provides a POSIX-compatible "mount" so that |
| 49 | +an S3 bucket can be accessed as a regular Unix file system. |
| 50 | + |
| 51 | + |
| 52 | +Practical usage |
| 53 | +--------------- |
| 54 | +Of the above, the only thing you actually need to know about is **s3-fuse**. |
| 55 | + |
| 56 | +There is an initial one-time setup, and also actions to take in advance of launching |
| 57 | +Python, and after exit, each time you want to access S3 from Python. |
| 58 | + |
| 59 | +Prior requirements |
| 60 | +^^^^^^^^^^^^^^^^^^ |
| 61 | + |
| 62 | +Install "s3-fuse" |
| 63 | +~~~~~~~~~~~~~~~~~ |
| 64 | +The most reliable method is to install into your Linux O.S. See |
| 65 | +`installation instructions <https://github.com/s3fs-fuse/s3fs-fuse/blob/master/README.md#installation>`_ . |
| 66 | +This presumes that you perform a system installation with ``apt``, ``yum`` or similar. |
| 67 | + |
| 68 | +If you do not have necessary 'sudo' or root access permissions, we have found that it |
| 69 | +is sufficient to install only **into your Python environment**, using conda. |
| 70 | +Though not suggested, this appears to work on Unix systems where we have tried it. |
| 71 | + |
| 72 | +For this, you can use conda -- e.g. |
| 73 | + |
| 74 | +.. code-block:: bash |
| 75 | +
|
| 76 | + $ conda install s3-fuse |
| 77 | +
|
| 78 | +( Or better, put it into a reusable 'spec file', with all other requirements, and then |
| 79 | +use ``$ conda create --file ...`` |
| 80 | +). |
| 81 | + |
| 82 | +.. note:: |
| 83 | + |
| 84 | + It is **not** possible to install s3fs-fuse into a Python environment with ``pip``, |
| 85 | + as it is not a Python package. |
| 86 | + |
| 87 | + |
| 88 | +Create an empty mount directory |
| 89 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 90 | +You need an empty directory in your existing filesystem tree, that you will map your |
| 91 | +S3 bucket **onto** -- e.g. |
| 92 | + |
| 93 | +.. code-block:: bash |
| 94 | +
|
| 95 | + $ mkdir /home/self.me/s3_root/testbucket_mountpoint |
| 96 | +
|
| 97 | +
|
| 98 | +Setup AWS credentials |
| 99 | +~~~~~~~~~~~~~~~~~~~~~ |
| 100 | +Provide S3 access credentials in an AWS credentials file, as described |
| 101 | +`here in the the s3-fuse documentation <https://github.com/s3fs-fuse/s3fs-fuse/blob/master/README.md#examples>`_. |
| 102 | + |
| 103 | +There is a general introduction to AWS credentials |
| 104 | +`here in the AWS documentation <https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html>`_ |
| 105 | +which should explain what you need here. |
| 106 | + |
| 107 | + |
| 108 | +Before use (before each Python invocation) |
| 109 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 110 | +Activate your Python environment, which then gives access to the **s3-fuse** Linux |
| 111 | +command "s3fs". |
| 112 | + |
| 113 | +Map your S3 bucket "into" the chosen empty directory -- e.g. |
| 114 | + |
| 115 | +.. code-block:: bash |
| 116 | +
|
| 117 | + $ s3fs my-test-bucket /home/self.me/s3_root/testbucket_mountpoint |
| 118 | +
|
| 119 | +.. note:: |
| 120 | + |
| 121 | + You can now freely list/access contents of your bucket at this path |
| 122 | + -- including updating or writing files. |
| 123 | + |
| 124 | +.. note:: |
| 125 | + |
| 126 | + This performs a Unix file-system "mount" operation, which temporarily |
| 127 | + modifies your system. This change is not part of the current environment, and is not |
| 128 | + limited to the scope of the current process. |
| 129 | + |
| 130 | + If you reboot, the mount will disappear. If you logout and login again, there can |
| 131 | + be problems : ideally you should avoid this by always "unmounting" (see below). |
| 132 | + |
| 133 | +.. note:: |
| 134 | + |
| 135 | + The command for mounting an s3-fuse filesystem is ``s3fs`` - this should not be |
| 136 | + confused with the similarly named s3fs python package. |
| 137 | + |
| 138 | + |
| 139 | +Within Python code |
| 140 | +^^^^^^^^^^^^^^^^^^ |
| 141 | +You can now access objects at the remote S3 URL via the mount point on your local file |
| 142 | +system you just created with `s3fs`, e.g. |
| 143 | + |
| 144 | +.. code-block:: python |
| 145 | +
|
| 146 | + >>> path = "/home/self.me/s3_root/testbucket_mountpoint/sub_dir/a_file.nc" |
| 147 | + >>> cubes = iris.load(path) |
| 148 | +
|
| 149 | +
|
| 150 | +After use (after Python exit) |
| 151 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 152 | +When you have finished accessing the S3 objects in the mounted virtual filesystem, it |
| 153 | +is a good idea to **unmount** it. Before doing this, make sure that all file handles to |
| 154 | +the objects have been closed and there are no terminals open in that directory. |
| 155 | + |
| 156 | +.. code-block:: bash |
| 157 | +
|
| 158 | + $ umount /home/self.me/s3_root/testbucket_mountpoint |
| 159 | +
|
| 160 | +.. note:: |
| 161 | + |
| 162 | + The ``umount`` command is a standard Unix command. It may not always succeed, in which case |
| 163 | + some kind of retry may be needed -- see detail notes below. |
| 164 | + |
| 165 | + The mount created will not survive a system reboot, nor does it function correctly |
| 166 | + if the user logs out + logs in again. |
| 167 | + |
| 168 | + Presumably, problems could occur if repeated operation were to create a very large |
| 169 | + number of mounts, so unmounting after use does seem advisable. |
| 170 | + |
| 171 | + |
| 172 | +Some Pros and Cons of this approach |
| 173 | +----------------------------------- |
| 174 | + |
| 175 | +PROs |
| 176 | +^^^^ |
| 177 | + |
| 178 | +* **s3fs** supports random access to "parts" of a file, allowing efficient handling of |
| 179 | + datasets larger than memory without requiring the data to be explicitly sharded |
| 180 | + in storage. |
| 181 | + |
| 182 | +* **s3-fuse** is transparent to file access within Python, including Iris load+save or |
| 183 | + other files accessed via a Python 'open' : the S3 data appears to be files in a |
| 184 | + regular file-system. |
| 185 | + |
| 186 | +* the file-system virtualisation approach works for all file formats, since the |
| 187 | + mapping occurs in the O.S. rather than in Iris, or Python. |
| 188 | + |
| 189 | +* "mounting" avoids the need for the Python code to dynamically connect to / |
| 190 | + disconnect from an S3 bucket. |
| 191 | + |
| 192 | +* the "unmount problem" (see below) is managed at the level of the operating system, |
| 193 | + where it occurs, instead of trying to allow for it in Python code. This means it |
| 194 | + could be managed differently in different operating systems, if needed. |
| 195 | + |
| 196 | +* it does also work with many other cloud object-storage platforms, though with extra |
| 197 | + required dependencies in some cases. |
| 198 | + See the s3fs-fuse `Non-Amazon S3`_ docs page for details. |
| 199 | + |
| 200 | +CONs |
| 201 | +^^^^ |
| 202 | + |
| 203 | +* only works on Unix-like O.S. |
| 204 | + |
| 205 | +* requires the "fuse" kernel module to be supported in your O.S. |
| 206 | + This is usually installed by default, but may not always be. |
| 207 | + See `'fuse' kernel module <https://www.kernel.org/doc/html/next/filesystems/fuse.html>`_ |
| 208 | + for more detail. |
| 209 | + |
| 210 | +* the file-system virtualisation may not be perfect : some file-system operations |
| 211 | + might not behave as expected, e.g. with regard to file permissions or system |
| 212 | + information. |
| 213 | + |
| 214 | +* it requires user actions *outside* the Python code. |
| 215 | + |
| 216 | +* the user must manage the mount/umount context. |
| 217 | + |
| 218 | +* some similar cloud object-storage platforms are *not* supported. |
| 219 | + See the s3fs-fuse `Non-Amazon S3`_ docs page for details of those which are. |
| 220 | + |
| 221 | + |
| 222 | +Background Notes and Details |
| 223 | +---------------------------- |
| 224 | + |
| 225 | +* The file-like objects provided by **fsspec** replicate nearly *all* the behaviours |
| 226 | + of a regular Python file. |
| 227 | + |
| 228 | + However, this is still hard to integrate with regular file access, since you |
| 229 | + cannot create one from a regular Python "open" call -- still less |
| 230 | + when opening a file with an underlying file-format such as netCDF4 or HDF5 |
| 231 | + (since these are usually implemented in other languages such as C). |
| 232 | + Nor can you interrogate file paths or system metadata, e.g. permissions. |
| 233 | + |
| 234 | + So, the key benefit offered by **s3-fuse** is that all functions are mapped |
| 235 | + onto regular O.S. file-system calls -- so the file-format never needs to |
| 236 | + know that the data is not a "real" file. |
| 237 | + |
| 238 | +* It would be possible, instead, to copy data into an *actual* file on disk, but the |
| 239 | + s3-fuse approach avoids the need for copying, and thus in a cloud environment also |
| 240 | + the cost and maintenance of a "local disk". |
| 241 | + |
| 242 | + s3fs also allows the software to access only *required* parts of a file, without |
| 243 | + copying the whole content. This is obviously essential for efficient use of large |
| 244 | + datasets, e.g. when larger than available memory. |
| 245 | + |
| 246 | +* It is also possible to use **s3-fuse** to establish the mounts *from within Python*. |
| 247 | + However, we have considered integrating this into Iris and rejected it because of |
| 248 | + unavoidable problems : namely, the "umount problem" (see below). |
| 249 | + For details, see : https://github.com/SciTools/iris/pull/6731 |
| 250 | + |
| 251 | +* "Unmounting" must be done via a shell ``umount`` command, and there is no easy way to |
| 252 | + guarantee that this succeeds, since it can often get a "target is busy" error. |
| 253 | + |
| 254 | + This "umount problem" is a known problem in Unix generally : see |
| 255 | + `here <https://stackoverflow.com/questions/tagged/linux%20umount>`_ . |
| 256 | + |
| 257 | + It can only be resolved by a delay + retry. |
| 258 | + |
| 259 | + |
| 260 | +.. _Non-Amazon S3: https://github.com/s3fs-fuse/s3fs-fuse/wiki/Non-Amazon-S3 |
0 commit comments