|
| 1 | +.. _s3_io: |
| 2 | + |
| 3 | +Loading From and Saving To S3 Buckets |
| 4 | +===================================== |
| 5 | + |
| 6 | +For cloud computing, it is natural to want to access data storage based on URIs. |
| 7 | +At the present time, by far the most widely used platform for this is |
| 8 | +`Amazon S3 "buckets" <https://aws.amazon.com/s3/>`_. |
| 9 | + |
| 10 | +It is common to treat an S3 bucket like a "disk", storing files as individual S3 |
| 11 | +objects. S3 access urls can also contain a nested |
| 12 | +`'prefix string' <https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html>`_ |
| 13 | +structure, which naturally mirrors sub-directories in a file-system. |
| 14 | + |
| 15 | +While it would be possible for Iris to support S3 access directly, as it does the |
| 16 | +"OpenDAP" protocol for netCDF data, this approach has some serious limitations : most |
| 17 | +notably, each supported file format would have to be separately extended to support S3 |
| 18 | +urls in the place of file paths for loading and saving. |
| 19 | + |
| 20 | +Instead, we have found that it is most practical to perform this access using a virtual |
| 21 | +file system approach. However, one drawback is that this is best controlled *outside* |
| 22 | +the Python code -- see details below. |
| 23 | + |
| 24 | + |
| 25 | +TL;DR |
| 26 | +----- |
| 27 | +Install s3-fuse and use its ``s3fs`` command, to create a file-system mount which maps |
| 28 | +to an S3 bucket. S3 objects can then be accessed as a regular files (read and write). |
| 29 | + |
| 30 | + |
| 31 | +Fsspec, S3-fs, fuse and s3-fuse |
| 32 | +-------------------------------- |
| 33 | +This approach depends on a set of related code solutions, as follows: |
| 34 | + |
| 35 | +`fsspec <https://github.com/fsspec/filesystem_spec/blob/master/README.md>`_ |
| 36 | +is a general framework for implementing Python-file-like access to alternative storage |
| 37 | +resources. |
| 38 | + |
| 39 | +`s3fs <https://github.com/fsspec/s3fs>`_ |
| 40 | +is a package based on fsspec, which enables Python to "open" S3 data objects as Python |
| 41 | +file-like objects for reading and writing. |
| 42 | + |
| 43 | +`fuse <https://github.com/libfuse/libfuse>`_ |
| 44 | +is an interface library that enables a data resource to be "mounted" as a Linux |
| 45 | +filesystem, with user (not root) privilege. |
| 46 | + |
| 47 | +`s3-fuse <https://github.com/s3fs-fuse/s3fs-fuse/blob/master/README.md>`_ |
| 48 | +is a utility based on s3fs and fuse, which provides a POSIX-compatible "mount" so that |
| 49 | +an S3 bucket can be accessed as a regular Unix file system. |
| 50 | + |
| 51 | + |
| 52 | +Practical usage |
| 53 | +--------------- |
| 54 | +Of the above, the only thing you actually need to know about is **s3-fuse**. |
| 55 | + |
| 56 | +There is an initial one-time setup, and also actions to take in advance of launching |
| 57 | +Python, and after exit, each time you want to access S3 from Python. |
| 58 | + |
| 59 | +Prior requirements |
| 60 | +^^^^^^^^^^^^^^^^^^ |
| 61 | + |
| 62 | +Install "s3-fuse" |
| 63 | +~~~~~~~~~~~~~~~~~ |
| 64 | +The official |
| 65 | +`installation instructions <https://github.com/s3fs-fuse/s3fs-fuse/blob/master/README.md#installation>`_ |
| 66 | +assume that you will perform a system installation with `apt`, `yum` or similar. |
| 67 | + |
| 68 | +However, since you may well not have adequate 'sudo' or root access permissions |
| 69 | +for this, it is simpler to instead install it only into your Python environment. |
| 70 | +Though not suggested, this appears to work on Unix systems where we have tried it. |
| 71 | + |
| 72 | +So, you can use conda or pip -- e.g. |
| 73 | + |
| 74 | +.. code-block:: bash |
| 75 | +
|
| 76 | + $ pip install s3-fuse |
| 77 | +
|
| 78 | +or |
| 79 | + |
| 80 | +.. code-block:: bash |
| 81 | +
|
| 82 | + $ conda install s3-fuse |
| 83 | +
|
| 84 | +( Or better, put it into a reusable 'spec file', with all other requirements, and then |
| 85 | +use ``$ conda create --file ...`` |
| 86 | +). |
| 87 | + |
| 88 | +Create an empty mount directory |
| 89 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 90 | +You need an empty directory in your existing filesystem tree, that you will map your |
| 91 | +S3 bucket **onto** -- e.g. |
| 92 | + |
| 93 | +.. code-block:: bash |
| 94 | +
|
| 95 | + $ mkdir /home/self.me/s3_root/testbucket_mountpoint |
| 96 | +
|
| 97 | +The file system which this belongs to is presumably irrelevant, and will not affect |
| 98 | +performance. |
| 99 | + |
| 100 | +Setup AWS credentials |
| 101 | +~~~~~~~~~~~~~~~~~~~~~ |
| 102 | +Provide S3 access credentials in an AWS credentials file, as described in |
| 103 | +`this account <https://github.com/s3fs-fuse/s3fs-fuse/blob/master/README.md#examples>`_. |
| 104 | + |
| 105 | + |
| 106 | +Before use (before each Python invocation) |
| 107 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 108 | +Activate your Python environment, which should then give you access to the s3-fuse Linux |
| 109 | +command (note: somewhat confusingly, this is called "s3fs"). |
| 110 | + |
| 111 | +Map your S3 bucket "into" the chosen empty directory -- e.g. |
| 112 | + |
| 113 | +.. code-block:: bash |
| 114 | +
|
| 115 | + $ s3fs my-test-bucket /home/self.me/s3_root/testbucket_mountpoint`` |
| 116 | +
|
| 117 | +.. note:: |
| 118 | + |
| 119 | + You can now freely list/access contents of your bucket at this path |
| 120 | + -- including updating or writing files. |
| 121 | + |
| 122 | +.. note:: |
| 123 | + |
| 124 | + This performs a Unix file-system "mount" operation, which temporarily |
| 125 | + modifies your system. This change is not part of the current environment, and is not |
| 126 | + limited to the scope of the current process. |
| 127 | + |
| 128 | + If you reboot, the mount will disappear. If you logout and login again, there can |
| 129 | + be problems : ideally you should avoid this by always "unmounting" (see below). |
| 130 | + |
| 131 | + |
| 132 | +Within Python code |
| 133 | +^^^^^^^^^^^^^^^^^^ |
| 134 | +Access files stored as S3 objects "under" the S3 url, appearing as files under the |
| 135 | +mapped file-system path -- e.g. |
| 136 | + |
| 137 | +.. code-block:: python |
| 138 | +
|
| 139 | + >>> path = "/home/self.me/s3_root/testbucket_mountpoint/sub_dir/a_file.nc" |
| 140 | + >>> cubes = iris.load(path) |
| 141 | +
|
| 142 | +
|
| 143 | +After use (after Python exit) |
| 144 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 145 | +At some point, you should 'forget' the mounted S3 filesystem by "unmounting" it -- e.g. |
| 146 | + |
| 147 | +.. code-block:: bash |
| 148 | +
|
| 149 | + $ umount /home/self.me/s3_root/testbucket_mountpoint |
| 150 | +
|
| 151 | +.. note:: |
| 152 | + |
| 153 | + The "umount" is a standard Unix command. It may not always succeed, in which case |
| 154 | + some kind of retry may be needed -- see detail notes below. |
| 155 | + |
| 156 | + The mount created will not survive a system reboot, nor does it function correctly |
| 157 | + if the user logs out + logs in again. |
| 158 | + |
| 159 | + Presumably problems can occur if repeated actions can create a very large number of |
| 160 | + mounts, so unmounting after use does seem advisable. |
| 161 | + |
| 162 | + |
| 163 | +Some Pros and Cons of this approach |
| 164 | +----------------------------------- |
| 165 | + |
| 166 | +PROs |
| 167 | +^^^^ |
| 168 | + |
| 169 | +* s3fs supports random access to "parts" of a file, allowing efficient handling of |
| 170 | + datasets larger than memory without requiring the data to be explicitly sharded |
| 171 | + in storage. |
| 172 | + |
| 173 | +* s3-fuse is transparent to file access within Python, including Iris load+save or |
| 174 | + other Python 'open' files : the S3 data appears to be files in a |
| 175 | + regular file-system. |
| 176 | + |
| 177 | +* the file-system virtualisation approach works for all file formats, since the |
| 178 | + mapping occurs in the O.S. rather than in Iris, or Python. |
| 179 | + |
| 180 | +* "mounting" avoids the need for the Python instance to dynamically connect to / |
| 181 | + disconnect from the S3 bucket |
| 182 | + |
| 183 | +* the "unmount problem" (see below) is managed at the level of the O.S., where it |
| 184 | + occurs, instead of trying to allow for it in Python code. This means it can be |
| 185 | + managed differently in different operating systems, if needed. |
| 186 | + |
| 187 | +CONs |
| 188 | +^^^^ |
| 189 | + |
| 190 | +* this solution is specific to S3 storage |
| 191 | + |
| 192 | +* possibly the virtualisation is not perfect, if some file-system operations do not |
| 193 | + behave as expected, e.g. with regard to file permissions or system information |
| 194 | + |
| 195 | +* it requires user actions *outside* the Python code |
| 196 | + |
| 197 | +* the user must manage the mount/umount context |
| 198 | + |
| 199 | + |
| 200 | +Background Notes and Details |
| 201 | +---------------------------- |
| 202 | + |
| 203 | +* The file-like objects provided by **fsspec** replicate nearly *all* the behaviours |
| 204 | + of a regular Python file. |
| 205 | + |
| 206 | + However, this is still hard to integrate with regular file access, since you |
| 207 | + cannot create one from a regular Python "open" call -- still less |
| 208 | + when opening a file with an underlying file-format such as netCDF4 or HDF5 |
| 209 | + (since these are usually implemented in other languages such as C). |
| 210 | + |
| 211 | + So, the key benefit offered by **s3-fuse** is that all the functions are mapped |
| 212 | + onto regular O.S. file-system calls -- so the file-format never needs to |
| 213 | + know that the data is not a "real" file. |
| 214 | + |
| 215 | +* It would be possible, instead, to copy data into an *actual* file on disk, but the |
| 216 | + s3-fuse approach avoids the need for copying, and thus in a cloud environment also |
| 217 | + the cost and maintenance of a "local disk". |
| 218 | + |
| 219 | + s3fs also allows the software to access only *required* parts of a file, without |
| 220 | + copying the whole content. This is obviously essential for efficient use of large |
| 221 | + datasets, e.g. when larger than available memory. |
| 222 | + |
| 223 | +* It is also possible to use "s3-fuse" to establish the mounts *from within Python*. |
| 224 | + However, we have considered integrating this into Iris and rejected it because of |
| 225 | + unavoidable problems : namely, the "umount problem" (see below). |
| 226 | + For details, see : https://github.com/SciTools/iris/pull/6731 |
| 227 | + |
| 228 | +* "Unmounting" must be done via a shell ``umount`` command, and there is no easy way to |
| 229 | + guarantee that this succeeds, since it can often get a "target is busy" error. |
| 230 | + This "umount problem" is a known problem in Unix generally : see |
| 231 | + `here <https://stackoverflow.com/questions/tagged/linux%20umount>`_ |
0 commit comments