Skip to content

Latest commit

 

History

History
271 lines (186 loc) · 9.15 KB

File metadata and controls

271 lines (186 loc) · 9.15 KB

GCSFS

A pythonic file-system interface to Google Cloud Storage.

Please file issues and requests on github and we welcome pull requests.

This package depends on fsspec, and inherits many useful behaviours from there, including integration with Dask, and the facility for key-value dict-like objects of the type used by zarr.

Warning

Default Filesystem Implementation Change: gcsfs now uses ExtendedFileSystem as the default entry point for all bucket types to support specialised storage buckets like HNS out-of-box. While all operations on standard buckets will route to the core.GCSFileSystem (pre-existing implementation) under the hood, this represents a change in the default flow. If you experience any unexpected behavior due to this change, you can revert to the previous implementation by setting the environment variable GCSFS_EXPERIMENTAL_ZB_HNS_SUPPORT=false before importing gcsfs.

Installation

The GCSFS library can be installed using conda:

conda install -c conda-forge gcsfs

or pip:

pip install gcsfs

or by cloning the repository:

git clone https://github.com/fsspec/gcsfs/
cd gcsfs/
pip install .

Examples

Locate and read a file:

>>> import gcsfs
>>> fs = gcsfs.GCSFileSystem(project='my-google-project')
>>> fs.ls('my-bucket')
['my-file.txt']
>>> with fs.open('my-bucket/my-file.txt', 'rb') as f:
...     print(f.read())
b'Hello, world'

(see also :meth:`~gcsfs.core.GCSFileSystem.walk` and :meth:`~gcsfs.core.GCSFileSystem.glob`)

Read with delimited blocks:

>>> fs.read_block(path, offset=1000, length=10, delimiter=b'\n')
b'A whole line of text\n'

Write with blocked caching:

>>> with fs.open('mybucket/new-file', 'wb') as f:
...     f.write(2*2**20 * b'a')
...     f.write(2*2**20 * b'a') # data is flushed and file closed
>>> fs.du('mybucket/new-file')
{'mybucket/new-file': 4194304}

Because GCSFS faithfully copies the Python file interface it can be used smoothly with other projects that consume the file interface like gzip or pandas.

>>> with fs.open('mybucket/my-file.csv.gz', 'rb') as f:
...     g = gzip.GzipFile(fileobj=f)  # Decompress data with gzip
...     df = pd.read_csv(g)           # Read CSV file with Pandas

Credentials

Several modes of authentication are supported:

  • if token=None (default), GCSFS will attempt to use your default gcloud credentials or, attempt to get credentials from the google metadata service, or fall back to anonymous access. This will work for most users without further action. Note that the default project may also be found, but it is often best to supply this anyway (only affects bucket- level operations).

  • if token='cloud', we assume we are running within google (compute or container engine) and fetch the credentials automatically from the metadata service.

  • if token=dict(...), token=<filepath> or token=<raw_token_str>, you may supply a token generated by the gcloud utility. This can be

    • a python dictionary
    • a raw token string
    • the path to a file containing the JSON returned by logging in with the gcloud CLI tool (e.g., ~/.config/gcloud/application_default_credentials.json or ~/.config/gcloud/legacy_credentials/<YOUR GOOGLE USERNAME>/adc.json)
    • the path to a service account key
    • a google.auth.credentials.Credentials object

    Note that ~ will not be automatically expanded to the user home directory, and must be manually expanded with a utility like os.path.expanduser().

    Please note that credentials automatically refresh 5 minutes prior to their actual expiration to prevent edge-case errors. In scenarios where refreshing is not possible (e.g., when using raw tokens), the system will fail early and will not retry if the attributes required for refreshing are missing. By default, the raw token expiration time is retrieved from the backend. You can disable this by setting FETCH_RAW_TOKEN_EXPIRY=0. When this setting is enabled, the system assumes the token has no expiration date, effectively disabling the 5-minute preemptive refresh.

  • you can also generate tokens via Oauth2 in the browser using token='browser', which gcsfs then caches in a special file, ~/.gcs_tokens, and can subsequently be accessed with token='cache'.

  • anonymous only access can be selected using token='anon', e.g. to access public resources such as 'anaconda-public-data'.

The acquired session tokens are not preserved when serializing the instances, so it is safe to pass them to worker processes on other machines if using in a distributed computation context. If credentials are given by a file path, however, then this file must exist on every machine.

Integration

The libraries intake, pandas and dask accept URLs with the prefix "gcs://", and will use gcsfs to complete the IO operation in question. The IO functions take an argument storage_options, which will be passed to GCSFileSystem, for example:

df = pd.read_excel("gcs://bucket/path/file.xls",
                   storage_options={"token": "anon"})

This gives the chance to pass any credentials or other necessary arguments needed to gcsfs.

Async

gcsfs is implemented using aiohttp, and offers async functionality. A number of methods of GCSFileSystem are async, and for each of these, there is also a synchronous version with the same name and lack of a "_" prefix.

If you wish to call gcsfs from async code, then you should pass asynchronous=True, loop=loop to the constructor (the latter is optional, if you wish to use both async and sync methods). You must also explicitly await the client creation before making any GCS call.

async def run_program():
    gcs = GCSFileSystem(asynchronous=True)
    print(await gcs._ls(""))

asyncio.run(run_program())  # or call from your async code

Concurrent async operations are also used internally for bulk operations such as pipe/cat, get/put, cp/mv/rm. The async calls are hidden behind a synchronisation layer, so are designed to be called from normal code. If you are not using async-style programming, you do not need to know about how this works, but you might find the implementation interesting.

For every synchronous function there is an asynchronous one prefixed by _, but the open operation does not support async operation. If you need it to open some file in an async manner, it's better to asynchronously download it to a temporary location and work with it from there.

Proxy

gcsfs uses aiohttp for calls to the storage api, which by default ignores HTTP_PROXY/HTTPS_PROXY environment variables. To read proxy settings from the environment provide session_kwargs as follows:

fs = GCSFileSystem(project='my-google-project', session_kwargs={'trust_env': True})

For further reference check aiohttp proxy support.

Targeting specific GCP endpoints

There are multiple ways to target non-default Google Cloud storage endpoints. Here they are, in order of precedence:

  • passing the endpoint_url parameter to the GCSFileSystem constructor.
  • setting the FSSPEC_GCS_ENDPOINT_URL environment variable to the desired endpoint URL.
  • setting the STORAGE_EMULATOR_HOST environment variable to the desired endpoint URL (usage is reserved for testing purposes).
  • setting the GOOGLE_CLOUD_UNIVERSE_DOMAIN environment variable to target alternative GCP universes. gcsfs will target the https://storage.{universe_domain} endpoint instead of the default https://storage.googleapis.com.

Contents

.. toctree::
   api
   developer
   hns_buckets
   retries
   rapid_storage_support
   fuse
   changelog
   code-of-conduct
   :maxdepth: 2


Indices and tables

These docs pages collect anonymous tracking data using goatcounter, and the dashboard is available to the public: https://gcsfs.goatcounter.com/ .