|
| 1 | +.. _roadmap: |
| 2 | + |
| 3 | +Development Roadmap |
| 4 | +=================== |
| 5 | + |
| 6 | +Authors: Joe Hamman and Ryan Abernathey |
| 7 | +Date: February 7, 2019 |
| 8 | + |
| 9 | +Background and scope |
| 10 | +-------------------- |
| 11 | + |
| 12 | +Xbatcher is a small library for iterating xarray objects in batches. The |
| 13 | +goal is to make it easy to feed xarray datasets to machine learning libraries |
| 14 | +such as `Keras`_ or `PyTorch`_. For example, implementing a simple machine |
| 15 | +learning workflow may look something like this: |
| 16 | + |
| 17 | +.. code-block:: Python |
| 18 | +
|
| 19 | + import xarray as xr |
| 20 | + import xbatcher as xb |
| 21 | +
|
| 22 | + da = xr.open_dataset(filename, chunks=chunks) # open a dataset and use dask |
| 23 | + da_train = preprocess(ds) # perform some preprocessing |
| 24 | + bgen = xb.BatchGenerator(da_train, {'time': 10}) # create a generator |
| 25 | +
|
| 26 | + for batch in bgen: # iterate through the generator |
| 27 | + model.fit(batch['x'], batch['y']) # fit a deep-learning model |
| 28 | + # or |
| 29 | + model.predict(batch['x']) # make one batch of predictions |
| 30 | +
|
| 31 | +We are currently envisioning the project growing to support more complex |
| 32 | +extract-transform-load components commonly found in machine learning workflows |
| 33 | +that use multidimensional data. We note that many of the concepts in Xbatcher |
| 34 | +have been developed through colaborations in the `Pangeo Project Machine |
| 35 | +Learning Working Group <https://pangeo.io/meeting-notes.html>`_. |
| 36 | + |
| 37 | +Batch generation |
| 38 | +~~~~~~~~~~~~~~~~ |
| 39 | + |
| 40 | +At the core of Xbatcher is the ability to define a schema that defines a |
| 41 | +selection of a larger dataset. Today, this schema is fairly simple (e.g. |
| 42 | +`{'time': 10}`) but this may evolve in the future. As we describe below, |
| 43 | +additional utilities for shuffling, sampling, and caching may provide enhanced |
| 44 | +batch generation functionality |
| 45 | + |
| 46 | +Shuffle and Sampling APIs |
| 47 | +~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 48 | + |
| 49 | +When training machine-learning models in batches, it is often necessary to |
| 50 | +selectively or randomly sample from your training data. Xbatcher can help |
| 51 | +facilitate seamless shuffling and sampling by providing APIs that operate on |
| 52 | +batches and/or full datasets. This may require working with Xarray and Dask to |
| 53 | +facilitate fast, distributed shuffles of Dask arrays. |
| 54 | + |
| 55 | +Caching APIs |
| 56 | +~~~~~~~~~~~~ |
| 57 | + |
| 58 | +A common pattern in ML is perform the ETL tasks once before saving the results |
| 59 | +to a local file system. This is an effective approach for speeding up dataset |
| 60 | +loading during training but comes with numerous downsides (i.e. requires |
| 61 | +sufficient file space, breaks workflow continuity, etc.). We propose the |
| 62 | +development of a pluggable cache mechanism in Xbatcher that would help address |
| 63 | +these downsides while providing improved performance during model training and |
| 64 | +inference. For example, this pluggable cache mechanism may allow choosing |
| 65 | +between multiple cache types, such as an LRU in-memory cache, a Zarr filesystem |
| 66 | +or S3 bucket, or a Redis database cache. |
| 67 | + |
| 68 | +Integration with TensorFlow and PyTorch Dataset Loaders |
| 69 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 70 | + |
| 71 | +Deep-learning libraries like TensorFlow and PyTorch provide high-performance |
| 72 | +dataset-generator APIs that facilitate the construction of flexible and |
| 73 | +efficient input pipelines. In particular, they have been optimized to support |
| 74 | +asynchronous data loading and training, transfer to and from GPUs, and batch |
| 75 | +caching. Xbatcher will provide compatible dataset APIs that allow users to pass |
| 76 | +Xarray datasets directly to deep-learning frameworks. |
| 77 | + |
| 78 | +Dependencies |
| 79 | +------------ |
| 80 | + |
| 81 | +- Core: Xarray, Pandas, Dask, Scikit-learn, Numpy, Scipy |
| 82 | +- Optional: Keras, PyTorch, Tensorflow, etc. |
| 83 | + |
| 84 | +.. _Keras: https://keras.io/ |
| 85 | +.. _PyTorch: https://pytorch.org/ |
0 commit comments