Skip to content

Commit 2b5d8a4

Browse files
author
Joseph Hamman
committed
Merge branch 'main' of github.com:pangeo-data/xbatcher into accessor
2 parents 95d9fe6 + 6b23a2c commit 2b5d8a4

File tree

3 files changed

+98
-4
lines changed

3 files changed

+98
-4
lines changed

doc/api.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
2+
API reference
3+
-------------
4+
5+
This page provides an auto-generated summary of Xbatcher's API.
6+
7+
.. autoclass:: xbatcher.BatchGenerator
8+
:members:

doc/index.rst

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -39,8 +39,9 @@ and we want to create batches along the time dimension. We can do it like this
3939
# actually feed to machine learning library
4040
batch
4141
42-
API
43-
---
42+
.. toctree::
43+
:maxdepth: 2
44+
:caption: Contents:
4445

45-
.. autoclass:: xbatcher.BatchGenerator
46-
:members:
46+
roadmap
47+
api

doc/roadmap.rst

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
.. _roadmap:
2+
3+
Development Roadmap
4+
===================
5+
6+
Authors: Joe Hamman and Ryan Abernathey
7+
Date: February 7, 2019
8+
9+
Background and scope
10+
--------------------
11+
12+
Xbatcher is a small library for iterating xarray objects in batches. The
13+
goal is to make it easy to feed xarray datasets to machine learning libraries
14+
such as `Keras`_ or `PyTorch`_. For example, implementing a simple machine
15+
learning workflow may look something like this:
16+
17+
.. code-block:: Python
18+
19+
import xarray as xr
20+
import xbatcher as xb
21+
22+
da = xr.open_dataset(filename, chunks=chunks) # open a dataset and use dask
23+
da_train = preprocess(ds) # perform some preprocessing
24+
bgen = xb.BatchGenerator(da_train, {'time': 10}) # create a generator
25+
26+
for batch in bgen: # iterate through the generator
27+
model.fit(batch['x'], batch['y']) # fit a deep-learning model
28+
# or
29+
model.predict(batch['x']) # make one batch of predictions
30+
31+
We are currently envisioning the project growing to support more complex
32+
extract-transform-load components commonly found in machine learning workflows
33+
that use multidimensional data. We note that many of the concepts in Xbatcher
34+
have been developed through colaborations in the `Pangeo Project Machine
35+
Learning Working Group <https://pangeo.io/meeting-notes.html>`_.
36+
37+
Batch generation
38+
~~~~~~~~~~~~~~~~
39+
40+
At the core of Xbatcher is the ability to define a schema that defines a
41+
selection of a larger dataset. Today, this schema is fairly simple (e.g.
42+
`{'time': 10}`) but this may evolve in the future. As we describe below,
43+
additional utilities for shuffling, sampling, and caching may provide enhanced
44+
batch generation functionality
45+
46+
Shuffle and Sampling APIs
47+
~~~~~~~~~~~~~~~~~~~~~~~~~
48+
49+
When training machine-learning models in batches, it is often necessary to
50+
selectively or randomly sample from your training data. Xbatcher can help
51+
facilitate seamless shuffling and sampling by providing APIs that operate on
52+
batches and/or full datasets. This may require working with Xarray and Dask to
53+
facilitate fast, distributed shuffles of Dask arrays.
54+
55+
Caching APIs
56+
~~~~~~~~~~~~
57+
58+
A common pattern in ML is perform the ETL tasks once before saving the results
59+
to a local file system. This is an effective approach for speeding up dataset
60+
loading during training but comes with numerous downsides (i.e. requires
61+
sufficient file space, breaks workflow continuity, etc.). We propose the
62+
development of a pluggable cache mechanism in Xbatcher that would help address
63+
these downsides while providing improved performance during model training and
64+
inference. For example, this pluggable cache mechanism may allow choosing
65+
between multiple cache types, such as an LRU in-memory cache, a Zarr filesystem
66+
or S3 bucket, or a Redis database cache.
67+
68+
Integration with TensorFlow and PyTorch Dataset Loaders
69+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
70+
71+
Deep-learning libraries like TensorFlow and PyTorch provide high-performance
72+
dataset-generator APIs that facilitate the construction of flexible and
73+
efficient input pipelines. In particular, they have been optimized to support
74+
asynchronous data loading and training, transfer to and from GPUs, and batch
75+
caching. Xbatcher will provide compatible dataset APIs that allow users to pass
76+
Xarray datasets directly to deep-learning frameworks.
77+
78+
Dependencies
79+
------------
80+
81+
- Core: Xarray, Pandas, Dask, Scikit-learn, Numpy, Scipy
82+
- Optional: Keras, PyTorch, Tensorflow, etc.
83+
84+
.. _Keras: https://keras.io/
85+
.. _PyTorch: https://pytorch.org/

0 commit comments

Comments
 (0)