|
| 1 | ++++ |
| 2 | +date = "2020-08-17T08:00:00+00:00" |
| 3 | +author = "Ralf Gommers" |
| 4 | +title = "Announcing the Consortium for Python Data API Standards" |
| 5 | +tags = ["APIs","standard", "consortium", "arrays", "dataframes", "community"] |
| 6 | +categories = ["Consortium"] |
| 7 | +description = "An initiative to develop API standards for n-dimensional arrays and dataframes" |
| 8 | +draft = false |
| 9 | +weight = 30 |
| 10 | ++++ |
| 11 | + |
| 12 | + |
| 13 | +Over the past few years, Python has exploded in popularity for data science, |
| 14 | +machine learning, deep learning and numerical computing. New frameworks |
| 15 | +pushing forward the state of the art in these fields are appearing every |
| 16 | +year. One unintended consequence of all this activity and creativity has been |
| 17 | +fragmentation in the fundamental building blocks - multidimensional array |
| 18 | +(tensor) and dataframe libraries - that underpin the whole Python data |
| 19 | +ecosystem. For example, arrays are fragmented between Tensorflow, PyTorch, |
| 20 | +NumPy, CuPy, MXNet, Xarray, Dask, and others. Dataframes are fragmented |
| 21 | +between Pandas, PySpark, cuDF, Vaex, Modin, Dask, Ibis, Apache Arrow, and |
| 22 | +more. This fragmentation comes with significant costs, from whole libraries |
| 23 | +being reimplemented for a different array or dataframe library to end users |
| 24 | +having to re-learn APIs and best practices when they move from one framework |
| 25 | +to another. |
| 26 | + |
| 27 | + |
| 28 | + |
| 29 | +Today, we are announcing the Consortium for Python Data API Standards, which |
| 30 | +aims to tackle this fragmentation by developing API standards for arrays |
| 31 | +(a.k.a. tensors) and dataframes. We aim to grow this Consortium into an |
| 32 | +organization where cross-project and cross-ecosystem alignment on APIs, data |
| 33 | +exchange mechanisms and other such topics happens. These topics require |
| 34 | +coordination and communication to a much larger extent than they require |
| 35 | +technical innovation. We aim to facilitate the former, while leaving the |
| 36 | +innovating to current and future individual libraries. |
| 37 | + |
| 38 | +_Want to get started right away? Collect data on your own API usage with_ |
| 39 | +_[python-record-api](https://github.com/data-apis/python-record-api)._ |
| 40 | +_And for array or dataframe maintainers: we want to_ |
| 41 | +_hear what you think -- see the end of this post._ |
| 42 | + |
| 43 | + |
| 44 | +## Bootstrapping an organization and involving the wider community |
| 45 | + |
| 46 | +The goal is ambitious, and there are obvious hurdles, such as answering |
| 47 | +questions like "what is the impact of a technical decision on each library?". |
| 48 | +There's a significant amount of engineering and technical writing to do to |
| 49 | +create overviews and data that can be used to answer such questions. This |
| 50 | +requires dedicated time, and hence funding, in addition to bringing |
| 51 | +maintainers of libraries together to provide input and assess those overviews |
| 52 | +and data. These factors motivated the following approach to bootstrapping the |
| 53 | +Consortium: |
| 54 | + |
| 55 | +1. Assemble a consortium of people from interested companies (who help fund |
| 56 | + the required engineering and technical writing time) and key community |
| 57 | + contributors. |
| 58 | +2. Form a working group which meets weekly for an hour, sets the high-level |
| 59 | + goals, requirements, and user stories, and makes initial decisions. |
| 60 | +3. Several engineers with enough bandwidth will do the heavy lifting on |
| 61 | + building the required tooling, and preparing the data and draft docs |
| 62 | + needed by the working group. Iterate as needed based on feedback from the |
| 63 | + working group. |
| 64 | +4. Once drafts of an API standard have a concrete outline, request input from |
| 65 | + a broader group of array and dataframe library maintainers. |
| 66 | + _<span style="color:#a14a76">This is where we are today.</span>_ |
| 67 | +5. Once tooling or drafts of the API standards get mature enough for wider review, |
| 68 | + release them as Request for Comments (RFC) and have a public review |
| 69 | + process. Iterate again as needed. |
| 70 | +6. Once there's enough buy-in for the RFCs, and it's clear projects are able |
| 71 | + and willing to adopt the proposed APIs, publish a 1.0 version of the |
| 72 | + standard. |
| 73 | + |
| 74 | +Such a gradual RFC process is a bit of an experiment. Community projects |
| 75 | +like NumPy and Pandas aren't used to this; however, it's similar to successful |
| 76 | +models in other communities (e.g. the Open Geospatial Consortium, or C++ |
| 77 | +standardization) and we think the breadth of projects involved and complexity |
| 78 | +of the challenge makes this the most promising and likely to succeed approach. |
| 79 | +The approach will certainly evolve over time though, based on experience and |
| 80 | +feedback from the many stakeholders. |
| 81 | + |
| 82 | + |
| 83 | + |
| 84 | +There are other, synergistic activities happening in the ecosystem that are |
| 85 | +relevant for this Consortium, that individual members are contributing to, |
| 86 | +such as the work on |
| 87 | +[developing NumPy array protocols](https://github.com/numpy/archive/blob/master/other_meetings/2020-04-21-array-protocols_discussion_and_notes.md), |
| 88 | +and the `__dataframe__` |
| 89 | +[data interchange protocol design](https://github.com/wesm/dataframe-protocol). |
| 90 | +The section in the `__array_module__` NEP on ["restricted subsets of NumPy's API"](https://numpy.org/neps/nep-0037-array-module.html#requesting-restricted-subsets-of-numpy-s-api) |
| 91 | +gives an outline for how the API standard we're developing can be adopted. |
| 92 | +The `__dataframe__` protocol attempts to solve a small, well-defined problem |
| 93 | +that is a sub-problem of the one a full dataframe API standard would solve. |
| 94 | + |
| 95 | + |
| 96 | +## An API standard - what do we mean by that? |
| 97 | + |
| 98 | +When we start talking about an "API standard" or a formal specification, it's |
| 99 | +important to start by explaining both why we need it and what "API standard" |
| 100 | +means. "Standard" is a loaded word, and "standardization" is a process that, |
| 101 | +when done right, can have a large impact but may also evoke past experiences |
| 102 | +that weren't always productive. |
| 103 | + |
| 104 | +We can look at an API in multiple levels of detail: |
| 105 | + |
| 106 | +1. Which functions, classes, class methods and other objects are in it. |
| 107 | +2. What is the signature of each object. |
| 108 | +3. What are the semantics of each object. |
| 109 | + |
| 110 | +When talking about _compatibility_ between libraries, e.g. "the Pandas, Dask |
| 111 | +and Vaex `groupby` methods are compatible", we imply that the respective |
| 112 | +dataframe objects all have a `groupby` method with the same signature and the |
| 113 | +same semantics, so they can be used interchangeably. |
| 114 | + |
| 115 | +Currently, array and dataframe libraries all have similar APIs, but with |
| 116 | +enough differences that using them interchangeably isn't really possible. |
| 117 | +Here is a concrete example for a relatively simple function, `mean`, for |
| 118 | +arrays: |
| 119 | + |
| 120 | +```python |
| 121 | +numpy: mean(a, axis=None, dtype=None, out=None, keepdims=<no value>) |
| 122 | +dask.array: mean(a, axis=None, dtype=None, out=None, keepdims=<no value>) |
| 123 | +cupy: mean(a, axis=None, dtype=None, out=None, keepdims=False) |
| 124 | +jax.numpy: mean(a, axis=None, dtype=None, out=None, keepdims=False) |
| 125 | +mxnet.np: mean(a, axis=None, dtype=None, out=None, keepdims=False) |
| 126 | +sparse: s.mean(axis=None, keepdims=False, dtype=None, out=None) |
| 127 | +torch: mean(input, dim, keepdim=False, out=None) |
| 128 | +tensorflow: reduce_mean(input_tensor, axis=None, keepdims=None, name=None, |
| 129 | + reduction_indices=None, keep_dims=None) |
| 130 | +``` |
| 131 | + |
| 132 | +We see that: |
| 133 | + |
| 134 | +- All libraries provide this functionality in a function called `mean`, |
| 135 | + except (1) Tensorflow calls it `reduce_mean` and (2) PyData Sparse has a |
| 136 | + `mean` method instead of a function (it does work with `np.mean` though via |
| 137 | + [array protocols](https://numpy.org/devdocs/reference/arrays.classes.html#special-attributes-and-methods)). |
| 138 | +- NumPy, Dask, CuPy, JAX and MXNet have compatible signatures (except the |
| 139 | + `<no_value>` default for `keepdims`, which really means `False` but with |
| 140 | + different behavior for array subclasses). |
| 141 | +- The semantics are harder to inspect, but will also have differences. For |
| 142 | + example, MXNet documents: _This function differs from the original `numpy.mean` |
| 143 | + in the following way(s): - only ndarray is accepted as valid input, python |
| 144 | + iterables or scalar is not supported - default data type for integer input is |
| 145 | + `float32`_. |
| 146 | + |
| 147 | +An API standard will specify function presence with signature, and semantics, e.g.: |
| 148 | + |
| 149 | +- `mean(a, axis=None, dtype=None, out=None, keepdims=False)` |
| 150 | +- Computes the arithmetic mean along the specified axis. |
| 151 | +- Meaning and detailed behavior of each keyword is explained. |
| 152 | +- Semantics, including for corner cases (e.g. `nan`, `inf`, empty arrays, and |
| 153 | + more) are given by a reference test suite. |
| 154 | + |
| 155 | +The semantics of a function is a large topic, so the _scope_ of what is |
| 156 | +specified must be very clear. For example (this may be specified separately, |
| 157 | +as it will be common between many functions): |
| 158 | + |
| 159 | +- Only array input is in scope, functions may or may not accept lists, tuples |
| 160 | + or other object. |
| 161 | +- Dtypes covered are `int32`, `int64`, `float16`, `float32`, `float64`; |
| 162 | + extended precision floating point and complex dtypes, datetime and custom |
| 163 | + dtypes are out of scope. |
| 164 | +- Default dtype may be either `float32` or `float64`; this is a consistent |
| 165 | + library-wide choice (rationale: for deep learning `float32` is the typical |
| 166 | + default, for general numerical computing `float64` is the default). |
| 167 | +- Expected results when the input contains `nan` or `inf` is in scope, |
| 168 | + behavior may vary slightly (e.g. warnings are produced) depending on |
| 169 | + implementation details. |
| 170 | + |
| 171 | +_Please note: all the above is meant to sketch what an "API standard" means, the concrete signatures, semantics and scope may and likely will change_. |
| 172 | + |
| 173 | + |
| 174 | +## Approach to building up the API standards |
| 175 | + |
| 176 | +The approach we're taking includes a combination of design discussions, |
| 177 | +requirements engineering and data-driven decision making. |
| 178 | + |
| 179 | +### Start from use cases and requirements |
| 180 | + |
| 181 | +Something that's often missing when an API of a library grows organically |
| 182 | +when many people add features and solve their own issues over a time span of |
| 183 | +years is _requirements engineering_. Meaning: start with a well-defined scope |
| 184 | +and use cases, derive requirements from those use cases, and then refer to |
| 185 | +those use cases and requirements when making individual technical design |
| 186 | +decisions. We aim to take such an approach, to end up with a consistent |
| 187 | +design and with good, documented rationales for decisions. |
| 188 | + |
| 189 | +We need to carefully define scope, including both goals and non-goals. For |
| 190 | +example, while we aim to make array and dataframe libraries compatible so |
| 191 | +consumers of those data structures will be able to support multiple of those |
| 192 | +libraries, _runtime switching_ without testing or any changes in the |
| 193 | +consuming library is not a goal. A concrete example: we aim to make it |
| 194 | +possible for scikit-learn to consume CuPy arrays and JAX and PyTorch tensors |
| 195 | +from pure Python code (as a first goal; C/C++/Cython is a harder nut to crack), |
| 196 | +but we expect that to require some amount of changes and possibly |
| 197 | +special-casing in scikit-learn - because specifying every last implementation |
| 198 | +detail scikit-learn may be relying on isn't feasible. |
| 199 | + |
| 200 | +### Be conservative in choices made |
| 201 | + |
| 202 | +A standard only has value if it's adhered to widely. So it has to be both |
| 203 | +easy to adopt a standard and sensible/uncontroversial to do so. |
| 204 | +This implies that we should only attempt to standardize functionality with which |
| 205 | +there is already wide experience, and that all libraries either already have |
| 206 | +in some form or can implement with a reasonable amount of effort. Therefore, |
| 207 | +there will be more consolidation than innovation - what is new is almost by |
| 208 | +definition hard to standardize. |
| 209 | + |
| 210 | +### A data-driven approach |
| 211 | + |
| 212 | +Two of the main questions we may have when talking about any individual |
| 213 | +function, method or object are: |
| 214 | + |
| 215 | +1. What are the signatures and semantics of all of the current implementations? |
| 216 | +2. Who uses the API, how often and in which manner? |
| 217 | + |
| 218 | + |
| 219 | +To answer those questions we built two sets of tooling for API comparisons |
| 220 | +and gathering telemetry data, which we are releasing today under an MIT |
| 221 | +license (the license we'll use for all code and documents): |
| 222 | + |
| 223 | +[array-api-comparison](https://github.com/data-apis/array-api-comparison) |
| 224 | +takes the approach of parsing all public html docs from array libraries and |
| 225 | +compiling overviews of presence/absence of functionality and its signatures, |
| 226 | +and rendering the result as html tables. Finding out what is common or |
| 227 | +different is one `make` command away; e.g., the intersection of functions |
| 228 | +present in all libraries can be obtained with `make view-intersection`: |
| 229 | + |
| 230 | + |
| 231 | + |
| 232 | +A similar tool and dataset for dataframe libraries will follow. |
| 233 | + |
| 234 | +[python-record-api](https://github.com/data-apis/python-record-api) takes a |
| 235 | +tracing-based approach. It is able to log all function calls from running a |
| 236 | +module, or when running pytest, from a specified module to another module. It |
| 237 | +is able to not only determine what functions are called, but also which |
| 238 | +keywords are used, and the types of all input arguments. It stores the |
| 239 | +results of running any code base, such as the test suite of a consumer |
| 240 | +library, as JSON. Initial results for NumPy usage by Pandas, Matplotlib, |
| 241 | +scikit-learn, Xarray and scikit-image are stored in the repository, and more |
| 242 | +results can be added incrementally. The next thing it can do is take that |
| 243 | +data and synthesize an API from it, based on actual usage. Such a generated |
| 244 | +API may need curation and changes, but is a very useful data point when |
| 245 | +discussing what should and should not be included in an API standard. |
| 246 | + |
| 247 | +```python |
| 248 | +def sum( |
| 249 | + a: object, |
| 250 | + axis: Union[None, int, Tuple[Union[int, None], ...]] = ..., |
| 251 | + out: Union[numpy.ndarray, numpy.float64] = ..., |
| 252 | + dtype: Union[type, None] = ..., |
| 253 | + keepdims: bool = ..., |
| 254 | +): |
| 255 | + """ |
| 256 | + usage.pandas: 38 |
| 257 | + usage.skimage: 114 |
| 258 | + usage.sklearn: 397 |
| 259 | + usage.xarray: 75 |
| 260 | + """ |
| 261 | + ... |
| 262 | +``` |
| 263 | + |
| 264 | +_Example of the usage statistics and synthesized API for `numpy.sum`._ |
| 265 | + |
| 266 | + |
| 267 | +## Who is involved? |
| 268 | + |
| 269 | +Quansight Labs started this initiative to tackle the problem of |
| 270 | +fragmentation of data structures. In discussions with potential sponsors and |
| 271 | +community members, it evolved from a development-focused effort to the |
| 272 | +current API standardization approach. Quansight Labs is a public benefit |
| 273 | +division of Quansight, with a [mission](https://labs.quansight.org/about/) to |
| 274 | +sustain and grow community-driven open source projects and ecosystems, with a |
| 275 | +focus on the core of the PyData stack. |
| 276 | + |
| 277 | +The founding sponsors are Intel, Microsoft, the D. E. Shaw group, Google |
| 278 | +Research and Quansight. We also invited a number of key community |
| 279 | +contributors, to ensure representation of stakeholder projects. |
| 280 | + |
| 281 | +The basic principles we used for initial membership are: |
| 282 | + |
| 283 | +- Consider all of the most popular array (tensor) and dataframe libraries |
| 284 | +- Invite at least one key contributor from each community-driven project |
| 285 | +- Engage with all company-driven projects on an equal basis: sketching the |
| 286 | + goals, asking for participation and $50k in funding in order to be able to |
| 287 | + support the required engineering and technical writing. |
| 288 | +- For company-driven projects that were interested but not able to sponsor, |
| 289 | + we invited a key member of their array or dataframe library to join. |
| 290 | + |
| 291 | +The details of how decision making is done and new members are accepted is |
| 292 | +outlined in the [Consortium governance repository](https://github.com/data-apis/governance), |
| 293 | +and the [members and sponsors](https://github.com/data-apis/governance/blob/master/members_and_sponsors.md) |
| 294 | +page gives an overview of current participants. |
| 295 | +_The details of how the Consortium functions are likely to evolve over the_ |
| 296 | +_next months - we're still at the start of this endeavour._ |
| 297 | + |
| 298 | + |
| 299 | +## Where we go from here |
| 300 | + |
| 301 | +Here is an approximate timeline of what we hope to do over the next couple of months: |
| 302 | + |
| 303 | +- today: announcement blog post and tooling and governance repositories made public |
| 304 | +- next week: first public conference call |
| 305 | +- Sep 1: publish a website for the Consortium at data-apis.org |
| 306 | +- Sep 15: publish the array API RFC and start community review |
| 307 | +- Nov 15: publish the dataframe API RFC and start community review |
| 308 | + |
| 309 | +If you're an array (tensor) or dataframe library maintainer: **we'd like to hear from you!** |
| 310 | +We have opened [an issue tracker](https://github.com/data-apis/consortium-feedback/) |
| 311 | +for discussions. We'd love to hear any ideas, questions and concerns you may |
| 312 | +have. |
| 313 | + |
| 314 | +This is a very challenging problem, with lots of thorny questions to answer, like: |
| 315 | + |
| 316 | +- how will projects adopt a standard and expose it to their users without significant backwards compatibility breaks? |
| 317 | +- what does versioning and evolving the standard look like? |
| 318 | +- what about extensions that are not included in the standard? |
| 319 | + |
| 320 | +Those challenges are worth tackling though, because the benefits are potentially very large. |
| 321 | +We're looking forward to what comes next! |
0 commit comments