Skip to content

Commit a092229

Browse files
committed
Add the announcement blog post
1 parent eb8f054 commit a092229

File tree

5 files changed

+325
-0
lines changed

5 files changed

+325
-0
lines changed

config.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,7 @@ Paginate = 4
2121
name = "Blog"
2222
weight = -100
2323
url = "/blog/"
24+
25+
# To render raw html tags within Markdown
26+
[markup.goldmark.renderer]
27+
unsafe= true
Lines changed: 321 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,321 @@
1+
+++
2+
date = "2020-08-17T08:00:00+00:00"
3+
author = "Ralf Gommers"
4+
title = "Announcing the Consortium for Python Data API Standards"
5+
tags = ["APIs","standard", "consortium", "arrays", "dataframes", "community"]
6+
categories = ["Consortium"]
7+
description = "An initiative to develop API standards for n-dimensional arrays and dataframes"
8+
draft = false
9+
weight = 30
10+
+++
11+
12+
13+
Over the past few years, Python has exploded in popularity for data science,
14+
machine learning, deep learning and numerical computing. New frameworks
15+
pushing forward the state of the art in these fields are appearing every
16+
year. One unintended consequence of all this activity and creativity has been
17+
fragmentation in the fundamental building blocks - multidimensional array
18+
(tensor) and dataframe libraries - that underpin the whole Python data
19+
ecosystem. For example, arrays are fragmented between Tensorflow, PyTorch,
20+
NumPy, CuPy, MXNet, Xarray, Dask, and others. Dataframes are fragmented
21+
between Pandas, PySpark, cuDF, Vaex, Modin, Dask, Ibis, Apache Arrow, and
22+
more. This fragmentation comes with significant costs, from whole libraries
23+
being reimplemented for a different array or dataframe library to end users
24+
having to re-learn APIs and best practices when they move from one framework
25+
to another.
26+
27+
![Ecosystem growing up in silos](/images/ecosystem_fragmentation.png)
28+
29+
Today, we are announcing the Consortium for Python Data API Standards, which
30+
aims to tackle this fragmentation by developing API standards for arrays
31+
(a.k.a. tensors) and dataframes. We aim to grow this Consortium into an
32+
organization where cross-project and cross-ecosystem alignment on APIs, data
33+
exchange mechanisms and other such topics happens. These topics require
34+
coordination and communication to a much larger extent than they require
35+
technical innovation. We aim to facilitate the former, while leaving the
36+
innovating to current and future individual libraries.
37+
38+
_Want to get started right away? Collect data on your own API usage with_
39+
_[python-record-api](https://github.com/data-apis/python-record-api)._
40+
_And for array or dataframe maintainers: we want to_
41+
_hear what you think -- see the end of this post._
42+
43+
44+
## Bootstrapping an organization and involving the wider community
45+
46+
The goal is ambitious, and there are obvious hurdles, such as answering
47+
questions like "what is the impact of a technical decision on each library?".
48+
There's a significant amount of engineering and technical writing to do to
49+
create overviews and data that can be used to answer such questions. This
50+
requires dedicated time, and hence funding, in addition to bringing
51+
maintainers of libraries together to provide input and assess those overviews
52+
and data. These factors motivated the following approach to bootstrapping the
53+
Consortium:
54+
55+
1. Assemble a consortium of people from interested companies (who help fund
56+
the required engineering and technical writing time) and key community
57+
contributors.
58+
2. Form a working group which meets weekly for an hour, sets the high-level
59+
goals, requirements, and user stories, and makes initial decisions.
60+
3. Several engineers with enough bandwidth will do the heavy lifting on
61+
building the required tooling, and preparing the data and draft docs
62+
needed by the working group. Iterate as needed based on feedback from the
63+
working group.
64+
4. Once drafts of an API standard have a concrete outline, request input from
65+
a broader group of array and dataframe library maintainers.
66+
_<span style="color:#a14a76">This is where we are today.</span>_
67+
5. Once tooling or drafts of the API standards get mature enough for wider review,
68+
release them as Request for Comments (RFC) and have a public review
69+
process. Iterate again as needed.
70+
6. Once there's enough buy-in for the RFCs, and it's clear projects are able
71+
and willing to adopt the proposed APIs, publish a 1.0 version of the
72+
standard.
73+
74+
Such a gradual RFC process is a bit of an experiment. Community projects
75+
like NumPy and Pandas aren't used to this; however, it's similar to successful
76+
models in other communities (e.g. the Open Geospatial Consortium, or C++
77+
standardization) and we think the breadth of projects involved and complexity
78+
of the challenge makes this the most promising and likely to succeed approach.
79+
The approach will certainly evolve over time though, based on experience and
80+
feedback from the many stakeholders.
81+
82+
![API standard RFC development and community review](/images/API_standard_RFC_review_diagram.png)
83+
84+
There are other, synergistic activities happening in the ecosystem that are
85+
relevant for this Consortium, that individual members are contributing to,
86+
such as the work on
87+
[developing NumPy array protocols](https://github.com/numpy/archive/blob/master/other_meetings/2020-04-21-array-protocols_discussion_and_notes.md),
88+
and the `__dataframe__`
89+
[data interchange protocol design](https://github.com/wesm/dataframe-protocol).
90+
The section in the `__array_module__` NEP on ["restricted subsets of NumPy's API"](https://numpy.org/neps/nep-0037-array-module.html#requesting-restricted-subsets-of-numpy-s-api)
91+
gives an outline for how the API standard we're developing can be adopted.
92+
The `__dataframe__` protocol attempts to solve a small, well-defined problem
93+
that is a sub-problem of the one a full dataframe API standard would solve.
94+
95+
96+
## An API standard - what do we mean by that?
97+
98+
When we start talking about an "API standard" or a formal specification, it's
99+
important to start by explaining both why we need it and what "API standard"
100+
means. "Standard" is a loaded word, and "standardization" is a process that,
101+
when done right, can have a large impact but may also evoke past experiences
102+
that weren't always productive.
103+
104+
We can look at an API in multiple levels of detail:
105+
106+
1. Which functions, classes, class methods and other objects are in it.
107+
2. What is the signature of each object.
108+
3. What are the semantics of each object.
109+
110+
When talking about _compatibility_ between libraries, e.g. "the Pandas, Dask
111+
and Vaex `groupby` methods are compatible", we imply that the respective
112+
dataframe objects all have a `groupby` method with the same signature and the
113+
same semantics, so they can be used interchangeably.
114+
115+
Currently, array and dataframe libraries all have similar APIs, but with
116+
enough differences that using them interchangeably isn't really possible.
117+
Here is a concrete example for a relatively simple function, `mean`, for
118+
arrays:
119+
120+
```python
121+
numpy: mean(a, axis=None, dtype=None, out=None, keepdims=<no value>)
122+
dask.array: mean(a, axis=None, dtype=None, out=None, keepdims=<no value>)
123+
cupy: mean(a, axis=None, dtype=None, out=None, keepdims=False)
124+
jax.numpy: mean(a, axis=None, dtype=None, out=None, keepdims=False)
125+
mxnet.np: mean(a, axis=None, dtype=None, out=None, keepdims=False)
126+
sparse: s.mean(axis=None, keepdims=False, dtype=None, out=None)
127+
torch: mean(input, dim, keepdim=False, out=None)
128+
tensorflow: reduce_mean(input_tensor, axis=None, keepdims=None, name=None,
129+
reduction_indices=None, keep_dims=None)
130+
```
131+
132+
We see that:
133+
134+
- All libraries provide this functionality in a function called `mean`,
135+
except (1) Tensorflow calls it `reduce_mean` and (2) PyData Sparse has a
136+
`mean` method instead of a function (it does work with `np.mean` though via
137+
[array protocols](https://numpy.org/devdocs/reference/arrays.classes.html#special-attributes-and-methods)).
138+
- NumPy, Dask, CuPy, JAX and MXNet have compatible signatures (except the
139+
`<no_value>` default for `keepdims`, which really means `False` but with
140+
different behavior for array subclasses).
141+
- The semantics are harder to inspect, but will also have differences. For
142+
example, MXNet documents: _This function differs from the original `numpy.mean`
143+
in the following way(s): - only ndarray is accepted as valid input, python
144+
iterables or scalar is not supported - default data type for integer input is
145+
`float32`_.
146+
147+
An API standard will specify function presence with signature, and semantics, e.g.:
148+
149+
- `mean(a, axis=None, dtype=None, out=None, keepdims=False)`
150+
- Computes the arithmetic mean along the specified axis.
151+
- Meaning and detailed behavior of each keyword is explained.
152+
- Semantics, including for corner cases (e.g. `nan`, `inf`, empty arrays, and
153+
more) are given by a reference test suite.
154+
155+
The semantics of a function is a large topic, so the _scope_ of what is
156+
specified must be very clear. For example (this may be specified separately,
157+
as it will be common between many functions):
158+
159+
- Only array input is in scope, functions may or may not accept lists, tuples
160+
or other object.
161+
- Dtypes covered are `int32`, `int64`, `float16`, `float32`, `float64`;
162+
extended precision floating point and complex dtypes, datetime and custom
163+
dtypes are out of scope.
164+
- Default dtype may be either `float32` or `float64`; this is a consistent
165+
library-wide choice (rationale: for deep learning `float32` is the typical
166+
default, for general numerical computing `float64` is the default).
167+
- Expected results when the input contains `nan` or `inf` is in scope,
168+
behavior may vary slightly (e.g. warnings are produced) depending on
169+
implementation details.
170+
171+
_Please note: all the above is meant to sketch what an "API standard" means, the concrete signatures, semantics and scope may and likely will change_.
172+
173+
174+
## Approach to building up the API standards
175+
176+
The approach we're taking includes a combination of design discussions,
177+
requirements engineering and data-driven decision making.
178+
179+
### Start from use cases and requirements
180+
181+
Something that's often missing when an API of a library grows organically
182+
when many people add features and solve their own issues over a time span of
183+
years is _requirements engineering_. Meaning: start with a well-defined scope
184+
and use cases, derive requirements from those use cases, and then refer to
185+
those use cases and requirements when making individual technical design
186+
decisions. We aim to take such an approach, to end up with a consistent
187+
design and with good, documented rationales for decisions.
188+
189+
We need to carefully define scope, including both goals and non-goals. For
190+
example, while we aim to make array and dataframe libraries compatible so
191+
consumers of those data structures will be able to support multiple of those
192+
libraries, _runtime switching_ without testing or any changes in the
193+
consuming library is not a goal. A concrete example: we aim to make it
194+
possible for scikit-learn to consume CuPy arrays and JAX and PyTorch tensors
195+
from pure Python code (as a first goal; C/C++/Cython is a harder nut to crack),
196+
but we expect that to require some amount of changes and possibly
197+
special-casing in scikit-learn - because specifying every last implementation
198+
detail scikit-learn may be relying on isn't feasible.
199+
200+
### Be conservative in choices made
201+
202+
A standard only has value if it's adhered to widely. So it has to be both
203+
easy to adopt a standard and sensible/uncontroversial to do so.
204+
This implies that we should only attempt to standardize functionality with which
205+
there is already wide experience, and that all libraries either already have
206+
in some form or can implement with a reasonable amount of effort. Therefore,
207+
there will be more consolidation than innovation - what is new is almost by
208+
definition hard to standardize.
209+
210+
### A data-driven approach
211+
212+
Two of the main questions we may have when talking about any individual
213+
function, method or object are:
214+
215+
1. What are the signatures and semantics of all of the current implementations?
216+
2. Who uses the API, how often and in which manner?
217+
218+
219+
To answer those questions we built two sets of tooling for API comparisons
220+
and gathering telemetry data, which we are releasing today under an MIT
221+
license (the license we'll use for all code and documents):
222+
223+
[array-api-comparison](https://github.com/data-apis/array-api-comparison)
224+
takes the approach of parsing all public html docs from array libraries and
225+
compiling overviews of presence/absence of functionality and its signatures,
226+
and rendering the result as html tables. Finding out what is common or
227+
different is one `make` command away; e.g., the intersection of functions
228+
present in all libraries can be obtained with `make view-intersection`:
229+
230+
![Array library API intersection](/images/array_API_comparison_output.png)
231+
232+
A similar tool and dataset for dataframe libraries will follow.
233+
234+
[python-record-api](https://github.com/data-apis/python-record-api) takes a
235+
tracing-based approach. It is able to log all function calls from running a
236+
module, or when running pytest, from a specified module to another module. It
237+
is able to not only determine what functions are called, but also which
238+
keywords are used, and the types of all input arguments. It stores the
239+
results of running any code base, such as the test suite of a consumer
240+
library, as JSON. Initial results for NumPy usage by Pandas, Matplotlib,
241+
scikit-learn, Xarray and scikit-image are stored in the repository, and more
242+
results can be added incrementally. The next thing it can do is take that
243+
data and synthesize an API from it, based on actual usage. Such a generated
244+
API may need curation and changes, but is a very useful data point when
245+
discussing what should and should not be included in an API standard.
246+
247+
```python
248+
def sum(
249+
a: object,
250+
axis: Union[None, int, Tuple[Union[int, None], ...]] = ...,
251+
out: Union[numpy.ndarray, numpy.float64] = ...,
252+
dtype: Union[type, None] = ...,
253+
keepdims: bool = ...,
254+
):
255+
"""
256+
usage.pandas: 38
257+
usage.skimage: 114
258+
usage.sklearn: 397
259+
usage.xarray: 75
260+
"""
261+
...
262+
```
263+
264+
_Example of the usage statistics and synthesized API for `numpy.sum`._
265+
266+
267+
## Who is involved?
268+
269+
Quansight Labs started this initiative to tackle the problem of
270+
fragmentation of data structures. In discussions with potential sponsors and
271+
community members, it evolved from a development-focused effort to the
272+
current API standardization approach. Quansight Labs is a public benefit
273+
division of Quansight, with a [mission](https://labs.quansight.org/about/) to
274+
sustain and grow community-driven open source projects and ecosystems, with a
275+
focus on the core of the PyData stack.
276+
277+
The founding sponsors are Intel, Microsoft, the D. E. Shaw group, Google
278+
Research and Quansight. We also invited a number of key community
279+
contributors, to ensure representation of stakeholder projects.
280+
281+
The basic principles we used for initial membership are:
282+
283+
- Consider all of the most popular array (tensor) and dataframe libraries
284+
- Invite at least one key contributor from each community-driven project
285+
- Engage with all company-driven projects on an equal basis: sketching the
286+
goals, asking for participation and $50k in funding in order to be able to
287+
support the required engineering and technical writing.
288+
- For company-driven projects that were interested but not able to sponsor,
289+
we invited a key member of their array or dataframe library to join.
290+
291+
The details of how decision making is done and new members are accepted is
292+
outlined in the [Consortium governance repository](https://github.com/data-apis/governance),
293+
and the [members and sponsors](https://github.com/data-apis/governance/blob/master/members_and_sponsors.md)
294+
page gives an overview of current participants.
295+
_The details of how the Consortium functions are likely to evolve over the_
296+
_next months - we're still at the start of this endeavour._
297+
298+
299+
## Where we go from here
300+
301+
Here is an approximate timeline of what we hope to do over the next couple of months:
302+
303+
- today: announcement blog post and tooling and governance repositories made public
304+
- next week: first public conference call
305+
- Sep 1: publish a website for the Consortium at data-apis.org
306+
- Sep 15: publish the array API RFC and start community review
307+
- Nov 15: publish the dataframe API RFC and start community review
308+
309+
If you're an array (tensor) or dataframe library maintainer: **we'd like to hear from you!**
310+
We have opened [an issue tracker](https://github.com/data-apis/consortium-feedback/)
311+
for discussions. We'd love to hear any ideas, questions and concerns you may
312+
have.
313+
314+
This is a very challenging problem, with lots of thorny questions to answer, like:
315+
316+
- how will projects adopt a standard and expose it to their users without significant backwards compatibility breaks?
317+
- what does versioning and evolving the standard look like?
318+
- what about extensions that are not included in the standard?
319+
320+
Those challenges are worth tackling though, because the benefits are potentially very large.
321+
We're looking forward to what comes next!
62 KB
Loading
106 KB
Loading
354 KB
Loading

0 commit comments

Comments
 (0)