Skip to content

Commit da61694

Browse files
MarsBarLeenoatamir
andauthored
introducing-ndindex-a-python-library-for-manipulating-indices-of-ndarrays (#530)
Co-authored-by: Noa Tamir <[email protected]>
1 parent 65a69e1 commit da61694

File tree

3 files changed

+330
-0
lines changed

3 files changed

+330
-0
lines changed
Lines changed: 328 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,328 @@
1+
---
2+
title: 'Introducing ndindex, a Python library for manipulating indices of ndarrays'
3+
published: April 4, 2020
4+
author: aaron-meurer
5+
description: 'ndindex is a new library that provides high level objects representing the various objects that can index NumPy arrays. These objects automatically canonicalize under the assumption of NumPy indexing semantics, and can be manipulated with a uniform API. All ndindex types have a .args that can be used to access the arguments used to create the object, and they are all hashable.'
6+
category: [PyData ecosystem]
7+
featuredImage:
8+
src: /posts/introducing-ndindex-a-python-library-for-manipulating-indices-of-ndarrays/blog_feature_var1.svg
9+
alt: 'An illustration of a brown and a dark brown hand coming towards each other to pass a business card with the logo of Quansight Labs.'
10+
hero:
11+
imageSrc: /posts/introducing-ndindex-a-python-library-for-manipulating-indices-of-ndarrays/blog_hero_org.svg
12+
imageAlt: 'An illustration of a white hand holding up a microphone, with some graphical elements highlighting the top of the microphone.'
13+
---
14+
15+
One of the most important features of NumPy arrays is their indexing
16+
semantics. By "indexing" I mean anything that happens inside square brackets,
17+
for example, `a[4::-1, 0, ..., [0, 1], np.newaxis]`. NumPy's index semantics
18+
are very expressive and powerful, and this is one of the reasons the library
19+
is so popular.
20+
21+
Index objects can be represented and manipulated directly. For example, the
22+
above index is `(slice(4, None, -1), 0, Ellipsis, [0, 1], None)`. If you are
23+
any author of a library that tries to replicate NumPy array semantics, you
24+
will have to work with these objects. However, they are often difficult to
25+
work with:
26+
27+
- The different types that are valid indices for NumPy arrays do not have a
28+
uniform API. Most of the types are also standard Python types, such as
29+
`tuple`, `list`, `int`, and `None`, which are usually unrelated to indexing.
30+
31+
- Those objects that are specific to indexes, such as `slice` and `Ellipsis`
32+
do not make any assumptions about their underlying semantics. For example,
33+
Python lets you create `slice(None, None, 0)` or `slice(0, 0.5)` even though
34+
`a[::0]` and `a[0:0.5]` would be always be an `IndexError` on a NumPy array.
35+
36+
- Some index objects, such as `slice`, `list`, and `ndarray` are not hashable.
37+
38+
- NumPy itself does not offer much in the way of helper functions to work with
39+
these objects.
40+
41+
These limitations may be annoying, but are easy enough to live with. The real
42+
challenge when working with indices comes when you try to manipulate them.
43+
Slices in particular are challenging to work with because the rich meaning of
44+
slice semantics. Writing formulas for even very simple things is a real
45+
challenge with slices. `slice(start, stop, step)` (corresponding to
46+
`a[start:stop:step]`) has fundamentally different meaning depending on whether
47+
`start`,`stop`, or `step` are negative, nonnegative, or `None`. As an example,
48+
take `a[4:-2:-2]`, where `a` is a one-dimensional array. This slices every
49+
other element from the third element to the second from the last. What will
50+
the shape of this sliced array be? The answer is `(0,)` if the original shape
51+
is less than 1 or greater than 5, and `(1,)` otherwise.
52+
53+
Code that manipulates slices will tend to have a lot of `if`/`else` chains for
54+
these different cases. And due to 0-based indexing, half-open semantics,
55+
wraparound behavior, clipping, and step logic, the formulas are often quite
56+
difficult to write down.
57+
58+
## ndindex
59+
60+
This is where ndindex comes in. ndindex is a new library that provides high
61+
level objects representing the various objects that can index NumPy arrays.
62+
These objects automatically canonicalize under the assumption of NumPy
63+
indexing semantics, and can be manipulated with a uniform API. All ndindex
64+
types have a `.args` that can be used to access the arguments used to create
65+
the object, and they are all hashable.
66+
67+
```py
68+
>>> from ndindex import Slice, Integer, Tuple
69+
>>> Slice(0, 3)
70+
Slice(0, 3, 1)
71+
>>> idx = Tuple(Slice(0, 10), Integer(0))
72+
>>> idx.args
73+
(Slice(0, 10, 1), Integer(0))
74+
>>> [i.args for i in idx.args]
75+
[(0, 10, 1), (0,)]
76+
```
77+
78+
The goal of ndindex is to give 100% correct semantics as defined by NumPy's
79+
ndarray. This means that ndindex will not make a transformation on an index
80+
object unless it is correct for all possible input array shapes. The only
81+
exception to this rule is that ndindex assumes that any given index will not
82+
raise IndexError (for instance, from an out of bounds integer index or from
83+
too few dimensions). For those operations where the array shape is known,
84+
there is a `reduce` method to reduce an index to a simpler index that is
85+
equivalent for the given shape.
86+
87+
## Features
88+
89+
ndindex is still a work in progress. The following things are currently
90+
implemented:
91+
92+
- `Slice`, `Integer`, and `Tuple`
93+
94+
- Constructing a class puts it into canonical form. For example
95+
96+
```python
97+
>>> from ndindex import Slice
98+
>>> Slice(None, 12)
99+
Slice(0, 12, 1)
100+
```
101+
102+
- Object arguments can be accessed with `idx.args`
103+
104+
```py
105+
>>> Slice(1, 3).args
106+
(1, 3, 1)
107+
```
108+
109+
- All ndindex objects are hashable and can be used as dictionary keys.
110+
111+
- A real index object can be accessed with `idx.raw`. Use this to use an
112+
ndindex to index an array.
113+
114+
```py
115+
>>> s = Slice(0, 2)
116+
>>> from numpy import arange
117+
>>> arange(4)[s.raw]
118+
array([0, 1])
119+
```
120+
121+
- `len()` computes the maximum length of an index over a given axis.
122+
123+
```py
124+
>>> len(Slice(2, 10, 3))
125+
3
126+
>>> len(arange(10)[2:10:3])
127+
3
128+
```
129+
130+
- `idx.reduce(shape)` reduces an index to an equivalent index over an array
131+
with the given shape.
132+
133+
```py
134+
>>> Slice(2, -1).reduce((10,))
135+
Slice(2, 9, 1)
136+
>>> arange(10)[2:-1]
137+
array([2, 3, 4, 5, 6, 7, 8])
138+
>>> arange(10)[2:9:1]
139+
array([2, 3, 4, 5, 6, 7, 8])
140+
```
141+
142+
The following things are not yet implemented, but are planned.
143+
144+
- `idx.newshape(shape)` returns the shape of `a[idx]`, assuming `a` has shape
145+
`shape`.
146+
147+
- `ellipsis`, `Newaxis`, `IntegerArray`, and `BooleanArray` types, so that all
148+
types of indexing are supported.
149+
150+
- `i1[i2]` will create a new ndindex `i3` (when possible) so that
151+
`a[i1][i2] == a[i3]`.
152+
153+
- `split(i0, [i1, i2, ...])` will return a list of indices `[j1, j2, ...]`
154+
such that `a[i0] = concat(a[i1][j1], a[i2][j2], ...)`
155+
156+
- `i1 + i2` will produce a single index so that `a[i1 + i2]` gives all the
157+
elements of `a[i1]` and `a[i2]`.
158+
159+
- Support [NEP 21 outer indexing and vectorized
160+
indexin](https://numpy.org/neps/nep-0021-advanced-indexing.html).
161+
162+
And more. If there is something you would like to see this library be able to
163+
do, please [open an issue](https://github.com/Quansight-Labs/ndindex/issues). Pull
164+
requests are welcome as well.
165+
166+
## Testing and correctness
167+
168+
The most important priority for a library like this is correctness. Index
169+
manipulations, and especially slice manipulations, are complicated to code
170+
correctly, and the code for them typically involves dozens of different
171+
branches for different cases and formulas that can be difficult to figure out.
172+
173+
In order to assure correctness, all operations are tested extensively against
174+
NumPy itself to ensure they give the same results. The basic idea is to take
175+
the pure Python `index` and the `ndindex(index).raw`, or in the case of a
176+
transformation, the before and after raw index, and index a `numpy.arange`
177+
with them (the input array itself doesn't matter, so long as its values are
178+
distinct). If they do not give the same output array, or do not both produce
179+
the same error (like an `IndexError`), the code is not correct. For example,
180+
the `reduce` method can be verified by checking that `a[idx.raw]` and
181+
`a[idx.reduce(a.shape).raw]` produce the same sub-arrays for all possible
182+
input arrays `a` and ndindex objects `idx`.
183+
184+
There are two primary types of tests that ndindex employs to verify this:
185+
186+
- **Exhaustive tests.** These test every possible value in some range. For
187+
example, `Slice` tests test all possible `start`, `stop`, and `step` values
188+
in the range [-10, 10], as well as `None`, on `numpy.arange(n)` for `n` in
189+
the range [0, 10]. This is the best type of test, because it checks every
190+
possible case. Unfortunately, it is often impossible to do full exhaustive
191+
testing due to combinatorial explosion.
192+
193+
For example, here is the exhaustive test for `Slice.reduce`:
194+
195+
```py
196+
def _iterslice(start_range=(-10, 10), stop_range=(-10, 10), step_range=(-10, 10)):
197+
for start in chain(range(*start_range), [None]):
198+
for stop in chain(range(*stop_range), [None]):
199+
for step in chain(range(*step_range), [None]):
200+
yield (start, stop, step)
201+
202+
def test_slice_reduce_exhaustive():
203+
for n in range(10):
204+
a = arange(n)
205+
for start, stop, step in _iterslice():
206+
try:
207+
s = Slice(start, stop, step)
208+
except ValueError:
209+
continue
210+
211+
check_same(a, s.raw, func=lambda x: x.reduce((n,)))
212+
213+
reduced = s.reduce((n,))
214+
assert reduced.start >= 0
215+
# We cannot require stop > 0 because if stop = None and step < 0, the
216+
# only equivalent stop that includes 0 is negative.
217+
assert reduced.stop != None
218+
assert len(reduced) == len(a[reduced.raw]), (s, n)
219+
```
220+
221+
`check_same` is a [helper
222+
function](https://github.com/Quansight-Labs/ndindex/blob/f8706a6fb6ffac879a0863cb93243f9bb14e6487/ndindex/tests/helpers.py#L60-L82)
223+
that ensures that two indices give either the exact same subarray or raise
224+
the exact same exception. The test checks all `a[start:stop:step]` where
225+
`a` is an array with shape from 0 to 10, and `start`, `stop`, and `step`
226+
range from -10 to 10 or `None`. We also test some basic invariants, such
227+
as that `Slice.reduce` always returns a slice with non-None arguments and
228+
that the start is nonnegative, and that the length of the slice is
229+
minimized for the given shape.
230+
231+
This test takes about 4 seconds to run, and is about at the limit of what is
232+
possible with exhaustive testing. Other objects, in particular `Tuple`, have
233+
so many possible combinations that a similar exhaustive test for them would
234+
take billions of years to complete.
235+
236+
- **Hypothesis tests.**
237+
[Hypothesis](https://hypothesis.readthedocs.io/en/latest/index.html) is a
238+
library that can intelligently check a combinatorial search space of inputs.
239+
This requires writing Hypothesis strategies that can generate all the
240+
relevant types of indices. All ndindex tests have Hypothesis tests, even if
241+
they are also tested exhaustively.
242+
243+
The Hypothesis test for the above test looks like this
244+
245+
```py
246+
from hypothesis import assume
247+
from hypothesis.strategies import integers, composite, none, one_of, lists
248+
249+
# hypothesis.strategies.tuples only generates tuples of a fixed size
250+
@composite
251+
def tuples(draw, elements, *, min_size=0, max_size=None, unique_by=None,
252+
unique=False):
253+
return tuple(draw(lists(elements, min_size=min_size, max_size=max_size,
254+
unique_by=unique_by, unique=unique)))
255+
256+
# Valid shapes for numpy arrays. Filter out shapes that would fill memory.
257+
shapes = tuples(integers(0, 10)).filter(lambda shape: prod([i for i in shape if i]) < 100000)
258+
259+
@composite
260+
def slices(draw, start=ints(), stop=ints(), step=ints()):
261+
return slice(
262+
draw(one_of(none(), start)),
263+
draw(one_of(none(), stop)),
264+
draw(one_of(none(), step)),
265+
)
266+
267+
@given(slices(), shapes)
268+
def test_slice_reduce_hypothesis(s, shape):
269+
a = arange(prod(shape)).reshape(shape)
270+
try:
271+
s = Slice(s)
272+
except ValueError:
273+
assume(False)
274+
275+
check_same(a, s.raw, func=lambda x: x.reduce(shape))
276+
277+
try:
278+
reduced = s.reduce(shape)
279+
except IndexError:
280+
# shape == ()
281+
return
282+
assert reduced.start >= 0
283+
# We cannot require stop > 0 because if stop = None and step < 0, the
284+
# only equivalent stop that includes 0 is negative.
285+
assert reduced.stop != None
286+
assert len(reduced) == len(a[reduced.raw]), (s, shape)
287+
```
288+
289+
In order to tell Hypothesis how to search the example space, we must define
290+
some functions to tell it how to draw example objects of a given type, in
291+
this case, slices and shape parameters for NumPy arrays. These strategies,
292+
as they are called, can be reused for multiple tests. Hypothesis then
293+
automatically and intelligently draws examples from the sample space to try
294+
to find one that fails the test. You can think of Hypothesis as a fuzzer, or
295+
as an "automated QA engineer". It tries to pick examples that are most
296+
likely to hit corner cases or different branch conditions.
297+
298+
Why bother with Hypothesis if the same thing is already tested exhaustively?
299+
The main reason is that Hypothesis is much better at producing human-readable
300+
failure examples. When an exhaustive test fails, the failure will always be
301+
from the first set of inputs in the loop that produces a failure. Hypothesis
302+
on the other hand attempts to "shrink" the failure input to smallest input
303+
that still fails. For example, a failing exhaustive slice test might give
304+
`Slice(-10, -9, -10)` as a the failing example, but Hypothesis would shrink it
305+
to `Slice(-2, -1, -1)`.
306+
307+
Another reason for the duplication is that Hypothesis can sometimes test a
308+
slightly expanded test space without any additional consequences. For example,
309+
the above Hypothesis tests all types of array shapes, whereas the exhaustive
310+
test tests only 1-dimensional shapes. This doesn't affect things because
311+
Hypothesis will always shrink large shapes to a 1-dimensional shape in the
312+
case of a failure, and it has the benefit of ensuring the code works correctly
313+
for larger shapes (it should always slice over the first index, or in the case
314+
of an empty shape raise `IndexError`).
315+
316+
## Try it out
317+
318+
You can install ndindex with pip or from conda-forge
319+
320+
conda install -c conda-forge ndindex
321+
322+
The documentation can be found [here](https://quansight-labs.github.io/ndindex/),
323+
and the development is on [GitHub](https://github.com/Quansight-Labs/ndindex).
324+
Please try the library out and
325+
[report](https://github.com/Quansight-Labs/ndindex/issues) any issues you have, or
326+
things you would like to see implemented. We are also looking for people who
327+
are interested in using the library and for people who are interested in
328+
contributing to it.

apps/labs/public/posts/introducing-ndindex-a-python-library-for-manipulating-indices-of-ndarrays/blog_feature_var1.svg

Lines changed: 1 addition & 0 deletions
Loading

apps/labs/public/posts/introducing-ndindex-a-python-library-for-manipulating-indices-of-ndarrays/blog_hero_org.svg

Lines changed: 1 addition & 0 deletions
Loading

0 commit comments

Comments
 (0)