|
| 1 | +--- |
| 2 | +title: 'Introducing ndindex, a Python library for manipulating indices of ndarrays' |
| 3 | +published: April 4, 2020 |
| 4 | +author: aaron-meurer |
| 5 | +description: 'ndindex is a new library that provides high level objects representing the various objects that can index NumPy arrays. These objects automatically canonicalize under the assumption of NumPy indexing semantics, and can be manipulated with a uniform API. All ndindex types have a .args that can be used to access the arguments used to create the object, and they are all hashable.' |
| 6 | +category: [PyData ecosystem] |
| 7 | +featuredImage: |
| 8 | + src: /posts/introducing-ndindex-a-python-library-for-manipulating-indices-of-ndarrays/blog_feature_var1.svg |
| 9 | + alt: 'An illustration of a brown and a dark brown hand coming towards each other to pass a business card with the logo of Quansight Labs.' |
| 10 | +hero: |
| 11 | + imageSrc: /posts/introducing-ndindex-a-python-library-for-manipulating-indices-of-ndarrays/blog_hero_org.svg |
| 12 | + imageAlt: 'An illustration of a white hand holding up a microphone, with some graphical elements highlighting the top of the microphone.' |
| 13 | +--- |
| 14 | + |
| 15 | +One of the most important features of NumPy arrays is their indexing |
| 16 | +semantics. By "indexing" I mean anything that happens inside square brackets, |
| 17 | +for example, `a[4::-1, 0, ..., [0, 1], np.newaxis]`. NumPy's index semantics |
| 18 | +are very expressive and powerful, and this is one of the reasons the library |
| 19 | +is so popular. |
| 20 | + |
| 21 | +Index objects can be represented and manipulated directly. For example, the |
| 22 | +above index is `(slice(4, None, -1), 0, Ellipsis, [0, 1], None)`. If you are |
| 23 | +any author of a library that tries to replicate NumPy array semantics, you |
| 24 | +will have to work with these objects. However, they are often difficult to |
| 25 | +work with: |
| 26 | + |
| 27 | +- The different types that are valid indices for NumPy arrays do not have a |
| 28 | + uniform API. Most of the types are also standard Python types, such as |
| 29 | + `tuple`, `list`, `int`, and `None`, which are usually unrelated to indexing. |
| 30 | + |
| 31 | +- Those objects that are specific to indexes, such as `slice` and `Ellipsis` |
| 32 | + do not make any assumptions about their underlying semantics. For example, |
| 33 | + Python lets you create `slice(None, None, 0)` or `slice(0, 0.5)` even though |
| 34 | + `a[::0]` and `a[0:0.5]` would be always be an `IndexError` on a NumPy array. |
| 35 | + |
| 36 | +- Some index objects, such as `slice`, `list`, and `ndarray` are not hashable. |
| 37 | + |
| 38 | +- NumPy itself does not offer much in the way of helper functions to work with |
| 39 | + these objects. |
| 40 | + |
| 41 | +These limitations may be annoying, but are easy enough to live with. The real |
| 42 | +challenge when working with indices comes when you try to manipulate them. |
| 43 | +Slices in particular are challenging to work with because the rich meaning of |
| 44 | +slice semantics. Writing formulas for even very simple things is a real |
| 45 | +challenge with slices. `slice(start, stop, step)` (corresponding to |
| 46 | +`a[start:stop:step]`) has fundamentally different meaning depending on whether |
| 47 | +`start`,`stop`, or `step` are negative, nonnegative, or `None`. As an example, |
| 48 | +take `a[4:-2:-2]`, where `a` is a one-dimensional array. This slices every |
| 49 | +other element from the third element to the second from the last. What will |
| 50 | +the shape of this sliced array be? The answer is `(0,)` if the original shape |
| 51 | +is less than 1 or greater than 5, and `(1,)` otherwise. |
| 52 | + |
| 53 | +Code that manipulates slices will tend to have a lot of `if`/`else` chains for |
| 54 | +these different cases. And due to 0-based indexing, half-open semantics, |
| 55 | +wraparound behavior, clipping, and step logic, the formulas are often quite |
| 56 | +difficult to write down. |
| 57 | + |
| 58 | +## ndindex |
| 59 | + |
| 60 | +This is where ndindex comes in. ndindex is a new library that provides high |
| 61 | +level objects representing the various objects that can index NumPy arrays. |
| 62 | +These objects automatically canonicalize under the assumption of NumPy |
| 63 | +indexing semantics, and can be manipulated with a uniform API. All ndindex |
| 64 | +types have a `.args` that can be used to access the arguments used to create |
| 65 | +the object, and they are all hashable. |
| 66 | + |
| 67 | +```py |
| 68 | +>>> from ndindex import Slice, Integer, Tuple |
| 69 | +>>> Slice(0, 3) |
| 70 | +Slice(0, 3, 1) |
| 71 | +>>> idx = Tuple(Slice(0, 10), Integer(0)) |
| 72 | +>>> idx.args |
| 73 | +(Slice(0, 10, 1), Integer(0)) |
| 74 | +>>> [i.args for i in idx.args] |
| 75 | +[(0, 10, 1), (0,)] |
| 76 | +``` |
| 77 | + |
| 78 | +The goal of ndindex is to give 100% correct semantics as defined by NumPy's |
| 79 | +ndarray. This means that ndindex will not make a transformation on an index |
| 80 | +object unless it is correct for all possible input array shapes. The only |
| 81 | +exception to this rule is that ndindex assumes that any given index will not |
| 82 | +raise IndexError (for instance, from an out of bounds integer index or from |
| 83 | +too few dimensions). For those operations where the array shape is known, |
| 84 | +there is a `reduce` method to reduce an index to a simpler index that is |
| 85 | +equivalent for the given shape. |
| 86 | + |
| 87 | +## Features |
| 88 | + |
| 89 | +ndindex is still a work in progress. The following things are currently |
| 90 | +implemented: |
| 91 | + |
| 92 | +- `Slice`, `Integer`, and `Tuple` |
| 93 | + |
| 94 | +- Constructing a class puts it into canonical form. For example |
| 95 | + |
| 96 | + ```python |
| 97 | + >>> from ndindex import Slice |
| 98 | + >>> Slice(None, 12) |
| 99 | + Slice(0, 12, 1) |
| 100 | + ``` |
| 101 | + |
| 102 | +- Object arguments can be accessed with `idx.args` |
| 103 | + |
| 104 | + ```py |
| 105 | + >>> Slice(1, 3).args |
| 106 | + (1, 3, 1) |
| 107 | + ``` |
| 108 | + |
| 109 | +- All ndindex objects are hashable and can be used as dictionary keys. |
| 110 | + |
| 111 | +- A real index object can be accessed with `idx.raw`. Use this to use an |
| 112 | + ndindex to index an array. |
| 113 | + |
| 114 | + ```py |
| 115 | + >>> s = Slice(0, 2) |
| 116 | + >>> from numpy import arange |
| 117 | + >>> arange(4)[s.raw] |
| 118 | + array([0, 1]) |
| 119 | + ``` |
| 120 | + |
| 121 | +- `len()` computes the maximum length of an index over a given axis. |
| 122 | + |
| 123 | + ```py |
| 124 | + >>> len(Slice(2, 10, 3)) |
| 125 | + 3 |
| 126 | + >>> len(arange(10)[2:10:3]) |
| 127 | + 3 |
| 128 | + ``` |
| 129 | + |
| 130 | +- `idx.reduce(shape)` reduces an index to an equivalent index over an array |
| 131 | + with the given shape. |
| 132 | + |
| 133 | + ```py |
| 134 | + >>> Slice(2, -1).reduce((10,)) |
| 135 | + Slice(2, 9, 1) |
| 136 | + >>> arange(10)[2:-1] |
| 137 | + array([2, 3, 4, 5, 6, 7, 8]) |
| 138 | + >>> arange(10)[2:9:1] |
| 139 | + array([2, 3, 4, 5, 6, 7, 8]) |
| 140 | + ``` |
| 141 | + |
| 142 | +The following things are not yet implemented, but are planned. |
| 143 | + |
| 144 | +- `idx.newshape(shape)` returns the shape of `a[idx]`, assuming `a` has shape |
| 145 | + `shape`. |
| 146 | + |
| 147 | +- `ellipsis`, `Newaxis`, `IntegerArray`, and `BooleanArray` types, so that all |
| 148 | + types of indexing are supported. |
| 149 | + |
| 150 | +- `i1[i2]` will create a new ndindex `i3` (when possible) so that |
| 151 | + `a[i1][i2] == a[i3]`. |
| 152 | + |
| 153 | +- `split(i0, [i1, i2, ...])` will return a list of indices `[j1, j2, ...]` |
| 154 | + such that `a[i0] = concat(a[i1][j1], a[i2][j2], ...)` |
| 155 | + |
| 156 | +- `i1 + i2` will produce a single index so that `a[i1 + i2]` gives all the |
| 157 | + elements of `a[i1]` and `a[i2]`. |
| 158 | + |
| 159 | +- Support [NEP 21 outer indexing and vectorized |
| 160 | + indexin](https://numpy.org/neps/nep-0021-advanced-indexing.html). |
| 161 | + |
| 162 | +And more. If there is something you would like to see this library be able to |
| 163 | +do, please [open an issue](https://github.com/Quansight-Labs/ndindex/issues). Pull |
| 164 | +requests are welcome as well. |
| 165 | + |
| 166 | +## Testing and correctness |
| 167 | + |
| 168 | +The most important priority for a library like this is correctness. Index |
| 169 | +manipulations, and especially slice manipulations, are complicated to code |
| 170 | +correctly, and the code for them typically involves dozens of different |
| 171 | +branches for different cases and formulas that can be difficult to figure out. |
| 172 | + |
| 173 | +In order to assure correctness, all operations are tested extensively against |
| 174 | +NumPy itself to ensure they give the same results. The basic idea is to take |
| 175 | +the pure Python `index` and the `ndindex(index).raw`, or in the case of a |
| 176 | +transformation, the before and after raw index, and index a `numpy.arange` |
| 177 | +with them (the input array itself doesn't matter, so long as its values are |
| 178 | +distinct). If they do not give the same output array, or do not both produce |
| 179 | +the same error (like an `IndexError`), the code is not correct. For example, |
| 180 | +the `reduce` method can be verified by checking that `a[idx.raw]` and |
| 181 | +`a[idx.reduce(a.shape).raw]` produce the same sub-arrays for all possible |
| 182 | +input arrays `a` and ndindex objects `idx`. |
| 183 | + |
| 184 | +There are two primary types of tests that ndindex employs to verify this: |
| 185 | + |
| 186 | +- **Exhaustive tests.** These test every possible value in some range. For |
| 187 | + example, `Slice` tests test all possible `start`, `stop`, and `step` values |
| 188 | + in the range [-10, 10], as well as `None`, on `numpy.arange(n)` for `n` in |
| 189 | + the range [0, 10]. This is the best type of test, because it checks every |
| 190 | + possible case. Unfortunately, it is often impossible to do full exhaustive |
| 191 | + testing due to combinatorial explosion. |
| 192 | + |
| 193 | + For example, here is the exhaustive test for `Slice.reduce`: |
| 194 | + |
| 195 | + ```py |
| 196 | + def _iterslice(start_range=(-10, 10), stop_range=(-10, 10), step_range=(-10, 10)): |
| 197 | + for start in chain(range(*start_range), [None]): |
| 198 | + for stop in chain(range(*stop_range), [None]): |
| 199 | + for step in chain(range(*step_range), [None]): |
| 200 | + yield (start, stop, step) |
| 201 | + |
| 202 | + def test_slice_reduce_exhaustive(): |
| 203 | + for n in range(10): |
| 204 | + a = arange(n) |
| 205 | + for start, stop, step in _iterslice(): |
| 206 | + try: |
| 207 | + s = Slice(start, stop, step) |
| 208 | + except ValueError: |
| 209 | + continue |
| 210 | + |
| 211 | + check_same(a, s.raw, func=lambda x: x.reduce((n,))) |
| 212 | + |
| 213 | + reduced = s.reduce((n,)) |
| 214 | + assert reduced.start >= 0 |
| 215 | + # We cannot require stop > 0 because if stop = None and step < 0, the |
| 216 | + # only equivalent stop that includes 0 is negative. |
| 217 | + assert reduced.stop != None |
| 218 | + assert len(reduced) == len(a[reduced.raw]), (s, n) |
| 219 | + ``` |
| 220 | + |
| 221 | + `check_same` is a [helper |
| 222 | + function](https://github.com/Quansight-Labs/ndindex/blob/f8706a6fb6ffac879a0863cb93243f9bb14e6487/ndindex/tests/helpers.py#L60-L82) |
| 223 | + that ensures that two indices give either the exact same subarray or raise |
| 224 | + the exact same exception. The test checks all `a[start:stop:step]` where |
| 225 | + `a` is an array with shape from 0 to 10, and `start`, `stop`, and `step` |
| 226 | + range from -10 to 10 or `None`. We also test some basic invariants, such |
| 227 | + as that `Slice.reduce` always returns a slice with non-None arguments and |
| 228 | + that the start is nonnegative, and that the length of the slice is |
| 229 | + minimized for the given shape. |
| 230 | + |
| 231 | + This test takes about 4 seconds to run, and is about at the limit of what is |
| 232 | + possible with exhaustive testing. Other objects, in particular `Tuple`, have |
| 233 | + so many possible combinations that a similar exhaustive test for them would |
| 234 | + take billions of years to complete. |
| 235 | + |
| 236 | +- **Hypothesis tests.** |
| 237 | + [Hypothesis](https://hypothesis.readthedocs.io/en/latest/index.html) is a |
| 238 | + library that can intelligently check a combinatorial search space of inputs. |
| 239 | + This requires writing Hypothesis strategies that can generate all the |
| 240 | + relevant types of indices. All ndindex tests have Hypothesis tests, even if |
| 241 | + they are also tested exhaustively. |
| 242 | + |
| 243 | + The Hypothesis test for the above test looks like this |
| 244 | + |
| 245 | + ```py |
| 246 | + from hypothesis import assume |
| 247 | + from hypothesis.strategies import integers, composite, none, one_of, lists |
| 248 | + |
| 249 | + # hypothesis.strategies.tuples only generates tuples of a fixed size |
| 250 | + @composite |
| 251 | + def tuples(draw, elements, *, min_size=0, max_size=None, unique_by=None, |
| 252 | + unique=False): |
| 253 | + return tuple(draw(lists(elements, min_size=min_size, max_size=max_size, |
| 254 | + unique_by=unique_by, unique=unique))) |
| 255 | + |
| 256 | + # Valid shapes for numpy arrays. Filter out shapes that would fill memory. |
| 257 | + shapes = tuples(integers(0, 10)).filter(lambda shape: prod([i for i in shape if i]) < 100000) |
| 258 | + |
| 259 | + @composite |
| 260 | + def slices(draw, start=ints(), stop=ints(), step=ints()): |
| 261 | + return slice( |
| 262 | + draw(one_of(none(), start)), |
| 263 | + draw(one_of(none(), stop)), |
| 264 | + draw(one_of(none(), step)), |
| 265 | + ) |
| 266 | + |
| 267 | + @given(slices(), shapes) |
| 268 | + def test_slice_reduce_hypothesis(s, shape): |
| 269 | + a = arange(prod(shape)).reshape(shape) |
| 270 | + try: |
| 271 | + s = Slice(s) |
| 272 | + except ValueError: |
| 273 | + assume(False) |
| 274 | + |
| 275 | + check_same(a, s.raw, func=lambda x: x.reduce(shape)) |
| 276 | + |
| 277 | + try: |
| 278 | + reduced = s.reduce(shape) |
| 279 | + except IndexError: |
| 280 | + # shape == () |
| 281 | + return |
| 282 | + assert reduced.start >= 0 |
| 283 | + # We cannot require stop > 0 because if stop = None and step < 0, the |
| 284 | + # only equivalent stop that includes 0 is negative. |
| 285 | + assert reduced.stop != None |
| 286 | + assert len(reduced) == len(a[reduced.raw]), (s, shape) |
| 287 | + ``` |
| 288 | + |
| 289 | + In order to tell Hypothesis how to search the example space, we must define |
| 290 | + some functions to tell it how to draw example objects of a given type, in |
| 291 | + this case, slices and shape parameters for NumPy arrays. These strategies, |
| 292 | + as they are called, can be reused for multiple tests. Hypothesis then |
| 293 | + automatically and intelligently draws examples from the sample space to try |
| 294 | + to find one that fails the test. You can think of Hypothesis as a fuzzer, or |
| 295 | + as an "automated QA engineer". It tries to pick examples that are most |
| 296 | + likely to hit corner cases or different branch conditions. |
| 297 | + |
| 298 | +Why bother with Hypothesis if the same thing is already tested exhaustively? |
| 299 | +The main reason is that Hypothesis is much better at producing human-readable |
| 300 | +failure examples. When an exhaustive test fails, the failure will always be |
| 301 | +from the first set of inputs in the loop that produces a failure. Hypothesis |
| 302 | +on the other hand attempts to "shrink" the failure input to smallest input |
| 303 | +that still fails. For example, a failing exhaustive slice test might give |
| 304 | +`Slice(-10, -9, -10)` as a the failing example, but Hypothesis would shrink it |
| 305 | +to `Slice(-2, -1, -1)`. |
| 306 | + |
| 307 | +Another reason for the duplication is that Hypothesis can sometimes test a |
| 308 | +slightly expanded test space without any additional consequences. For example, |
| 309 | +the above Hypothesis tests all types of array shapes, whereas the exhaustive |
| 310 | +test tests only 1-dimensional shapes. This doesn't affect things because |
| 311 | +Hypothesis will always shrink large shapes to a 1-dimensional shape in the |
| 312 | +case of a failure, and it has the benefit of ensuring the code works correctly |
| 313 | +for larger shapes (it should always slice over the first index, or in the case |
| 314 | +of an empty shape raise `IndexError`). |
| 315 | + |
| 316 | +## Try it out |
| 317 | + |
| 318 | +You can install ndindex with pip or from conda-forge |
| 319 | + |
| 320 | + conda install -c conda-forge ndindex |
| 321 | + |
| 322 | +The documentation can be found [here](https://quansight-labs.github.io/ndindex/), |
| 323 | +and the development is on [GitHub](https://github.com/Quansight-Labs/ndindex). |
| 324 | +Please try the library out and |
| 325 | +[report](https://github.com/Quansight-Labs/ndindex/issues) any issues you have, or |
| 326 | +things you would like to see implemented. We are also looking for people who |
| 327 | +are interested in using the library and for people who are interested in |
| 328 | +contributing to it. |
0 commit comments