|
| 1 | +--- |
| 2 | +title: 'Python Sorted Containers' |
| 3 | +tags: |
| 4 | + - Python |
| 5 | + - sorted |
| 6 | + - list |
| 7 | + - dictionary |
| 8 | + - set |
| 9 | +authors: |
| 10 | + - name: Grant Jenks |
| 11 | + orcid: 0000-0000-0000-0000 |
| 12 | + affiliation: 1 |
| 13 | +affiliations: |
| 14 | + - name: Lyman Spitzer, Jr. Fellow, Princeton University |
| 15 | + index: 1 |
| 16 | +date: 23 February 2019 |
| 17 | +bibliography: paper.bib |
| 18 | +--- |
| 19 | + |
| 20 | +# Summary |
| 21 | + |
| 22 | +You’re probably more familiar with sorted collections than you realize. |
| 23 | + |
| 24 | +In the standard library we have heapq, bisect and queue.PriorityQueue but they |
| 25 | +don’t quite fill the gap. Behind the scenes, priority queue uses a heap |
| 26 | +implementation. Another common mistake is to think that collections.OrderedDict |
| 27 | +is a dictionary that maintains sort order but that’s not the case. |
| 28 | + |
| 29 | +I don’t always import sorted types. But when I do, I expect them in the |
| 30 | +standard library. |
| 31 | + |
| 32 | +And here’s why. Java, C++ and .NET have them. Python has broken into the top |
| 33 | +five of the TIOBE index but feels a bit more like PHP or Javascript in this |
| 34 | +regard. |
| 35 | + |
| 36 | +We also depend on external solutions: Sqlite in-memory indexes, |
| 37 | +pandas.DataFrame indexes, and Redis sorted sets. If you’ve ever issued a “zadd” |
| 38 | +command to Redis then you used a sorted collection. |
| 39 | + |
| 40 | +So what should be the API of sorted collection types in Python? |
| 41 | + |
| 42 | +Well, a sorted list should be a MutableSequence. Pretty close to the “list” |
| 43 | +API. But there’s a sort order constraint that must be satisfied by “setitem” |
| 44 | +and “insert” methods. Also should support a “key” argument like the “sorted” |
| 45 | +builtin function. Given sorted order, “bisect_right” and “bisect_left” methods |
| 46 | +make sense. You could also imagine an “add” method and “discard” method for |
| 47 | +elements. Kind of like a multi-set in other languages. I’d also expect |
| 48 | +“getitem”, “contains”, “count”, etc. to be faster than linear time. |
| 49 | + |
| 50 | +A sorted dictionary should be a MutableMapping. Pretty close to the dictionary |
| 51 | +API. But iteration yields items in sorted order. Also should support efficient |
| 52 | +positional indexing, something like a SequenceView. |
| 53 | + |
| 54 | +Sorted set should be a MutableSet. Pretty close to the “set” API. Sorted set |
| 55 | +should also be a Sequence like the “tuple” API to support efficient positional |
| 56 | +indexing. |
| 57 | + |
| 58 | +The chorus and the refrain from core developers is: “Look to the PyPI.” Which |
| 59 | +is good advice. |
| 60 | + |
| 61 | +So let’s talk about your options with a bit of software archaeology. |
| 62 | + |
| 63 | +Blist is the genesis of our story but it wasn’t really designed for sorted |
| 64 | +collections. It’s written in C and the innovation here is the “blist” data |
| 65 | +type. That’s a B-tree based replacement for CPython’s built-in list. Sorted |
| 66 | +list, sorted dictionary, and sorted set were built on top of this “blist” data |
| 67 | +type and it became the incumbent to beat. Also noteworthy is that the API was |
| 68 | +rather well thought out. |
| 69 | + |
| 70 | +There were some quirks; for example: the “pop” method returns the first element |
| 71 | +rather than the last element in the sorted list. |
| 72 | + |
| 73 | +SortedCollection is not a package. You can’t install this with “pip”. It’s |
| 74 | +simply a Python recipe that Raymond Hettinger linked from the Python |
| 75 | +docs. Couple innovations here though: it’s simple, it’s written in pure-Python, |
| 76 | +and maintains a parallel list of keys. So we have efficient support for that |
| 77 | +key-function parameter. |
| 78 | + |
| 79 | +This is bintrees. Still alive and kicking today. A few innovations here: it’s |
| 80 | +written with Cython support to improve performance and has a few different tree |
| 81 | +“backends.” You can create a red-black or AVL-tree depending on your |
| 82 | +needs. There’s also some notion of accessing the nodes themselves and |
| 83 | +customizing the tree traversal to slice by value rather than by index. |
| 84 | + |
| 85 | +Banyan had a very short life but adds another couple innovations: it’s |
| 86 | +incredibly fast and achieves that through C++ template meta-programming. It |
| 87 | +also has a feature called tree-augmentation that will let you store metadata at |
| 88 | +tree nodes. You can use this for interval trees if you need those. |
| 89 | + |
| 90 | +Finally there’s skiplistcollections. Couple significant things here: it’s |
| 91 | +pure-Python but fast, even for large collections, and it uses a skip-list data |
| 92 | +type rather than a binary tree. |
| 93 | + |
| 94 | +Altogether, you go on PyPI and try to figure this out and it’s kind of like |
| 95 | +this. It’s a mess. PyPI has really got to work better than using Google with |
| 96 | +the site operator. |
| 97 | + |
| 98 | +Couple others worth calling out: rbtree is another fast C-based |
| 99 | +implementation. And there’s a few like treap, splay and scapegoat that are |
| 100 | +contributions and experiments by Dan Stromberg. He’s also done some interesting |
| 101 | +benchmarking of the various tree types. There’s no silver bullet when it comes |
| 102 | +to trees. |
| 103 | + |
| 104 | +I love Python because there’s one right way to do things. If I just want sorted |
| 105 | +types, what’s the right answer? |
| 106 | + |
| 107 | +I couldn’t find the right answer so I built it. The missing battery: Sorted |
| 108 | +Containers. |
| 109 | + |
| 110 | +Here it is. This is the project home page. Sorted Containers is a Python sorted |
| 111 | +collections library with sorted list, sorted dictionary, and sorted set |
| 112 | +implementations. It’s pure-Python but it’s as fast as C-extensions. It’s Python |
| 113 | +2 and Python 3 compatible. It’s fully-featured. And it’s extensively tested |
| 114 | +with 100% coverage and hours of stress. |
| 115 | + |
| 116 | +Performance is a feature. That means graphs. Lot’s of them. There are 189 |
| 117 | +performance graphs in total. Let’s look at a few of them together. |
| 118 | + |
| 119 | +Here’s the performance of adding a random value to a sorted list. I’m comparing |
| 120 | +Sorted Containers with other competing implementations. |
| 121 | + |
| 122 | +Notice the axes are log-log. So if performance differs by major tick marks then |
| 123 | +one is actually ten times faster than the other. |
| 124 | + |
| 125 | +We see here that Sorted Containers is in fact about ten times faster than blist |
| 126 | +when it comes to adding random values to a sorted list. Notice also Raymond’s |
| 127 | +recipe is just a list and that displays order n-squared runtime |
| 128 | +complexity. That’s why it curves upwards. |
| 129 | + |
| 130 | +Of all the sorted collections libraries, Sorted Containers is also fastest at |
| 131 | +initialization. We’ll look at why soon. |
| 132 | + |
| 133 | +Sorted Containers is not always fastest. But notice here the performance |
| 134 | +improves with scale. You can see it there in blue. It starts in the middle of |
| 135 | +the pack and has a lesser slope than competitors. |
| 136 | + |
| 137 | +In short, Sorted Containers is kind of like a B-tree implementation. That means |
| 138 | +you can configure the the fan-out of nodes in the tree. We call that the load |
| 139 | +parameter and there are extensive performance graphs of three different load |
| 140 | +parameters. |
| 141 | + |
| 142 | +Here we see that a load factor of ten thousand is fastest for indexing a sorted |
| 143 | +list. |
| 144 | + |
| 145 | +Notice the axes now go up to ten million elements. I’ve actually scaled |
| 146 | +SortedList all the way to ten billion elements. It was a really incredible |
| 147 | +experiment. I had to rent the largest high-memory instance available from |
| 148 | +Google Compute Engine. That benchmark required about 128 gigabytes of memory |
| 149 | +and cost me about thirty dollars. |
| 150 | + |
| 151 | +This is the performance of deleting a key from a sorted dictionary. Now the |
| 152 | +smaller load-factor is fastest. The default load-factor is 1,000 and works well |
| 153 | +for most scenarios. It’s a very sane default. |
| 154 | + |
| 155 | +In addition to comparisons and load-factors, I also benchmark runtimes. Here’s |
| 156 | +CPython 2.7, CPython 3.5 and PyPy version 5. You can see where the the |
| 157 | +just-in-time compiler, the jit-compiler, kicks in. That’ll make Sorted |
| 158 | +Containers another ten times faster. |
| 159 | + |
| 160 | +Finally, I made a survey in 2015 on Github as to how people were using sorted |
| 161 | +collections. I noticed patterns like priority queues, mutli-sets, |
| 162 | +nearest-neighbor algorithms, etc. |
| 163 | + |
| 164 | +This is the priority queue workload which spends 40% of its time adding |
| 165 | +elements, 40% popping elements, 10% discarding elements, and has a couple other |
| 166 | +methods. |
| 167 | + |
| 168 | +Sorted Containers is two to ten times faster in all of these scenarios. |
| 169 | + |
| 170 | +We also have a lot of features. The API is nearly a drop-in replacement for the |
| 171 | +“blist” and “rbtree” modules. But the quirks have been fixed so the “pop” |
| 172 | +method returns the last element rather than the first. |
| 173 | + |
| 174 | +Sorted lists are sorted so you can bisect them. Looking up the index of an |
| 175 | +element is also very fast. |
| 176 | + |
| 177 | +Bintrees introduced methods for tree traversal. And I’ve boiled those down to a |
| 178 | +couple API methods. On line 3, we see “irange”. Irange iterates all keys from |
| 179 | +bob to eve in sorted order. |
| 180 | + |
| 181 | +Sorted dictionaries also have a sequence-like view called iloc. If you’re |
| 182 | +coming from Pandas that should look familiar. Line 4 creates a list of the five |
| 183 | +largest keys in the dictionary. |
| 184 | + |
| 185 | +Similar to “irange” there is an “islice” method. Islice does positional index |
| 186 | +slicing. In line 5 we create an iterator over the indexes 10 through 49 |
| 187 | +inclusive. |
| 188 | + |
| 189 | +One of the benefits of being pure-Python: it’s easy to hack on. Over the years, |
| 190 | +a few patterns have emerged and become recipes. All of these are available from |
| 191 | +PyPI with pip install sortedcollections. |
| 192 | + |
| 193 | +If all that didn’t convince you that Sorted Containers is great then listen to |
| 194 | +what other smart people say about it: |
| 195 | + |
| 196 | +Alex Martelli says: Good stuff! … I like the simple, effective implementation |
| 197 | +idea of splitting the sorted containers into smaller “fragments” to avoid the |
| 198 | +O(N) insertion costs. |
| 199 | + |
| 200 | +Jeff Knupp writes: That last part, “fast as C-extensions,” was difficult to |
| 201 | +believe. I would need some sort of performance comparison to be convinced this |
| 202 | +is true. The author includes this in the docs. It is. |
| 203 | + |
| 204 | +Kevin Samuel says: I’m quite amazed, not just by the code quality (it’s |
| 205 | +incredibly readable and has more comment than code, wow), but the actual amount |
| 206 | +of work you put at stuff that is not code: documentation, benchmarking, |
| 207 | +implementation explanations. Even the git log is clean and the unit tests run |
| 208 | +out of the box on Python 2 and 3. |
| 209 | + |
| 210 | +If you’re new to sorted collections, I hope I’ve piqued your interest. Think |
| 211 | +about the achievement here. Sorted Containers is pure-Python but as fast as |
| 212 | +C-implementations. Let’s look under the hood of Sorted Containers at what makes |
| 213 | +it so fast. |
| 214 | + |
| 215 | +It really comes down to bisect for the heavy lifting. Bisect is a module in the |
| 216 | +standard library that implements binary search on lists. There’s also a handy |
| 217 | +method called insort that does a binary search and insertion for us in one |
| 218 | +call. There’s no magic here, it’s just implemented in C and part of the |
| 219 | +standard library. |
| 220 | + |
| 221 | +Here’s the basic structure. It’s just a list of sublists. So there’s a member |
| 222 | +variable called “lists” that points to sublists. Each of those is maintained in |
| 223 | +sorted order. You’ll sometimes hear me refer to these as the top-level list and |
| 224 | +its sublists. |
| 225 | + |
| 226 | +There’s no need to wrap sublists in their own objects. They are just |
| 227 | +lists. Simple is fast and efficient. |
| 228 | + |
| 229 | +In addition to the list of sublists. There’s an index called the maxes |
| 230 | +index. That simply stores the maximum value in each sublist. Now lists in |
| 231 | +CPython are simply arrays of pointers so we’re not adding much overhead with |
| 232 | +this index. |
| 233 | + |
| 234 | +Let’s walk through testing membership with contains. Let’s look for element 14. |
| 235 | + |
| 236 | +Let’s also walk through adding an element. Let’s add 5 to the sorted list. |
| 237 | + |
| 238 | +Now numeric indexing is a little more complex. Numeric indexing uses a tree |
| 239 | +packed densely into another list. I haven’t seen this structure described in |
| 240 | +textbooks or research so I’d like to call it a “Jenks” index. But I’ll also |
| 241 | +refer to it as the positional index. |
| 242 | + |
| 243 | +Let’s build the positional index together. |
| 244 | + |
| 245 | +Remember the positional index is a tree stored in a list, kind of like a heap. |
| 246 | + |
| 247 | +Let’s use this to lookup index 8. Starting at the root, 18, compare index to |
| 248 | +the left-child node. 8 is greater than 7 so we subtract 7 from 8 and move to |
| 249 | +the right-child node. Again, now at node 11, compare index again to the |
| 250 | +left-child node. 1 is less than 6, so we simply move to the left-child node. We |
| 251 | +terminate at 6 because it’s a leaf node. Our final index is 1 and our final |
| 252 | +position is 5. We calculate the top-level list index as the position minus the |
| 253 | +offset. So our final coordinates are index 2 in the top-level list and index 1 |
| 254 | +in the sublist. |
| 255 | + |
| 256 | +That’s it. Three lists maintain the elements, the maxes index, and the |
| 257 | +positional index. We’ve used simple built-in types to construct complex |
| 258 | +behavior. |
| 259 | + |
| 260 | +# Acknowledgements |
| 261 | + |
| 262 | +Thank you to the open source community that has contributed bug reports, |
| 263 | +documentation improvements, and feature guidance in development of the project. |
| 264 | + |
| 265 | +Significant interface design credit is due to Daniel Stutzbach and the "blist" |
| 266 | +software project which this project originally copied. |
| 267 | + |
| 268 | +# References |
0 commit comments