Skip to content

Commit 977c247

Browse files
committed
Add initial draft skeleton of paper for JOSS
1 parent e19dc4a commit 977c247

File tree

2 files changed

+275
-0
lines changed

2 files changed

+275
-0
lines changed

docs/paper.bib

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
@manual{Python,
2+
title = {Python Programming Language},
3+
author = {Python Core Developers},
4+
organization = {Python Software Foundation},
5+
year = {1995},
6+
url = {https://www.python.org/},
7+
}

docs/paper.md

Lines changed: 268 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,268 @@
1+
---
2+
title: 'Python Sorted Containers'
3+
tags:
4+
- Python
5+
- sorted
6+
- list
7+
- dictionary
8+
- set
9+
authors:
10+
- name: Grant Jenks
11+
orcid: 0000-0000-0000-0000
12+
affiliation: 1
13+
affiliations:
14+
- name: Lyman Spitzer, Jr. Fellow, Princeton University
15+
index: 1
16+
date: 23 February 2019
17+
bibliography: paper.bib
18+
---
19+
20+
# Summary
21+
22+
You’re probably more familiar with sorted collections than you realize.
23+
24+
In the standard library we have heapq, bisect and queue.PriorityQueue but they
25+
don’t quite fill the gap. Behind the scenes, priority queue uses a heap
26+
implementation. Another common mistake is to think that collections.OrderedDict
27+
is a dictionary that maintains sort order but that’s not the case.
28+
29+
I don’t always import sorted types. But when I do, I expect them in the
30+
standard library.
31+
32+
And here’s why. Java, C++ and .NET have them. Python has broken into the top
33+
five of the TIOBE index but feels a bit more like PHP or Javascript in this
34+
regard.
35+
36+
We also depend on external solutions: Sqlite in-memory indexes,
37+
pandas.DataFrame indexes, and Redis sorted sets. If you’ve ever issued a “zadd”
38+
command to Redis then you used a sorted collection.
39+
40+
So what should be the API of sorted collection types in Python?
41+
42+
Well, a sorted list should be a MutableSequence. Pretty close to the “list”
43+
API. But there’s a sort order constraint that must be satisfied by “setitem”
44+
and “insert” methods. Also should support a “key” argument like the “sorted”
45+
builtin function. Given sorted order, “bisect_right” and “bisect_left” methods
46+
make sense. You could also imagine an “add” method and “discard” method for
47+
elements. Kind of like a multi-set in other languages. I’d also expect
48+
“getitem”, “contains”, “count”, etc. to be faster than linear time.
49+
50+
A sorted dictionary should be a MutableMapping. Pretty close to the dictionary
51+
API. But iteration yields items in sorted order. Also should support efficient
52+
positional indexing, something like a SequenceView.
53+
54+
Sorted set should be a MutableSet. Pretty close to the “set” API. Sorted set
55+
should also be a Sequence like the “tuple” API to support efficient positional
56+
indexing.
57+
58+
The chorus and the refrain from core developers is: “Look to the PyPI.” Which
59+
is good advice.
60+
61+
So let’s talk about your options with a bit of software archaeology.
62+
63+
Blist is the genesis of our story but it wasn’t really designed for sorted
64+
collections. It’s written in C and the innovation here is the “blist” data
65+
type. That’s a B-tree based replacement for CPython’s built-in list. Sorted
66+
list, sorted dictionary, and sorted set were built on top of this “blist” data
67+
type and it became the incumbent to beat. Also noteworthy is that the API was
68+
rather well thought out.
69+
70+
There were some quirks; for example: the “pop” method returns the first element
71+
rather than the last element in the sorted list.
72+
73+
SortedCollection is not a package. You can’t install this with “pip”. It’s
74+
simply a Python recipe that Raymond Hettinger linked from the Python
75+
docs. Couple innovations here though: it’s simple, it’s written in pure-Python,
76+
and maintains a parallel list of keys. So we have efficient support for that
77+
key-function parameter.
78+
79+
This is bintrees. Still alive and kicking today. A few innovations here: it’s
80+
written with Cython support to improve performance and has a few different tree
81+
“backends.” You can create a red-black or AVL-tree depending on your
82+
needs. There’s also some notion of accessing the nodes themselves and
83+
customizing the tree traversal to slice by value rather than by index.
84+
85+
Banyan had a very short life but adds another couple innovations: it’s
86+
incredibly fast and achieves that through C++ template meta-programming. It
87+
also has a feature called tree-augmentation that will let you store metadata at
88+
tree nodes. You can use this for interval trees if you need those.
89+
90+
Finally there’s skiplistcollections. Couple significant things here: it’s
91+
pure-Python but fast, even for large collections, and it uses a skip-list data
92+
type rather than a binary tree.
93+
94+
Altogether, you go on PyPI and try to figure this out and it’s kind of like
95+
this. It’s a mess. PyPI has really got to work better than using Google with
96+
the site operator.
97+
98+
Couple others worth calling out: rbtree is another fast C-based
99+
implementation. And there’s a few like treap, splay and scapegoat that are
100+
contributions and experiments by Dan Stromberg. He’s also done some interesting
101+
benchmarking of the various tree types. There’s no silver bullet when it comes
102+
to trees.
103+
104+
I love Python because there’s one right way to do things. If I just want sorted
105+
types, what’s the right answer?
106+
107+
I couldn’t find the right answer so I built it. The missing battery: Sorted
108+
Containers.
109+
110+
Here it is. This is the project home page. Sorted Containers is a Python sorted
111+
collections library with sorted list, sorted dictionary, and sorted set
112+
implementations. It’s pure-Python but it’s as fast as C-extensions. It’s Python
113+
2 and Python 3 compatible. It’s fully-featured. And it’s extensively tested
114+
with 100% coverage and hours of stress.
115+
116+
Performance is a feature. That means graphs. Lot’s of them. There are 189
117+
performance graphs in total. Let’s look at a few of them together.
118+
119+
Here’s the performance of adding a random value to a sorted list. I’m comparing
120+
Sorted Containers with other competing implementations.
121+
122+
Notice the axes are log-log. So if performance differs by major tick marks then
123+
one is actually ten times faster than the other.
124+
125+
We see here that Sorted Containers is in fact about ten times faster than blist
126+
when it comes to adding random values to a sorted list. Notice also Raymond’s
127+
recipe is just a list and that displays order n-squared runtime
128+
complexity. That’s why it curves upwards.
129+
130+
Of all the sorted collections libraries, Sorted Containers is also fastest at
131+
initialization. We’ll look at why soon.
132+
133+
Sorted Containers is not always fastest. But notice here the performance
134+
improves with scale. You can see it there in blue. It starts in the middle of
135+
the pack and has a lesser slope than competitors.
136+
137+
In short, Sorted Containers is kind of like a B-tree implementation. That means
138+
you can configure the the fan-out of nodes in the tree. We call that the load
139+
parameter and there are extensive performance graphs of three different load
140+
parameters.
141+
142+
Here we see that a load factor of ten thousand is fastest for indexing a sorted
143+
list.
144+
145+
Notice the axes now go up to ten million elements. I’ve actually scaled
146+
SortedList all the way to ten billion elements. It was a really incredible
147+
experiment. I had to rent the largest high-memory instance available from
148+
Google Compute Engine. That benchmark required about 128 gigabytes of memory
149+
and cost me about thirty dollars.
150+
151+
This is the performance of deleting a key from a sorted dictionary. Now the
152+
smaller load-factor is fastest. The default load-factor is 1,000 and works well
153+
for most scenarios. It’s a very sane default.
154+
155+
In addition to comparisons and load-factors, I also benchmark runtimes. Here’s
156+
CPython 2.7, CPython 3.5 and PyPy version 5. You can see where the the
157+
just-in-time compiler, the jit-compiler, kicks in. That’ll make Sorted
158+
Containers another ten times faster.
159+
160+
Finally, I made a survey in 2015 on Github as to how people were using sorted
161+
collections. I noticed patterns like priority queues, mutli-sets,
162+
nearest-neighbor algorithms, etc.
163+
164+
This is the priority queue workload which spends 40% of its time adding
165+
elements, 40% popping elements, 10% discarding elements, and has a couple other
166+
methods.
167+
168+
Sorted Containers is two to ten times faster in all of these scenarios.
169+
170+
We also have a lot of features. The API is nearly a drop-in replacement for the
171+
“blist” and “rbtree” modules. But the quirks have been fixed so the “pop”
172+
method returns the last element rather than the first.
173+
174+
Sorted lists are sorted so you can bisect them. Looking up the index of an
175+
element is also very fast.
176+
177+
Bintrees introduced methods for tree traversal. And I’ve boiled those down to a
178+
couple API methods. On line 3, we see “irange”. Irange iterates all keys from
179+
bob to eve in sorted order.
180+
181+
Sorted dictionaries also have a sequence-like view called iloc. If you’re
182+
coming from Pandas that should look familiar. Line 4 creates a list of the five
183+
largest keys in the dictionary.
184+
185+
Similar to “irange” there is an “islice” method. Islice does positional index
186+
slicing. In line 5 we create an iterator over the indexes 10 through 49
187+
inclusive.
188+
189+
One of the benefits of being pure-Python: it’s easy to hack on. Over the years,
190+
a few patterns have emerged and become recipes. All of these are available from
191+
PyPI with pip install sortedcollections.
192+
193+
If all that didn’t convince you that Sorted Containers is great then listen to
194+
what other smart people say about it:
195+
196+
Alex Martelli says: Good stuff! … I like the simple, effective implementation
197+
idea of splitting the sorted containers into smaller “fragments” to avoid the
198+
O(N) insertion costs.
199+
200+
Jeff Knupp writes: That last part, “fast as C-extensions,” was difficult to
201+
believe. I would need some sort of performance comparison to be convinced this
202+
is true. The author includes this in the docs. It is.
203+
204+
Kevin Samuel says: I’m quite amazed, not just by the code quality (it’s
205+
incredibly readable and has more comment than code, wow), but the actual amount
206+
of work you put at stuff that is not code: documentation, benchmarking,
207+
implementation explanations. Even the git log is clean and the unit tests run
208+
out of the box on Python 2 and 3.
209+
210+
If you’re new to sorted collections, I hope I’ve piqued your interest. Think
211+
about the achievement here. Sorted Containers is pure-Python but as fast as
212+
C-implementations. Let’s look under the hood of Sorted Containers at what makes
213+
it so fast.
214+
215+
It really comes down to bisect for the heavy lifting. Bisect is a module in the
216+
standard library that implements binary search on lists. There’s also a handy
217+
method called insort that does a binary search and insertion for us in one
218+
call. There’s no magic here, it’s just implemented in C and part of the
219+
standard library.
220+
221+
Here’s the basic structure. It’s just a list of sublists. So there’s a member
222+
variable called “lists” that points to sublists. Each of those is maintained in
223+
sorted order. You’ll sometimes hear me refer to these as the top-level list and
224+
its sublists.
225+
226+
There’s no need to wrap sublists in their own objects. They are just
227+
lists. Simple is fast and efficient.
228+
229+
In addition to the list of sublists. There’s an index called the maxes
230+
index. That simply stores the maximum value in each sublist. Now lists in
231+
CPython are simply arrays of pointers so we’re not adding much overhead with
232+
this index.
233+
234+
Let’s walk through testing membership with contains. Let’s look for element 14.
235+
236+
Let’s also walk through adding an element. Let’s add 5 to the sorted list.
237+
238+
Now numeric indexing is a little more complex. Numeric indexing uses a tree
239+
packed densely into another list. I haven’t seen this structure described in
240+
textbooks or research so I’d like to call it a “Jenks” index. But I’ll also
241+
refer to it as the positional index.
242+
243+
Let’s build the positional index together.
244+
245+
Remember the positional index is a tree stored in a list, kind of like a heap.
246+
247+
Let’s use this to lookup index 8. Starting at the root, 18, compare index to
248+
the left-child node. 8 is greater than 7 so we subtract 7 from 8 and move to
249+
the right-child node. Again, now at node 11, compare index again to the
250+
left-child node. 1 is less than 6, so we simply move to the left-child node. We
251+
terminate at 6 because it’s a leaf node. Our final index is 1 and our final
252+
position is 5. We calculate the top-level list index as the position minus the
253+
offset. So our final coordinates are index 2 in the top-level list and index 1
254+
in the sublist.
255+
256+
That’s it. Three lists maintain the elements, the maxes index, and the
257+
positional index. We’ve used simple built-in types to construct complex
258+
behavior.
259+
260+
# Acknowledgements
261+
262+
Thank you to the open source community that has contributed bug reports,
263+
documentation improvements, and feature guidance in development of the project.
264+
265+
Significant interface design credit is due to Daniel Stutzbach and the "blist"
266+
software project which this project originally copied.
267+
268+
# References

0 commit comments

Comments
 (0)