Skip to content

Commit 724f80b

Browse files
JostMigendaRobadob
authored andcommitted
add bookshelf analogy for hashing data structures; move technical explanation into a callout
1 parent 5329867 commit 724f80b

File tree

3 files changed

+36
-1
lines changed

3 files changed

+36
-1
lines changed

episodes/fig/bookshelf_dict.jpg

288 KB
Loading

episodes/fig/bookshelf_list.jpg

119 KB
Loading

episodes/optimisation-data-structures-algorithms.md

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -151,8 +151,41 @@ Since Python 3.6, the items within a dictionary will iterate in the order that t
151151

152152
### Hashing Data Structures
153153

154-
<!-- simple explanation of how a hash-based data structure works -->
155154
Python's dictionaries are implemented as hashing data structures.
155+
Explaining how these work will get a bit technical, so let’s start with an analogy:
156+
157+
A Python list is like having a single long bookshelf. When you buy a new book (append a new element to the list), you place it at the far end of the shelf, right after all the previous books.
158+
159+
![A bookshelf corresponding to a Python list.](episodes/fig/bookshelf_list.jpg){alt="An image of a single long bookshelf, with a large number of books."}
160+
161+
A hashing data structure is more like a bookcase with several shelves, labeled by genre (sci-fi, romance, children’s books, non-fiction, …) and author surname. When you buy a new book by Jules Verne, you might place it on the shelf labeled “Sci-Fi, V–Z”.
162+
And if you keep adding more books, at some point you’ll move to a larger bookcase with more shelves (and thus more fine-grained sorting), to make sure you don’t have too many books on a single shelf.
163+
164+
![A bookshelf corresponding to a Python dictionary.](episodes/fig/bookshelf_dict.jpg){alt="An image of two bookcases, labelled “Sci-Fi” and “Romance”. Each bookcase contains shelves labelled in alphabetical order, with zero or few books on each shelf."}
165+
166+
Now, let's say a friend wanted to borrow the book "'—All You Zombies—'" by Robert Heinlein.
167+
If I had my books arranged on a single bookshelf (in a list), I would have to look through every book I own in order to find it.
168+
However, if I had a bookcase with several shelves (a hashing data structure), I know immediately that I need to check the shelf “Sci-Fi, G—J”, so I’d be able to find it much more quickly!
169+
170+
::::::::::::::::::::::::::::::::::::: instructor
171+
172+
The large bookcases in the second illustration, with many shelves almost empty, take up a lot more space than the single shelf in the first illustration.
173+
This may also be interpreted as the dictionary using more memory than a list.
174+
175+
In principle, this is correct. However:
176+
177+
* The actual difference is much less pronounced than in the illustration. (A list requires about 8 bytes to keep track of each item, while a dictionary requires about 30 bytes.)
178+
* In most cases this net size of the list/dictionary itself is negligibly small compared to the size of the objects stored in the list or dictionary (e.g. 41 bytes for an empty string or 112 bytes for an empty NumPy array).
179+
180+
In practice, therefore, this trade-off between memory usage and speed is usually worth it.
181+
182+
::::::::::::::::::::::::::::::::::::::::::::::::
183+
184+
185+
::::::::::::::::::::::::::::::::::::: callout
186+
187+
### Technical explanation
188+
156189
Within a hashing data structure each inserted key is hashed to produce a (hopefully unique) integer key.
157190
The dictionary is pre-allocated to a default size, and the key is assigned the index within the dictionary equivalent to the hash modulo the length of the dictionary.
158191
If that index doesn't already contain another key, the key (and any associated values) can be inserted.
@@ -190,6 +223,8 @@ dict[MyKey("one", 2, 3.0)] = 12
190223
```
191224
The only limitation is that where two objects are equal they must have the same hash, hence all member variables which contribute to `__eq__()` should also contribute to `__hash__()` and vice versa (it's fine to have irrelevant or redundant internal members contribute to neither).
192225

226+
:::::::::::::::::::::::::::::::::::::
227+
193228
## Sets
194229

195230
Sets are dictionaries without the values (both are declared using `{}`), a collection of unique keys equivalent to the mathematical set. *Modern CPython now uses a set implementation distinct from that of it's dictionary, however they still behave much the same in terms of performance characteristics.*

0 commit comments

Comments
 (0)