Skip to content

Commit 0427672

Browse files
committed
Move hashing data structures to technical appendix.
1 parent 5ceb38c commit 0427672

File tree

4 files changed

+25
-20
lines changed

4 files changed

+25
-20
lines changed

episodes/optimisation-data-structures-algorithms.md

Lines changed: 5 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -156,14 +156,13 @@ Since Python 3.6, the items within a dictionary will iterate in the order that t
156156

157157
### Hashing Data Structures
158158

159-
Python's dictionaries are implemented as hashing data structures.
160-
Explaining how these work will get a bit technical, so let's start with an analogy:
159+
Python's dictionaries are implemented as hashing data structures, we can understand where these at a high-level with an analogy:
161160

162161
A Python list is like having a single long bookshelf. When you buy a new book (append a new element to the list), you place it at the far end of the shelf, right after all the previous books.
163162

164163
![A bookshelf corresponding to a Python list.](episodes/fig/bookshelf_list.jpg){alt="An image of a single long bookshelf, with a large number of books."}
165164

166-
A hashing data structure is more like a bookcase with several shelves, labelled by genre (sci-fi, romance, children's books, non-fiction, …) and author surname. When you buy a new book by Jules Verne, you might place it on the shelf labelled "Sci-Fi, V–Z".
165+
A dictionary is more like a bookcase with several shelves, labelled by genre (sci-fi, romance, children's books, non-fiction, …) and author surname. When you buy a new book by Jules Verne, you might place it on the shelf labelled "Sci-Fi, V–Z".
167166
And if you keep adding more books, at some point you'll move to a larger bookcase with more shelves (and thus more fine-grained sorting), to make sure you don't have too many books on a single shelf.
168167

169168
![A bookshelf corresponding to a Python dictionary.](episodes/fig/bookshelf_dict.jpg){alt="An image of two bookcases, labelled "Sci-Fi" and "Romance". Each bookcase contains shelves labelled in alphabetical order, with zero or few books on each shelf."}
@@ -186,25 +185,14 @@ In practice, therefore, this trade-off between memory usage and speed is usually
186185

187186
::::::::::::::::::::::::::::::::::::::::::::::::
188187

188+
When a value is inserted into a dictionary, its key is hashed to decide on which "shelf" it should be stored. Most items will have a unique shelf, allowing them to be accessed directly. This is typically much faster for locating a specific item than searching a list.
189189

190-
::::::::::::::::::::::::::::::::::::: callout
191-
192-
### Technical explanation
193-
194-
Within a hashing data structure each inserted key is hashed to produce a (hopefully unique) integer key.
195-
The dictionary is pre-allocated to a default size, and the key is assigned the index within the dictionary equivalent to the hash modulo the length of the dictionary.
196-
If that index doesn't already contain another key, the key (and any associated values) can be inserted.
197-
When the index isn't free, a collision strategy is applied. CPython's [dictionary](https://github.com/python/cpython/blob/main/Objects/dictobject.c) and [set](https://github.com/python/cpython/blob/main/Objects/setobject.c) both use a form of open addressing whereby a hash is mutated and corresponding indices probed until a free one is located.
198-
When the hashing data structure exceeds a given load factor (e.g. 2/3 of indices have been assigned keys), the internal storage must grow. This process requires every item to be re-inserted which can be expensive, but reduces the average probes for a key to be found.
199-
200-
![An visual explanation of linear probing, CPython uses an advanced form of this.](episodes/fig/hash_linear_probing.png){alt="A diagram demonstrating how the keys (hashes) 37, 64, 14, 94, 67 are inserted into a hash table with 11 indices. This is followed by the insertion of 59, 80 and 39 which require linear probing to be inserted due to collisions."}
201-
202-
To retrieve or check for the existence of a key within a hashing data structure, the key is hashed again and a process equivalent to insertion is repeated. However, now the key at each index is checked for equality with the one provided. If any empty index is found before an equivalent key, then the key must not be present in the data structure.
203190

191+
::::::::::::::::::::::::::::::::::::: callout
204192

205193
### Keys
206194

207-
Keys will typically be a core Python type such as a number or string. However, multiple of these can be combined as a Tuple to form a compound key, or a custom class can be used if the methods `__hash__()` and `__eq__()` have been implemented.
195+
A dictionary's keys will typically be a core Python type such as a number or string. However, multiple of these can be combined as a Tuple to form a compound key, or a custom class can be used if the methods `__hash__()` and `__eq__()` have been implemented.
208196

209197
You can implement `__hash__()` by utilising the ability for Python to hash tuples, avoiding the need to implement a bespoke hash function.
210198

learners/technical-appendix.md

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ The topics covered here exceed the level of knowledge required to benefit from t
88

99
- [Viewing Python's ByteCode](#viewing-pythons-bytecode): What the Python code you write compiles to and executes as.
1010
- [Hardware Level Memory Accesses](#hardware-level-memory-accesses): A look at how memory accesses pass through a processor's caches.
11-
- []()
11+
- [Hashing Data-Structures](#hashing-data-structures): A deeper look at how data structures such as Dictionaries operate.
1212

1313
## Viewing Python's ByteCode
1414

@@ -143,7 +143,7 @@ Modern computers typically have a single processor (CPU), within this processor
143143
Data held in memory by running software is exists in RAM, this memory is faster to access than hard drives (and solid-state drives).
144144
But the CPU has much smaller caches on-board, to make accessing the most recent variables even faster.
145145

146-
![An annotated photo of a computer's hardware.](learners/fig/annotated-motherboard.jpg){alt="An annotated photo of inside a desktop computer's case. The CPU, RAM, power supply, graphics cards (GPUs) and harddrive are labelled."}
146+
![An annotated photo of a computer's hardware.](learners/fig/annotated-motherboard.jpg){alt="An annotated photo of inside a desktop computer's case. The CPU, RAM, power supply, graphics cards (GPUs) and hard-drive are labelled."}
147147

148148
<!-- Read/operate on variable ram->cpu cache->registers->cpu -->
149149
When reading a variable, to perform an operation with it, the CPU will first look in its registers. These exist per core, they are the location that computation is actually performed. Accessing them is incredibly fast, but there only exists enough storage for around 32 variables (typical number, e.g. 4 bytes).
@@ -179,4 +179,21 @@ However all is not lost, packages such as `numpy` and `pandas` implemented in C/
179179
:::::::::::::::::::::::::::::::::::::::::::::
180180

181181

182-
##
182+
## Hashing Data-Structures
183+
184+
Within a hashing data structure (such as a Dictionary or Set) each inserted key is hashed to produce a (preferably unique) integer key, which serves as the basis for indexing. Dictionaries are initialized with a default size, and the hash value of a key, modulo the dictionary's length, determines its initial index. If this index is available, the key and its associated value are stored there. If the index is already occupied, a collision occurs, and a resolution strategy is applied to find an alternate index.
185+
186+
In CPython's [dictionary](https://github.com/python/cpython/blob/main/Objects/dictobject.c) and [set](https://github.com/python/cpython/blob/main/Objects/setobject.c)implementations, a technique called open addressing is employed. This approach modifies the hash and probes subsequent indices until an empty one is found.
187+
188+
When a dictionary or hash table in Python grows, the underlying storage is resized, which necessitates re-inserting every existing item into the new structure. This process can be computationally expensive but is essential for maintaining efficient average probe times when searching for keys.
189+
190+
![A visual explanation of linear probing, CPython uses an advanced form of this.](learners/fig/hash_linear_probing.png){alt="A diagram showing how keys (hashes) 37, 64, 14, 94, 67 are inserted into a hash table with 11 indices. The insertion of 59, 80, and 39 demonstrates linear probing to resolve collisions."}
191+
192+
To look up or verify the existence of a key in a hashing data structure, the key is re-hashed, and the process mirrors that of insertion. The corresponding index is probed to see if it contains the provided key. If the key at the index matches, the operation succeeds. If an empty index is reached before finding the key, it indicates that the key does not exist in the structure.
193+
194+
The above diagrams shows a hash table of 5 elements within a block of 11 slots:
195+
196+
1. We try to add element k=59. Based on its hash, the intended position is p=4. However, slot 4 is already occupied by the element k=37. This results in a collision.
197+
2. To resolve the collision, the linear probing mechanism is employed. The algorithm checks the next available slot, starting from position p=4. The first available slot is found at position 5.
198+
3. The number of jumps (or steps) it took to find the available slot are represented by i=1 (since we moved from position 4 to 5).
199+
In this case, the number of jumps i=1 indicates that the algorithm had to probe one slot to find an empty position at index 5.

0 commit comments

Comments
 (0)