Skip to content

Commit b1a88c0

Browse files
committed
differences for PR #3
1 parent 9e1ba0b commit b1a88c0

File tree

4 files changed

+20
-7
lines changed

4 files changed

+20
-7
lines changed

fig/python_lists.png

448 KB
Loading

md5sum.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,10 @@
55
"index.md" "8f0476c27469136028995d6b7c9d4240" "site/built/index.md" "2025-01-14"
66
"links.md" "8184cf4149eafbf03ce8da8ff0778c14" "site/built/links.md" "2025-01-08"
77
"episodes/optimisation-introduction.md" "4ca162f5e35aa54d9618423d84a200cd" "site/built/optimisation-introduction.md" "2025-01-29"
8-
"episodes/optimisation-data-structures-algorithms.md" "a7cdce11f55fde5a86e6ae49e4b95645" "site/built/optimisation-data-structures-algorithms.md" "2025-01-29"
8+
"episodes/optimisation-data-structures-algorithms.md" "599df8343a30b5c526ec68ee0f24159e" "site/built/optimisation-data-structures-algorithms.md" "2025-01-30"
99
"episodes/optimisation-minimise-python.md" "a4ee08b0ba064aaf4271d8712b29af17" "site/built/optimisation-minimise-python.md" "2025-01-29"
1010
"episodes/optimisation-use-latest.md" "5948276773890e97b7898292fddbcb39" "site/built/optimisation-use-latest.md" "2025-01-08"
11-
"episodes/optimisation-memory.md" "dc08f479e4758bcaea243f11251c2464" "site/built/optimisation-memory.md" "2025-01-29"
11+
"episodes/optimisation-memory.md" "2e3f414bceba47f1a3f814880fbfa20f" "site/built/optimisation-memory.md" "2025-01-30"
1212
"episodes/optimisation-conclusion.md" "567478d44c721cbf1bc8a71297a54a56" "site/built/optimisation-conclusion.md" "2025-01-08"
1313
"episodes/long-break1.md" "dea66ed9de52386eebf67722b167a2a8" "site/built/long-break1.md" "2025-01-14"
1414
"episodes/profiling-introduction.md" "e9fe7f86f9704b3e3655b55c0097ed67" "site/built/profiling-introduction.md" "2025-01-23"

optimisation-data-structures-algorithms.md

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,9 @@ CPython for example uses [`newsize + (newsize >> 3) + 6`](https://github.com/pyt
6363

6464
![The relationship between the number of appends to an empty list, and the number of internal resizes in CPython.](episodes/fig/cpython_list_allocations.png){alt='A line graph displaying the relationship between the number of calls to append() and the number of internal resizes of a CPython list. It has a logarithmic relationship, at 1 million appends there have been 84 internal resizes.'}
6565

66+
![Visual note on resizing behaviour of Python lists.](episodes/fig/python_lists.png){alt='Small cheat note for better visualization of Python lists.'}
67+
68+
6669
This has two implications:
6770

6871
* If you are creating large static lists, they will use up to 12.5% excess memory.
@@ -155,7 +158,6 @@ Python's dictionaries are implemented using hashing as their underlying data str
155158

156159
In CPython's [dictionary](https://github.com/python/cpython/blob/main/Objects/dictobject.c) and [set](https://github.com/python/cpython/blob/main/Objects/setobject.c)implementations, a technique called open addressing is employed. This approach modifies the hash and probes subsequent indices until an empty one is found.
157160

158-
159161
When a dictionary or hash table in Python grows, the underlying storage is resized, which necessitates re-inserting every existing item into the new structure. This process can be computationally expensive but is essential for maintaining efficient average probe times when searching for keys.
160162
![A visual explanation of linear probing, CPython uses an advanced form of this.](episodes/fig/hash_linear_probing.png){alt="A diagram showing how keys (hashes) 37, 64, 14, 94, 67 are inserted into a hash table with 11 indices. The insertion of 59, 80, and 39 demonstrates linear probing to resolve collisions."}
161163
To look up or verify the existence of a key in a hashing data structure, the key is re-hashed, and the process mirrors that of insertion. The corresponding index is probed to see if it contains the provided key. If the key at the index matches, the operation succeeds. If an empty index is reached before finding the key, it indicates that the key does not exist in the structure.
@@ -166,7 +168,6 @@ The above diagrams shows a hash table of 5 elements within a block of 11 slots:
166168
3. The number of jumps (or steps) it took to find the available slot are represented by i=1 (since we moved from position 4 to 5).
167169
In this case, the number of jumps i=1 indicates that the algorithm had to probe one slot to find an empty position at index 5.
168170

169-
170171
### Keys
171172

172173
Keys will typically be a core Python type such as a number or string. However, multiple of these can be combined as a Tuple to form a compound key, or a custom class can be used if the methods `__hash__()` and `__eq__()` have been implemented.
@@ -284,7 +285,7 @@ uniqueListSort: 2.67ms
284285

285286
Independent of the performance to construct a unique set (as covered in the previous section), it's worth identifying the performance to search the data-structure to retrieve an item or check whether it exists.
286287

287-
The performance of a hashing data structure is subject to the load factor and number of collisions. An item that hashes with no collision can be checked almost directly, whereas one with collisions will probe until it finds the correct item or an empty slot. In the worst possible case, whereby all insert items have collided this would mean checking every single item. In practice, hashing data-structures are designed to minimise the chances of this happening and most items should be found or identified as missing with single access.
288+
The performance of a hashing data structure is subject to the load factor and number of collisions. An item that hashes with no collision can be checked almost directly, whereas one with collisions will probe until it finds the correct item or an empty slot. In the worst possible case, whereby all insert items have collided this would mean checking every single item. In practice, hashing data-structures are designed to minimise the chances of this happening and most items should be found or identified as missing with single access, result in an average time complexity of a constant (which is very good!).
288289

289290
In contrast, if searching a list or array, the default approach is to start at the first item and check all subsequent items until the correct item has been found. If the correct item is not present, this will require the entire list to be checked. Therefore, the worst-case is similar to that of the hashing data-structure, however it is guaranteed in cases where the item is missing. Similarly, on-average we would expect an item to be found halfway through the list, meaning that an average search will require checking half of the items.
290291

@@ -347,6 +348,20 @@ binary_search_list: 5.79ms
347348

348349
These results are subject to change based on the number of items and the proportion of searched items that exist within the list. However, the pattern is likely to remain the same. Linear searches should be avoided!
349350

351+
::::::::::::::::::::::::::::::::::::: callout
352+
353+
Dictionaries are designed to handle insertions efficiently, with average-case O(1) time complexity per insertion for a small size dict, but it is clearly problematic for large size dict. In this case, it is better to find an alternative Data Structure for example List, NumPy Array or Pandas DataFrame. The table below summarizes the best uses and performance characteristics of each data structure:
354+
355+
| Data Structure | Small Size Insertion (O(1)) | Large Size Insertion | Search Performance (O(1)) | Best For |
356+
|------------------|-----------------------------------|------------------------------------------|---------------------------|--------------------------------------------------------------------------|
357+
| Dictionary || ⚠️ Occasional O(n) (due to resizing) | ✅ O(1) (Hashing) | Fast insertions and lookups, key-value storage, small to medium data |
358+
| List | ✅ Amortized (O(1) Append) | ✅ Efficient (Amortized O(1)) | ❌ O(n) (Linear Search) | Dynamic appends, ordered data storage, general-purpose use |
359+
| Set | ✅ Average O(1) | ⚠️ Occasional O(n) (due to resizing) | ✅ O(1) (Hashing) | Membership testing, unique elements, small to medium datasets |
360+
| NumPy Array | ❌ (Fixed Size) | ⚠️ Costly (O(n) when resizing) | ❌ O(n) (Linear Search) | Numerical computations, fixed-size data, vectorized operations |
361+
| Pandas DataFrame | ❌ (if adding rows) | ⚠️ Efficient (Column-wise) | ❌ O(n) (Linear Search) | Column-wise analytics, tabular data, large datasets |
362+
NumPy and Pandas, which we have not yet covered, are powerful libraries designed for handling large matrices and arrays. They are implemented in C to optimize performance, making them ideal for numerical computations and data analysis tasks.
363+
364+
:::::::::::::::::::::::::::::::::::::::::::::
350365

351366
::::::::::::::::::::::::::::::::::::: keypoints
352367

optimisation-memory.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,6 @@ Within Python memory is not explicitly allocated and deallocated, instead it is
173173
The below implementation of the [heat-equation](https://en.wikipedia.org/wiki/Heat_equation), reallocates `out_grid`, a large 2 dimensional (500x500) list each time `update()` is called which progresses the model.
174174

175175
```python
176-
import time
177176
grid_shape = (512, 512)
178177

179178
def update(grid, a_dt):
@@ -222,7 +221,6 @@ Line # Hits Time Per Hit % Time Line Contents
222221
If instead `out_grid` is double buffered, such that two buffers are allocated outside the function, which are swapped after each call to update().
223222

224223
```python
225-
import time
226224
grid_shape = (512, 512)
227225

228226
def update(grid, a_dt, out_grid):

0 commit comments

Comments
 (0)