Skip to content

Commit 039d2d8

Browse files
RobadobJostMigenda
andauthored
Technical Appendix (#79)
* work in progress * dis Doesn't feel worth adding a cross-reference for this section. * Understanding memory * Move hashing data structures to technical appendix. * Fix some dead links etc on review. * Update episodes/optimisation-data-structures-algorithms.md Co-authored-by: Jost Migenda <[email protected]> * Update episodes/optimisation-data-structures-algorithms.md Co-authored-by: Jost Migenda <[email protected]> * Update learners/technical-appendix.md Co-authored-by: Jost Migenda <[email protected]> * Update episodes/optimisation-data-structures-algorithms.md Co-authored-by: Jost Migenda <[email protected]> * Update learners/technical-appendix.md Co-authored-by: Jost Migenda <[email protected]> * Update episodes/optimisation-latency.md Co-authored-by: Jost Migenda <[email protected]> * Update learners/technical-appendix.md Co-authored-by: Jost Migenda <[email protected]> * Update learners/technical-appendix.md Co-authored-by: Jost Migenda <[email protected]> * Update learners/technical-appendix.md Co-authored-by: Jost Migenda <[email protected]> * Update learners/acknowledgements.md Co-authored-by: Jost Migenda <[email protected]> * Update learners/technical-appendix.md Co-authored-by: Jost Migenda <[email protected]> --------- Co-authored-by: Jost Migenda <[email protected]>
1 parent 0c6e8b6 commit 039d2d8

12 files changed

+226
-214
lines changed

config.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,13 +70,14 @@ episodes:
7070
- long-break1.md
7171
- optimisation-numpy.md
7272
- optimisation-use-latest.md
73-
- optimisation-memory.md
73+
- optimisation-latency.md
7474
- optimisation-conclusion.md
7575

7676
# Information for Learners
7777
learners:
7878
- setup.md
7979
- registration.md
80+
- technical-appendix.md
8081
- acknowledgements.md
8182
- ppp.md
8283
- reference.md

episodes/optimisation-conclusion.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -54,10 +54,9 @@ Your feedback enables us to improve the course for future attendees!
5454
- Where feasible, the latest version of Python and packages should be used as they can include significant free improvements to the performance of your code.
5555
- There is a risk that updating Python or packages will not be possible to due to version incompatibilities or will require breaking changes to your code.
5656
- Changes to packages may impact results output by your code, ensure you have a method of validation ready prior to attempting upgrades.
57-
- How the Computer Hardware Affects Performance
58-
- Sequential accesses to memory (RAM or disk) will be faster than random or scattered accesses.
59-
- This is not always natively possible in Python without the use of packages such as NumPy and Pandas
57+
- How Latency Affects Performance
6058
- One large file is preferable to many small files.
59+
- Network requests can be parallelised to reduce the impact of fixed overheads.
6160
- Memory allocation is not free, avoiding destroying and recreating objects can improve performance.
6261

6362
::::::::::::::::::::::::::::::::::::::::::::::::

episodes/optimisation-data-structures-algorithms.md

Lines changed: 5 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -156,14 +156,13 @@ Since Python 3.6, the items within a dictionary will iterate in the order that t
156156

157157
### Hashing Data Structures
158158

159-
Python's dictionaries are implemented as hashing data structures.
160-
Explaining how these work will get a bit technical, so let's start with an analogy:
159+
Python's dictionaries are implemented as hashing data structures, we can understand these at a high-level with an analogy:
161160

162161
A Python list is like having a single long bookshelf. When you buy a new book (append a new element to the list), you place it at the far end of the shelf, right after all the previous books.
163162

164163
![A bookshelf corresponding to a Python list.](episodes/fig/bookshelf_list.jpg){alt="An image of a single long bookshelf, with a large number of books."}
165164

166-
A hashing data structure is more like a bookcase with several shelves, labelled by genre (sci-fi, romance, children's books, non-fiction,&nbsp;…) and author surname. When you buy a new book by Jules Verne, you might place it on the shelf labelled &quot;Sci-Fi, V–Z&quot;.
165+
A Python dictionary is more like a bookcase with several shelves, labelled by genre (sci-fi, romance, children's books, non-fiction,&nbsp;…) and author surname. When you buy a new book by Jules Verne, you might place it on the shelf labelled &quot;Sci-Fi, V–Z&quot;.
167166
And if you keep adding more books, at some point you'll move to a larger bookcase with more shelves (and thus more fine-grained sorting), to make sure you don't have too many books on a single shelf.
168167

169168
![A bookshelf corresponding to a Python dictionary.](episodes/fig/bookshelf_dict.jpg){alt="An image of two bookcases, labelled &quot;Sci-Fi&quot; and &quot;Romance&quot;. Each bookcase contains shelves labelled in alphabetical order, with zero or few books on each shelf."}
@@ -186,25 +185,14 @@ In practice, therefore, this trade-off between memory usage and speed is usually
186185

187186
::::::::::::::::::::::::::::::::::::::::::::::::
188187

188+
When a value is inserted into a dictionary, its key is hashed to decide on which "shelf" it should be stored. Most items will have a unique shelf, allowing them to be accessed directly. This is typically much faster for locating a specific item than searching a list.
189189

190-
::::::::::::::::::::::::::::::::::::: callout
191-
192-
### Technical explanation
193-
194-
Within a hashing data structure each inserted key is hashed to produce a (hopefully unique) integer key.
195-
The dictionary is pre-allocated to a default size, and the key is assigned the index within the dictionary equivalent to the hash modulo the length of the dictionary.
196-
If that index doesn't already contain another key, the key (and any associated values) can be inserted.
197-
When the index isn't free, a collision strategy is applied. CPython's [dictionary](https://github.com/python/cpython/blob/main/Objects/dictobject.c) and [set](https://github.com/python/cpython/blob/main/Objects/setobject.c) both use a form of open addressing whereby a hash is mutated and corresponding indices probed until a free one is located.
198-
When the hashing data structure exceeds a given load factor (e.g. 2/3 of indices have been assigned keys), the internal storage must grow. This process requires every item to be re-inserted which can be expensive, but reduces the average probes for a key to be found.
199-
200-
![An visual explanation of linear probing, CPython uses an advanced form of this.](episodes/fig/hash_linear_probing.png){alt="A diagram demonstrating how the keys (hashes) 37, 64, 14, 94, 67 are inserted into a hash table with 11 indices. This is followed by the insertion of 59, 80 and 39 which require linear probing to be inserted due to collisions."}
201-
202-
To retrieve or check for the existence of a key within a hashing data structure, the key is hashed again and a process equivalent to insertion is repeated. However, now the key at each index is checked for equality with the one provided. If any empty index is found before an equivalent key, then the key must not be present in the data structure.
203190

191+
::::::::::::::::::::::::::::::::::::: callout
204192

205193
### Keys
206194

207-
Keys will typically be a core Python type such as a number or string. However, multiple of these can be combined as a Tuple to form a compound key, or a custom class can be used if the methods `__hash__()` and `__eq__()` have been implemented.
195+
A dictionary's keys will typically be a core Python type such as a number or string. However, multiple of these can be combined as a tuple to form a compound key, or a custom class can be used if the methods `__hash__()` and `__eq__()` have been implemented.
208196

209197
You can implement `__hash__()` by utilising the ability for Python to hash tuples, avoiding the need to implement a bespoke hash function.
210198

Lines changed: 13 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -1,83 +1,33 @@
11
---
2-
title: "Understanding Memory"
2+
title: "Understanding Latency"
33
teaching: 30
44
exercises: 0
55
---
66

77
:::::::::::::::::::::::::::::::::::::: questions
88

9-
- How does a CPU look for a variable it requires?
10-
- What impact do cache lines have on memory accesses?
119
- Why is it faster to read/write a single 100 MB file, than 100 files of 1 MB each?
10+
- How many orders of magnitude slower are disk accesses than RAM?
11+
- What's the cost of creating a list?
1212

1313
::::::::::::::::::::::::::::::::::::::::::::::::
1414

1515
::::::::::::::::::::::::::::::::::::: objectives
1616

17-
- Able to explain, at a high-level, how memory accesses occur during computation and how this impacts optimisation considerations.
1817
- Able to identify the relationship between different latencies relevant to software.
18+
- Demonstrate how to implement parallel network requests.
19+
- Justify the re-use of existing variables over creating new ones.
1920

2021
::::::::::::::::::::::::::::::::::::::::::::::::
2122

22-
## Accessing Variables
23-
24-
The storage and movement of data plays a large role in the performance of executing software.
25-
26-
<!-- Brief summary of hardware -->
27-
Modern computer's typically have a single processor (CPU), within this processor there are multiple processing cores each capable of executing different code in parallel.
28-
29-
Data held in memory by running software is exists in RAM, this memory is faster to access than hard drives (and solid-state drives).
30-
But the CPU has much smaller caches on-board, to make accessing the most recent variables even faster.
31-
32-
![An annotated photo of a computer's hardware.](episodes/fig/annotated-motherboard.jpg){alt="An annotated photo of inside a desktop computer's case. The CPU, RAM, power supply, graphics cards (GPUs) and harddrive are labelled."}
33-
34-
<!-- Read/operate on variable ram->cpu cache->registers->cpu -->
35-
When reading a variable, to perform an operation with it, the CPU will first look in its registers. These exist per core, they are the location that computation is actually performed. Accessing them is incredibly fast, but there only exists enough storage for around 32 variables (typical number, e.g. 4 bytes).
36-
As the register file is so small, most variables won't be found and the CPU's caches will be searched.
37-
It will first check the current processing core's L1 (Level 1) cache, this small cache (typically 64 KB per physical core) is the smallest and fastest to access cache on a CPU.
38-
If the variable is not found in the L1 cache, the L2 cache that is shared between multiple cores will be checked. This shared cache, is slower to access but larger than L1 (typically 1-3MB per core).
39-
This process then repeats for the L3 cache which may be shared among all cores of the CPU. This cache again has higher latency to access, but increased size (typically slightly larger than the total L2 cache size).
40-
If the variable has not been found in any of the CPU's cache, the CPU will look to the computer's RAM. This is an order of magnitude slower to access, with several orders of magnitude greater capacity (tens to hundreds of GB are now standard).
41-
42-
Correspondingly, the earlier the CPU finds the variable the faster it will be to access.
43-
However, to fully understand the cache's it's necessary to explain what happens once a variable has been found.
44-
45-
If a variable is not found in the caches, it must be fetched from RAM.
46-
The full 64 byte cache line containing the variable, will be copied first into the CPU's L3, then L2 and then L1.
47-
Most variables are only 4 or 8 bytes, so many neighbouring variables are also pulled into the caches.
48-
Similarly, adding new data to a cache evicts old data.
49-
This means that reading 16 integers contiguously stored in memory, should be faster than 16 scattered integers
50-
51-
Therefore, to **optimally** access variables they should be stored contiguously in memory with related data and worked on whilst they remain in caches.
52-
If you add to a variable, perform large amount of unrelated processing, then add to the variable again it will likely have been evicted from caches and need to be reloaded from slower RAM again.
53-
54-
<!-- Latency/Throughput typically inversely proportional to capacity -->
55-
It's not necessary to remember this full detail of how memory access work within a computer, but the context perhaps helps understand why memory locality is important.
56-
57-
![An abstract diagram showing the path data takes from disk or RAM to be used for computation.](episodes/fig/hardware.png){alt='An abstract representation of a CPU, RAM and Disk, showing their internal caches and the pathways data can pass.'}
58-
59-
::::::::::::::::::::::::::::::::::::: callout
60-
61-
Python as a programming language, does not give you enough control to carefully pack your variables in this manner (every variable is an object, so it's stored as a pointer that redirects to the actual data stored elsewhere).
62-
63-
However all is not lost, packages such as `numpy` and `pandas` implemented in C/C++ enable Python users to take advantage of efficient memory accesses (when they are used correctly).
64-
65-
:::::::::::::::::::::::::::::::::::::::::::::
66-
67-
<!-- TODO python code example
68-
```python
69-
70-
```-->
7123

7224
## Accessing Disk
7325

7426
<!-- Read data from a file it goes disk->disk cache->ram->cpu cache/s->cpu -->
75-
When accessing data on disk (or network), a very similar process is performed to that between CPU and RAM when accessing variables.
27+
When reading data from a file, it is first transferred from the disk to the disk cache and then to the RAM (the computer's main memory, where variables are stored).
28+
The latency to access files on disk is another order of magnitude higher than accessing normal variables.
7629

77-
When reading data from a file, it transferred from the disk, to the disk cache, to the RAM.
78-
The latency to access files on disk is another order of magnitude higher than accessing RAM.
79-
80-
As such, disk accesses similarly benefit from sequential accesses and reading larger blocks together rather than single variables.
30+
As such, disk accesses benefit from sequential accesses and reading larger blocks together rather than single variables.
8131
Python's `io` package is already buffered, so automatically handles this for you in the background.
8232

8333
However before a file can be read, the file system on the disk must be polled to transform the file path to its address on disk to initiate the transfer (or throw an exception).
@@ -158,7 +108,7 @@ An even greater overhead would apply.
158108

159109
## Accessing the Network
160110

161-
When transfering files over a network, similar effects apply. There is a fixed overhead for every file transfer (no matter how big the file), so downloading many small files will be slower than downloading a single large file of the same total size.
111+
When transferring files over a network, similar effects apply. There is a fixed overhead for every file transfer (no matter how big the file), so downloading many small files will be slower than downloading a single large file of the same total size.
162112

163113
Because of this overhead, downloading many small files often does not use all the available bandwidth. It may be possible to speed things up by parallelising downloads.
164114

@@ -227,7 +177,9 @@ Latency can have a big impact on the speed that a program executes, the below gr
227177

228178
![A graph demonstrating the wide variety of latencies a programmer may experience when accessing data.](episodes/fig/latency.png){alt="A horizontal bar chart displaying the relative latencies for L1/L2/L3 cache, RAM, SSD, HDD and a packet being sent from London to California and back. These latencies range from 1 nanosecond to 140 milliseconds and are displayed with a log scale."}
229179

230-
The lower the latency typically the higher the effective bandwidth (L1 and L2 cache have 1 TB/s, RAM 100 GB/s, SSDs up to 32 GB/s, HDDs up to 150 MB/s), making large memory transactions even slower.
180+
L1/L2/L3 caches are where your most recently accessed variables are stored inside the CPU, whereas RAM is where most of your variables will be found.
181+
182+
The lower the latency typically the higher the effective bandwidth (L1 and L2 cache have 1&nbsp;TB/s, RAM 100&nbsp;GB/s, SSDs up to 32 GB/s, HDDs up to 150&nbsp;MB/s), making large memory transactions even slower.
231183

232184
## Memory Allocation is not Free
233185

@@ -335,9 +287,8 @@ Line # Hits Time Per Hit % Time Line Contents
335287

336288
::::::::::::::::::::::::::::::::::::: keypoints
337289

338-
- Sequential accesses to memory (RAM or disk) will be faster than random or scattered accesses.
339-
- This is not always natively possible in Python without the use of packages such as NumPy and Pandas
340290
- One large file is preferable to many small files.
291+
- Network requests can be parallelised to reduce the impact of fixed overheads.
341292
- Memory allocation is not free, avoiding destroying and recreating objects can improve performance.
342293

343294
::::::::::::::::::::::::::::::::::::::::::::::::

episodes/optimisation-using-python.md

Lines changed: 0 additions & 128 deletions
Original file line numberDiff line numberDiff line change
@@ -150,134 +150,6 @@ operatorSearch: 28.43ms
150150

151151
An easy approach to follow is that if two blocks of code do the same operation, the one that contains less Python is probably faster. This won't apply if you're using 3rd party packages written purely in Python though.
152152

153-
::::::::::::::::::::::::::::::::::::: callout
154-
155-
### Python bytecode
156-
157-
158-
You can use `dis` to view the bytecode generated by Python, the amount of byte-code more strongly correlates with how much code is being executed by the Python interpreter. However, this still does not account for whether functions called are implemented using Python or C.
159-
160-
The pure Python search compiles to 82 lines of byte-code.
161-
162-
```python
163-
import dis
164-
165-
def manualSearch():
166-
ls = generateInputs()
167-
ct = 0
168-
for i in range(0, int(N*M), M):
169-
for j in range(0, len(ls)):
170-
if ls[j] == i:
171-
ct += 1
172-
break
173-
174-
dis.dis(manualSearch)
175-
```
176-
```output
177-
11 0 LOAD_GLOBAL 0 (generateInputs)
178-
2 CALL_FUNCTION 0
179-
4 STORE_FAST 0 (ls)
180-
181-
12 6 LOAD_CONST 1 (0)
182-
8 STORE_FAST 1 (ct)
183-
184-
13 10 LOAD_GLOBAL 1 (range)
185-
12 LOAD_CONST 1 (0)
186-
14 LOAD_GLOBAL 2 (int)
187-
16 LOAD_GLOBAL 3 (N)
188-
18 LOAD_GLOBAL 4 (M)
189-
20 BINARY_MULTIPLY
190-
22 CALL_FUNCTION 1
191-
24 LOAD_GLOBAL 4 (M)
192-
26 CALL_FUNCTION 3
193-
28 GET_ITER
194-
>> 30 FOR_ITER 24 (to 80)
195-
32 STORE_FAST 2 (i)
196-
197-
14 34 LOAD_GLOBAL 1 (range)
198-
36 LOAD_CONST 1 (0)
199-
38 LOAD_GLOBAL 5 (len)
200-
40 LOAD_FAST 0 (ls)
201-
42 CALL_FUNCTION 1
202-
44 CALL_FUNCTION 2
203-
46 GET_ITER
204-
>> 48 FOR_ITER 14 (to 78)
205-
50 STORE_FAST 3 (j)
206-
207-
15 52 LOAD_FAST 0 (ls)
208-
54 LOAD_FAST 3 (j)
209-
56 BINARY_SUBSCR
210-
58 LOAD_FAST 2 (i)
211-
60 COMPARE_OP 2 (==)
212-
62 POP_JUMP_IF_FALSE 38 (to 76)
213-
214-
16 64 LOAD_FAST 1 (ct)
215-
66 LOAD_CONST 2 (1)
216-
68 INPLACE_ADD
217-
70 STORE_FAST 1 (ct)
218-
219-
17 72 POP_TOP
220-
74 JUMP_FORWARD 1 (to 78)
221-
222-
15 >> 76 JUMP_ABSOLUTE 24 (to 48)
223-
>> 78 JUMP_ABSOLUTE 15 (to 30)
224-
225-
13 >> 80 LOAD_CONST 0 (None)
226-
82 RETURN_VALUE
227-
```
228-
229-
Whereas the `in` variant only compiles to 54.
230-
231-
```python
232-
import dis
233-
234-
def operatorSearch():
235-
ls = generateInputs()
236-
ct = 0
237-
for i in range(0, int(N*M), M):
238-
if i in ls:
239-
ct += 1
240-
241-
dis.dis(operatorSearch)
242-
```
243-
```output
244-
4 0 LOAD_GLOBAL 0 (generateInputs)
245-
2 CALL_FUNCTION 0
246-
4 STORE_FAST 0 (ls)
247-
248-
5 6 LOAD_CONST 1 (0)
249-
8 STORE_FAST 1 (ct)
250-
251-
6 10 LOAD_GLOBAL 1 (range)
252-
12 LOAD_CONST 1 (0)
253-
14 LOAD_GLOBAL 2 (int)
254-
16 LOAD_GLOBAL 3 (N)
255-
18 LOAD_GLOBAL 4 (M)
256-
20 BINARY_MULTIPLY
257-
22 CALL_FUNCTION 1
258-
24 LOAD_GLOBAL 4 (M)
259-
26 CALL_FUNCTION 3
260-
28 GET_ITER
261-
>> 30 FOR_ITER 10 (to 52)
262-
32 STORE_FAST 2 (i)
263-
264-
7 34 LOAD_FAST 2 (i)
265-
36 LOAD_FAST 0 (ls)
266-
38 CONTAINS_OP 0
267-
40 POP_JUMP_IF_FALSE 25 (to 50)
268-
269-
8 42 LOAD_FAST 1 (ct)
270-
44 LOAD_CONST 2 (1)
271-
46 INPLACE_ADD
272-
48 STORE_FAST 1 (ct)
273-
>> 50 JUMP_ABSOLUTE 15 (to 30)
274-
275-
6 >> 52 LOAD_CONST 0 (None)
276-
54 RETURN_VALUE
277-
```
278-
279-
:::::::::::::::::::::::::::::::::::::::::::::
280-
281153

282154
## Example: Parsing data from a text file
283155

0 commit comments

Comments
 (0)