You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: episodes/optimisation-memory.md
+7-56Lines changed: 7 additions & 56 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,68 +6,20 @@ exercises: 0
6
6
7
7
:::::::::::::::::::::::::::::::::::::: questions
8
8
9
-
- How does a CPU look for a variable it requires?
10
-
- What impact do cache lines have on memory accesses?
11
9
- Why is it faster to read/write a single 100 MB file, than 100 files of 1 MB each?
10
+
- How many orders of magnitude slower are disk accesses than RAM?
11
+
- What's the cost of creating a list?
12
12
13
13
::::::::::::::::::::::::::::::::::::::::::::::::
14
14
15
15
::::::::::::::::::::::::::::::::::::: objectives
16
16
17
-
- Able to explain, at a high-level, how memory accesses occur during computation and how this impacts optimisation considerations.
18
17
- Able to identify the relationship between different latencies relevant to software.
18
+
- Demonstrate how to implement parallel network requests.
19
+
- Justify the re-use of existing variables over creating new ones.
19
20
20
21
::::::::::::::::::::::::::::::::::::::::::::::::
21
22
22
-
## Accessing Variables
23
-
24
-
The storage and movement of data plays a large role in the performance of executing software.
25
-
26
-
<!-- Brief summary of hardware -->
27
-
Modern computer's typically have a single processor (CPU), within this processor there are multiple processing cores each capable of executing different code in parallel.
28
-
29
-
Data held in memory by running software is exists in RAM, this memory is faster to access than hard drives (and solid-state drives).
30
-
But the CPU has much smaller caches on-board, to make accessing the most recent variables even faster.
31
-
32
-
{alt="An annotated photo of inside a desktop computer's case. The CPU, RAM, power supply, graphics cards (GPUs) and harddrive are labelled."}
33
-
34
-
<!-- Read/operate on variable ram->cpu cache->registers->cpu -->
35
-
When reading a variable, to perform an operation with it, the CPU will first look in its registers. These exist per core, they are the location that computation is actually performed. Accessing them is incredibly fast, but there only exists enough storage for around 32 variables (typical number, e.g. 4 bytes).
36
-
As the register file is so small, most variables won't be found and the CPU's caches will be searched.
37
-
It will first check the current processing core's L1 (Level 1) cache, this small cache (typically 64 KB per physical core) is the smallest and fastest to access cache on a CPU.
38
-
If the variable is not found in the L1 cache, the L2 cache that is shared between multiple cores will be checked. This shared cache, is slower to access but larger than L1 (typically 1-3MB per core).
39
-
This process then repeats for the L3 cache which may be shared among all cores of the CPU. This cache again has higher latency to access, but increased size (typically slightly larger than the total L2 cache size).
40
-
If the variable has not been found in any of the CPU's cache, the CPU will look to the computer's RAM. This is an order of magnitude slower to access, with several orders of magnitude greater capacity (tens to hundreds of GB are now standard).
41
-
42
-
Correspondingly, the earlier the CPU finds the variable the faster it will be to access.
43
-
However, to fully understand the cache's it's necessary to explain what happens once a variable has been found.
44
-
45
-
If a variable is not found in the caches, it must be fetched from RAM.
46
-
The full 64 byte cache line containing the variable, will be copied first into the CPU's L3, then L2 and then L1.
47
-
Most variables are only 4 or 8 bytes, so many neighbouring variables are also pulled into the caches.
48
-
Similarly, adding new data to a cache evicts old data.
49
-
This means that reading 16 integers contiguously stored in memory, should be faster than 16 scattered integers
50
-
51
-
Therefore, to **optimally** access variables they should be stored contiguously in memory with related data and worked on whilst they remain in caches.
52
-
If you add to a variable, perform large amount of unrelated processing, then add to the variable again it will likely have been evicted from caches and need to be reloaded from slower RAM again.
53
-
54
-
<!-- Latency/Throughput typically inversely proportional to capacity -->
55
-
It's not necessary to remember this full detail of how memory access work within a computer, but the context perhaps helps understand why memory locality is important.
56
-
57
-
{alt='An abstract representation of a CPU, RAM and Disk, showing their internal caches and the pathways data can pass.'}
58
-
59
-
::::::::::::::::::::::::::::::::::::: callout
60
-
61
-
Python as a programming language, does not give you enough control to carefully pack your variables in this manner (every variable is an object, so it's stored as a pointer that redirects to the actual data stored elsewhere).
62
-
63
-
However all is not lost, packages such as `numpy` and `pandas` implemented in C/C++ enable Python users to take advantage of efficient memory accesses (when they are used correctly).
64
-
65
-
:::::::::::::::::::::::::::::::::::::::::::::
66
-
67
-
<!-- TODO python code example
68
-
```python
69
-
70
-
```-->
71
23
72
24
## Accessing Disk
73
25
@@ -158,7 +110,7 @@ An even greater overhead would apply.
158
110
159
111
## Accessing the Network
160
112
161
-
When transfering files over a network, similar effects apply. There is a fixed overhead for every file transfer (no matter how big the file), so downloading many small files will be slower than downloading a single large file of the same total size.
113
+
When transferring files over a network, similar effects apply. There is a fixed overhead for every file transfer (no matter how big the file), so downloading many small files will be slower than downloading a single large file of the same total size.
162
114
163
115
Because of this overhead, downloading many small files often does not use all the available bandwidth. It may be possible to speed things up by parallelising downloads.
164
116
@@ -227,7 +179,7 @@ Latency can have a big impact on the speed that a program executes, the below gr
227
179
228
180
{alt="A horizontal bar chart displaying the relative latencies for L1/L2/L3 cache, RAM, SSD, HDD and a packet being sent from London to California and back. These latencies range from 1 nanosecond to 140 milliseconds and are displayed with a log scale."}
229
181
230
-
The lower the latency typically the higher the effective bandwidth (L1 and L2 cache have 1TB/s, RAM 100GB/s, SSDs up to 32 GB/s, HDDs up to 150MB/s), making large memory transactions even slower.
182
+
The lower the latency typically the higher the effective bandwidth (L1 and L2 cache have 1 TB/s, RAM 100 GB/s, SSDs up to 32 GB/s, HDDs up to 150 MB/s), making large memory transactions even slower.
231
183
232
184
## Memory Allocation is not Free
233
185
@@ -335,9 +287,8 @@ Line # Hits Time Per Hit % Time Line Contents
335
287
336
288
::::::::::::::::::::::::::::::::::::: keypoints
337
289
338
-
- Sequential accesses to memory (RAM or disk) will be faster than random or scattered accesses.
339
-
- This is not always natively possible in Python without the use of packages such as NumPy and Pandas
340
290
- One large file is preferable to many small files.
291
+
- Network requests can be parallelised to reduce the impact of fixed overheads.
341
292
- Memory allocation is not free, avoiding destroying and recreating objects can improve performance.
-[Viewing Python's ByteCode](#viewing-pythons-bytecode): What the Python code you write compiles to and executes as.
10
+
-[Hardware Level Memory Accesses](#hardware-level-memory-accesses): A look at how memory accesses pass through a processor's caches.
11
11
-[]()
12
12
13
13
## Viewing Python's ByteCode
@@ -133,6 +133,50 @@ dis.dis(operatorSearch)
133
133
54 RETURN_VALUE
134
134
```
135
135
136
-
##
136
+
## Hardware Level Memory Accesses
137
+
138
+
The storage and movement of data plays a large role in the performance of executing software.
139
+
140
+
<!-- Brief summary of hardware -->
141
+
Modern computers typically have a single processor (CPU), within this processor there are multiple processing cores each capable of executing different code in parallel.
142
+
143
+
Data held in memory by running software is exists in RAM, this memory is faster to access than hard drives (and solid-state drives).
144
+
But the CPU has much smaller caches on-board, to make accessing the most recent variables even faster.
145
+
146
+
{alt="An annotated photo of inside a desktop computer's case. The CPU, RAM, power supply, graphics cards (GPUs) and harddrive are labelled."}
147
+
148
+
<!-- Read/operate on variable ram->cpu cache->registers->cpu -->
149
+
When reading a variable, to perform an operation with it, the CPU will first look in its registers. These exist per core, they are the location that computation is actually performed. Accessing them is incredibly fast, but there only exists enough storage for around 32 variables (typical number, e.g. 4 bytes).
150
+
As the register file is so small, most variables won't be found and the CPU's caches will be searched.
151
+
It will first check the current processing core's L1 (Level 1) cache, this small cache (typically 64 KB per physical core) is the smallest and fastest to access cache on a CPU.
152
+
If the variable is not found in the L1 cache, the L2 cache that is shared between multiple cores will be checked. This shared cache, is slower to access but larger than L1 (typically 1-3MB per core).
153
+
This process then repeats for the L3 cache which may be shared among all cores of the CPU. This cache again has higher latency to access, but increased size (typically slightly larger than the total L2 cache size).
154
+
If the variable has not been found in any of the CPU's cache, the CPU will look to the computer's RAM. This is an order of magnitude slower to access, with several orders of magnitude greater capacity (tens to hundreds of GB are now standard).
155
+
156
+
Correspondingly, the earlier the CPU finds the variable the faster it will be to access.
157
+
However, to fully understand the cache's it's necessary to explain what happens once a variable has been found.
158
+
159
+
If a variable is not found in the caches, it must be fetched from RAM.
160
+
The full 64 byte cache line containing the variable, will be copied first into the CPU's L3, then L2 and then L1.
161
+
Most variables are only 4 or 8 bytes, so many neighbouring variables are also pulled into the caches.
162
+
Similarly, adding new data to a cache evicts old data.
163
+
This means that reading 16 integers contiguously stored in memory, should be faster than 16 scattered integers
164
+
165
+
Therefore, to **optimally** access variables they should be stored contiguously in memory with related data and worked on whilst they remain in caches.
166
+
If you add to a variable, perform large amount of unrelated processing, then add to the variable again it will likely have been evicted from caches and need to be reloaded from slower RAM again.
167
+
168
+
<!-- Latency/Throughput typically inversely proportional to capacity -->
169
+
It's not necessary to remember this full detail of how memory access work within a computer, but the context perhaps helps understand why memory locality is important.
170
+
171
+
{alt='An abstract representation of a CPU, RAM and Disk, showing their internal caches and the pathways data can pass.'}
172
+
173
+
::::::::::::::::::::::::::::::::::::: callout
174
+
175
+
Python as a programming language, does not give you enough control to carefully pack your variables in this manner (every variable is an object, so it's stored as a pointer that redirects to the actual data stored elsewhere).
176
+
177
+
However all is not lost, packages such as `numpy` and `pandas` implemented in C/C++ enable Python users to take advantage of efficient memory accesses (when they are used correctly).
0 commit comments