Skip to content

Commit 34c06c6

Browse files
Merge pull request #329 from AaltoSciComp/profiling-updates-2025
profiling: Updates for 2025 run
2 parents 6f3a315 + 3aab539 commit 34c06c6

File tree

2 files changed

+286
-26
lines changed

2 files changed

+286
-26
lines changed
77.5 KB
Loading

content/profiling.md

Lines changed: 286 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,13 @@
1616

1717
Classic quote to keep in mind: "Premature optimization is the root of all evil." [Donald Knuth]
1818

19+
:::{figure} https://imgs.xkcd.com/comics/is_it_worth_the_time.png
20+
:alt: Humorous comic from xkcd that shows how long you can work on making a routine task more efficient before you're spending more time than you save?
21+
Figure 1: Is it worth the time? ([xkcd#1205](https://xkcd.com/1205/))
22+
:::
23+
24+
25+
1926
:::{discussion}
2027
It is important to ask ourselves whether it is worth it.
2128
- Is it worth spending e.g. 2 days to make a program run 20% faster?
@@ -25,24 +32,67 @@ Depends. What does it depend on?
2532
:::
2633

2734

28-
## Measure instead of guessing
35+
## Key steps of optimization
36+
37+
When you encounter a situation that you think would benefit from optimization,
38+
follow these three steps:
39+
40+
1. **Measure:** Before doing blind modifications you should figure out which
41+
part of the code is actually the problem.
42+
43+
This is analogous to medical doctors doing lab tests and taking X-rays
44+
to determine the disease. They won't start treatment without figuring out
45+
what is wrong with the patient.
46+
47+
2. **Diagnose:** When you have found out the part of the code that is slow,
48+
you should determine why it is slow. Doing changes without knowing why the
49+
code is slow can be counter-productive. Remember that not all slow parts
50+
can be endlessly optimized: some parts of the code might take time because
51+
they do a lot of work.
52+
53+
This step is analogous to doctor creating a specialized treatment
54+
program for a disease.
55+
56+
3. **Treat:** When you have found out the slow part and figured what causes
57+
it to be slow, you can then try to fix it.
58+
59+
This is analogous to doctor treating the disease with surgery or a
60+
prescription.
61+
2962

30-
Before doing code surgery to optimize the run time or lower the memory usage,
31-
we should **measure** where the bottlenecks are. This is called **profiling**.
63+
## Using profiling to measure your program
3264

33-
Analogy: Medical doctors don't start surgery based on guessing. They first measure
34-
(X-ray, MRI, ...) to know precisely where the problem is.
3565

36-
Not only programming beginners can otherwise guess wrong, but also experienced
37-
programmers can be surprised by the results of profiling.
66+
While diagnosing and treating depends heavily on the case at hand, the
67+
measurement part can be done with tools and tactics that show where the
68+
bottlenecks are. This is called **profiling**.
3869

70+
Doing profiling is recommended for everyone. Even experienced programmers
71+
can be surprised by the results of profiling.
3972

40-
## One of the simplest tools is to insert timers
73+
:::{discussion} Scale matters for profiling
74+
75+
Sometimes we can configure the system size (for instance the time step in a simulation
76+
or the number of time steps or the matrix dimensions) to make the program finish sooner.
77+
78+
For profiling, we should choose a system size that is **representative of the real-world**
79+
use case. If we profile a program with a small input size, we might not see the same
80+
bottlenecks as when running the program with a larger input size.
81+
82+
Often, when we scale up the system size, or scale the number of processors, new bottlenecks
83+
might appear which we didn't see before. This brings us back to: "measure instead of guessing".
84+
85+
At the same time adding more time steps or more iterations can mean that the
86+
program does the same things over and over again. Thus sometimes you can try to
87+
profile a program for a shorter time and then extrapolate the results for the
88+
case where you're running for a longer time. When doing this be mindful to
89+
profile enough so that you can make proper extrapolations.
90+
:::
4191

42-
Below we will list some tools that can be used to profile Python code.
43-
But even without these tools you can find **time-consuming parts** of your code
44-
by inserting timers:
92+
## Simplest profiling: timers
4593

94+
Simplest way of determining where the **time-consuming parts** are is to
95+
insert timers into your code:
4696

4797

4898
```{code-block} python
@@ -65,32 +115,39 @@ print(f"some_function took {time.time() - start} seconds")
65115
# ...
66116
```
67117

118+
An alternative solution that also improves your code's output is to
119+
use Python's
120+
[logging module](https://docs.python.org/3/library/logging.html) to log
121+
important breakpoints in your code. You can then check the timestamps of
122+
different log entries to see how long it took to execute a section
123+
of your code.
68124

69-
## Many tools exist
125+
## Better profiling: Dedicated profiling tools
126+
127+
There are plenty of dedicated profile tools that can be used to profile
128+
your code. These can measure the CPU time and memory utilization often on a
129+
line-by-line level.
70130

71131
The list below here is probably not complete, but it gives an overview of the
72132
different tools available for profiling Python code.
73133

74-
CPU profilers:
134+
**CPU profilers:**
75135
- [cProfile and profile](https://docs.python.org/3/library/profile.html)
76136
- [line_profiler](https://kernprof.readthedocs.io/)
77137
- [py-spy](https://github.com/benfred/py-spy)
78138
- [Yappi](https://github.com/sumerc/yappi)
79139
- [pyinstrument](https://pyinstrument.readthedocs.io/)
80140
- [Perfetto](https://perfetto.dev/docs/analysis/trace-processor-python)
81141

82-
Memory profilers:
142+
**Memory profilers:**
83143
- [memory_profiler](https://pypi.org/project/memory-profiler/) (not actively maintained)
84144
- [Pympler](https://pympler.readthedocs.io/)
85145
- [tracemalloc](https://docs.python.org/3/library/tracemalloc.html)
86146
- [guppy/heapy](https://github.com/zhuyifei1999/guppy3/)
87147

88-
Both CPU and memory:
148+
**CPU, memory and GPU:**
89149
- [Scalene](https://github.com/plasma-umass/scalene)
90150

91-
In the exercise below, we will use Scalene to profile a Python program. Scalene
92-
is a sampling profiler that can profile CPU, memory, and GPU usage of Python.
93-
94151

95152
## Tracing profilers vs. sampling profilers
96153

@@ -125,17 +182,220 @@ asked few passengers to help us by tracking their journey.
125182
:::
126183

127184

128-
## Choosing the right system size
185+
## Example profiling case: throwing darts to calculate pi
129186

130-
Sometimes we can configure the system size (for instance the time step in a simulation
131-
or the number of time steps or the matrix dimensions) to make the program finish sooner.
187+
### Problem description
132188

133-
For profiling, we should choose a system size that is **representative of the real-world**
134-
use case. If we profile a program with a small input size, we might not see the same
135-
bottlenecks as when running the program with a larger input size.
189+
In this example we'll profile the following Python code:
136190

137-
Often, when we scale up the system size, or scale the number of processors, new bottlenecks
138-
might appear which we didn't see before. This brings us back to: "measure instead of guessing".
191+
```{code-block} python
192+
import random
193+
import math
194+
195+
def calculate_pi(n_darts):
196+
hits = 0
197+
for n in range(n_darts):
198+
i = random.random()
199+
j = random.random()
200+
r = math.sqrt(i*i + j*j)
201+
if (r<1):
202+
hits += 1
203+
pi = 4 * hits / n_darts
204+
return pi
205+
```
206+
207+
This code implements a well known example of the Monte Carlo method, where by
208+
throwing darts at a dartboard we can calculate an approximation for pi.
209+
210+
:::{figure} img/profiling/dartboard.png
211+
:alt: Diagram that shows a unit circle inside a square with side length 1
212+
:width: 50%
213+
Figure 2: The algorithm throws darts at a dartboard and estimates pi by calculating the ratio of hits to throws
214+
:::
215+
216+
:::{}
217+
218+
Commented out for now, maybe not interesting for learners.
219+
220+
The algorithm goes as follows:
221+
222+
1. Place a square with side length 1 around origin.
223+
2. Place an unit circle inside the square.
224+
3. Pick a random point inside the square (this is like throwing a dart, if
225+
the thrower was really random).
226+
4. If the throw landed inside the circle, count is as a hit.
227+
5. Continue throwing darts until **N** darts have been thrown.
228+
6. Probability of a throw landing inside the circle is
229+
```{math}
230+
P(\mathrm{throw}) = \frac{A_{circle}}{A_{square}} = \frac{\pi}{4}
231+
```
232+
and we can estimate this probability by calculating the ratio of hits
233+
to throws
234+
```{math}
235+
\pi \approx 4\frac{N_{hits}}{N_{throws}}
236+
237+
```
238+
239+
:::
240+
241+
### Profiling the example
242+
243+
244+
Let's run this with `%%time`-magic and ten million throws:
245+
246+
```{code-block} python
247+
%%timeit
248+
calculate_pi(10_000_000)
249+
```
250+
```{code-block} console
251+
1.07 s ± 30.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
252+
```
253+
254+
We can then profile the code with either cProfile by using the ``%%prun`` magic.
255+
Here we tell the magic to sort the results by the total time used and to give us
256+
top 5 time users:
257+
```{code-block} python
258+
%%prun -s tottime -l 5
259+
260+
def calculate_pi(n_darts):
261+
hits = 0
262+
for n in range(n_darts):
263+
i = random.random()
264+
j = random.random()
265+
r = math.sqrt(i*i + j*j)
266+
if (r<1):
267+
hits += 1
268+
pi = 4 * hits / n_darts
269+
return pi
270+
271+
30000742 function calls (30000736 primitive calls) in 4.608 seconds
272+
273+
Ordered by: internal time
274+
List reduced from 150 to 5 due to restriction <5>
275+
276+
ncalls tottime percall cumtime percall filename:lineno(function)
277+
4 2.545 0.636 4.005 1.001 {built-in method time.sleep}
278+
20000000 1.091 0.000 1.091 0.000 {method 'random' of '_random.Random' objects}
279+
10000000 0.571 0.000 0.571 0.000 {built-in method math.sqrt}
280+
3 0.172 0.057 0.270 0.090 {method 'poll' of 'select.epoll' objects}
281+
1 0.148 0.148 0.233 0.233 <string>:1(calculate_pi)
282+
283+
calculate_pi(10_000_000)
284+
```
285+
286+
```{code-block} console
287+
288+
289+
30000742 function calls (30000736 primitive calls) in 4.608 seconds
290+
291+
Ordered by: internal time
292+
List reduced from 150 to 5 due to restriction <5>
293+
294+
ncalls tottime percall cumtime percall filename:lineno(function)
295+
4 2.545 0.636 4.005 1.001 {built-in method time.sleep}
296+
20000000 1.091 0.000 1.091 0.000 {method 'random' of '_random.Random' objects}
297+
10000000 0.571 0.000 0.571 0.000 {built-in method math.sqrt}
298+
3 0.172 0.057 0.270 0.090 {method 'poll' of 'select.epoll' objects}
299+
1 0.148 0.148 0.233 0.233 <string>:1(calculate_pi)
300+
```
301+
302+
The output shows that most of the time is used by the `random.random` and
303+
`math.sqrt` function calls. Those functions are called in every iteration of the loop,
304+
so the profile makes sense.
305+
306+
### Naive optimization: switching to NumPy functions
307+
308+
A naive approach to optimization might be to simply switch to using NumPy functions
309+
instead of basic Python functions. The code would then look like this:
310+
311+
```{code-block} python
312+
---
313+
emphasize-lines: 1,6-8
314+
---
315+
316+
import numpy
317+
318+
def calculate_pi_numpy(n_darts):
319+
hits = 0
320+
for n in range(n_darts):
321+
i = numpy.random.random()
322+
j = numpy.random.random()
323+
r = numpy.sqrt(i*i + j*j)
324+
if (r<1):
325+
hits += 1
326+
pi = 4 * hits / n_darts
327+
return pi
328+
```
329+
330+
However, if we do the same profiling, we'll find out that the program is
331+
**ten times slower**. Something must have gone wrong.
332+
333+
334+
### Actual optimization: using vectorization
335+
336+
337+
The reason for the bad performance is simple: we didn't actually reduce
338+
the number of function calls, we just switched them to ones from NumPy. These
339+
function call have extra overhead because they have more complex logic compared
340+
to the standard library ones.
341+
342+
The actual speedup is right around the corner: switch from the costly for-loop
343+
to vectorized calls. NumPy's functions can calculate all of the operations for
344+
all of our numbers in a single function call without the for-loop:
345+
346+
```{code-block} python
347+
---
348+
emphasize-lines: 2-5
349+
---
350+
def calculate_numpy_fast(n_darts):
351+
i = numpy.random.random(n_darts)
352+
j = numpy.random.random(n_darts)
353+
r = numpy.sqrt(i*i + j*j)
354+
hits = (r < 1).sum()
355+
pi = 4 * hits / n_darts
356+
return pi
357+
```
358+
359+
```{code-block} python
360+
%%prun -s tottime -l 5
361+
362+
def calculate_numpy_fast(n_darts):
363+
i = numpy.random.random(n_darts)
364+
j = numpy.random.random(n_darts)
365+
r = numpy.sqrt(i*i + j*j)
366+
hits = (r < 1).sum()
367+
pi = 4 * hits / n_darts
368+
return pi
369+
370+
calculate_numpy_fast(10_000_000)
371+
```
372+
373+
```{code-block} console
374+
375+
664 function calls (658 primitive calls) in 0.225 seconds
376+
377+
Ordered by: internal time
378+
List reduced from 129 to 5 due to restriction <5>
379+
380+
ncalls tottime percall cumtime percall filename:lineno(function)
381+
1 0.202 0.202 0.205 0.205 <string>:1(calculate_numpy_fast)
382+
2 0.010 0.005 0.010 0.005 {method '__exit__' of 'sqlite3.Connection' objects}
383+
2/1 0.009 0.005 0.214 0.214 <string>:1(<module>)
384+
1 0.002 0.002 0.002 0.002 {method 'reduce' of 'numpy.ufunc' objects}
385+
1 0.001 0.001 0.010 0.010 history.py:92(only_when_enabled)
386+
```
387+
388+
So vectorizing the code achieved around five times speedup.
389+
390+
### Profiling with scalene
391+
392+
The previous example can also be profiled with
393+
[scalene](https://github.com/plasma-umass/scalene). Scalene is a sampling
394+
profiler and it can be run in Jupyter as well with `%scrun` (line-mode) and
395+
`%%scalene` (cell-mode).
396+
397+
Scalene will produce a nice looking output containing line-by-line profiling
398+
information.
139399

140400

141401
## Exercises

0 commit comments

Comments
 (0)