AaltoSciComp
diff --git a/‎content/img/profiling/dartboard.png‎
77.5 KB b/‎content/img/profiling/dartboard.png‎
77.5 KB
diff --git a/‎content/profiling.md‎
Lines changed: 286 additions & 26 deletions b/‎content/profiling.md‎
Lines changed: 286 additions & 26 deletions
@@ -16,6 +16,13 @@
 
 Classic quote to keep in mind: "Premature optimization is the root of all evil." [Donald Knuth]
 
+:::{figure} https://imgs.xkcd.com/comics/is_it_worth_the_time.png
+:alt: Humorous comic from xkcd that shows how long you can work on making a routine task more efficient before you're spending more time than you save? 
+Figure 1: Is it worth the time? ([xkcd#1205](https://xkcd.com/1205/))
+:::
+
+
+
 :::{discussion}
 It is important to ask ourselves whether it is worth it.
 - Is it worth spending e.g. 2 days to make a program run 20% faster?
@@ -25,24 +32,67 @@ Depends. What does it depend on?
 :::
 
 
-## Measure instead of guessing
+## Key steps of optimization
+
+When you encounter a situation that you think would benefit from optimization,
+follow these three steps:
+
+1. **Measure:** Before doing blind modifications you should figure out which
+   part of the code is actually the problem.
+
+   This is analogous to medical doctors doing lab tests and taking X-rays
+   to determine the disease. They won't start treatment without figuring out
+   what is wrong with the patient.
+
+2. **Diagnose:** When you have found out the part of the code that is slow,
+   you should determine why it is slow. Doing changes without knowing why the
+   code is slow can be counter-productive. Remember that not all slow parts
+   can be endlessly optimized: some parts of the code might take time because
+   they do a lot of work.
+
+   This step is analogous to doctor creating a specialized treatment
+   program for a disease.
+
+3. **Treat:** When you have found out the slow part and figured what causes
+   it to be slow, you can then try to fix it.
+
+   This is analogous to doctor treating the disease with surgery or a
+   prescription.
+
 
-Before doing code surgery to optimize the run time or lower the memory usage,
-we should **measure** where the bottlenecks are. This is called **profiling**.
+## Using profiling to measure your program
 
-Analogy: Medical doctors don't start surgery based on guessing. They first measure
-(X-ray, MRI, ...) to know precisely where the problem is.
 
-Not only programming beginners can otherwise guess wrong, but also experienced
-programmers can be surprised by the results of profiling.
+While diagnosing and treating depends heavily on the case at hand, the
+measurement part can be done with tools and tactics that show where the
+bottlenecks are. This is called **profiling**.
 
+Doing profiling is recommended for everyone. Even experienced programmers
+can be surprised by the results of profiling.
 
-## One of the simplest tools is to insert timers
+:::{discussion} Scale matters for profiling
+
+Sometimes we can configure the system size (for instance the time step in a simulation
+or the number of time steps or the matrix dimensions) to make the program finish sooner.
+
+For profiling, we should choose a system size that is **representative of the real-world**
+use case. If we profile a program with a small input size, we might not see the same
+bottlenecks as when running the program with a larger input size.
+
+Often, when we scale up the system size, or scale the number of processors, new bottlenecks
+might appear which we didn't see before. This brings us back to: "measure instead of guessing".
+
+At the same time adding more time steps or more iterations can mean that the
+program does the same things over and over again. Thus sometimes you can try to
+profile a program for a shorter time and then extrapolate the results for the
+case where you're running for a longer time. When doing this be mindful to
+profile enough so that you can make proper extrapolations.
+:::
 
-Below we will list some tools that can be used to profile Python code.
-But even without these tools you can find **time-consuming parts** of your code
-by inserting timers:
+## Simplest profiling: timers
 
+Simplest way of determining where the **time-consuming parts** are is to
+insert timers into your code:
 
 
 ```{code-block} python
@@ -65,32 +115,39 @@ print(f"some_function took {time.time() - start} seconds")
 # ...
 ```
 
+An alternative solution that also improves your code's output is to
+use Python's
+[logging module](https://docs.python.org/3/library/logging.html) to log
+important breakpoints in your code. You can then check the timestamps of
+different log entries to see how long it took to execute a section
+of your code.
 
-## Many tools exist
+## Better profiling: Dedicated profiling tools
+
+There are plenty of dedicated profile tools that can be used to profile
+your code. These can measure the CPU time and memory utilization often on a
+line-by-line level.
 
 The list below here is probably not complete, but it gives an overview of the
 different tools available for profiling Python code.
 
-CPU profilers:
+**CPU profilers:**
 - [cProfile and profile](https://docs.python.org/3/library/profile.html)
 - [line_profiler](https://kernprof.readthedocs.io/)
 - [py-spy](https://github.com/benfred/py-spy)
 - [Yappi](https://github.com/sumerc/yappi)
 - [pyinstrument](https://pyinstrument.readthedocs.io/)
 - [Perfetto](https://perfetto.dev/docs/analysis/trace-processor-python)
 
-Memory profilers:
+**Memory profilers:**
 - [memory_profiler](https://pypi.org/project/memory-profiler/) (not actively maintained)
 - [Pympler](https://pympler.readthedocs.io/)
 - [tracemalloc](https://docs.python.org/3/library/tracemalloc.html)
 - [guppy/heapy](https://github.com/zhuyifei1999/guppy3/)
 
-Both CPU and memory:
+**CPU, memory and GPU:**
 - [Scalene](https://github.com/plasma-umass/scalene)
 
-In the exercise below, we will use Scalene to profile a Python program. Scalene
-is a sampling profiler that can profile CPU, memory, and GPU usage of Python.
-
 
 ## Tracing profilers vs. sampling profilers
 
@@ -125,17 +182,220 @@ asked few passengers to help us by tracking their journey.
 :::
 
 
-## Choosing the right system size
+## Example profiling case: throwing darts to calculate pi
 
-Sometimes we can configure the system size (for instance the time step in a simulation
-or the number of time steps or the matrix dimensions) to make the program finish sooner.
+### Problem description
 
-For profiling, we should choose a system size that is **representative of the real-world**
-use case. If we profile a program with a small input size, we might not see the same
-bottlenecks as when running the program with a larger input size.
+In this example we'll profile the following Python code:
 
-Often, when we scale up the system size, or scale the number of processors, new bottlenecks
-might appear which we didn't see before. This brings us back to: "measure instead of guessing".
+```{code-block} python
+import random
+import math
+
+def calculate_pi(n_darts):
+    hits = 0
+    for n in range(n_darts):
+        i = random.random()
+        j = random.random()
+        r = math.sqrt(i*i + j*j)
+        if (r<1):
+            hits += 1
+    pi = 4 * hits / n_darts
+    return pi
+```
+
+This code implements a well known example of the Monte Carlo method, where by
+throwing darts at a dartboard we can calculate an approximation for pi.
+
+:::{figure} img/profiling/dartboard.png
+:alt: Diagram that shows a unit circle inside a square with side length 1
+:width: 50%
+Figure 2: The algorithm throws darts at a dartboard and estimates pi by calculating the ratio of hits to throws
+:::
+
+:::{}
+
+Commented out for now, maybe not interesting for learners.
+
+The algorithm goes as follows:
+
+1. Place a square with side length 1 around origin.
+2. Place an unit circle inside the square.
+3. Pick a random point inside the square (this is like throwing a dart, if
+   the thrower was really random).
+4. If the throw landed inside the circle, count is as a hit.
+5. Continue throwing darts until **N** darts have been thrown.
+6. Probability of a throw landing inside the circle is
+```{math}
+P(\mathrm{throw}) = \frac{A_{circle}}{A_{square}} = \frac{\pi}{4}
+```
+   and we can estimate this probability by calculating the ratio of hits
+   to throws
+```{math}
+\pi \approx 4\frac{N_{hits}}{N_{throws}}
+
+```
+
+:::
+
+### Profiling the example
+
+
+Let's run this with `%%time`-magic and ten million throws:
+
+```{code-block} python
+%%timeit
+calculate_pi(10_000_000)
+```
+```{code-block} console
+1.07 s ± 30.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+```
+
+We can then profile the code with either cProfile by using the ``%%prun`` magic.
+Here we tell the magic to sort the results by the total time used and to give us
+top 5 time users:
+```{code-block} python
+%%prun -s tottime -l 5
+
+def calculate_pi(n_darts):
+    hits = 0
+    for n in range(n_darts):
+        i = random.random()
+        j = random.random()
+        r = math.sqrt(i*i + j*j)
+        if (r<1):
+            hits += 1
+    pi = 4 * hits / n_darts
+    return pi 
+
+         30000742 function calls (30000736 primitive calls) in 4.608 seconds
+
+   Ordered by: internal time
+   List reduced from 150 to 5 due to restriction <5>
+
+   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+        4    2.545    0.636    4.005    1.001 {built-in method time.sleep}
+ 20000000    1.091    0.000    1.091    0.000 {method 'random' of '_random.Random' objects}
+ 10000000    0.571    0.000    0.571    0.000 {built-in method math.sqrt}
+        3    0.172    0.057    0.270    0.090 {method 'poll' of 'select.epoll' objects}
+        1    0.148    0.148    0.233    0.233 <string>:1(calculate_pi)
+
+calculate_pi(10_000_000)
+```
+
+```{code-block} console
+ 
+
+         30000742 function calls (30000736 primitive calls) in 4.608 seconds
+
+   Ordered by: internal time
+   List reduced from 150 to 5 due to restriction <5>
+
+   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+        4    2.545    0.636    4.005    1.001 {built-in method time.sleep}
+ 20000000    1.091    0.000    1.091    0.000 {method 'random' of '_random.Random' objects}
+ 10000000    0.571    0.000    0.571    0.000 {built-in method math.sqrt}
+        3    0.172    0.057    0.270    0.090 {method 'poll' of 'select.epoll' objects}
+        1    0.148    0.148    0.233    0.233 <string>:1(calculate_pi)
+```
+
+The output shows that most of the time is used by the `random.random` and
+`math.sqrt` function calls. Those functions are called in every iteration of the loop,
+so the profile makes sense.
+
+### Naive optimization: switching to NumPy functions
+
+A naive approach to optimization might be to simply switch to using NumPy functions
+instead of basic Python functions. The code would then look like this:
+
+```{code-block} python
+---
+emphasize-lines: 1,6-8
+---
+
+import numpy
+
+def calculate_pi_numpy(n_darts):
+    hits = 0
+    for n in range(n_darts):
+        i = numpy.random.random()
+        j = numpy.random.random()
+        r = numpy.sqrt(i*i + j*j)
+        if (r<1):
+            hits += 1
+    pi = 4 * hits / n_darts
+    return pi
+```
+
+However, if we do the same profiling, we'll find out that the program is
+**ten times slower**. Something must have gone wrong.
+
+
+### Actual optimization: using vectorization
+
+
+The reason for the bad performance is simple: we didn't actually reduce
+the number of function calls, we just switched them to ones from NumPy. These
+function call have extra overhead because they have more complex logic compared
+to the standard library ones. 
+
+The actual speedup is right around the corner: switch from the costly for-loop
+to vectorized calls. NumPy's functions can calculate all of the operations for
+all of our numbers in a single function call without the for-loop:
+
+```{code-block} python
+---
+emphasize-lines: 2-5
+---
+def calculate_numpy_fast(n_darts):
+    i = numpy.random.random(n_darts)
+    j = numpy.random.random(n_darts)
+    r = numpy.sqrt(i*i + j*j)
+    hits = (r < 1).sum() 
+    pi = 4 * hits / n_darts
+    return pi
+```
+
+```{code-block} python
+%%prun -s tottime -l 5
+
+def calculate_numpy_fast(n_darts):
+    i = numpy.random.random(n_darts)
+    j = numpy.random.random(n_darts)
+    r = numpy.sqrt(i*i + j*j)
+    hits = (r < 1).sum() 
+    pi = 4 * hits / n_darts
+    return pi
+
+calculate_numpy_fast(10_000_000)
+```
+
+```{code-block} console
+
+         664 function calls (658 primitive calls) in 0.225 seconds
+
+   Ordered by: internal time
+   List reduced from 129 to 5 due to restriction <5>
+
+   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+        1    0.202    0.202    0.205    0.205 <string>:1(calculate_numpy_fast)
+        2    0.010    0.005    0.010    0.005 {method '__exit__' of 'sqlite3.Connection' objects}
+      2/1    0.009    0.005    0.214    0.214 <string>:1(<module>)
+        1    0.002    0.002    0.002    0.002 {method 'reduce' of 'numpy.ufunc' objects}
+        1    0.001    0.001    0.010    0.010 history.py:92(only_when_enabled)
+```
+
+So vectorizing the code achieved around five times speedup.
+
+### Profiling with scalene
+
+The previous example can also be profiled with
+[scalene](https://github.com/plasma-umass/scalene). Scalene is a sampling
+profiler and it can be run in Jupyter as well with `%scrun` (line-mode) and
+`%%scalene` (cell-mode).
+
+Scalene will produce a nice looking output containing line-by-line profiling
+information. 
 
 
 ## Exercises