Skip to content

Commit 4fe98fc

Browse files
committed
add profiling exercise
1 parent 3754551 commit 4fe98fc

File tree

2 files changed

+111
-6
lines changed

2 files changed

+111
-6
lines changed

content/profiling.md

Lines changed: 111 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -43,11 +43,116 @@
4343
> - [ ] Introduce line profilers
4444
> - [ ] Visualize one code example using `scalane`
4545
46-
## Exercise
47-
> [!IMPORTANT]
48-
> Prepare two exercises for the last 20 minutes of this lecture.
49-
> Left to do:
50-
> - [ ] Provide exercise in pure python, Radovan has some ideas
51-
> - [ ] Provide exercise showing the improvement in performance when introducing numpy and/or pandas, Gregor will work on this
5246

47+
## Exercises
48+
49+
:::{exercise} Exercise Profiling-1
50+
Work in progress: we will provide an exercise showing the improvement in
51+
performance when introducing numpy and/or pandas.
52+
:::
53+
54+
::::{exercise} Exercise Profiling-2
55+
In this exercise we will use the `scalene` profiler to find out where most of the time is spent
56+
and most of the memory is used in a given code example.
57+
58+
Please try to go through the exercise in the following steps:
59+
1. Make sure `scalene` is installed in your environment (if you have followed
60+
this course from the start and installed the recommended software
61+
environment, then it is).
62+
1. Download Leo Tolstoy's "War and Peace" from the following link (the text is
63+
provided by [Project Gutenberg](https://www.gutenberg.org/)):
64+
<https://www.gutenberg.org/cache/epub/2600/pg2600.txt>
65+
(right-click and "save as" to download the file and **save it as "book.txt"**).
66+
1. **Before** you run the profiler, try to predict in which function the code
67+
will spend most of the time and in which function it will use most of the
68+
memory.
69+
1. Run the `scalene` profiler on the following code example and browse the
70+
generated HTML report to find out where most of the time is spent and where
71+
most of the memory is used:
72+
```console
73+
$ scalene example.py
74+
```
75+
You can find an example of the generated HTML report in the solution below.
76+
1. Does the result match your prediction? Can you explain the results?
77+
78+
```python
79+
"""
80+
The code below reads a text file and counts the number of unique words in it
81+
(case-insensitive).
82+
"""
83+
import re
84+
85+
86+
def count_unique_words1(file_path: str) -> int:
87+
with open(file_path, "r", encoding="utf-8") as file:
88+
text = file.read()
89+
words = re.findall(r"\b\w+\b", text.lower())
90+
return len(set(words))
91+
92+
93+
def count_unique_words2(file_path: str) -> int:
94+
unique_words = []
95+
with open(file_path, "r", encoding="utf-8") as file:
96+
for line in file:
97+
words = re.findall(r"\b\w+\b", line.lower())
98+
for word in words:
99+
if word not in unique_words:
100+
unique_words.append(word)
101+
return len(unique_words)
102+
103+
104+
def count_unique_words3(file_path: str) -> int:
105+
unique_words = set()
106+
with open(file_path, "r", encoding="utf-8") as file:
107+
for line in file:
108+
words = re.findall(r"\b\w+\b", line.lower())
109+
for word in words:
110+
unique_words.add(word)
111+
return len(unique_words)
112+
113+
114+
def main():
115+
_result = count_unique_words1("book.txt")
116+
_result = count_unique_words2("book.txt")
117+
_result = count_unique_words3("book.txt")
118+
119+
120+
if __name__ == "__main__":
121+
main()
122+
```
123+
124+
:::{solution}
125+
```{figure} profiling/exercise2.png
126+
:alt: Result of the profiling run for the above code example.
127+
:width: 100%
128+
129+
Result of the profiling run for the above code example. You can click on the image to make it larger.
130+
```
131+
132+
Results:
133+
- Most time is spent in the `count_unique_words2` function.
134+
- Most memory is used in the `count_unique_words1` function.
135+
136+
Explanation:
137+
- The `count_unique_words2` function is the slowest because it **uses a list**
138+
to store unique words and checks if a word is already in the list before
139+
adding it.
140+
Checking whether a list contains an element might require traversing the
141+
whole list, which is an O(n) operation.
142+
- The `count_unique_words1` and `count_unique_words3` functions are faster
143+
because they **use a set** to store unique words.
144+
Checking whether a set contains an element is an O(1) operation.
145+
- The `count_unique_words1` function uses the most memory because it **creates
146+
a list of all words** in the text file and then **creates a set** from that
147+
list.
148+
- The `count_unique_words3` function uses less memory because it traverses
149+
the text file line by line instead of reading the whole file into memory.
53150

151+
What we can learn from this exercise:
152+
- When processing large files, it can be good to read them line by line
153+
instead of reading the whole file into memory.
154+
- It is good to get an overview over standard data structures and their
155+
advantages and disadvantages (e.g. adding an element to a list is fast but checking whether
156+
it already contains the element is slow).
157+
:::
158+
::::

content/profiling/exercise2.png

243 KB
Loading

0 commit comments

Comments
 (0)