Skip to content

Commit fe285f8

Browse files
authored
Merge pull request #287 from bast/radovan/profiling-exercise
add profiling exercise
2 parents 3754551 + d94db4d commit fe285f8

File tree

2 files changed

+117
-6
lines changed

2 files changed

+117
-6
lines changed

content/profiling.md

Lines changed: 117 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -43,11 +43,122 @@
4343
> - [ ] Introduce line profilers
4444
> - [ ] Visualize one code example using `scalane`
4545
46-
## Exercise
47-
> [!IMPORTANT]
48-
> Prepare two exercises for the last 20 minutes of this lecture.
49-
> Left to do:
50-
> - [ ] Provide exercise in pure python, Radovan has some ideas
51-
> - [ ] Provide exercise showing the improvement in performance when introducing numpy and/or pandas, Gregor will work on this
5246

47+
## Exercises
48+
49+
:::{exercise} Exercise Profiling-1
50+
Work in progress: we will provide an exercise showing the improvement in
51+
performance when introducing numpy and/or pandas.
52+
:::
53+
54+
::::{exercise} Exercise Profiling-2
55+
In this exercise we will use the `scalene` profiler to find out where most of the time is spent
56+
and most of the memory is used in a given code example.
57+
58+
Please try to go through the exercise in the following steps:
59+
1. Make sure `scalene` is installed in your environment (if you have followed
60+
this course from the start and installed the recommended software
61+
environment, then it is).
62+
1. Download Leo Tolstoy's "War and Peace" from the following link (the text is
63+
provided by [Project Gutenberg](https://www.gutenberg.org/)):
64+
<https://www.gutenberg.org/cache/epub/2600/pg2600.txt>
65+
(right-click and "save as" to download the file and **save it as "book.txt"**).
66+
1. **Before** you run the profiler, try to predict in which function the code
67+
will spend most of the time and in which function it will use most of the
68+
memory.
69+
1. Run the `scalene` profiler on the following code example and browse the
70+
generated HTML report to find out where most of the time is spent and where
71+
most of the memory is used:
72+
```console
73+
$ scalene example.py
74+
```
75+
Alternatively you can do this (and then open the generated file in a browser):
76+
```console
77+
$ scalene example.py --html > profile.html
78+
```
79+
You can find an example of the generated HTML report in the solution below.
80+
1. Does the result match your prediction? Can you explain the results?
81+
82+
```python
83+
"""
84+
The code below reads a text file and counts the number of unique words in it
85+
(case-insensitive).
86+
"""
87+
import re
88+
89+
90+
def count_unique_words1(file_path: str) -> int:
91+
with open(file_path, "r", encoding="utf-8") as file:
92+
text = file.read()
93+
words = re.findall(r"\b\w+\b", text.lower())
94+
return len(set(words))
95+
96+
97+
def count_unique_words2(file_path: str) -> int:
98+
unique_words = []
99+
with open(file_path, "r", encoding="utf-8") as file:
100+
for line in file:
101+
words = re.findall(r"\b\w+\b", line.lower())
102+
for word in words:
103+
if word not in unique_words:
104+
unique_words.append(word)
105+
return len(unique_words)
106+
107+
108+
def count_unique_words3(file_path: str) -> int:
109+
unique_words = set()
110+
with open(file_path, "r", encoding="utf-8") as file:
111+
for line in file:
112+
words = re.findall(r"\b\w+\b", line.lower())
113+
for word in words:
114+
unique_words.add(word)
115+
return len(unique_words)
116+
117+
118+
def main():
119+
_result = count_unique_words1("book.txt")
120+
_result = count_unique_words2("book.txt")
121+
_result = count_unique_words3("book.txt")
122+
123+
124+
if __name__ == "__main__":
125+
main()
126+
```
127+
128+
:::{solution}
129+
```{figure} profiling/exercise2.png
130+
:alt: Result of the profiling run for the above code example.
131+
:width: 100%
132+
133+
Result of the profiling run for the above code example. You can click on the image to make it larger.
134+
```
135+
136+
Results:
137+
- Most time is spent in the `count_unique_words2` function.
138+
- Most memory is used in the `count_unique_words1` function.
139+
140+
Explanation:
141+
- The `count_unique_words2` function is the slowest because it **uses a list**
142+
to store unique words and checks if a word is already in the list before
143+
adding it.
144+
Checking whether a list contains an element might require traversing the
145+
whole list, which is an O(n) operation. As the list grows in size,
146+
the lookup time increases with the size of the list.
147+
- The `count_unique_words1` and `count_unique_words3` functions are faster
148+
because they **use a set** to store unique words.
149+
Checking whether a set contains an element is an O(1) operation.
150+
- The `count_unique_words1` function uses the most memory because it **creates
151+
a list of all words** in the text file and then **creates a set** from that
152+
list.
153+
- The `count_unique_words3` function uses less memory because it traverses
154+
the text file line by line instead of reading the whole file into memory.
53155

156+
What we can learn from this exercise:
157+
- When processing large files, it can be good to read them line by line
158+
or in batches
159+
instead of reading the whole file into memory.
160+
- It is good to get an overview over standard data structures and their
161+
advantages and disadvantages (e.g. adding an element to a list is fast but checking whether
162+
it already contains the element is slow).
163+
:::
164+
::::

content/profiling/exercise2.png

243 KB
Loading

0 commit comments

Comments
 (0)