|
43 | 43 | > - [ ] Introduce line profilers |
44 | 44 | > - [ ] Visualize one code example using `scalane` |
45 | 45 |
|
46 | | -## Exercise |
47 | | -> [!IMPORTANT] |
48 | | -> Prepare two exercises for the last 20 minutes of this lecture. |
49 | | -> Left to do: |
50 | | -> - [ ] Provide exercise in pure python, Radovan has some ideas |
51 | | -> - [ ] Provide exercise showing the improvement in performance when introducing numpy and/or pandas, Gregor will work on this |
52 | 46 |
|
| 47 | +## Exercises |
| 48 | + |
| 49 | +:::{exercise} Exercise Profiling-1 |
| 50 | +Work in progress: we will provide an exercise showing the improvement in |
| 51 | +performance when introducing numpy and/or pandas. |
| 52 | +::: |
| 53 | + |
| 54 | +::::{exercise} Exercise Profiling-2 |
| 55 | +In this exercise we will use the `scalene` profiler to find out where most of the time is spent |
| 56 | +and most of the memory is used in a given code example. |
| 57 | + |
| 58 | +Please try to go through the exercise in the following steps: |
| 59 | +1. Make sure `scalene` is installed in your environment (if you have followed |
| 60 | + this course from the start and installed the recommended software |
| 61 | + environment, then it is). |
| 62 | +1. Download Leo Tolstoy's "War and Peace" from the following link (the text is |
| 63 | + provided by [Project Gutenberg](https://www.gutenberg.org/)): |
| 64 | + <https://www.gutenberg.org/cache/epub/2600/pg2600.txt> |
| 65 | + (right-click and "save as" to download the file and **save it as "book.txt"**). |
| 66 | +1. **Before** you run the profiler, try to predict in which function the code |
| 67 | + will spend most of the time and in which function it will use most of the |
| 68 | + memory. |
| 69 | +1. Run the `scalene` profiler on the following code example and browse the |
| 70 | + generated HTML report to find out where most of the time is spent and where |
| 71 | + most of the memory is used: |
| 72 | + ```console |
| 73 | + $ scalene example.py |
| 74 | + ``` |
| 75 | + Alternatively you can do this (and then open the generated file in a browser): |
| 76 | + ```console |
| 77 | + $ scalene example.py --html > profile.html |
| 78 | + ``` |
| 79 | + You can find an example of the generated HTML report in the solution below. |
| 80 | +1. Does the result match your prediction? Can you explain the results? |
| 81 | + |
| 82 | +```python |
| 83 | +""" |
| 84 | +The code below reads a text file and counts the number of unique words in it |
| 85 | +(case-insensitive). |
| 86 | +""" |
| 87 | +import re |
| 88 | + |
| 89 | + |
| 90 | +def count_unique_words1(file_path: str) -> int: |
| 91 | + with open(file_path, "r", encoding="utf-8") as file: |
| 92 | + text = file.read() |
| 93 | + words = re.findall(r"\b\w+\b", text.lower()) |
| 94 | + return len(set(words)) |
| 95 | + |
| 96 | + |
| 97 | +def count_unique_words2(file_path: str) -> int: |
| 98 | + unique_words = [] |
| 99 | + with open(file_path, "r", encoding="utf-8") as file: |
| 100 | + for line in file: |
| 101 | + words = re.findall(r"\b\w+\b", line.lower()) |
| 102 | + for word in words: |
| 103 | + if word not in unique_words: |
| 104 | + unique_words.append(word) |
| 105 | + return len(unique_words) |
| 106 | + |
| 107 | + |
| 108 | +def count_unique_words3(file_path: str) -> int: |
| 109 | + unique_words = set() |
| 110 | + with open(file_path, "r", encoding="utf-8") as file: |
| 111 | + for line in file: |
| 112 | + words = re.findall(r"\b\w+\b", line.lower()) |
| 113 | + for word in words: |
| 114 | + unique_words.add(word) |
| 115 | + return len(unique_words) |
| 116 | + |
| 117 | + |
| 118 | +def main(): |
| 119 | + _result = count_unique_words1("book.txt") |
| 120 | + _result = count_unique_words2("book.txt") |
| 121 | + _result = count_unique_words3("book.txt") |
| 122 | + |
| 123 | + |
| 124 | +if __name__ == "__main__": |
| 125 | + main() |
| 126 | +``` |
| 127 | + |
| 128 | +:::{solution} |
| 129 | + ```{figure} profiling/exercise2.png |
| 130 | + :alt: Result of the profiling run for the above code example. |
| 131 | + :width: 100% |
| 132 | +
|
| 133 | + Result of the profiling run for the above code example. You can click on the image to make it larger. |
| 134 | + ``` |
| 135 | + |
| 136 | + Results: |
| 137 | + - Most time is spent in the `count_unique_words2` function. |
| 138 | + - Most memory is used in the `count_unique_words1` function. |
| 139 | + |
| 140 | + Explanation: |
| 141 | + - The `count_unique_words2` function is the slowest because it **uses a list** |
| 142 | + to store unique words and checks if a word is already in the list before |
| 143 | + adding it. |
| 144 | + Checking whether a list contains an element might require traversing the |
| 145 | + whole list, which is an O(n) operation. As the list grows in size, |
| 146 | + the lookup time increases with the size of the list. |
| 147 | + - The `count_unique_words1` and `count_unique_words3` functions are faster |
| 148 | + because they **use a set** to store unique words. |
| 149 | + Checking whether a set contains an element is an O(1) operation. |
| 150 | + - The `count_unique_words1` function uses the most memory because it **creates |
| 151 | + a list of all words** in the text file and then **creates a set** from that |
| 152 | + list. |
| 153 | + - The `count_unique_words3` function uses less memory because it traverses |
| 154 | + the text file line by line instead of reading the whole file into memory. |
53 | 155 |
|
| 156 | + What we can learn from this exercise: |
| 157 | + - When processing large files, it can be good to read them line by line |
| 158 | + or in batches |
| 159 | + instead of reading the whole file into memory. |
| 160 | + - It is good to get an overview over standard data structures and their |
| 161 | + advantages and disadvantages (e.g. adding an element to a list is fast but checking whether |
| 162 | + it already contains the element is slow). |
| 163 | + ::: |
| 164 | +:::: |
0 commit comments