diff --git a/config.yaml b/config.yaml index 117da0c9..9886ea23 100644 --- a/config.yaml +++ b/config.yaml @@ -3,15 +3,15 @@ #------------------------------------------------------------ # Which carpentry is this (swc, dc, lc, or cp)? -# swc: Software Carpentry +# swc: Software Carpentry - # dc: Data Carpentry # lc: Library Carpentry # cp: Carpentries (to use for instructor training for instance) # incubator: The Carpentries Incubator -carpentry: 'incubator' +carpentry: 'swc' # Overall title for pages. -title: 'Performance Profiling & Optimisation (Python)' +title: 'Python Optimisation and Performance Profiling' # Date the lesson was created (YYYY-MM-DD, this is empty by default) created: 2024-02-01~ # FIXME @@ -27,13 +27,13 @@ life_cycle: 'alpha' license: 'CC-BY 4.0' # Link to the source repository for this lesson -source: 'https://github.com/RSE-Sheffield/pando-python' +source: 'https://github.com/ICR-RSE-Group/carpentry-pando-python' # Default branch of your lesson branch: 'main' # Who to contact if there are any issues -contact: 'robert.chisholm@sheffield.ac.uk' +contact: 'mira.sarkis@icr.ac.uk' # Navigation ------------------------------------------------ # @@ -59,18 +59,17 @@ contact: 'robert.chisholm@sheffield.ac.uk' # Order of episodes in your lesson episodes: -- profiling-introduction.md -- profiling-functions.md -- short-break1.md -- profiling-lines.md -- profiling-conclusion.md - optimisation-introduction.md - optimisation-data-structures-algorithms.md -- long-break1.md - optimisation-minimise-python.md - optimisation-use-latest.md - optimisation-memory.md - optimisation-conclusion.md +- long-break1.md +- profiling-introduction.md +- profiling-functions.md +- profiling-lines.md +- profiling-conclusion.md # Information for Learners learners: @@ -91,5 +90,5 @@ profiles: # This space below is where custom yaml items (e.g. pinning # sandpaper and varnish versions) should live -varnish: RSE-Sheffield/uos-varnish@main -url: 'https://rse.shef.ac.uk/pando-python' +#varnish: RSE-Sheffield/uos-varnish@main +#url: 'https://icr-rse-group.github.io/carpentry-pando-python' diff --git a/episodes/long-break1.md b/episodes/long-break1.md index d28a91bd..2bda1dad 100644 --- a/episodes/long-break1.md +++ b/episodes/long-break1.md @@ -1,5 +1,5 @@ --- -title: Break +title: Lunch Break teaching: 0 exercises: 0 break: 60 diff --git a/episodes/optimisation-data-structures-algorithms.md b/episodes/optimisation-data-structures-algorithms.md index 0c1c8d35..30b1a0c4 100644 --- a/episodes/optimisation-data-structures-algorithms.md +++ b/episodes/optimisation-data-structures-algorithms.md @@ -65,7 +65,7 @@ CPython for example uses [`newsize + (newsize >> 3) + 6`](https://github.com/pyt This has two implications: -* If you are creating large static lists, they will use upto 12.5% excess memory. +* If you are creating large static lists, they will use up to 12.5% excess memory. * If you are growing a list with `append()`, there will be large amounts of redundant allocations and copies as the list grows. ### List Comprehension @@ -165,7 +165,7 @@ To retrieve or check for the existence of a key within a hashing data structure, ### Keys -Keys will typically be a core Python type such as a number or string. However multiple of these can be combined as a Tuple to form a compound key, or a custom class can be used if the methods `__hash__()` and `__eq__()` have been implemented. +Keys will typically be a core Python type such as a number or string. However, multiple of these can be combined as a Tuple to form a compound key, or a custom class can be used if the methods `__hash__()` and `__eq__()` have been implemented. You can implement `__hash__()` by utilising the ability for Python to hash tuples, avoiding the need to implement a bespoke hash function. @@ -265,7 +265,7 @@ Constructing a set with a loop and `add()` (equivalent to a list's `append()`) c The naive list approach is 2200x times slower than the fastest approach, because of how many times the list is searched. This gap will only grow as the number of items increases. -Sorting the input list reduces the cost of searching the output list significantly, however it is still 8x slower than the fastest approach. In part because around half of it's runtime is now spent sorting the list. +Sorting the input list reduces the cost of searching the output list significantly, however it is still 8x slower than the fastest approach. In part because around half of its runtime is now spent sorting the list. ```output uniqueSet: 0.30ms @@ -280,9 +280,9 @@ uniqueListSort: 2.67ms Independent of the performance to construct a unique set (as covered in the previous section), it's worth identifying the performance to search the data-structure to retrieve an item or check whether it exists. -The performance of a hashing data structure is subject to the load factor and number of collisions. An item that hashes with no collision can be checked almost directly, whereas one with collisions will probe until it finds the correct item or an empty slot. In the worst possible case, whereby all insert items have collided this would mean checking every single item. In practice, hashing data-structures are designed to minimise the chances of this happening and most items should be found or identified as missing with a single access. +The performance of a hashing data structure is subject to the load factor and number of collisions. An item that hashes with no collision can be checked almost directly, whereas one with collisions will probe until it finds the correct item or an empty slot. In the worst possible case, whereby all insert items have collided this would mean checking every single item. In practice, hashing data-structures are designed to minimise the chances of this happening and most items should be found or identified as missing with single access. -In contrast if searching a list or array, the default approach is to start at the first item and check all subsequent items until the correct item has been found. If the correct item is not present, this will require the entire list to be checked. Therefore the worst-case is similar to that of the hashing data-structure, however it is guaranteed in cases where the item is missing. Similarly, on-average we would expect an item to be found half way through the list, meaning that an average search will require checking half of the items. +In contrast, if searching a list or array, the default approach is to start at the first item and check all subsequent items until the correct item has been found. If the correct item is not present, this will require the entire list to be checked. Therefore, the worst-case is similar to that of the hashing data-structure, however it is guaranteed in cases where the item is missing. Similarly, on-average we would expect an item to be found halfway through the list, meaning that an average search will require checking half of the items. If however the list or array is sorted, a binary search can be used. A binary search divides the list in half and checks which half the target item would be found in, this continues recursively until the search is exhausted whereby the item should be found or dismissed. This is significantly faster than performing a linear search of the list, checking a total of `log N` items every time. @@ -333,9 +333,7 @@ print(f"linear_search_list: {timeit(linear_search_list, number=repeats)-gen_time print(f"binary_search_list: {timeit(binary_search_list, number=repeats)-gen_time:.2f}ms") ``` -Searching the set is fastest performing 25,000 searches in 0.04ms. -This is followed by the binary search of the (sorted) list which is 145x slower, although the list has been filtered for duplicates. A list still containing duplicates would be longer, leading to a more expensive search. -The linear search of the list is more than 56,600x slower than the fastest, it really shouldn't be used! +Searching the set is the fastest, performing 25,000 searches in 0.04ms. This is followed by the binary search of the (sorted) list which is 145x slower, although the list has been filtered for duplicates. A list still containing duplicates would be longer, leading to a more expensive search. The linear search of the list is more than 56,600x slower than searching the set, it really shouldn't be used! ```output search_set: 0.04ms diff --git a/episodes/optimisation-introduction.md b/episodes/optimisation-introduction.md index 1650962a..5f9bafec 100644 --- a/episodes/optimisation-introduction.md +++ b/episodes/optimisation-introduction.md @@ -18,57 +18,41 @@ exercises: 0 ## Introduction - -Now that you're able to find the most expensive components of your code with profiling, it becomes time to learn how to identify whether that expense is reasonable. - + +Think about optimisation as the first step on your journey to writing high-performance code. +It’s like a race: the faster you can go without taking unnecessary detours, the better. +Code optmisation is all about understanding the principles of efficiency in Python and being conscious of how small changes can yield massive improvements. + -In order to optimise code for performance, it is necessary to have an understanding of what a computer is doing to execute it. +These are the first steps in code optimisation: making better choices as you write your code and have an understanding of what a computer is doing to execute it. -Even a high-level understanding of how you code executes, such as how Python and the most common data-structures and algorithms are implemented, can help you to identify suboptimal approaches when programming. If you have learned to write code informally out of necessity, to get something to work, it's not uncommon to have collected some bad habits along the way. +A high-level understanding of how your code executes, such as how Python and the most common data-structures and algorithms are implemented, can help you identify suboptimal approaches when programming. If you have learned to write code informally out of necessity, to get something to work, it's not uncommon to have collected some bad habits along the way. The remaining content is often abstract knowledge, that is transferable to the vast majority of programming languages. This is because the hardware architecture, data-structures and algorithms used are common to many languages and they hold some of the greatest influence over performance bottlenecks. -## Premature Optimisation - -> Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: **premature optimization is the root of all evil**. Yet we should not pass up our opportunities in that critical 3%. - Donald Knuth - -This classic quote among computer scientists states; when considering optimisation it is important to focus on the potential impact, both to the performance and maintainability of the code. - -Profiling is a valuable tool in this cause. Should effort be expended to optimise a component which occupies 1% of the runtime? Or would that time be better spent focusing on the most expensive components? - -Advanced optimisations, mostly outside the scope of this course, can increase the cost of maintenance by obfuscating what code is doing. Even if you are a solo-developer working on private code, your future self should be able to easily comprehend your implementation. +## Optimising code from scratch: trade-off between performance and maintainability -Therefore, the balance between the impact to both performance and maintainability should be considered when optimising code. +> Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: **premature optimisation is the root of all evil**. Yet we should not pass up our opportunities in that critical 3%. - Donald Knuth -This is not to say, don't consider performance when first writing code. The selection of appropriate algorithms and data-structures covered in this course form good practice, simply don't fret over a need to micro-optimise every small component of the code that you write. +This classic quote among computer scientists states; when considering optimisation it is important to focus on the potential impact, both to the performance and maintainability of the code. Advanced optimisations, mostly outside the scope of this course, can increase the cost of maintenance by obfuscating what code is doing. Even if you are a solo-developer working on private code, your future self should be able to easily comprehend your implementation. Therefore, the balance between the impact to both performance and maintainability should be considered when optimising code. +This is not to say, don't consider performance when first writing code. The selection of appropriate algorithms and data-structures covered in this course form a good practice, simply don't fret over a need to micro-optimise every small component of the code that you write. -## Ensuring Reproducible Results +## Ensuring Reproducible Results when optimising an existing code -When optimising your code, you are making speculative changes. It's easy to make mistakes, many of which can be subtle. Therefore, it's important to have a strategy in place to check that the outputs remain correct. +When optimising an existing code, you are making speculative changes. It's easy to make mistakes, many of which can be subtle. Therefore, it's important to have a strategy in place to check that the outputs remain correct. -Testing is hopefully already a seamless part of your research software development process. -Test can be used to clarify how your software should perform, ensuring that new features work as intended and protecting against unintended changes to old functionality. - -There are a plethora of methods for testing code. +Testing is hopefully already a seamless part of your research software development process. Test can be used to clarify how your software should perform, ensuring that new features work as intended and protecting against unintended changes to old functionality. ## pytest Overview -Most Python developers use the testing package [pytest](https://docs.pytest.org/en/latest/), it's a great place to get started if you're new to testing code. +There are a plethora of methods for testing code. Most Python developers use the testing package [pytest](https://docs.pytest.org/en/latest/), it's a great place to get started if you're new to testing code. Tests should be created within a project's testing directory, by creating files named with the form `test_*.py` or `*_test.py`. pytest looks for these files, when running the test suite. Within the created test file, any functions named in the form `test*` are considered tests that will be executed by pytest. The `assert` keyword is used, to test whether a condition evaluates to `True`. Here's a quick example of how a test can be used to check your function's output against an expected value. -Tests should be created within a project's testing directory, by creating files named with the form `test_*.py` or `*_test.py`. - -pytest looks for these files, when running the test suite. - -Within the created test file, any functions named in the form `test*` are considered tests that will be executed by pytest. - -The `assert` keyword is used, to test whether a condition evaluates to `True`. - ```python # file: test_demonstration.py diff --git a/episodes/optimisation-minimise-python.md b/episodes/optimisation-minimise-python.md index 0e2666f6..5ab4e983 100644 --- a/episodes/optimisation-minimise-python.md +++ b/episodes/optimisation-minimise-python.md @@ -20,7 +20,7 @@ exercises: 0 :::::::::::::::::::::::::::::::::::::::::::::::: -Python is an interpreted programming language. When you execute your `.py` file, the (default) CPython back-end compiles your Python source code to an intermediate bytecode. This bytecode is then interpreted in software at runtime generating instructions for the processor as necessary. This interpretation stage, and other features of the language, harm the performance of Python (whilst improving it's usability). +Python is an interpreted programming language. When you execute your `.py` file, the (default) CPython back-end compiles your Python source code to an intermediate bytecode. This bytecode is then interpreted in software at runtime generating instructions for the processor as necessary. This interpretation stage, and other features of the language, harm the performance of Python (whilst improving its usability). In comparison, many languages such as C/C++ compile directly to machine code. This allows the compiler to perform low-level optimisations that better exploit hardware nuance to achieve fast performance. This however comes at the cost of compiled software not being cross-platform. @@ -28,7 +28,7 @@ Whilst Python will rarely be as fast as compiled languages like C/C++, it is pos A simple example of this would be to perform a linear search of a list (in the previous episode we did say this is not recommended). The below example creates a list of 2500 integers in the inclusive-exclusive range `[0, 5000)`. -It then searches for all of the even numbers in that range. +It then searches for all the even numbers in that range. `searchlistPython()` is implemented manually, iterating `ls` checking each individual item in Python code. `searchListC()` in contrast uses the `in` operator to perform each search, which allows CPython to implement the inner loop in it's C back-end. @@ -281,7 +281,7 @@ In particular, those which are passed an `iterable` (e.g. lists) are likely to p ::::::::::::::::::::::::::::::::::::: callout -The built-in functions [`filter()`](https://docs.python.org/3/library/functions.html#filter) and [`map()`](https://docs.python.org/3/library/functions.html#map) can be used for processing iterables However list-comprehension is likely to be more performant. +The built-in functions [`filter()`](https://docs.python.org/3/library/functions.html#filter) and [`map()`](https://docs.python.org/3/library/functions.html#map) can be used for processing iterables. However, list-comprehension is likely to be more performant. @@ -292,11 +292,11 @@ The built-in functions [`filter()`](https://docs.python.org/3/library/functions. [NumPy](https://numpy.org/) is a commonly used package for scientific computing, which provides a wide variety of methods. -It adds restriction via it's own [basic numeric types](https://numpy.org/doc/stable/user/basics.types.html), and static arrays to enable even greater performance than that of core Python. However if these restrictions are ignored, the performance can become significantly worse. +It adds restriction via its own [basic numeric types](https://numpy.org/doc/stable/user/basics.types.html), and static arrays to enable even greater performance than that of core Python. However, if these restrictions are ignored, the performance can become significantly worse. ### Arrays -NumPy's arrays (not to be confused with the core Python `array` package) are static arrays. Unlike core Python's lists, they do not dynamically resize. Therefore if you wish to append to a NumPy array, you must call `resize()` first. If you treat this like `append()` for a Python list, resizing for each individual append you will be performing significantly more copies and memory allocations than a Python list. +NumPy's arrays (not to be confused with the core Python `array` package) are static arrays. Unlike core Python's lists, they do not dynamically resize. Therefore, if you wish to append to a NumPy array, you must call `resize()` first. If you treat this like `append()` for a Python list, resizing for each individual append you will be performing significantly more copies and memory allocations than a Python list. The below example sees lists and arrays constructed from `range(100000)`. @@ -390,7 +390,7 @@ There is however a trade-off, using `numpy.random.choice()` can be clearer to so ### Vectorisation -The manner by which NumPy stores data in arrays enables it's functions to utilise vectorisation, whereby the processor executes one instruction across multiple variables simultaneously, for every mathematical operation between arrays. +The manner by which NumPy stores data in arrays enables its functions to utilise vectorisation, whereby the processor executes one instruction across multiple variables simultaneously, for every mathematical operation between arrays. Earlier in this episode it was demonstrated that using core Python methods over a list, will outperform a loop performing the same calculation faster. The below example takes this a step further by demonstrating the calculation of dot product. @@ -416,11 +416,6 @@ print(f"numpy_sum_array: {timeit(np_sum_ar, setup=gen_array, number=repeats):.2f print(f"numpy_dot_array: {timeit(np_dot_ar, setup=gen_array, number=repeats):.2f}ms") ``` -* `python_sum_list` uses list comprehension to perform the multiplication, followed by the Python core `sum()`. This comes out at 46.93ms -* `python_sum_array` instead directly multiplies the two arrays, taking advantage of NumPy's vectorisation. But uses the core Python `sum()`, this comes in slightly faster at 33.26ms. -* `numpy_sum_array` again takes advantage of NumPy's vectorisation for the multiplication, and additionally uses NumPy's `sum()` implementation. These two rounds of vectorisation provide a much faster 1.44ms completion. -* `numpy_dot_array` instead uses NumPy's `dot()` to calculate the dot product in a single operation. This comes out the fastest at 0.29ms, 162x faster than `python_sum_list`. - ```output python_sum_list: 46.93ms python_sum_array: 33.26ms @@ -428,6 +423,11 @@ numpy_sum_array: 1.44ms numpy_dot_array: 0.29ms ``` +* `python_sum_list` uses list comprehension to perform the multiplication, followed by the Python core `sum()`. This comes out at 46.93ms +* `python_sum_array` instead directly multiplies the two arrays, taking advantage of NumPy's vectorisation. But uses the core Python `sum()`, this comes in slightly faster at 33.26ms. +* `numpy_sum_array` again takes advantage of NumPy's vectorisation for the multiplication, and additionally uses NumPy's `sum()` implementation. These two rounds of vectorisation provide a much faster 1.44ms completion. +* `numpy_dot_array` instead uses NumPy's `dot()` to calculate the dot product in a single operation. This comes out the fastest at 0.29ms, 162x faster than `python_sum_list`. + ::::::::::::::::::::::::::::::::::::: callout ## Parallel NumPy @@ -439,7 +439,7 @@ A small number of functions are backed by BLAS and LAPACK, enabling even greater The [supported functions](https://numpy.org/doc/stable/reference/routines.linalg.html) mostly correspond to linear algebra operations. -The auto-parallelisation of these functions is hardware dependant, so you won't always automatically get the additional benefit of parallelisation. +The auto-parallelisation of these functions is hardware-dependent, so you won't always automatically get the additional benefit of parallelisation. However, HPC systems should be primed to take advantage, so try increasing the number of cores you request when submitting your jobs and see if it improves the performance. *This might be why `numpy_dot_array` is that much faster than `numpy_sum_array` in the previous example!* @@ -449,7 +449,7 @@ However, HPC systems should be primed to take advantage, so try increasing the n ### `vectorize()` Python's `map()` was introduced earlier, for applying a function to all elements within a list. -NumPy provides `vectorize()` an equivalent for operating over it's arrays. +NumPy provides `vectorize()` an equivalent for operating over its arrays. This doesn't actually make use of processor-level vectorisation, from the [documentation](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html): @@ -497,7 +497,7 @@ Pandas' methods by default operate on columns. Each column or series can be thou Following the theme of this episode, iterating over the rows of a data frame using a `for` loop is not advised. The pythonic iteration will be slower than other approaches. -Pandas allows it's own methods to be applied to rows in many cases by passing `axis=1`, where available these functions should be preferred over manual loops. Where you can't find a suitable method, `apply()` can be used, which is similar to `map()`/`vectorize()`, to apply your own function to rows. +Pandas allows its own methods to be applied to rows in many cases by passing `axis=1`, where available these functions should be preferred over manual loops. Where you can't find a suitable method, `apply()` can be used, which is similar to `map()`/`vectorize()`, to apply your own function to rows. ```python from timeit import timeit @@ -571,7 +571,7 @@ vectorize: 1.48ms It won't always be possible to take full advantage of vectorisation, for example you may have conditional logic. -An alternate approach is converting your dataframe to a Python dictionary using `to_dict(orient='index')`. This creates a nested dictionary, where each row of the outer dictionary is an internal dictionary. This can then be processed via list-comprehension: +An alternate approach is converting your DataFrame to a Python dictionary using `to_dict(orient='index')`. This creates a nested dictionary, where each row of the outer dictionary is an internal dictionary. This can then be processed via list-comprehension: ```python def to_dict(): @@ -588,7 +588,7 @@ Whilst still nearly 100x slower than pure vectorisation, it's twice as fast as ` to_dict: 131.15ms ``` -This is because indexing into Pandas' `Series` (rows) is significantly slower than a Python dictionary. There is a slight overhead to creating the dictionary (40ms in this example), however the stark difference in access speed is more than enough to overcome that cost for any large dataframe. +This is because indexing into Pandas' `Series` (rows) is significantly slower than a Python dictionary. There is a slight overhead to creating the dictionary (40ms in this example), however the stark difference in access speed is more than enough to overcome that cost for any large DataFrame. ```python from timeit import timeit diff --git a/episodes/profiling-functions.md b/episodes/profiling-functions.md index 54ba38b6..9e99c584 100644 --- a/episodes/profiling-functions.md +++ b/episodes/profiling-functions.md @@ -46,7 +46,7 @@ As a stack it is a last-in first-out (LIFO) data structure. ![A diagram of a call stack](fig/stack.png){alt="A greyscale diagram showing a (call)stack, containing 5 stack frame. Two additional stack frames are shown outside the stack, one is marked as entering the call stack with an arrow labelled push and the other is marked as exiting the call stack labelled pop."} -When a function is called, a frame to track it's variables and metadata is pushed to the call stack. +When a function is called, a frame to track its variables and metadata is pushed to the call stack. When that same function finishes and returns, it is popped from the stack and variables local to the function are dropped. If you've ever seen a stack overflow error, this refers to the call stack becoming too large. @@ -88,7 +88,7 @@ Hence, this prints the following call stack: traceback.print_stack() ``` -The first line states the file and line number where `a()` was called from (the last line of code in the file shown). The second line states that it was the function `a()` that was called, this could include it's arguments. The third line then repeats this pattern, stating the line number where `b2()` was called inside `a()`. This continues until the call to `traceback.print_stack()` is reached. +The first line states the file and line number where `a()` was called from (the last line of code in the file shown). The second line states that it was the function `a()` that was called, this could include its arguments. The third line then repeats this pattern, stating the line number where `b2()` was called inside `a()`. This continues until the call to `traceback.print_stack()` is reached. You may see stack traces like this when an unhandled exception is thrown by your code. @@ -102,7 +102,7 @@ You may see stack traces like this when an unhandled exception is thrown by your [`cProfile`](https://docs.python.org/3/library/profile.html#instant-user-s-manual) is a function-level profiler provided as part of the Python standard library. -It can be called directly within your Python code as an imported package, however it's easier to use it's script interface: +It can be called directly within your Python code as an imported package, however it's easier to use its script interface: ```sh python -m cProfile -o