diff --git a/config.yaml b/config.yaml index 117da0c9..76397e97 100644 --- a/config.yaml +++ b/config.yaml @@ -3,15 +3,15 @@ #------------------------------------------------------------ # Which carpentry is this (swc, dc, lc, or cp)? -# swc: Software Carpentry +# swc: Software Carpentry - # dc: Data Carpentry # lc: Library Carpentry # cp: Carpentries (to use for instructor training for instance) # incubator: The Carpentries Incubator -carpentry: 'incubator' +carpentry: 'swc' # Overall title for pages. -title: 'Performance Profiling & Optimisation (Python)' +title: 'Python Optimisation and Performance Profiling' # Date the lesson was created (YYYY-MM-DD, this is empty by default) created: 2024-02-01~ # FIXME @@ -27,13 +27,13 @@ life_cycle: 'alpha' license: 'CC-BY 4.0' # Link to the source repository for this lesson -source: 'https://github.com/RSE-Sheffield/pando-python' +source: 'https://github.com/ICR-RSE-Group/carpentry-pando-python' # Default branch of your lesson branch: 'main' # Who to contact if there are any issues -contact: 'robert.chisholm@sheffield.ac.uk' +contact: 'mira.sarkis@icr.ac.uk' # Navigation ------------------------------------------------ # @@ -59,23 +59,21 @@ contact: 'robert.chisholm@sheffield.ac.uk' # Order of episodes in your lesson episodes: -- profiling-introduction.md -- profiling-functions.md -- short-break1.md -- profiling-lines.md -- profiling-conclusion.md - optimisation-introduction.md - optimisation-data-structures-algorithms.md -- long-break1.md - optimisation-minimise-python.md - optimisation-use-latest.md - optimisation-memory.md - optimisation-conclusion.md +- long-break1.md +- profiling-introduction.md +- profiling-functions.md +- profiling-lines.md +- profiling-conclusion.md # Information for Learners learners: - setup.md -- registration.md - acknowledgements.md - ppp.md - reference.md @@ -91,5 +89,5 @@ profiles: # This space below is where custom yaml items (e.g. pinning # sandpaper and varnish versions) should live -varnish: RSE-Sheffield/uos-varnish@main -url: 'https://rse.shef.ac.uk/pando-python' +#varnish: RSE-Sheffield/uos-varnish@main +#url: 'https://icr-rse-group.github.io/carpentry-pando-python' diff --git a/episodes/fig/python_lists.png b/episodes/fig/python_lists.png new file mode 100644 index 00000000..89462a40 Binary files /dev/null and b/episodes/fig/python_lists.png differ diff --git a/episodes/long-break1.md b/episodes/long-break1.md index d28a91bd..2bda1dad 100644 --- a/episodes/long-break1.md +++ b/episodes/long-break1.md @@ -1,5 +1,5 @@ --- -title: Break +title: Lunch Break teaching: 0 exercises: 0 break: 60 diff --git a/episodes/optimisation-data-structures-algorithms.md b/episodes/optimisation-data-structures-algorithms.md index 0c1c8d35..820434c4 100644 --- a/episodes/optimisation-data-structures-algorithms.md +++ b/episodes/optimisation-data-structures-algorithms.md @@ -63,9 +63,12 @@ CPython for example uses [`newsize + (newsize >> 3) + 6`](https://github.com/pyt ![The relationship between the number of appends to an empty list, and the number of internal resizes in CPython.](episodes/fig/cpython_list_allocations.png){alt='A line graph displaying the relationship between the number of calls to append() and the number of internal resizes of a CPython list. It has a logarithmic relationship, at 1 million appends there have been 84 internal resizes.'} +![Visual note on resizing behaviour of Python lists.](episodes/fig/python_lists.png){alt='Small cheat note for better visualization of Python lists.'} + + This has two implications: -* If you are creating large static lists, they will use upto 12.5% excess memory. +* If you are creating large static lists, they will use up to 12.5% excess memory. * If you are growing a list with `append()`, there will be large amounts of redundant allocations and copies as the list grows. ### List Comprehension @@ -151,21 +154,23 @@ Since Python 3.6, the items within a dictionary will iterate in the order that t ### Hashing Data Structures -Python's dictionaries are implemented as hashing data structures. -Within a hashing data structure each inserted key is hashed to produce a (hopefully unique) integer key. -The dictionary is pre-allocated to a default size, and the key is assigned the index within the dictionary equivalent to the hash modulo the length of the dictionary. -If that index doesn't already contain another key, the key (and any associated values) can be inserted. -When the index isn't free, a collision strategy is applied. CPython's [dictionary](https://github.com/python/cpython/blob/main/Objects/dictobject.c) and [set](https://github.com/python/cpython/blob/main/Objects/setobject.c) both use a form of open addressing whereby a hash is mutated and corresponding indices probed until a free one is located. -When the hashing data structure exceeds a given load factor (e.g. 2/3 of indices have been assigned keys), the internal storage must grow. This process requires every item to be re-inserted which can be expensive, but reduces the average probes for a key to be found. +Python's dictionaries are implemented using hashing as their underlying data structure. In this structure, each key is hashed to generate a (preferably unique) integer, which serves as the basis for indexing. Dictionaries are initialized with a default size, and the hash value of a key, modulo the dictionary's length, determines its initial index. If this index is available, the key and its associated value are stored there. If the index is already occupied, a collision occurs, and a resolution strategy is applied to find an alternate index. -![An visual explanation of linear probing, CPython uses an advanced form of this.](episodes/fig/hash_linear_probing.png){alt="A diagram demonstrating how the keys (hashes) 37, 64, 14, 94, 67 are inserted into a hash table with 11 indices. This is followed by the insertion of 59, 80 and 39 which require linear probing to be inserted due to collisions."} +In CPython's [dictionary](https://github.com/python/cpython/blob/main/Objects/dictobject.c) and [set](https://github.com/python/cpython/blob/main/Objects/setobject.c)implementations, a technique called open addressing is employed. This approach modifies the hash and probes subsequent indices until an empty one is found. -To retrieve or check for the existence of a key within a hashing data structure, the key is hashed again and a process equivalent to insertion is repeated. However, now the key at each index is checked for equality with the one provided. If any empty index is found before an equivalent key, then the key must not be present in the ata structure. +When a dictionary or hash table in Python grows, the underlying storage is resized, which necessitates re-inserting every existing item into the new structure. This process can be computationally expensive but is essential for maintaining efficient average probe times when searching for keys. +![A visual explanation of linear probing, CPython uses an advanced form of this.](episodes/fig/hash_linear_probing.png){alt="A diagram showing how keys (hashes) 37, 64, 14, 94, 67 are inserted into a hash table with 11 indices. The insertion of 59, 80, and 39 demonstrates linear probing to resolve collisions."} +To look up or verify the existence of a key in a hashing data structure, the key is re-hashed, and the process mirrors that of insertion. The corresponding index is probed to see if it contains the provided key. If the key at the index matches, the operation succeeds. If an empty index is reached before finding the key, it indicates that the key does not exist in the structure. +The above diagrams shows a hash table of 5 elements within a block of 11 slots: +1. We try to add element k=59. Based on its hash, the intended position is p=4. However, slot 4 is already occupied by the element k=37. This results in a collision. +2. To resolve the collision, the linear probing mechanism is employed. The algorithm checks the next available slot, starting from position p=4. The first available slot is found at position 5. +3. The number of jumps (or steps) it took to find the available slot are represented by i=1 (since we moved from position 4 to 5). +In this case, the number of jumps i=1 indicates that the algorithm had to probe one slot to find an empty position at index 5. ### Keys -Keys will typically be a core Python type such as a number or string. However multiple of these can be combined as a Tuple to form a compound key, or a custom class can be used if the methods `__hash__()` and `__eq__()` have been implemented. +Keys will typically be a core Python type such as a number or string. However, multiple of these can be combined as a Tuple to form a compound key, or a custom class can be used if the methods `__hash__()` and `__eq__()` have been implemented. You can implement `__hash__()` by utilising the ability for Python to hash tuples, avoiding the need to implement a bespoke hash function. @@ -265,7 +270,7 @@ Constructing a set with a loop and `add()` (equivalent to a list's `append()`) c The naive list approach is 2200x times slower than the fastest approach, because of how many times the list is searched. This gap will only grow as the number of items increases. -Sorting the input list reduces the cost of searching the output list significantly, however it is still 8x slower than the fastest approach. In part because around half of it's runtime is now spent sorting the list. +Sorting the input list reduces the cost of searching the output list significantly, however it is still 8x slower than the fastest approach. In part because around half of its runtime is now spent sorting the list. ```output uniqueSet: 0.30ms @@ -280,9 +285,9 @@ uniqueListSort: 2.67ms Independent of the performance to construct a unique set (as covered in the previous section), it's worth identifying the performance to search the data-structure to retrieve an item or check whether it exists. -The performance of a hashing data structure is subject to the load factor and number of collisions. An item that hashes with no collision can be checked almost directly, whereas one with collisions will probe until it finds the correct item or an empty slot. In the worst possible case, whereby all insert items have collided this would mean checking every single item. In practice, hashing data-structures are designed to minimise the chances of this happening and most items should be found or identified as missing with a single access. +The performance of a hashing data structure is subject to the load factor and number of collisions. An item that hashes with no collision can be checked almost directly, whereas one with collisions will probe until it finds the correct item or an empty slot. In the worst possible case, whereby all insert items have collided this would mean checking every single item. In practice, hashing data-structures are designed to minimise the chances of this happening and most items should be found or identified as missing with single access, result in an average time complexity of a constant (which is very good!). -In contrast if searching a list or array, the default approach is to start at the first item and check all subsequent items until the correct item has been found. If the correct item is not present, this will require the entire list to be checked. Therefore the worst-case is similar to that of the hashing data-structure, however it is guaranteed in cases where the item is missing. Similarly, on-average we would expect an item to be found half way through the list, meaning that an average search will require checking half of the items. +In contrast, if searching a list or array, the default approach is to start at the first item and check all subsequent items until the correct item has been found. If the correct item is not present, this will require the entire list to be checked. Therefore, the worst-case is similar to that of the hashing data-structure, however it is guaranteed in cases where the item is missing. Similarly, on-average we would expect an item to be found halfway through the list, meaning that an average search will require checking half of the items. If however the list or array is sorted, a binary search can be used. A binary search divides the list in half and checks which half the target item would be found in, this continues recursively until the search is exhausted whereby the item should be found or dismissed. This is significantly faster than performing a linear search of the list, checking a total of `log N` items every time. @@ -333,9 +338,7 @@ print(f"linear_search_list: {timeit(linear_search_list, number=repeats)-gen_time print(f"binary_search_list: {timeit(binary_search_list, number=repeats)-gen_time:.2f}ms") ``` -Searching the set is fastest performing 25,000 searches in 0.04ms. -This is followed by the binary search of the (sorted) list which is 145x slower, although the list has been filtered for duplicates. A list still containing duplicates would be longer, leading to a more expensive search. -The linear search of the list is more than 56,600x slower than the fastest, it really shouldn't be used! +Searching the set is the fastest, performing 25,000 searches in 0.04ms. This is followed by the binary search of the (sorted) list which is 145x slower, although the list has been filtered for duplicates. A list still containing duplicates would be longer, leading to a more expensive search. The linear search of the list is more than 56,600x slower than searching the set, it really shouldn't be used! ```output search_set: 0.04ms @@ -345,6 +348,20 @@ binary_search_list: 5.79ms These results are subject to change based on the number of items and the proportion of searched items that exist within the list. However, the pattern is likely to remain the same. Linear searches should be avoided! +::::::::::::::::::::::::::::::::::::: callout + +Dictionaries are designed to handle insertions efficiently, with average-case O(1) time complexity per insertion for a small size dict, but it is clearly problematic for large size dict. In this case, it is better to find an alternative Data Structure for example List, NumPy Array or Pandas DataFrame. The table below summarizes the best uses and performance characteristics of each data structure: + +| Data Structure | Small Size Insertion (O(1)) | Large Size Insertion | Search Performance (O(1)) | Best For | +|------------------|-----------------------------------|------------------------------------------|---------------------------|--------------------------------------------------------------------------| +| Dictionary | ✅ | ⚠️ Occasional O(n) (due to resizing) | ✅ O(1) (Hashing) | Fast insertions and lookups, key-value storage, small to medium data | +| List | ✅ Amortized (O(1) Append) | ✅ Efficient (Amortized O(1)) | ❌ O(n) (Linear Search) | Dynamic appends, ordered data storage, general-purpose use | +| Set | ✅ Average O(1) | ⚠️ Occasional O(n) (due to resizing) | ✅ O(1) (Hashing) | Membership testing, unique elements, small to medium datasets | +| NumPy Array | ❌ (Fixed Size) | ⚠️ Costly (O(n) when resizing) | ❌ O(n) (Linear Search) | Numerical computations, fixed-size data, vectorized operations | +| Pandas DataFrame | ❌ (if adding rows) | ⚠️ Efficient (Column-wise) | ❌ O(n) (Linear Search) | Column-wise analytics, tabular data, large datasets | +NumPy and Pandas, which we have not yet covered, are powerful libraries designed for handling large matrices and arrays. They are implemented in C to optimize performance, making them ideal for numerical computations and data analysis tasks. + +::::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::: keypoints diff --git a/episodes/optimisation-introduction.md b/episodes/optimisation-introduction.md index 1650962a..a5c55aba 100644 --- a/episodes/optimisation-introduction.md +++ b/episodes/optimisation-introduction.md @@ -18,57 +18,41 @@ exercises: 0 ## Introduction - -Now that you're able to find the most expensive components of your code with profiling, it becomes time to learn how to identify whether that expense is reasonable. - + +Think about optimisation as the first step on your journey to writing high-performance code. +It’s like a race: the faster you can go without taking unnecessary detours, the better. +Code optmisation is all about understanding the principles of efficiency in Python and being conscious of how small changes can yield massive improvements. + -In order to optimise code for performance, it is necessary to have an understanding of what a computer is doing to execute it. +These are the first steps in code optimisation: making better choices as you write your code and have an understanding of what a computer is doing to execute it. -Even a high-level understanding of how you code executes, such as how Python and the most common data-structures and algorithms are implemented, can help you to identify suboptimal approaches when programming. If you have learned to write code informally out of necessity, to get something to work, it's not uncommon to have collected some bad habits along the way. +A high-level understanding of how your code executes, such as how Python and the most common data-structures and algorithms are implemented, can help you identify suboptimal approaches when programming. If you have learned to write code informally out of necessity, to get something to work, it's not uncommon to have collected some bad habits along the way. The remaining content is often abstract knowledge, that is transferable to the vast majority of programming languages. This is because the hardware architecture, data-structures and algorithms used are common to many languages and they hold some of the greatest influence over performance bottlenecks. -## Premature Optimisation - -> Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: **premature optimization is the root of all evil**. Yet we should not pass up our opportunities in that critical 3%. - Donald Knuth - -This classic quote among computer scientists states; when considering optimisation it is important to focus on the potential impact, both to the performance and maintainability of the code. - -Profiling is a valuable tool in this cause. Should effort be expended to optimise a component which occupies 1% of the runtime? Or would that time be better spent focusing on the most expensive components? - -Advanced optimisations, mostly outside the scope of this course, can increase the cost of maintenance by obfuscating what code is doing. Even if you are a solo-developer working on private code, your future self should be able to easily comprehend your implementation. +## Optimising code from scratch: trade-off between performance and maintainability -Therefore, the balance between the impact to both performance and maintainability should be considered when optimising code. +> Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: **premature optimisation is the root of all evil**. Yet we should not pass up our opportunities in that critical 3%. - Donald Knuth -This is not to say, don't consider performance when first writing code. The selection of appropriate algorithms and data-structures covered in this course form good practice, simply don't fret over a need to micro-optimise every small component of the code that you write. +This classic quote among computer scientists emphasizes the importance of considering both performance and maintainability when optimizing code. While advanced optimizations may boost performance, they often come at the cost of making the code harder to understand and maintain. Even if you're working alone on private code, your future self should be able to easily understand the implementation. Hence, when optimizing, always weigh the potential impact on both performance and maintainability. +This doesn't mean you should ignore performance when initially writing code. Choosing the right algorithms and data structures, as we’ve discussed in this course, is essential. However, there's no need to obsess over micro-optimizing every tiny component of your code—focus on the bigger picture. -## Ensuring Reproducible Results +## Ensuring Reproducible Results when optimising an existing code -When optimising your code, you are making speculative changes. It's easy to make mistakes, many of which can be subtle. Therefore, it's important to have a strategy in place to check that the outputs remain correct. +When optimizing existing code, you're often making speculative changes, which can lead to subtle mistakes. To ensure that your optimizations are actually improving the code without introducing errors, it's crucial to have a solid strategy for checking that the results remain correct. -Testing is hopefully already a seamless part of your research software development process. -Test can be used to clarify how your software should perform, ensuring that new features work as intended and protecting against unintended changes to old functionality. - -There are a plethora of methods for testing code. +Testing should already be an integral part of your development process. It helps clarify expected behavior, ensures new features are working as intended, and protects against unintended regressions in previously working functionality. Always verify your changes through testing to ensure that the optimizations don’t compromise the correctness of your code. ## pytest Overview -Most Python developers use the testing package [pytest](https://docs.pytest.org/en/latest/), it's a great place to get started if you're new to testing code. +There are a plethora of methods for testing code. Most Python developers use the testing package [pytest](https://docs.pytest.org/en/latest/), it's a great place to get started if you're new to testing code. Tests should be created within a project's testing directory, by creating files named with the form `test_*.py` or `*_test.py`. pytest looks for these files, when running the test suite. Within the created test file, any functions named in the form `test*` are considered tests that will be executed by pytest. The `assert` keyword is used, to test whether a condition evaluates to `True`. Here's a quick example of how a test can be used to check your function's output against an expected value. -Tests should be created within a project's testing directory, by creating files named with the form `test_*.py` or `*_test.py`. - -pytest looks for these files, when running the test suite. - -Within the created test file, any functions named in the form `test*` are considered tests that will be executed by pytest. - -The `assert` keyword is used, to test whether a condition evaluates to `True`. - ```python # file: test_demonstration.py diff --git a/episodes/optimisation-memory.md b/episodes/optimisation-memory.md index 27500316..51a66652 100644 --- a/episodes/optimisation-memory.md +++ b/episodes/optimisation-memory.md @@ -24,36 +24,32 @@ exercises: 0 The storage and movement of data plays a large role in the performance of executing software. -Modern computer's typically have a single processor (CPU), within this processor there are multiple processing cores each capable of executing different code in parallel. - -Data held in memory by running software is exists in RAM, this memory is faster to access than hard drives (and solid-state drives). -But the CPU has much smaller caches on-board, to make accessing the most recent variables even faster. +Modern computers have a single CPU with multiple cores, each capable of working on tasks at the same time. Data used by programs is stored in RAM, which is faster than hard drives or solid-state drives. However, the CPU has even faster memory called caches to access frequently used data quickly. ![An annotated photo of a computer's hardware.](episodes/fig/annotated-motherboard.jpg){alt="An annotated photo of inside a desktop computer's case. The CPU, RAM, power supply, graphics cards (GPUs) and harddrive are labelled."} -When reading a variable, to perform an operation with it, the CPU will first look in it's registers. These exist per core, they are the location that computation is actually performed. Accessing them is incredibly fast, but there only exists enough storage for around 32 variables (typical number, e.g. 4 bytes). -As the register file is so small, most variables won't be found and the CPU's caches will be searched. -It will first check the current processing core's L1 (Level 1) cache, this small cache (typically 64 KB per physical core) is the smallest and fastest to access cache on a CPU. -If the variable is not found in the L1 cache, the L2 cache that is shared between multiple cores will be checked. This shared cache, is slower to access but larger than L1 (typically 1-3MB per core). -This process then repeats for the L3 cache which may be shared among all cores of the CPU. This cache again has higher latency to access, but increased size (typically slightly larger than the total L2 cache size). -If the variable has not been found in any of the CPU's cache, the CPU will look to the computer's RAM. This is an order of magnitude slower to access, with several orders of magnitude greater capacity (tens to hundreds of GB are now standard). +How the CPU Accesses Data? +When the CPU needs to use a variable, it follows these steps: -Correspondingly, the earlier the CPU finds the variable the faster it will be to access. -However, to fully understand the cache's it's necessary to explain what happens once a variable has been found. +1) Registers: First, the CPU checks its own small, super-fast storage (registers). But it only has room for about 32 variables, so it usually doesn’t find the data here. +2) L1 Cache: Next, the CPU looks in the L1 cache. It’s small (64 KB per core) and fast, but it only stores data for a single core. +3) L2 Cache: If the variable isn’t in L1, it checks the larger L2 cache, which is shared by several cores. It’s slower than L1 but still faster than RAM. +4) L3 Cache: If the variable isn’t in L2, the CPU checks the L3 cache, which is shared by all cores. It’s slower than L2 but bigger. +5) RAM: If the variable is still not found, the CPU fetches it from the much slower RAM. +The faster the CPU finds the data in the cache, the quicker it can do the job. +This is why understanding how the cache works can help make things run faster. -If a variable is not found in the caches, so must be fetched from RAM. -The full 64 byte cache line containing the variable, will be copied first into the CPU's L3, then L2 and then L1. -Most variables are only 4 or 8 bytes, so many neighbouring variables are also pulled into the caches. -Similarly, adding new data to a cache evicts old data. -This means that reading 16 integers contiguously stored in memory, should be faster than 16 scattered integers +Cache Details: +When the CPU pulls data from RAM, it loads not just the variable, but also a full 64-byte chunk of memory called a "cache line." +This chunk often contains nearby variables that might be needed soon. When new data is added to the cache, old data is pushed out. -Therefore, to **optimally** access variables they should be stored contiguously in memory with related data and worked on whilst they remain in caches. -If you add to a variable, perform large amount of unrelated processing, then add to the variable again it will likely have been evicted from caches and need to be reloaded from slower RAM again. +Because of this, reading a list of data that’s next to each other in memory (like 16 numbers in a row) is much faster than reading scattered data, since the CPU can keep more of it in the cache. +To make programs run faster, related data should be stored next to each other in memory. +By working with this data while it's still in the cache, the CPU doesn’t have to go all the way to RAM, which is much slower. -It's not necessary to remember this full detail of how memory access work within a computer, but the context perhaps helps understand why memory locality is important. - +While you don’t need to know all the details of how memory works, it’s helpful to know that memory locality—keeping related data together and accessing it in chunks—is key to making programs run faster. ![An abstract diagram showing the path data takes from disk or RAM to be used for computation.](episodes/fig/hardware.png){alt='An abstract representation of a CPU, RAM and Disk, showing their internal caches and the pathways data can pass.'} ::::::::::::::::::::::::::::::::::::: callout @@ -177,7 +173,6 @@ Within Python memory is not explicitly allocated and deallocated, instead it is The below implementation of the [heat-equation](https://en.wikipedia.org/wiki/Heat_equation), reallocates `out_grid`, a large 2 dimensional (500x500) list each time `update()` is called which progresses the model. ```python -import time grid_shape = (512, 512) def update(grid, a_dt): @@ -226,7 +221,6 @@ Line # Hits Time Per Hit % Time Line Contents If instead `out_grid` is double buffered, such that two buffers are allocated outside the function, which are swapped after each call to update(). ```python -import time grid_shape = (512, 512) def update(grid, a_dt, out_grid): diff --git a/episodes/optimisation-minimise-python.md b/episodes/optimisation-minimise-python.md index 0e2666f6..a88f5f9d 100644 --- a/episodes/optimisation-minimise-python.md +++ b/episodes/optimisation-minimise-python.md @@ -20,17 +20,16 @@ exercises: 0 :::::::::::::::::::::::::::::::::::::::::::::::: -Python is an interpreted programming language. When you execute your `.py` file, the (default) CPython back-end compiles your Python source code to an intermediate bytecode. This bytecode is then interpreted in software at runtime generating instructions for the processor as necessary. This interpretation stage, and other features of the language, harm the performance of Python (whilst improving it's usability). +Python is an interpreted language. When you run a .py file, the CPython interpreter first converts the Python code into bytecode. CPython is the default implementation of Python, written in C. This bytecode is then processed at runtime to generate instructions for the CPU. While this makes Python easier to use, it can slow down performance. -In comparison, many languages such as C/C++ compile directly to machine code. This allows the compiler to perform low-level optimisations that better exploit hardware nuance to achieve fast performance. This however comes at the cost of compiled software not being cross-platform. +In contrast, languages like C/C++ are compiled directly into machine code, allowing the compiler to optimize for better performance. However, this means C/C++ programs aren't as easily portable across different platforms. -Whilst Python will rarely be as fast as compiled languages like C/C++, it is possible to take advantage of the CPython back-end and packages such as NumPy and Pandas that have been written in compiled languages to expose this performance. +Although Python isn’t as fast as languages like C/C++, it can still be efficient by using tools like NumPy and Pandas, which are written in faster compiled languages. -A simple example of this would be to perform a linear search of a list (in the previous episode we did say this is not recommended). -The below example creates a list of 2500 integers in the inclusive-exclusive range `[0, 5000)`. -It then searches for all of the even numbers in that range. -`searchlistPython()` is implemented manually, iterating `ls` checking each individual item in Python code. -`searchListC()` in contrast uses the `in` operator to perform each search, which allows CPython to implement the inner loop in it's C back-end. + +A simple example of this is performing a linear search on a list (though we mentioned in the previous episode that this isn’t the most efficient approach). In the following example, we create a list of 2500 integers in the range `[0, 5000)`. The goal is to search for all even numbers within that range. + +The function `searchlistPython()` manually iterates through the list (`ls`) and checks each item using Python code. On the other hand, `searchListC()` uses the `in` operator, which lets CPython handle the search more efficiently by running the inner loop in its C back-end. ```python import random @@ -281,7 +280,7 @@ In particular, those which are passed an `iterable` (e.g. lists) are likely to p ::::::::::::::::::::::::::::::::::::: callout -The built-in functions [`filter()`](https://docs.python.org/3/library/functions.html#filter) and [`map()`](https://docs.python.org/3/library/functions.html#map) can be used for processing iterables However list-comprehension is likely to be more performant. +The built-in functions [`filter()`](https://docs.python.org/3/library/functions.html#filter) and [`map()`](https://docs.python.org/3/library/functions.html#map) can be used for processing iterables. However, list-comprehension is likely to be more performant. @@ -292,11 +291,11 @@ The built-in functions [`filter()`](https://docs.python.org/3/library/functions. [NumPy](https://numpy.org/) is a commonly used package for scientific computing, which provides a wide variety of methods. -It adds restriction via it's own [basic numeric types](https://numpy.org/doc/stable/user/basics.types.html), and static arrays to enable even greater performance than that of core Python. However if these restrictions are ignored, the performance can become significantly worse. +It adds restriction via its own [basic numeric types](https://numpy.org/doc/stable/user/basics.types.html), and static arrays to enable even greater performance than that of core Python. However, if these restrictions are ignored, the performance can become significantly worse. ### Arrays -NumPy's arrays (not to be confused with the core Python `array` package) are static arrays. Unlike core Python's lists, they do not dynamically resize. Therefore if you wish to append to a NumPy array, you must call `resize()` first. If you treat this like `append()` for a Python list, resizing for each individual append you will be performing significantly more copies and memory allocations than a Python list. +NumPy's arrays (not to be confused with the core Python `array` package) are static arrays. Unlike core Python's lists, they do not dynamically resize. Therefore, if you wish to append to a NumPy array, you must call `resize()` first. If you treat this like `append()` for a Python list, resizing for each individual append you will be performing significantly more copies and memory allocations than a Python list. The below example sees lists and arrays constructed from `range(100000)`. @@ -390,7 +389,7 @@ There is however a trade-off, using `numpy.random.choice()` can be clearer to so ### Vectorisation -The manner by which NumPy stores data in arrays enables it's functions to utilise vectorisation, whereby the processor executes one instruction across multiple variables simultaneously, for every mathematical operation between arrays. +The manner by which NumPy stores data in arrays enables its functions to utilise vectorisation, whereby the processor executes one instruction across multiple variables simultaneously, for every mathematical operation between arrays. Earlier in this episode it was demonstrated that using core Python methods over a list, will outperform a loop performing the same calculation faster. The below example takes this a step further by demonstrating the calculation of dot product. @@ -416,11 +415,6 @@ print(f"numpy_sum_array: {timeit(np_sum_ar, setup=gen_array, number=repeats):.2f print(f"numpy_dot_array: {timeit(np_dot_ar, setup=gen_array, number=repeats):.2f}ms") ``` -* `python_sum_list` uses list comprehension to perform the multiplication, followed by the Python core `sum()`. This comes out at 46.93ms -* `python_sum_array` instead directly multiplies the two arrays, taking advantage of NumPy's vectorisation. But uses the core Python `sum()`, this comes in slightly faster at 33.26ms. -* `numpy_sum_array` again takes advantage of NumPy's vectorisation for the multiplication, and additionally uses NumPy's `sum()` implementation. These two rounds of vectorisation provide a much faster 1.44ms completion. -* `numpy_dot_array` instead uses NumPy's `dot()` to calculate the dot product in a single operation. This comes out the fastest at 0.29ms, 162x faster than `python_sum_list`. - ```output python_sum_list: 46.93ms python_sum_array: 33.26ms @@ -428,6 +422,11 @@ numpy_sum_array: 1.44ms numpy_dot_array: 0.29ms ``` +* `python_sum_list` uses list comprehension to perform the multiplication, followed by the Python core `sum()`. This comes out at 46.93ms +* `python_sum_array` instead directly multiplies the two arrays, taking advantage of NumPy's vectorisation. But uses the core Python `sum()`, this comes in slightly faster at 33.26ms. +* `numpy_sum_array` again takes advantage of NumPy's vectorisation for the multiplication, and additionally uses NumPy's `sum()` implementation. These two rounds of vectorisation provide a much faster 1.44ms completion. +* `numpy_dot_array` instead uses NumPy's `dot()` to calculate the dot product in a single operation. This comes out the fastest at 0.29ms, 162x faster than `python_sum_list`. + ::::::::::::::::::::::::::::::::::::: callout ## Parallel NumPy @@ -439,7 +438,7 @@ A small number of functions are backed by BLAS and LAPACK, enabling even greater The [supported functions](https://numpy.org/doc/stable/reference/routines.linalg.html) mostly correspond to linear algebra operations. -The auto-parallelisation of these functions is hardware dependant, so you won't always automatically get the additional benefit of parallelisation. +The auto-parallelisation of these functions is hardware-dependent, so you won't always automatically get the additional benefit of parallelisation. However, HPC systems should be primed to take advantage, so try increasing the number of cores you request when submitting your jobs and see if it improves the performance. *This might be why `numpy_dot_array` is that much faster than `numpy_sum_array` in the previous example!* @@ -449,7 +448,7 @@ However, HPC systems should be primed to take advantage, so try increasing the n ### `vectorize()` Python's `map()` was introduced earlier, for applying a function to all elements within a list. -NumPy provides `vectorize()` an equivalent for operating over it's arrays. +NumPy provides `vectorize()` an equivalent for operating over its arrays. This doesn't actually make use of processor-level vectorisation, from the [documentation](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html): @@ -497,7 +496,7 @@ Pandas' methods by default operate on columns. Each column or series can be thou Following the theme of this episode, iterating over the rows of a data frame using a `for` loop is not advised. The pythonic iteration will be slower than other approaches. -Pandas allows it's own methods to be applied to rows in many cases by passing `axis=1`, where available these functions should be preferred over manual loops. Where you can't find a suitable method, `apply()` can be used, which is similar to `map()`/`vectorize()`, to apply your own function to rows. +Pandas allows its own methods to be applied to rows in many cases by passing `axis=1`, where available these functions should be preferred over manual loops. Where you can't find a suitable method, `apply()` can be used, which is similar to `map()`/`vectorize()`, to apply your own function to rows. ```python from timeit import timeit @@ -571,7 +570,7 @@ vectorize: 1.48ms It won't always be possible to take full advantage of vectorisation, for example you may have conditional logic. -An alternate approach is converting your dataframe to a Python dictionary using `to_dict(orient='index')`. This creates a nested dictionary, where each row of the outer dictionary is an internal dictionary. This can then be processed via list-comprehension: +An alternate approach is converting your DataFrame to a Python dictionary using `to_dict(orient='index')`. This creates a nested dictionary, where each row of the outer dictionary is an internal dictionary. This can then be processed via list-comprehension: ```python def to_dict(): @@ -588,7 +587,7 @@ Whilst still nearly 100x slower than pure vectorisation, it's twice as fast as ` to_dict: 131.15ms ``` -This is because indexing into Pandas' `Series` (rows) is significantly slower than a Python dictionary. There is a slight overhead to creating the dictionary (40ms in this example), however the stark difference in access speed is more than enough to overcome that cost for any large dataframe. +This is because indexing into Pandas' `Series` (rows) is significantly slower than a Python dictionary. There is a slight overhead to creating the dictionary (40ms in this example), however the stark difference in access speed is more than enough to overcome that cost for any large DataFrame. ```python from timeit import timeit diff --git a/episodes/profiling-functions.md b/episodes/profiling-functions.md index 54ba38b6..9e99c584 100644 --- a/episodes/profiling-functions.md +++ b/episodes/profiling-functions.md @@ -46,7 +46,7 @@ As a stack it is a last-in first-out (LIFO) data structure. ![A diagram of a call stack](fig/stack.png){alt="A greyscale diagram showing a (call)stack, containing 5 stack frame. Two additional stack frames are shown outside the stack, one is marked as entering the call stack with an arrow labelled push and the other is marked as exiting the call stack labelled pop."} -When a function is called, a frame to track it's variables and metadata is pushed to the call stack. +When a function is called, a frame to track its variables and metadata is pushed to the call stack. When that same function finishes and returns, it is popped from the stack and variables local to the function are dropped. If you've ever seen a stack overflow error, this refers to the call stack becoming too large. @@ -88,7 +88,7 @@ Hence, this prints the following call stack: traceback.print_stack() ``` -The first line states the file and line number where `a()` was called from (the last line of code in the file shown). The second line states that it was the function `a()` that was called, this could include it's arguments. The third line then repeats this pattern, stating the line number where `b2()` was called inside `a()`. This continues until the call to `traceback.print_stack()` is reached. +The first line states the file and line number where `a()` was called from (the last line of code in the file shown). The second line states that it was the function `a()` that was called, this could include its arguments. The third line then repeats this pattern, stating the line number where `b2()` was called inside `a()`. This continues until the call to `traceback.print_stack()` is reached. You may see stack traces like this when an unhandled exception is thrown by your code. @@ -102,7 +102,7 @@ You may see stack traces like this when an unhandled exception is thrown by your [`cProfile`](https://docs.python.org/3/library/profile.html#instant-user-s-manual) is a function-level profiler provided as part of the Python standard library. -It can be called directly within your Python code as an imported package, however it's easier to use it's script interface: +It can be called directly within your Python code as an imported package, however it's easier to use its script interface: ```sh python -m cProfile -o