Improvements and text simplifications #51

msarkis-icr · 2025-02-03T13:00:26Z

We propose flipping the order of the course to start with the complex part (optimised python) and finish with the profiling.

We went through most sections and rewrote those that seemed challenging to understand.
We also added a table summarising the different data types where we added recommendations for when to use each data-type.

Please feel free to take what you need from our modifications.
Any feedback/correction are also welcome :)

…aphs

…sation

Fix issue 21 - reviewing the optimisation course

Simplifying text in different sections

…constant time

…tructures section

Fix issue 21

github-actions · 2025-02-03T13:00:41Z

🆗 Pre-flight checks passed 😃

This pull request has been checked and contains no modified workflow files or spoofing.

It should be safe to Approve and Run the workflows that need maintainer approval.

Robadob · 2025-02-03T13:47:02Z

Thanks

I haven't got a huge amount of time dedicated to maintaining this, but I'll try to give your changes a thorough look through in the next month or so.

Robadob · 2025-03-16T10:02:25Z

config.yaml

 # cp: Carpentries (to use for instructor training for instance)
 # incubator: The Carpentries Incubator
-carpentry: 'incubator'
+carpentry: 'swc'


It remains incubator until it's graduated to Carpentries lab (aka super stable, well tested), I think it then has to graduate from lab (somehow?) to be officially adopted into software carpentries.

Robadob · 2025-03-16T10:05:18Z

config.yaml


 # Overall title for pages.
-title: 'Performance Profiling & Optimisation (Python)'
+title: 'Python Optimisation and Performance Profiling'


I've discussed the order switching with colleagues and we remain firmly in the camp that profiling comes before optimisation (find the bottleneck, then identify whether it's something that can be addressed).

There's now a slight plan to extend the profiling Predator Prey exercise into the optimisation half of the course (#53), as you may have noticed optimisation is very exercise light, which won't really work without profiling being introduced first.

ok I understand your point of view.
Switching the order fits mostly the timing we chose for giving the course.
We always realise that students concentrate less after lunch time, thus we wanted a lighter section post-lunch

We always realise that students concentrate less after lunch time,

I'm aware that some carpentries courses instead run as two half-day sessions for this reason (our local Git course is normally structured that way).

Robadob · 2025-03-16T10:08:24Z

config.yaml

 # Information for Learners
 learners: 
 - setup.md
- registration.md


There's an unbranded fork of the course on carpentries incubator.

https://github.com/carpentries-incubator/pando-python

Robadob · 2025-03-16T10:15:04Z

episodes/fig/python_lists.png

This is a great figure in principle, however I'm concerned the handwritten text isn't widely accessible as some readers may struggle with it (all the learning support guidance at my institution is very adamant that we stick to clear sans-serif fonts).

I'm happy to reproduce it and include it in the course giving you credit with your blessing. Potentially with a few changes to the text too (e.g. more context to "continuous block of memory").

yeah feel free to reproduce it and modify it and thank you for giving us the credit for it.

Robadob · 2025-03-16T10:17:53Z

episodes/optimisation-data-structures-algorithms.md

+![A visual explanation of linear probing, CPython uses an advanced form of this.](episodes/fig/hash_linear_probing.png){alt="A diagram showing how keys (hashes) 37, 64, 14, 94, 67 are inserted into a hash table with 11 indices. The insertion of 59, 80, and 39 demonstrates linear probing to resolve collisions."}
+To look up or verify the existence of a key in a hashing data structure, the key is re-hashed, and the process mirrors that of insertion. The corresponding index is probed to see if it contains the provided key. If the key at the index matches, the operation succeeds. If an empty index is reached before finding the key, it indicates that the key does not exist in the structure.

+The above diagrams shows a hash table of 5 elements within a block of 11 slots:


This explanation of the figure is great, something that was missed in the original.

However, we're now planning to migrate some of these more technical explanations to a technical appendix, as they distract a bit from the course materials.

I think this section will get moved, though that's yet to be confirmed.

Regardless there's a note added to #61 to review this when that's handled processed, which I'm hoping to do within the next month.

This is a great idea, moving technical sections to an appendix, thus making it more accessible for most people
We found the course very dense and advanced: we learned a lot reading it

Robadob · 2025-03-16T10:22:29Z

episodes/optimisation-data-structures-algorithms.md

 Independent of the performance to construct a unique set (as covered in the previous section), it's worth identifying the performance to search the data-structure to retrieve an item or check whether it exists.

-The performance of a hashing data structure is subject to the load factor and number of collisions. An item that hashes with no collision can be checked almost directly, whereas one with collisions will probe until it finds the correct item or an empty slot. In the worst possible case, whereby all insert items have collided this would mean checking every single item. In practice, hashing data-structures are designed to minimise the chances of this happening and most items should be found or identified as missing with a single access.
+The performance of a hashing data structure is subject to the load factor and number of collisions. An item that hashes with no collision can be checked almost directly, whereas one with collisions will probe until it finds the correct item or an empty slot. In the worst possible case, whereby all insert items have collided this would mean checking every single item. In practice, hashing data-structures are designed to minimise the chances of this happening and most items should be found or identified as missing with single access, result in an average time complexity of a constant (which is very good!).


Course does not explain O-notation or time complexity, I have reworded this though.

In practice, hashing data-structures are designed to minimise the chances of this happening and most items should be found or identified as missing on the first attempt (without probing beyond the original hash).

Likewise, not sure why you removed "a", and "result" feels like it should be "resulting".

no problem dropping this change about O-notation.

Robadob · 2025-03-16T10:27:13Z

episodes/optimisation-data-structures-algorithms.md


+::::::::::::::::::::::::::::::::::::: callout
+
+Dictionaries are designed to handle insertions efficiently, with average-case O(1) time complexity per insertion for a small size dict, but it is clearly problematic for large size dict. In this case, it is better to find an alternative Data Structure for example List, NumPy Array or Pandas DataFrame. The table below summarizes the best uses and performance characteristics of each data structure:


As above, attendees are not expected to be aware of O-notation (the vast majority are not formally trained programmers), this would need to provide alot more context (e.g. what does O(1) mean, what does O(N) mean, why O-notation is not the whole story [e.g. linked lists are rarely good in practice, due to scattered memory accesses]).

It may be suitable for the technical appendix in future.

Robadob · 2025-03-16T10:47:15Z

episodes/optimisation-introduction.md

+
 <!-- Necessary to understand how code executes (to a degree) -->
-In order to optimise code for performance, it is necessary to have an understanding of what a computer is doing to execute it.
+These are the first steps in code optimisation: making better choices as you write your code and have an understanding of what a computer is doing to execute it.


I've added a sentence equivalent to this to the original introduction, rest doesn't fit as we're not changing the order.

Robadob · 2025-03-16T11:06:00Z

episodes/optimisation-introduction.md

-This is not to say, don't consider performance when first writing code. The selection of appropriate algorithms and data-structures covered in this course form good practice, simply don't fret over a need to micro-optimise every small component of the code that you write.
+This classic quote among computer scientists emphasizes the importance of considering both performance and maintainability when optimizing code. While advanced optimizations may boost performance, they often come at the cost of making the code harder to understand and maintain. Even if you're working alone on private code, your future self should be able to easily understand the implementation. Hence, when optimizing, always weigh the potential impact on both performance and maintainability.

+This doesn't mean you should ignore performance when initially writing code. Choosing the right algorithms and data structures, as we’ve discussed in this course, is essential. However, there's no need to obsess over micro-optimizing every tiny component of your code—focus on the bigger picture.


I prefer your wording, however you've removed the reference to profiling which I actually want to flesh out further with Amdahls law #71. I might just move that ahead of premature optimisation though.

I've created a WIP branch for my manual merge of the now heavily conflicted PR (https://github.com/RSE-Sheffield/pando-python/tree/icr-rse-manual-merge), though it's taking me longer to get through than I'd like so I probably won't have the manual merge drafted in a single day.

thank you for taking the time to check our changes and for adopting some of them.
I will be tracking the wip icr-rse branch

Robadob · 2025-03-20T15:30:40Z

episodes/optimisation-introduction.md

+## Ensuring Reproducible Results when optimising an existing code

 <!-- This is also good practice when optimising your code, to ensure mistakes aren't made -->
-When optimising your code, you are making speculative changes. It's easy to make mistakes, many of which can be subtle. Therefore, it's important to have a strategy in place to check that the outputs remain correct.


I've used most of the suggested reword from this section.

Robadob · 2025-03-21T11:43:44Z

episodes/optimisation-memory.md

 The storage and movement of data plays a large role in the performance of executing software.

 <!-- Brief summary of hardware -->
-Modern computer's typically have a single processor (CPU), within this processor there are multiple processing cores each capable of executing different code in parallel.


Leaving this section untouched, have added a note to #61 to review it when putting that together.

Robadob · 2025-03-21T11:44:55Z

episodes/optimisation-memory.md

 The below implementation of the [heat-equation](https://en.wikipedia.org/wiki/Heat_equation), reallocates `out_grid`, a large 2 dimensional (500x500) list each time `update()` is called which progresses the model.

 ```python
-import time


Robadob · 2025-03-21T11:58:57Z

episodes/optimisation-minimise-python.md

 - Able to identify when Python code can be rewritten to perform execution in the back-end.
 - Able to utilise NumPy's vectorisation when operating on arrays of data.
 - Able to efficiently process rows when working with data tables.



Lots of typo corrections taken from this file.

Some rewording used, however much of the initial CPython vs C has already been reworded and split out to other files so it's more difficult to integrate all these changes, I have done where possible.

Robadob · 2025-03-21T12:43:58Z

episodes/profiling-functions.md


 As a stack it is a last-in first-out (LIFO) data structure.

 ![A diagram of a call stack](fig/stack.png){alt="A greyscale diagram showing a (call)stack, containing 5 stack frame. Two additional stack frames are shown outside the stack, one is marked as entering the call stack with an arrow labelled push and the other is marked as exiting the call stack labelled pop."}


All of these corrections used (clearly my usage of it's as possessive is wrong).

Robadob · 2025-03-21T12:44:40Z

episodes/profiling-introduction.md


 ## Introduction

+<!-- what if my code is not performnant -->


Not required, due to order not changing.

Robadob · 2025-03-21T13:02:49Z

episodes/profiling-introduction.md

 Profiling is most relevant to working code, when you have reached a stage that the code works and are considering deploying it.

-Any code that will run for more than a few minutes over it's lifetime, that isn't a quick one-shot script can benefit from profiling.
+Any code that will run for more than a few minutes over its lifetime, that isn't a quick one-shot script can benefit from profiling.


Rest of the corrections used.

Opted for time-steps rather than time steps

Robadob · 2025-03-21T13:19:37Z

episodes/profiling-lines.md

 Whilst profiling, you may find that function-level profiling highlights expensive methods where you can't easily determine the cause of the cost due to their complexity.

 <!-- What -->
-Line level profiling allows you to target specific methods to collect more granular metrics, which can help narrow the source of expensive computation further. Typically line-level profiling will calculate the number of times each line is called and the total time spent executing each line. However, with the increased granularity come increased collection costs, which is why it's targeted to specific methods.


Full changes applied (timestep ->time-step)

Robadob · 2025-03-21T13:24:28Z

index.md


-If you are now comfortable using Python, this course may be of interest to supplement and advance your programming knowledge. This course is particularly relevant if you are writing research code and desire greater confidence that your code is both performant and suitable for publication.
-
+If you are now comfortable using Python, this course may be of interest to supplement and advance your programming knowledge. This course is particularly relevant if you are writing from scratch or re-using and existing research code and would desire a greater confidence that your code is both performant and suitable for publication. 


Haven't used this, I don't think it adds much just more verbose about new vs existing.
Did changing it to "working with research code", but felt that too broad.

index.md

Robadob · 2025-03-21T13:35:52Z

learners/setup.md


-<!-- conda create -n pando python
-     conda activate pando -->
+To create a new Anaconda environment named `py311_env` with Python 3.11, use the following command for conda:


Added, slightly reworded as people might be use Miniconda/similar.

Robadob

I think I've now merged across all the appropriate changes manually (e.g. your team's names are gone from git history) over in the PR #73 (the branch I mentioned the other day.

Please see my comment there, regarding how you want your team (members) to be acknowledged.

Once @JostMigenda has also helped me iterate the figure it's probably good to be merged (and this closed, although the technical-appendix issue refers to it so please don't delete the branch!).

#73

@msarkis-icr

Including an additional figure that was reproduced to improve the accessibility of the font used. This is a manual merge of approriate changes from PR (#51) due to large conflicts with earlier merged optimisation PRs. With thanks to the original authors of these changes @msarkis-icr and @stacyrse

@msarkis-icr

Including an additional figure that was reproduced to improve the accessibility of the font used. This is a manual merge of approriate changes from PR (#51) due to large conflicts with earlier merged optimisation PRs. With thanks to the original authors of these changes @msarkis-icr and @stacyrse

@msarkis-icr

) Including an additional figure that was reproduced to improve the accessibility of the font used. This is a manual merge of approriate changes from PR (#51) due to large conflicts with earlier merged optimisation PRs. With thanks to the original authors of these changes @msarkis-icr and @stacyrse

@msarkis-icr

Auto-generated via `{sandpaper}` Source : f8e9b79 Branch : main Author : Robert Chisholm <[email protected]> Time : 2025-03-24 12:43:59 +0000 Message : Text simplification and typographic improvements via @ICR-RSE-Group (#73) Including an additional figure that was reproduced to improve the accessibility of the font used. This is a manual merge of approriate changes from PR (#51) due to large conflicts with earlier merged optimisation PRs. With thanks to the original authors of these changes @msarkis-icr and @stacyrse

@msarkis-icr

Auto-generated via `{sandpaper}` Source : cb617f4 Branch : md-outputs Author : GitHub Actions <[email protected]> Time : 2025-03-24 12:44:46 +0000 Message : markdown source builds Auto-generated via `{sandpaper}` Source : f8e9b79 Branch : main Author : Robert Chisholm <[email protected]> Time : 2025-03-24 12:43:59 +0000 Message : Text simplification and typographic improvements via @ICR-RSE-Group (#73) Including an additional figure that was reproduced to improve the accessibility of the font used. This is a manual merge of approriate changes from PR (#51) due to large conflicts with earlier merged optimisation PRs. With thanks to the original authors of these changes @msarkis-icr and @stacyrse

@msarkis-icr

Auto-generated via `{sandpaper}` Source : cb617f4 Branch : md-outputs Author : GitHub Actions <[email protected]> Time : 2025-03-24 12:44:46 +0000 Message : markdown source builds Auto-generated via `{sandpaper}` Source : f8e9b79 Branch : main Author : Robert Chisholm <[email protected]> Time : 2025-03-24 12:43:59 +0000 Message : Text simplification and typographic improvements via @ICR-RSE-Group (#73) Including an additional figure that was reproduced to improve the accessibility of the font used. This is a manual merge of approriate changes from PR (#51) due to large conflicts with earlier merged optimisation PRs. With thanks to the original authors of these changes @msarkis-icr and @stacyrse

msarkis-icr and others added 23 commits January 14, 2025 15:22

start course with optimisation section and finish with profiling tools

e503272

change the course narrative to start with the optimisation

618fb80

remove logo by disconnecting varnish

c074680

change carpentry type

a7a09ca

Fixed US/British spelling mix-up and reduced the number of new paragr…

c73399a

…aphs

Fixed comma issues and fixed grammar in the second last paragraph

010a273

Fix spelling mistakes and changed output/description order in vectori…

d18ba1c

…sation

Fixed comma issues and added 'Later' specifier for one topic

ea7ba46

Fixed gammar issues

5a31eff

Fixed gammar issues

45751ef

Merge pull request #1 from ICR-RSE-Group/fix-issue-21

e00e48e

Fix issue 21 - reviewing the optimisation course

remove registration

29e3e24

re-write the algo and diagram explanation for linear probing

354f4fa

rephrase introduction for a smoother link with testing

6375824

re-write memory section

eb289fa

add a brief definition of CPython and simplify text

ee341e0

Merge pull request #2 from ICR-RSE-Group/fix_issue_21_bis

bbfb7be

Simplifying text in different sections

added an explicit comment about search in dictionaries being average …

f95d4df

…constant time

removed extra imports which already exist on the page

58857a6

added a visual representation of python lists resizing

a3271f2

added a summary table for all data structure at the end of the data s…

3d73d1d

…tructures section

changed summary able of data structures

0d03524

Merge pull request #3 from ICR-RSE-Group/fix-issue-21-ss

3320f94

Fix issue 21

Robadob mentioned this pull request Mar 16, 2025

Technical Appendix #61

Closed

Robadob reviewed Mar 16, 2025

View reviewed changes

Robadob mentioned this pull request Mar 18, 2025

Add Amdahls Law callout to Optimisation Introduction #72

Closed

Robadob reviewed Mar 20, 2025

View reviewed changes

Robadob reviewed Mar 21, 2025

View reviewed changes

index.md Show resolved Hide resolved

Robadob reviewed Mar 21, 2025

View reviewed changes

Robadob mentioned this pull request Mar 21, 2025

ICR RSE: Improvements and Text Simplifications #73

Merged

4 tasks

Robadob reviewed Mar 21, 2025

View reviewed changes

JostMigenda closed this in #73 Mar 24, 2025


		::::::::::::::::::::::::::::::::::::: callout

		Dictionaries are designed to handle insertions efficiently, with average-case O(1) time complexity per insertion for a small size dict, but it is clearly problematic for large size dict. In this case, it is better to find an alternative Data Structure for example List, NumPy Array or Pandas DataFrame. The table below summarizes the best uses and performance characteristics of each data structure:


		As a stack it is a last-in first-out (LIFO) data structure.

		![A diagram of a call stack](fig/stack.png){alt="A greyscale diagram showing a (call)stack, containing 5 stack frame. Two additional stack frames are shown outside the stack, one is marked as entering the call stack with an arrow labelled push and the other is marked as exiting the call stack labelled pop."}


		If you are now comfortable using Python, this course may be of interest to supplement and advance your programming knowledge. This course is particularly relevant if you are writing research code and desire greater confidence that your code is both performant and suitable for publication.

		If you are now comfortable using Python, this course may be of interest to supplement and advance your programming knowledge. This course is particularly relevant if you are writing from scratch or re-using and existing research code and would desire a greater confidence that your code is both performant and suitable for publication.

Improvements and text simplifications #51

Improvements and text simplifications #51

Uh oh!

Conversation

msarkis-icr commented Feb 3, 2025

Uh oh!

github-actions bot commented Feb 3, 2025

🆗 Pre-flight checks passed 😃

Uh oh!

Robadob commented Feb 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Robadob Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Robadob Mar 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Robadob Mar 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Robadob Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Robadob left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Robadob Mar 20, 2025 •

edited

Loading

Robadob Mar 16, 2025 •

edited

Loading

Robadob Mar 16, 2025 •

edited

Loading

Robadob Mar 21, 2025 •

edited

Loading

Robadob left a comment •

edited

Loading