UniExeterRSE
diff --git a/‎README.md‎
Lines changed: 86 additions & 67 deletions b/‎README.md‎
Lines changed: 86 additions & 67 deletions
diff --git a/‎_static/profiling/snakeviz_output.png‎
49.2 KB b/‎_static/profiling/snakeviz_output.png‎
49.2 KB
diff --git a/‎lessons/conways_game_of_life.ipynb‎
Lines changed: 36 additions & 10 deletions b/‎lessons/conways_game_of_life.ipynb‎
Lines changed: 36 additions & 10 deletions
diff --git a/‎lessons/profiling.ipynb‎
Lines changed: 33 additions & 13 deletions b/‎lessons/profiling.ipynb‎
Lines changed: 33 additions & 13 deletions
@@ -16,17 +16,19 @@
     "In this project, we are going to see implementations of **Conway's Game of Life**, a classic cellular automaton in three ways: a pure python approach (to run on the CPU), a vectorised approach using NumPy (to run on the CPU) and then using CuPy (to run on the GPU). We'll also visualise the evolution of the Game of Life grid to see the computation in action. \n",
     "\n",
     "## What is Conway's Game of Life?\n",
+    "\n",
     "It's a zero-player game devised by John Conway, where you have a grid of cells that live or die based on a few simple rules:\n",
     "- Each cell can be \"alive\" (1) or \"dead\" (0).\n",
     "- At each time step (generation), the following rules apply to every cell simultaneously:\n",
-    "- Any live cell with fewer than 2 live neighbours dies (underpopulation).\n",
-    "- Any live cell with 2 or 3 live neighbours lives on to the next generation (survival).\n",
-    "- Any live cell with more than 3 live neighbours dies (overpopulation).\n",
-    "- Any dead cell with exactly 3 live neighbours becomes a live cell (reproduction).\n",
+    "  - Any live cell with fewer than 2 live neighbours dies (underpopulation).\n",
+    "  - Any live cell with 2 or 3 live neighbours lives on to the next generation (survival).\n",
+    "  - Any live cell with more than 3 live neighbours dies (overpopulation).\n",
+    "  - Any dead cell with exactly 3 live neighbours becomes a live cell (reproduction).\n",
     "- Neighbours are the 8 cells touching a given cell horizontally, vertically, or diagonally.\n",
     "- From these simple rules emerges a lot of interesting behaviour – stable patterns, oscillators, spaceships (patterns that move), etc. It's a good example of a grid-based simulation that can benefit from parallel computation because the state of each cell for the next generation can be computed independently (based on the current generation).\n",
     "\n",
     "## Visualisation of Game of Life\n",
+    "\n",
     "To make this project more visually engaging, below is an **animated GIF** showing an example of a Game of Life simulation starting from a random initial configuration. White pixels represent live cells, and black pixels represent dead cells. You can see patterns forming, moving, and changing over time:\n",
     "An example evolution of Conway's Game of Life over a few generations (white = alive, black = dead).\n",
     "The animation demonstrates how random initial clusters of cells can evolve into interesting patterns. Notice some cells blink on and off or form moving patterns.\n",
@@ -37,25 +39,39 @@
     "\n",
     "\n",
     "## Implementations\n",
+    "\n",
     "All of the implementation for the three different versions (Pure Python, NumPy and CuPy) are contained within the `.py` located at `content/game_of_life.py`. \n",
     "\n",
     "To run the different versions of the code, you can use:\n",
     "\n",
     "**Naïve Python Version**\n",
     "\n",
-    "`python game_of_life.py run_life_naive --size 100 --timesteps 50`\n",
+    "```bash\n",
+    "python game_of_life.py run_life_naive --size 100 --timesteps 50\n",
+    "```\n",
+    "\n",
     "which will produce a file called `game_of_life_naive.gif`.\n",
     "\n",
     "**CPU-Vectorized Version**\n",
-    "`python game_of_life.py run_life_numpy --size 100 --timesteps 50`\n",
+    "\n",
+    "```bash\n",
+    "python game_of_life.py run_life_numpy --size 100 --timesteps 50\n",
+    "```\n",
+    "\n",
     "which will produce a file called `game_of_life_cpu.gif`.\n",
     "\n",
     "**GPU-Accelerated Version**\n",
-    "`python game_of_life.py run_life_cupy --size 100 --timesteps 50`\n",
+    "\n",
+    "```bash\n",
+    "python game_of_life.py run_life_cupy --size 100 --timesteps 50\n",
+    "```\n",
+    "\n",
     "which will produce a file called `game_of_life_gpu.gif`.\n",
     "\n",
     "## Naive Implementation\n",
+    "\n",
     "The core computation that is being performed for the naive implementation is: \n",
+    "\n",
     "```python\n",
     "def life_step_naive(grid: np.ndarray) -> np.ndarray:\n",
     "    N, M = grid.shape\n",
@@ -86,6 +102,7 @@
     "### Explanation \n",
     "\n",
     "There are a number of different reasons that the naive implementation runs slow, including: \n",
+    "\n",
     "- **Nested Python Loops**: Instead of eight `np.roll` calls and one `np.where`, we make two loops over `i, j` (10^4 iterations) and two more loops over `di, dj` (9 checks each), for roughly 9x10^4 Python level operation per step. \n",
     "- **Manual edge-wrapping logic**: Branching (`if ni < 0 … elif …`) for each neighbour check, instead of the single fast shift that `np.roll` does in C. \n",
     "- **Per-cell rule application** The game of life rule is applied with Python `if/else` instead of the single vectorised Boolean mask. \n",
@@ -122,31 +139,34 @@
     "### Explanation\n",
     "\n",
     "#### From Per-Cell Loops to Whole-Array Operations \n",
+    "\n",
     "In the **naive** version, every one of the NxN cells in Python was traversed within two nested loops; then, for each cell, two more loops over the offsets `di` and `dj` counted its eight neighbours by computing. `(i + di) % N` and `(j + dj) % M` in pure Python. \n",
     "**Cost**: ~9·N² Python-level iterations per generation, including branching and modulo arithmetic.\n",
     "**Drawback** Thousands of interpreter calls and non-contiguous memory access. \n",
     "In the **NumPy** version, no Python loops over individual cells occur. Instead, eight calls to `np.roll` shift the entire grid array (up, down, left, right and on diagonals), automatically handling wrap-around in one C-level operation. Summing those eight arrays gives a full neighbour count in a single, optimised pass. \n",
     "\n",
     "#### Manual `if/else` vs Vectorised Mask \n",
+    "\n",
     "In the **naive** implementation, after counting neighbours, each cell's fate is determined with a Python `if grid[i,j] == 1: ... else: ...` and assigned via `new[i,j] = ...`. \n",
     "In the **NumPy** implementation a single expression of `(neighbours == 3) | ((grid == 1) & (neighbours == 2))` produces an NxN Boolean mask of *cells alive next*. Converting that mask to integers with `np.where(mask, 1, 0)` builds the entire next-generation grid in one C-level operation, resulting in no per-element Python overhead. \n",
     "\n",
-    "\n",
     "#### Automatic Wrap-Around vs Manual Modulo Logic\n",
+    "\n",
     "In the **naive** version, every neighbour checks does: \n",
     "\n",
     "```python \n",
     "ni = (i + di) % N\n",
     "nj = (j + dj) % M\n",
     "```\n",
     "\n",
-    "with Python-level branching and modulo arithmetic on each of the 9 checks per cell. The associated **cost** is thousands of `%` operations and branch instructions per generation. \n",
+    "with Python-level branching and modulo arithmetic on each of the 9 checks per cell. The associated **cost** is thousands of modulo (`%`) operations and branch instructions per generation. \n",
     "\n",
     "In the **NumPy** version, a single call to \n",
     "\n",
     "```python\n",
     "np.roll(grid, shift, axis=)\n",
     "```\n",
+    "\n",
     "automatically wraps the entire array in one C-level operation. The **benefit** is that all per-cell `%` operations and branching are eliminated, being replaced by a single optimised memory shift over the whole grid. \n",
     "\n",
     "## GPU-Accelerated Implementation \n",
@@ -194,6 +214,7 @@
     "```\n",
     "\n",
     "#### Random initialisation \n",
+    "\n",
     "**NumPy**: \n",
     "```Python \n",
     "grid = np.random.choice([0,1], size=(N,N), p=[1-p, p])\n",
@@ -205,18 +226,22 @@
     "```\n",
     "\n",
     "#### Data Transfer\n",
+    "\n",
     "**CuPy**: \n",
+    "\n",
     "```Python \n",
     "cp.asnumpy(grid_gpu)  # bring a CuPy array back to NumPy\n",
     "```\n",
     "\n",
     "### Which to use?\n",
+    "\n",
     "**Large grids (e.g. N ≥ 500) or many timesteps**: GPU's parallel throughput outweighs kernel-launch and transfer overhead.\n",
     "**Small grids (e.g. 10×10)**: GPU overhead may dominate, so you may want to stick with NumPy.\n",
     "\n",
     "### Why is this quicker?\n",
     "\n",
     "When a computation can be expressed as the same operation applied independently across many data elements, like counting neighbours on every cell of a large Game of Life grid, GPUs often deliver dramatic speedups compared to CPUs. This advantage stems from several architectural and compiler-related factors that we discussed earlier in the section on theory, including: \n",
+    "\n",
     "- **Massive Data Parallelism**\n",
     "    - **CPU**: A few (4–16) powerful cores optimised for sequential tasks and complex control flow.\n",
     "    - **GPU**: Hundreds to thousands of simpler cores running in lock-step.\n",
@@ -239,8 +264,9 @@
     "### How much quicker?\n",
     "\n",
     "Each implementation exhibits a different overall runtime, as you have probably noticed when running them from the command line. We can use the built-in UNIX command line tool `time` to measure the time that is taken to run the code. The `time` command is a simple profiler that measures how long a given program takes to run. It provides three primary metrics, including:\n",
+    "\n",
     "- **real**: The \"wall-clock\" time elapsed from start to finish (i.e. actual elapsed time).\n",
-    "- **user**: CPU time spent in user-mode *your programs own computations)\n",
+    "- **user**: CPU time spent in user-mode (your programs own computations)\n",
     "- **sys**: CPU time spent in kernel mode (system calls on behalf of your program)."
    ]
   },
 
@@ -18,7 +18,7 @@
     "Python has a built-in profiler called **cPython**. It can help you find which functions are taking up the most time in your program. This is key before you go into GPU acceleration; sometimes, you might find bottlenecks in places you didn't expect or identify parts of the code that would benefit the most from being moved to the GPU.\n",
     "\n",
     "### How to use cProfile \n",
-    "You can make use of cProfile via the command line: `python -m cProfile -o profile_results.pstats myscript.py`, which will run `myscript.py` under the profiler and output stats to a file.\n",
+    "You can make use of cProfile via the command line: `python -m cProfile -o profile_results.pstats myscript.py`, which will run `myscript.py` under the profiler and output stats to a file.  In the following examples we will instead call cProfile directly within our scripts, and use the `pstat` library to create immediate summaries.\n",
     "\n",
     "```Python \n",
     "import cProfile\n",
@@ -70,17 +70,18 @@
     "simulate_life_naive(N=N, timesteps=STEPS, p_alive=P_ALIVE)\n",
     "\n",
     "profiler.disable()                 # ── stop profiling ─────────────────\n",
+    "profiler.dump_file(\"naive.pstat\")  # ── save output ────────────────────\n",
     "\n",
     "stats = pstats.Stats(profiler).sort_stats('cumtime')\n",
     "stats.print_stats(10)              # print top 10 functions by cumulative time\n",
     "\n",
     "\n",
     "```\n",
     "\n",
-    "- **Interpreting cProfile output**: When you print stats, you'll see a table with columns including: \n",
-    "- **ncalls**: number of calls to the function. \n",
-    "- **tottime**: total time spent in the function (excluding sub-function calls). \n",
-    "- **cumtime**: cumulative time spent in the function includes sub-functions.\n",
+    "**Interpreting cProfile output**: When you print stats, you'll see a table with columns including: \n",
+    "- **ncalls**: number of calls to the function \n",
+    "- **tottime**: total time spent in the function (excluding sub-function calls) \n",
+    "- **cumtime**: cumulative time spent in the function includes sub-functions\n",
     "- The function name\n",
     "\n",
     "```bash \n",
@@ -89,7 +90,25 @@
     "      100    4.147    0.041    4.150    0.041 4263274180.py:9(life_step_naive)\n",
     "... (other functions)\n",
     "```\n",
-    "Therefore in the above table `ncalls` (100) tells you `life_step_naive` was invoked 100 times. `tottime` (4.147 s) is the time spent inside `life_step_naive` itself, excluding any functions it calls. `cumtime` (4.150 s) is the total time in `life_step_naive` plus any sub-calls it makes. So in this example, `life_step_naive` spent about 4.147 s in its own Python loops, and an extra ~0.003 s in whatever minor sub-calls it did (array indexing, % operations, etc.), for a total of 4.150 s. The per-call columns are simply` tottime/ncalls` and `cumtime/ncalls`, and the single call to `simulate_life_naive` shows its cumulative 4.312 s includes all the 100 naive steps plus the list-append overhead.\n",
+    "Therefore in the above table `ncalls` (100) tells you `life_step_naive` was invoked 100 times; `tottime` (4.147 s) is the time spent inside `life_step_naive` itself, excluding any functions it calls; `cumtime` (4.150 s) is the total cumulative time in `life_step_naive` plus any sub-calls it makes. In this example, `life_step_naive` spent about 4.147 s in its own Python loops, and an extra ~0.003 s in whatever minor sub-calls it did (array indexing, % operations, etc.), for a total of 4.150 s. The per-call columns are simply `tottime/ncalls` and `cumtime/ncalls`, and the single call to `simulate_life_naive` shows its cumulative 4.312 s includes all the 100 naive steps plus the list-append overhead.\n",
+    "\n",
+    "### Visualising the Output with Snakeviz \n",
+    "\n",
+    "Snakeviz is a separate tool that we can use to analyse the output of cProfile.  Snakeviz is a stand-alone tool available through PyPI.  We can install it with\n",
+    "\n",
+    "``` bash\n",
+    "poetry add snakeviz\n",
+    "```\n",
+    "\n",
+    "We can use it to visualise a cProfile output such as the one generated from the above snippet\n",
+    "\n",
+    "``` bash\n",
+    "poetry run snakeviz naive.pstat\n",
+    "```\n",
+    "\n",
+    "which launches an interactive webapp which we can use to explore the profiling timings.\n",
+    "\n",
+    "![Screenshot of SnakeViz](../_static/profiling/snakeviz_output.png)\n",
     "\n",
     "### Finding Bottlenecks \n",
     "\n",
@@ -102,7 +121,7 @@
     "- `simulate_life_naive` appears once with `cumtime ≈ 4.312 s`, which covers the single Python loop plus all 100 calls to `life_step_naive`.\n",
     "\n",
     "Once you’ve identified the culprit:\n",
-    "- If you have high `tottime` in a Python function, you may want to consider consider vectorising inner loops (e.g. switch to NumPy’s np.roll + np.where) or using a compiled extension.\n",
+    "- If you have high `tottime` in a Python function, you may want to consider consider vectorising inner loops (e.g. switch to NumPy’s `np.roll` + `np.where`) or using a compiled extension.\n",
     "- If you have heavy external calls under your `cumtime`, then you may want to explore hardware acceleration (e.g. GPU via `CuPy`) or more efficient algorithms.\n",
     "\n",
     "## Profiling the CPU-Vectorised Implementation using NumPy. \n",
@@ -152,6 +171,7 @@
     "simulate_life_numpy(N=N, timesteps=STEPS, p_alive=P_ALIVE)\n",
     "\n",
     "profiler.disable()  # ── stop profiling ─────────────────────────\n",
+    "profiler.dump_file('numpy.pstat')  # ── save output ─────────────────────────\n",
     "\n",
     "stats = (\n",
     "    pstats.Stats(profiler)\n",
@@ -206,7 +226,7 @@
     "\n",
     "**NVIDIA Nsight Systems** is a profiler for GPU applications that provides a timeline of CPU and GPU activity. It can show: \n",
     "- When your code launched GPU kernels and how long they ran \n",
-    "- GPU memory transfers between host and device. \n",
+    "- GPU memory transfers between host and device \n",
     "- CPU-side functions as well (to correlate CPU and GPU)\n",
     "\n",
     "### Using Nsight Systems \n",
@@ -224,10 +244,10 @@
     "nsys stats profile_report.nsys-rep\n",
     "```\n",
     "\n",
-    "An example `.nsys-rep` file has been included within the GitHub Repo for you to try the command with, at the filepath `_static/profiling/example_data_file.nsys-rep`. We will discuss the contents of the file in the section \"Example Output\" after discussing the needed code changes to generate the file. \n",
+    "An example `.nsys-rep` file has been included within the GitHub Repo for you to try the command with, at the filepath `_static/profiling/example_data_file.nsys-rep`. We will discuss the contents of the file in the section \"Example Output\" after discussing the necessary code changes to generate the file. \n",
     "\n",
     "### Code Changes \n",
-    "To get the fine-tuned profiling, we also need to make some changes to the code. A new version of Conways Game of Life has been created and is located in `game_of_life_profiled.py`, where additional imports are needed: \n",
+    "To get the fine-tuned profiling, we also need to make some changes to the code. A new version of Conway's Game of Life has been created and is located in `game_of_life_profiled.py`, where additional imports are needed: \n",
     "\n",
     "```python \n",
     "from cupyx.profiler import time_range  \n",
@@ -272,7 +292,7 @@
     "\n",
     "Unfortunately, you can't call the Python script itself as we did before as the Python interpreter obfuscates the profiler, and so there is a need to instead define a new entry point and call that to run the complete experiment run. \n",
     "\n",
-    "Together these are all the changes that are needed to create the data file and be able to understand better how the code is performing and where there are potential for further improvements to optimisation. "
+    "Together these are all the changes that are needed to create the data file and be able to understand better how the code is performing and where there is potential for further improvements through optimisation. "
    ]
   },
   {
@@ -371,7 +391,7 @@
     "\n",
     "The takeaways that we could take from this include the following:\n",
     "- **Python loops severely degrade performance**: Over 72% of run time is in the naive implementations, so vectorisation (NumPy/CuPy) is critical. \n",
-    "- **Implicit syncs dominate**: `cudaFree` stalls the pipe, and so avoiding per-iteration free(0 calls by reusing buffers is key. \n",
+    "- **Implicit syncs dominate**: `cudaFree` stalls the pipe, and so avoiding per-iteration free calls by reusing buffers is key. \n",
     "- **Kernel work is tiny**: Each kernel takes ~1-2µs; orchestration (kernel launches + memops) is the real bottleneck.\n",
     "- **Memcopy patterns matter**: 7200 small transfers add up, so we need to use larger batches of copies to reduce the overhead.\n",
     "\n",
@@ -528,7 +548,7 @@
     "Bringing everything together, some strategies include:\n",
     "\n",
     "**On the CPU side (Python)**: \n",
-    "- **Vectorise Operations**: We saw this with NumPy; doing things in batch is faster than Python loops. \n",
+    "- **Vectorise Operations**: We saw this with NumPy; doing things in batches is faster than Python loops. \n",
     "- **Use efficient libraries**: If a certain computation is slow in Python, see if there is a library (NumPy, SciPy, etc) that does it in C or another language. \n",
     "- **Optimise algorithms**: Sometimes, a better algorithm can speed things up more than any level of optimisation. For example, if you find a certain computation is N^2 in complexity and it's slow, see if you can make it N log N or similar.\n",
     "- **Consider multiprocessing or parallelisation**: Use multiple CPU cores (with `multiprocessing` or `joblib` or others) if appropriate.\n",