Merge branch 'refine_the_tutorial'

wvictor14 · wvictor14 · commit 67ec40c992fe · 2025-09-16T23:06:53.000Z
diff --git a/02_develop_visualization.qmd b/02_develop_visualization.qmd
@@ -9,10 +9,26 @@ editor:
   render-on-save: true
 ---
 
-# Setup
+Current Canadian sentiment is at a low, with high cost-of-living, global political instability, and sweeping layoffs across multiple sectors. For the [2025 `plotnine` contest](https://posit.co/blog/announcing-the-2025-table-and-plotnine-contests/), I wanted to explore current official Canadian labour statistics using `plotnine`, a data visualization library in `python`.
+
+# Introduction 
+
+I am so happy that `plotnine` exists, which is a relatively new python data visualization package. `plotnine` is based on `ggplot2`, an R package that I have been using for almost a decade. 
+
+In this tutorial, I'll walk through the process of creating my `plotnine` 2025 contest submission. The plot shows employment across Canadian industries, ranked by their  percent change in monthly employment. To help visualize data across different industries, industry-specific plots are laid out in a "pseudo" interactive manner.
+
+# Setup 
+
+## Data
+
+The data can be downloaded using this bash [script](https://github.com/wvictor14/labourcan/blob/main/data/downloadLabourData.sh), or directly from [StatCan's website](https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1410035502).
 
 ## Parameters
 
+In this initial code chunk we initialize some paramters that, later if needed, we can rerun this entire notebook with different paramters (e.g. different years).
+
+`pyprojroot` is similar to R's package `here`, which lets us construct filepaths relative to the project root. This is very convenient especially for quarto projects with complex file organization.  
+
 ```{python}
 from pyprojroot import here
 ```
@@ -26,21 +42,28 @@ FILTER_YEAR = (2018, 2025)
 ## Libraries
 
 ```{python}
+# Data manipulation
 import polars as pl
 import polars.selectors as cs
+
+# Visualization
+from plotnine import *
+
+# Mizani helps customize the text and breaks on axes
 from mizani.bounds import squish
 import mizani.labels as ml
 import mizani.breaks as mb
-import textwrap
-from pyprojroot import here
-from great_tables import GT, md, html
-from plotnine import *
-from labourcan.data_processing import read_labourcan,calculate_centered_rank
+import textwrap  # for wrapping long lines of text
+
+# Custom extract and transform functions for plot data
+from labourcan.data_processing import read_labourcan, calculate_centered_rank
 ```
 
-## Read data
+## Read and process data for graphing
 
-[`read_labourcan`](../py/labourcan/data_processing.py) returns a polars with:
+The visualization required a fair amount of data processing which is detailed in this [page](01_develop_data_processing.html). The steps are summarized here:
+
+[`read_labourcan`](../py/labourcan/data_processing.py) returns a `polars.Data.Frame` with:
 
 - Unused columns removed
 - Filtered to seasonally adjusted estimates only
@@ -55,93 +78,76 @@ labour = read_labourcan(LABOUR_DATA_FILE)
 labour_processed = calculate_centered_rank(labour)
 ```
 
-# Heatmap of Employment Numbers
+# A first attempt 
+
+The type of visual that's being developed here is something like a heatmap of employment numbers.
 
-Let's take a stab at a first visual. 
+We want a clean separation of industries that are growing or shrinking. For that we are using a rank ordering by % monthly changed. But not just any ordinary rank, we center it around 0 such that sectors that are growing (% change > 0) have a positive rank and those that are shrinking are negative.
+
+`scale_color_gradient2` is a great option because it allows specification of our `midpoint=0`
 
 ```{python}
 # | page-layout: column-page
 (
     ggplot(
         (
             labour_processed.filter(
-                pl.col("YEAR") >= FILTER_YEAR[0],
-                pl.col("YEAR") <= FILTER_YEAR[1]
+                pl.col("YEAR") >= FILTER_YEAR[0], pl.col("YEAR") <= FILTER_YEAR[1]
             )
-        ), aes(x="DATE_YMD", y="centered_rank_across_industry", color="PDIFF"))
+        ),
+        aes(x="DATE_YMD", y="centered_rank_across_industry", color="PDIFF"),
+    )
     + geom_point(shape="s")
     + theme_tufte()
     + theme(figure_size=FIGURE_THEME_SIZE, axis_text_x=element_text(angle=90))
-
     + scale_color_gradient2(
-        limits=(-0.01, 0.01), low="#ff0000ff", high="#0000dbff", midpoint=0, oob=squish)
+        limits=(-0.01, 0.01), low="#ff0000ff", high="#0000dbff", midpoint=0, oob=squish
+    )
 )
 ```
 
 ## `geom_point` or `geom_tile`
 
-It looks good. but the whitespace between each point is distracting. I could make 
+The whitespace between each point is distracting. I could make 
 the point size larger, but the ratio of point size to range of the x and y axis, as well as
-the figure size all will determine ultimately how much whitespace remains between each point.
+the figure size all will ultimately determine how much whitespace remains between each point.
 
-We can use `geom_tile` instead, which will plot rectangles specified by a center point. 
+If we use `geom_tile` instead, which will plot rectangles specified by a center point, we can explicitly control the whitespace between tiles.
 
 ```{python}
-labour_processed_cat = labour_processed.drop_nulls(
-    ['centered_rank_across_industry'])
-order = (
-    labour_processed_cat.select('centered_rank_across_industry').unique().sort(
-        'centered_rank_across_industry')
-    .to_series()
-    .cast(pl.Utf8)
-    .to_list()
-)
-
-labour_processed_cat = (
-    labour_processed_cat.with_columns(
-        pl.col('centered_rank_across_industry').cast(
-            pl.Utf8).cast(pl.Enum(categories=order)).alias('centered_rank_cat')
-    )
-)
-
 (
-    ggplot((
-        labour_processed_cat.filter(
-            pl.col("YEAR") >= FILTER_YEAR[0],
-            pl.col("YEAR") <= FILTER_YEAR[1]
-        )
-    ), aes(x="DATE_YMD", y="centered_rank_across_industry", fill="PDIFF"))
-    + geom_tile(height=0.95)  # whitespace between tiles, vertically
-    + theme_tufte()
-    + theme(
-        figure_size=FIGURE_THEME_SIZE,
-        axis_text_x=element_text(angle=90)
+    ggplot(
+        (
+            labour_processed.filter(
+                pl.col("YEAR") >= FILTER_YEAR[0], pl.col("YEAR") <= FILTER_YEAR[1]
+            )
+        ),
+        aes(x="DATE_YMD", y="centered_rank_across_industry", fill="PDIFF"),
     )
+    + geom_tile(height=0.95, width=30 * 0.95)  # <1>
+    + theme_tufte()
+    + theme(figure_size=FIGURE_THEME_SIZE, axis_text_x=element_text(angle=90))
     + scale_fill_gradient2(
-        limits=(-0.01, 0.01), low="#ff0000ff", high="#0000dbff", midpoint=0, oob=squish)
-
+        limits=(-0.01, 0.01), low="#ff0000ff", high="#0000dbff", midpoint=0, oob=squish
+    )
 )
 ```
 
-This is looking pretty good. I added `height = 0.95` to add some whitespace between tiles vertically.
-I actually wanted to remove whitespace completely, but I discovered `width` for `geom_tile` doesn't 
-work the same as it does for `ggplot2`. If I set `width=1` it seems to make the tiles smaller, instead of wider.
-
+1. I added `height = 0.95` to add some whitespace between tiles vertically. To remove horizontal whitespace, we need to specify a `width`. Because we are using a `datetime` axis, we need to specify it in unit of days. But each tile here is a month, so we need to express in units of 30 hence: `width = 30*0.95`.
 
 ## Explicit color mapping with `scale_color_manual`
 
 I am fairly happy with the `scale_fill_gradient2` used with `squish`. We get a really nice palette 
-that's centered around 0. However `scale_fill_gradient2` is limited to 3 colors (high, midpoint, low),
+that's centered around 0. However `scale_fill_gradient2` is limited to 3 colors (`high`, `midpoint`, `low`),
 which is not quite enable the more dynamic color palette that I'm seeking.
 
-To be more explicit with the colors, I will bin the `PDIFF` and map colors manually
-using `scale_fill_manual`
+To be more explicit with the colors, I will bin the % change variable and then map each bin to a color manually using `scale_fill_manual`.
 
 ### Bin with `polars.Series.cut`
 
 ```{python}
 labour_processed_cutted = (
-    labour_processed_cat.with_columns(
+    labour_processed.with_columns(
         pl.col("PDIFF")
         .cut(
             [
@@ -180,7 +186,7 @@ labour_processed_cutted.group_by("PDIFF_BINNED").len()
                 pl.col("YEAR") >= FILTER_YEAR[0], pl.col("YEAR") <= FILTER_YEAR[1]
             )
         ),
-        aes(x="DATE_YMD", y="centered_rank_cat", fill="PDIFF_BINNED"),
+        aes(x="DATE_YMD", y="centered_rank_across_industry", fill="PDIFF_BINNED"),
     )
     + geom_tile(height=0.95)  # whitespace between tiles, vertically
     + theme_tufte()
@@ -190,9 +196,9 @@ labour_processed_cutted.group_by("PDIFF_BINNED").len()
 
 ### `scale_fill_manual` for explicit color mapping
 
-Now we need to order the levels, and map explicit colors
+Now we need to order the levels, and map to a specific color palette.
 
-We will make PDIFF=0% to be gray, positive values to have a green and blue colors (job growth = good), and negative values to have warmer (alarming, bad) colors.
+We will make `PDIFF=0%` (no change) to be gray, positive values to have `green` and `blue` colors (*growth* = *good*), and negative values to be `red` and `orange` (*contraction* = *bad*) colors.
 
 ```{python}
 order = (
@@ -231,23 +237,24 @@ color_mapping = {
                 pl.col("YEAR") >= FILTER_YEAR[0], pl.col("YEAR") <= FILTER_YEAR[1]
             )
         ),
-        aes(x="DATE_YMD", y="centered_rank_across_industry", fill="PDIFF_BINNED"),
+        aes(x="DATE_YMD", y="centered_rank_across_industry", fill="PDIFF_BINNED"), # <1>
     )
     + geom_tile(color="white")
-    # + geom_point(shape="s")
     + theme_tufte()
     + theme(figure_size=FIGURE_THEME_SIZE, axis_text_x=element_text(angle=90))
-    + scale_fill_manual(values=color_mapping, breaks=order)
+    + scale_fill_manual(values=color_mapping, breaks=order) # <2>
 )
 ```
 
-That looks great. The power of `scale_fill_manual` enables much more control over 
-the color palette. However, the cost was that it takes a lot more effort and lines of code
-to create a custom mapping. 
+1. map `fill` to `PDIFF_BINNED`
+2. provide explicit color mapping to `scale_fill_manual`
+
+The power of `scale_fill_manual` is that it enables much more explicit control over 
+how color is mapped to data. However, the cost was that it takes a lot more effort and lines of code, compared to `scale_fill_gradient2`, which works well "out-of-box".
 
 ## The legend
 
-...is extremely accurate, however we are going to simplify it and nicer to look at.
+...is mathematically accurate, however we are going to make it nicer to look at.
 
 First let's make the text more concise: we don't need every bin to be labelled, and instead of listing the range, we can just describe the midpoint.
 
@@ -289,17 +296,19 @@ legend_labels = [
         legend_key_height=10,
         legend_text=element_text(size=8),
     )
-    + scale_fill_manual(values=color_mapping, breaks=order, labels=legend_labels)
+    + scale_fill_manual(values=color_mapping, breaks=order, labels=legend_labels) # <1>
 )
 ```
 
-Looks much better than my first attempt with a [horizontal legend](#horizontal-legend-with-horizontal-legend-text)
+1. provide the list `legend_labels` to `scale_fill_manual`
+
+I originally wanted to make a [horizontal legend](#horizontal-legend-with-horizontal-legend-text), but this works much better.
 
 ## Text and fonts
 
-Next up is the text and fonts. I played with a few fonts on [google fonts](https://fonts.google.com/) before settling on two. 
+Next up is the text and fonts. I played with a few fonts on [google fonts](https://fonts.google.com/) before settling on two. Note that this website uses these fonts with the help of [brand.yml](_brand.yml)
 
-First, install the fonts:
+Install the fonts:
 
 ```{python}
 FONT_PRIMARY = "Playfair Display"
@@ -309,7 +318,9 @@ fk.install(FONT_PRIMARY)
 fk.install(FONT_SECONDARY)
 ```
 
-plotnine breaks and labels for the scales can be easily adjusted using `mizani`, which is like the `scales` equivalent to `ggplot2`
+### `mizani` for axis breaks and labels
+
+plotnine breaks and labels for the scales can be easily adjusted using [`mizani`](https://mizani.readthedocs.io/en/stable/), which is like the [`scales`](https://scales.r-lib.org/) equivalent to `ggplot2`
 
 We're going to use `mizani.breaks.breaks_date_width` to put breaks for each year, and `mizani.labels.label_date` to drop the "month" part of the date. 
 
@@ -327,10 +338,10 @@ plot = (
     + geom_tile(color="white", height=0.95)
     + theme_tufte()
     + theme(
-        text=element_text(family=FONT_PRIMARY),
+        text=element_text(family=FONT_PRIMARY),               # <1>
         figure_size=FIGURE_THEME_SIZE,
-        axis_text_y=element_text(family=FONT_SECONDARY),
-        axis_text_x=element_text(family=FONT_SECONDARY),
+        axis_text_y=element_text(family=FONT_SECONDARY),      # <1>
+        axis_text_x=element_text(family=FONT_SECONDARY),      # <1>
         axis_title_y=element_text(weight=300),
         legend_justification_right=1,
         legend_position="right",
@@ -339,7 +350,7 @@ plot = (
         legend_key_spacing=0,
         legend_key_width=15,
         legend_key_height=15,
-        legend_text=element_text(size=8, family=FONT_SECONDARY),
+        legend_text=element_text(size=8, family=FONT_SECONDARY), # <1>
         legend_title=element_blank(),
         plot_title=element_text(ha="left"),
         plot_subtitle=element_text(
@@ -349,11 +360,11 @@ plot = (
                         breaks=order, labels=legend_labels)
     + guides(fill=guide_legend(ncol=1, reverse=True))
     + scale_x_datetime(
-        labels=ml.label_date("%Y"),  # Format labels to show only the year
+        labels=ml.label_date("%Y"),                   #  <2> 
         expand=(0, 0),
-        breaks=mb.breaks_date_width("1 years"),
+        breaks=mb.breaks_date_width("1 years"),       #  <2> 
     )
-    + labs(
+    + labs(                                               # <3>
         title="Sector Shifts: Where Canada's Jobs Are Moving",
         subtitle=textwrap.fill(
             "Track the number of industries gaining or losing jobs each month. Boxes are shaded based on percentage change from previous month in each industry's employment levels.",
@@ -366,25 +377,33 @@ plot = (
 plot
 ```
 
+1. Apply font family changes to the primary font in `theme(...)`
+2. Use `mizani` to format labels to show only the year in `scale_x_datetime`
+3. Add `title`, `subtitle` and wrap long lines with the help of `textwrap`
+
 ## Highlighting an Industry
 
-For more deeper insights, I would like to see where each individual ranks in the graphic.
+For more industry-specific insights, I would like to see where each individual ranks in the graphic.
 
 ```{python}
-labour_processed_cutted.select('Industry').unique().to_series().to_list()
-INDUSTRY = 'Wholesale and retail trade [41, 44-45]'
+INDUSTRY = 'Wholesale and retail trade [41, 44-45]'                       #  <1>
 
-plot_data_subsetted = labour_processed_cutted.filter(
-    pl.col("YEAR") >= FILTER_YEAR[0],
-    pl.col("YEAR") <= FILTER_YEAR[1],
-    pl.col('Industry') == INDUSTRY
+plot_data_subsetted = labour_processed_cutted.filter(                     #  <2>
+    pl.col("YEAR") >= FILTER_YEAR[0],                                     
+    pl.col("YEAR") <= FILTER_YEAR[1],                                     
+    pl.col('Industry') == INDUSTRY                                       
 )
+
 (
     plot
-    + geom_point(data=plot_data_subsetted, color='black', fill='black')
+    + geom_point(data=plot_data_subsetted, color='black', fill='black')   #  <3>
 )
 ```
 
+1. Specify indsutry
+2. Subset data
+3. Add the subsetted data to another `geom_point` layer
+
 # Line plot of unemployment
 
 # Appendix
@@ -393,6 +412,9 @@ plot_data_subsetted = labour_processed_cutted.filter(
 
 This section is a non-exhaustive list of design elements I wasn't able to solve with `plotnine`
 
+https://ggplot2.tidyverse.org/reference/geom_tile.html#aesthetics
+
+
 ### Horizontal legend with horizontal legend text
 
 Initially I wanted a horizontal legend for the colors. But in order to remove the whitespace between keys, I discovered that the text needs to be smaller than the legend keys, otherwise they "push" the legend keys apart in uneven manner. I attempted to (*unsuccesfully*) address this by making the legend text small, eliminating as much text as possible (e.g. removing the "%" characters for `-0.50` and `0.50`), and lastly increasing the legend key size. 
diff --git a/_brand.yml b/_brand.yml
@@ -3,6 +3,7 @@ color:
     black: "#1A1A1A"
     white: "#F9F9F9"
     red: "#da161f"
+    grey: "#ececec"
   background: white
   foreground: black
   primary: red
@@ -18,6 +19,10 @@ typography:
   base: Playfair Display
   headings: Playfair Display
   monospace: Fira Code
+  monospace-inline:
+    family: Fira Code
+    color: red
+    background-color: grey
 
 defaults:
   bootstrap:
diff --git a/_quarto.yml b/_quarto.yml
@@ -10,10 +10,10 @@ website:
         text: "Canada Employment Tracker"
       - sidebar:how
     tools:
+      - icon: linkedin
+        href:  linkedin.com/in/victor2wy/
       - icon: github
-        menu:
-          - text: Source Code
-            href:  github.com/wvictor14/labourcan
+        href:  github.com/wvictor14/labourcan/
   sidebar:
     - id: how
       title: "How"
diff --git a/py/labourcan/data_processing.py b/py/labourcan/data_processing.py