Skip to content

Commit 67ec40c

Browse files
committed
Merge branch 'refine_the_tutorial'
2 parents edcf81f + 4d0cb72 commit 67ec40c

File tree

4 files changed

+117
-90
lines changed

4 files changed

+117
-90
lines changed

02_develop_visualization.qmd

Lines changed: 108 additions & 86 deletions
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,26 @@ editor:
99
render-on-save: true
1010
---
1111

12-
# Setup
12+
Current Canadian sentiment is at a low, with high cost-of-living, global political instability, and sweeping layoffs across multiple sectors. For the [2025 `plotnine` contest](https://posit.co/blog/announcing-the-2025-table-and-plotnine-contests/), I wanted to explore current official Canadian labour statistics using `plotnine`, a data visualization library in `python`.
13+
14+
# Introduction
15+
16+
I am so happy that `plotnine` exists, which is a relatively new python data visualization package. `plotnine` is based on `ggplot2`, an R package that I have been using for almost a decade.
17+
18+
In this tutorial, I'll walk through the process of creating my `plotnine` 2025 contest submission. The plot shows employment across Canadian industries, ranked by their percent change in monthly employment. To help visualize data across different industries, industry-specific plots are laid out in a "pseudo" interactive manner.
19+
20+
# Setup
21+
22+
## Data
23+
24+
The data can be downloaded using this bash [script](https://github.com/wvictor14/labourcan/blob/main/data/downloadLabourData.sh), or directly from [StatCan's website](https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1410035502).
1325

1426
## Parameters
1527

28+
In this initial code chunk we initialize some paramters that, later if needed, we can rerun this entire notebook with different paramters (e.g. different years).
29+
30+
`pyprojroot` is similar to R's package `here`, which lets us construct filepaths relative to the project root. This is very convenient especially for quarto projects with complex file organization.
31+
1632
```{python}
1733
from pyprojroot import here
1834
```
@@ -26,21 +42,28 @@ FILTER_YEAR = (2018, 2025)
2642
## Libraries
2743

2844
```{python}
45+
# Data manipulation
2946
import polars as pl
3047
import polars.selectors as cs
48+
49+
# Visualization
50+
from plotnine import *
51+
52+
# Mizani helps customize the text and breaks on axes
3153
from mizani.bounds import squish
3254
import mizani.labels as ml
3355
import mizani.breaks as mb
34-
import textwrap
35-
from pyprojroot import here
36-
from great_tables import GT, md, html
37-
from plotnine import *
38-
from labourcan.data_processing import read_labourcan,calculate_centered_rank
56+
import textwrap # for wrapping long lines of text
57+
58+
# Custom extract and transform functions for plot data
59+
from labourcan.data_processing import read_labourcan, calculate_centered_rank
3960
```
4061

41-
## Read data
62+
## Read and process data for graphing
4263

43-
[`read_labourcan`](../py/labourcan/data_processing.py) returns a polars with:
64+
The visualization required a fair amount of data processing which is detailed in this [page](01_develop_data_processing.html). The steps are summarized here:
65+
66+
[`read_labourcan`](../py/labourcan/data_processing.py) returns a `polars.Data.Frame` with:
4467

4568
- Unused columns removed
4669
- Filtered to seasonally adjusted estimates only
@@ -55,93 +78,76 @@ labour = read_labourcan(LABOUR_DATA_FILE)
5578
labour_processed = calculate_centered_rank(labour)
5679
```
5780

58-
# Heatmap of Employment Numbers
81+
# A first attempt
82+
83+
The type of visual that's being developed here is something like a heatmap of employment numbers.
5984

60-
Let's take a stab at a first visual.
85+
We want a clean separation of industries that are growing or shrinking. For that we are using a rank ordering by % monthly changed. But not just any ordinary rank, we center it around 0 such that sectors that are growing (% change > 0) have a positive rank and those that are shrinking are negative.
86+
87+
`scale_color_gradient2` is a great option because it allows specification of our `midpoint=0`
6188

6289
```{python}
6390
# | page-layout: column-page
6491
(
6592
ggplot(
6693
(
6794
labour_processed.filter(
68-
pl.col("YEAR") >= FILTER_YEAR[0],
69-
pl.col("YEAR") <= FILTER_YEAR[1]
95+
pl.col("YEAR") >= FILTER_YEAR[0], pl.col("YEAR") <= FILTER_YEAR[1]
7096
)
71-
), aes(x="DATE_YMD", y="centered_rank_across_industry", color="PDIFF"))
97+
),
98+
aes(x="DATE_YMD", y="centered_rank_across_industry", color="PDIFF"),
99+
)
72100
+ geom_point(shape="s")
73101
+ theme_tufte()
74102
+ theme(figure_size=FIGURE_THEME_SIZE, axis_text_x=element_text(angle=90))
75-
76103
+ scale_color_gradient2(
77-
limits=(-0.01, 0.01), low="#ff0000ff", high="#0000dbff", midpoint=0, oob=squish)
104+
limits=(-0.01, 0.01), low="#ff0000ff", high="#0000dbff", midpoint=0, oob=squish
105+
)
78106
)
79107
```
80108

81109
## `geom_point` or `geom_tile`
82110

83-
It looks good. but the whitespace between each point is distracting. I could make
111+
The whitespace between each point is distracting. I could make
84112
the point size larger, but the ratio of point size to range of the x and y axis, as well as
85-
the figure size all will determine ultimately how much whitespace remains between each point.
113+
the figure size all will ultimately determine how much whitespace remains between each point.
86114

87-
We can use `geom_tile` instead, which will plot rectangles specified by a center point.
115+
If we use `geom_tile` instead, which will plot rectangles specified by a center point, we can explicitly control the whitespace between tiles.
88116

89117
```{python}
90-
labour_processed_cat = labour_processed.drop_nulls(
91-
['centered_rank_across_industry'])
92-
order = (
93-
labour_processed_cat.select('centered_rank_across_industry').unique().sort(
94-
'centered_rank_across_industry')
95-
.to_series()
96-
.cast(pl.Utf8)
97-
.to_list()
98-
)
99-
100-
labour_processed_cat = (
101-
labour_processed_cat.with_columns(
102-
pl.col('centered_rank_across_industry').cast(
103-
pl.Utf8).cast(pl.Enum(categories=order)).alias('centered_rank_cat')
104-
)
105-
)
106-
107118
(
108-
ggplot((
109-
labour_processed_cat.filter(
110-
pl.col("YEAR") >= FILTER_YEAR[0],
111-
pl.col("YEAR") <= FILTER_YEAR[1]
112-
)
113-
), aes(x="DATE_YMD", y="centered_rank_across_industry", fill="PDIFF"))
114-
+ geom_tile(height=0.95) # whitespace between tiles, vertically
115-
+ theme_tufte()
116-
+ theme(
117-
figure_size=FIGURE_THEME_SIZE,
118-
axis_text_x=element_text(angle=90)
119+
ggplot(
120+
(
121+
labour_processed.filter(
122+
pl.col("YEAR") >= FILTER_YEAR[0], pl.col("YEAR") <= FILTER_YEAR[1]
123+
)
124+
),
125+
aes(x="DATE_YMD", y="centered_rank_across_industry", fill="PDIFF"),
119126
)
127+
+ geom_tile(height=0.95, width=30 * 0.95) # <1>
128+
+ theme_tufte()
129+
+ theme(figure_size=FIGURE_THEME_SIZE, axis_text_x=element_text(angle=90))
120130
+ scale_fill_gradient2(
121-
limits=(-0.01, 0.01), low="#ff0000ff", high="#0000dbff", midpoint=0, oob=squish)
122-
131+
limits=(-0.01, 0.01), low="#ff0000ff", high="#0000dbff", midpoint=0, oob=squish
132+
)
123133
)
124134
```
125135

126-
This is looking pretty good. I added `height = 0.95` to add some whitespace between tiles vertically.
127-
I actually wanted to remove whitespace completely, but I discovered `width` for `geom_tile` doesn't
128-
work the same as it does for `ggplot2`. If I set `width=1` it seems to make the tiles smaller, instead of wider.
129-
136+
1. I added `height = 0.95` to add some whitespace between tiles vertically. To remove horizontal whitespace, we need to specify a `width`. Because we are using a `datetime` axis, we need to specify it in unit of days. But each tile here is a month, so we need to express in units of 30 hence: `width = 30*0.95`.
130137

131138
## Explicit color mapping with `scale_color_manual`
132139

133140
I am fairly happy with the `scale_fill_gradient2` used with `squish`. We get a really nice palette
134-
that's centered around 0. However `scale_fill_gradient2` is limited to 3 colors (high, midpoint, low),
141+
that's centered around 0. However `scale_fill_gradient2` is limited to 3 colors (`high`, `midpoint`, `low`),
135142
which is not quite enable the more dynamic color palette that I'm seeking.
136143

137-
To be more explicit with the colors, I will bin the `PDIFF` and map colors manually
138-
using `scale_fill_manual`
144+
To be more explicit with the colors, I will bin the % change variable and then map each bin to a color manually using `scale_fill_manual`.
139145

140146
### Bin with `polars.Series.cut`
141147

142148
```{python}
143149
labour_processed_cutted = (
144-
labour_processed_cat.with_columns(
150+
labour_processed.with_columns(
145151
pl.col("PDIFF")
146152
.cut(
147153
[
@@ -180,7 +186,7 @@ labour_processed_cutted.group_by("PDIFF_BINNED").len()
180186
pl.col("YEAR") >= FILTER_YEAR[0], pl.col("YEAR") <= FILTER_YEAR[1]
181187
)
182188
),
183-
aes(x="DATE_YMD", y="centered_rank_cat", fill="PDIFF_BINNED"),
189+
aes(x="DATE_YMD", y="centered_rank_across_industry", fill="PDIFF_BINNED"),
184190
)
185191
+ geom_tile(height=0.95) # whitespace between tiles, vertically
186192
+ theme_tufte()
@@ -190,9 +196,9 @@ labour_processed_cutted.group_by("PDIFF_BINNED").len()
190196

191197
### `scale_fill_manual` for explicit color mapping
192198

193-
Now we need to order the levels, and map explicit colors
199+
Now we need to order the levels, and map to a specific color palette.
194200

195-
We will make PDIFF=0% to be gray, positive values to have a green and blue colors (job growth = good), and negative values to have warmer (alarming, bad) colors.
201+
We will make `PDIFF=0%` (no change) to be gray, positive values to have `green` and `blue` colors (*growth* = *good*), and negative values to be `red` and `orange` (*contraction* = *bad*) colors.
196202

197203
```{python}
198204
order = (
@@ -231,23 +237,24 @@ color_mapping = {
231237
pl.col("YEAR") >= FILTER_YEAR[0], pl.col("YEAR") <= FILTER_YEAR[1]
232238
)
233239
),
234-
aes(x="DATE_YMD", y="centered_rank_across_industry", fill="PDIFF_BINNED"),
240+
aes(x="DATE_YMD", y="centered_rank_across_industry", fill="PDIFF_BINNED"), # <1>
235241
)
236242
+ geom_tile(color="white")
237-
# + geom_point(shape="s")
238243
+ theme_tufte()
239244
+ theme(figure_size=FIGURE_THEME_SIZE, axis_text_x=element_text(angle=90))
240-
+ scale_fill_manual(values=color_mapping, breaks=order)
245+
+ scale_fill_manual(values=color_mapping, breaks=order) # <2>
241246
)
242247
```
243248

244-
That looks great. The power of `scale_fill_manual` enables much more control over
245-
the color palette. However, the cost was that it takes a lot more effort and lines of code
246-
to create a custom mapping.
249+
1. map `fill` to `PDIFF_BINNED`
250+
2. provide explicit color mapping to `scale_fill_manual`
251+
252+
The power of `scale_fill_manual` is that it enables much more explicit control over
253+
how color is mapped to data. However, the cost was that it takes a lot more effort and lines of code, compared to `scale_fill_gradient2`, which works well "out-of-box".
247254

248255
## The legend
249256

250-
...is extremely accurate, however we are going to simplify it and nicer to look at.
257+
...is mathematically accurate, however we are going to make it nicer to look at.
251258

252259
First let's make the text more concise: we don't need every bin to be labelled, and instead of listing the range, we can just describe the midpoint.
253260

@@ -289,17 +296,19 @@ legend_labels = [
289296
legend_key_height=10,
290297
legend_text=element_text(size=8),
291298
)
292-
+ scale_fill_manual(values=color_mapping, breaks=order, labels=legend_labels)
299+
+ scale_fill_manual(values=color_mapping, breaks=order, labels=legend_labels) # <1>
293300
)
294301
```
295302

296-
Looks much better than my first attempt with a [horizontal legend](#horizontal-legend-with-horizontal-legend-text)
303+
1. provide the list `legend_labels` to `scale_fill_manual`
304+
305+
I originally wanted to make a [horizontal legend](#horizontal-legend-with-horizontal-legend-text), but this works much better.
297306

298307
## Text and fonts
299308

300-
Next up is the text and fonts. I played with a few fonts on [google fonts](https://fonts.google.com/) before settling on two.
309+
Next up is the text and fonts. I played with a few fonts on [google fonts](https://fonts.google.com/) before settling on two. Note that this website uses these fonts with the help of [brand.yml](_brand.yml)
301310

302-
First, install the fonts:
311+
Install the fonts:
303312

304313
```{python}
305314
FONT_PRIMARY = "Playfair Display"
@@ -309,7 +318,9 @@ fk.install(FONT_PRIMARY)
309318
fk.install(FONT_SECONDARY)
310319
```
311320

312-
plotnine breaks and labels for the scales can be easily adjusted using `mizani`, which is like the `scales` equivalent to `ggplot2`
321+
### `mizani` for axis breaks and labels
322+
323+
plotnine breaks and labels for the scales can be easily adjusted using [`mizani`](https://mizani.readthedocs.io/en/stable/), which is like the [`scales`](https://scales.r-lib.org/) equivalent to `ggplot2`
313324

314325
We're going to use `mizani.breaks.breaks_date_width` to put breaks for each year, and `mizani.labels.label_date` to drop the "month" part of the date.
315326

@@ -327,10 +338,10 @@ plot = (
327338
+ geom_tile(color="white", height=0.95)
328339
+ theme_tufte()
329340
+ theme(
330-
text=element_text(family=FONT_PRIMARY),
341+
text=element_text(family=FONT_PRIMARY), # <1>
331342
figure_size=FIGURE_THEME_SIZE,
332-
axis_text_y=element_text(family=FONT_SECONDARY),
333-
axis_text_x=element_text(family=FONT_SECONDARY),
343+
axis_text_y=element_text(family=FONT_SECONDARY), # <1>
344+
axis_text_x=element_text(family=FONT_SECONDARY), # <1>
334345
axis_title_y=element_text(weight=300),
335346
legend_justification_right=1,
336347
legend_position="right",
@@ -339,7 +350,7 @@ plot = (
339350
legend_key_spacing=0,
340351
legend_key_width=15,
341352
legend_key_height=15,
342-
legend_text=element_text(size=8, family=FONT_SECONDARY),
353+
legend_text=element_text(size=8, family=FONT_SECONDARY), # <1>
343354
legend_title=element_blank(),
344355
plot_title=element_text(ha="left"),
345356
plot_subtitle=element_text(
@@ -349,11 +360,11 @@ plot = (
349360
breaks=order, labels=legend_labels)
350361
+ guides(fill=guide_legend(ncol=1, reverse=True))
351362
+ scale_x_datetime(
352-
labels=ml.label_date("%Y"), # Format labels to show only the year
363+
labels=ml.label_date("%Y"), # <2>
353364
expand=(0, 0),
354-
breaks=mb.breaks_date_width("1 years"),
365+
breaks=mb.breaks_date_width("1 years"), # <2>
355366
)
356-
+ labs(
367+
+ labs( # <3>
357368
title="Sector Shifts: Where Canada's Jobs Are Moving",
358369
subtitle=textwrap.fill(
359370
"Track the number of industries gaining or losing jobs each month. Boxes are shaded based on percentage change from previous month in each industry's employment levels.",
@@ -366,25 +377,33 @@ plot = (
366377
plot
367378
```
368379

380+
1. Apply font family changes to the primary font in `theme(...)`
381+
2. Use `mizani` to format labels to show only the year in `scale_x_datetime`
382+
3. Add `title`, `subtitle` and wrap long lines with the help of `textwrap`
383+
369384
## Highlighting an Industry
370385

371-
For more deeper insights, I would like to see where each individual ranks in the graphic.
386+
For more industry-specific insights, I would like to see where each individual ranks in the graphic.
372387

373388
```{python}
374-
labour_processed_cutted.select('Industry').unique().to_series().to_list()
375-
INDUSTRY = 'Wholesale and retail trade [41, 44-45]'
389+
INDUSTRY = 'Wholesale and retail trade [41, 44-45]' # <1>
376390
377-
plot_data_subsetted = labour_processed_cutted.filter(
378-
pl.col("YEAR") >= FILTER_YEAR[0],
379-
pl.col("YEAR") <= FILTER_YEAR[1],
380-
pl.col('Industry') == INDUSTRY
391+
plot_data_subsetted = labour_processed_cutted.filter( # <2>
392+
pl.col("YEAR") >= FILTER_YEAR[0],
393+
pl.col("YEAR") <= FILTER_YEAR[1],
394+
pl.col('Industry') == INDUSTRY
381395
)
396+
382397
(
383398
plot
384-
+ geom_point(data=plot_data_subsetted, color='black', fill='black')
399+
+ geom_point(data=plot_data_subsetted, color='black', fill='black') # <3>
385400
)
386401
```
387402

403+
1. Specify indsutry
404+
2. Subset data
405+
3. Add the subsetted data to another `geom_point` layer
406+
388407
# Line plot of unemployment
389408

390409
# Appendix
@@ -393,6 +412,9 @@ plot_data_subsetted = labour_processed_cutted.filter(
393412

394413
This section is a non-exhaustive list of design elements I wasn't able to solve with `plotnine`
395414

415+
https://ggplot2.tidyverse.org/reference/geom_tile.html#aesthetics
416+
417+
396418
### Horizontal legend with horizontal legend text
397419

398420
Initially I wanted a horizontal legend for the colors. But in order to remove the whitespace between keys, I discovered that the text needs to be smaller than the legend keys, otherwise they "push" the legend keys apart in uneven manner. I attempted to (*unsuccesfully*) address this by making the legend text small, eliminating as much text as possible (e.g. removing the "%" characters for `-0.50` and `0.50`), and lastly increasing the legend key size.

_brand.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ color:
33
black: "#1A1A1A"
44
white: "#F9F9F9"
55
red: "#da161f"
6+
grey: "#ececec"
67
background: white
78
foreground: black
89
primary: red
@@ -18,6 +19,10 @@ typography:
1819
base: Playfair Display
1920
headings: Playfair Display
2021
monospace: Fira Code
22+
monospace-inline:
23+
family: Fira Code
24+
color: red
25+
background-color: grey
2126

2227
defaults:
2328
bootstrap:

_quarto.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,10 @@ website:
1010
text: "Canada Employment Tracker"
1111
- sidebar:how
1212
tools:
13+
- icon: linkedin
14+
href: linkedin.com/in/victor2wy/
1315
- icon: github
14-
menu:
15-
- text: Source Code
16-
href: github.com/wvictor14/labourcan
16+
href: github.com/wvictor14/labourcan/
1717
sidebar:
1818
- id: how
1919
title: "How"

0 commit comments

Comments
 (0)