Skip to content

Commit 3c042cc

Browse files
committed
Add the Checking Missingness in a Table article
1 parent 47224d8 commit 3c042cc

File tree

2 files changed

+86
-0
lines changed

2 files changed

+86
-0
lines changed

docs/_quarto.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,7 @@ website:
6969
contents:
7070
- user-guide/preview.qmd
7171
- user-guide/col-summary-tbl.qmd
72+
- user-guide/missing-vals-tbl.qmd
7273

7374
html-table-processing: none
7475

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
---
2+
title: Checking Missingness in a Table
3+
jupyter: python3
4+
html-table-processing: none
5+
---
6+
7+
```{python}
8+
#| echo: false
9+
#| output: false
10+
import pointblank as pb
11+
```
12+
13+
Sometimes values just aren't there: they're missing. This can either be expected or another thing to
14+
worry about. Either way, we can dig a little deeper if need be and use the
15+
[`missing_vals_tbl()`](https://posit-dev.github.io/pointblank/reference/missing_vals_tbl.html)
16+
function to generate a summary table that can elucidate how many values are missing, and roughly
17+
where.
18+
19+
## Using and Understanding `missing_vals_tbl()`
20+
21+
The missing values table is arranged a lot like the column summary table (generated via the
22+
[`col_summary_tbl()`](https://posit-dev.github.io/pointblank/reference/col_summary_tbl.html)
23+
function) in that columns of the input table are arranged as rows in the reporting table. Let's use
24+
[`missing_vals_tbl()`](https://posit-dev.github.io/pointblank/reference/missing_vals_tbl.html) on
25+
the `nycflights` dataset, which has a lot of missing values:
26+
27+
```{python}
28+
import pointblank as pb
29+
30+
nycflights = pb.load_dataset(dataset="nycflights", tbl_type="polars")
31+
pb.missing_vals_tbl(nycflights)
32+
```
33+
34+
There are 18 columns in `nycflights` and they're arranged down the missing values table as rows. To
35+
the right we see column headers indicating 10 columns that are row sectors. Row sectors are groups
36+
of rows and each sector contains a tenth of the total rows in the table. The leftmost sectors are
37+
the rows at the top of the table whereas the sectors on the right are closer to the bottom. If you'd
38+
like to know which rows make up each row sector, there are details on this in the table footer area
39+
(click the `ROW SECTORS` text or the disclosure triangle).
40+
41+
Now that we know about row sectors, we need to understand the visuals here. A light blue cell
42+
indicates there are no (`0`) missing values within a given row sector of a column. For `nycflights`
43+
we can see that several columns have no missing values at all (i.e., the light blue color makes up
44+
the entire row in the missing values table).
45+
46+
When there are missing values in a column's row sector, you'll be met with a grayscale color. The
47+
proportion of missing values corresponds to the color ramp from light gray to solid black.
48+
Interestingly, most of the columns that have missing values appear to be related to each other in
49+
terms of the extent of missing values (i.e., the appearance in the reporting table looks roughly the
50+
same, indicating a sort of systematic missingness). These columns are `dep_time`, `dep_delay`,
51+
`arr_time`, `arr_delay`, and `air_time`.
52+
53+
The odd column out with regard to the distribution of missing values is `tailnum`. By scanning the
54+
row and observing that the grayscale color values are all a little different we see that the degree
55+
of missingness of more variable and not related to the other columns containing missing values.
56+
57+
## Missing Value Tables from the Other Datasets
58+
59+
The `small_table` dataset has only 13 rows to it. Let's use that as a Pandas DataFrame with
60+
[`missing_vals_tbl()`](https://posit-dev.github.io/pointblank/reference/missing_vals_tbl.html):
61+
62+
```{python}
63+
import pointblank as pb
64+
65+
small_table = pb.load_dataset(dataset="small_table", tbl_type="pandas")
66+
pb.missing_vals_tbl(small_table)
67+
```
68+
69+
It appears that only column `c` has missing values. And since the table is very small in terms of
70+
row count, most of the row sectors contain only a single row.
71+
72+
The `game_revenue` dataset has *no* missing values. And this can be easily proven by using
73+
[`missing_vals_tbl()`](https://posit-dev.github.io/pointblank/reference/missing_vals_tbl.html) with
74+
it:
75+
76+
```{python}
77+
import pointblank as pb
78+
79+
game_revenue = pb.load_dataset(dataset="game_revenue", tbl_type="duckdb")
80+
pb.missing_vals_tbl(game_revenue)
81+
```
82+
83+
We see nothing but light blue in this report! The header also indicates that there are no missing
84+
values by displaying a large green check mark (the other report tables provided a count of total
85+
missing values across all columns).

0 commit comments

Comments
 (0)