|
| 1 | +--- |
| 2 | +title: Checking Missingness in a Table |
| 3 | +jupyter: python3 |
| 4 | +html-table-processing: none |
| 5 | +--- |
| 6 | + |
| 7 | +```{python} |
| 8 | +#| echo: false |
| 9 | +#| output: false |
| 10 | +import pointblank as pb |
| 11 | +``` |
| 12 | + |
| 13 | +Sometimes values just aren't there: they're missing. This can either be expected or another thing to |
| 14 | +worry about. Either way, we can dig a little deeper if need be and use the |
| 15 | +[`missing_vals_tbl()`](https://posit-dev.github.io/pointblank/reference/missing_vals_tbl.html) |
| 16 | +function to generate a summary table that can elucidate how many values are missing, and roughly |
| 17 | +where. |
| 18 | + |
| 19 | +## Using and Understanding `missing_vals_tbl()` |
| 20 | + |
| 21 | +The missing values table is arranged a lot like the column summary table (generated via the |
| 22 | +[`col_summary_tbl()`](https://posit-dev.github.io/pointblank/reference/col_summary_tbl.html) |
| 23 | +function) in that columns of the input table are arranged as rows in the reporting table. Let's use |
| 24 | +[`missing_vals_tbl()`](https://posit-dev.github.io/pointblank/reference/missing_vals_tbl.html) on |
| 25 | +the `nycflights` dataset, which has a lot of missing values: |
| 26 | + |
| 27 | +```{python} |
| 28 | +import pointblank as pb |
| 29 | +
|
| 30 | +nycflights = pb.load_dataset(dataset="nycflights", tbl_type="polars") |
| 31 | +pb.missing_vals_tbl(nycflights) |
| 32 | +``` |
| 33 | + |
| 34 | +There are 18 columns in `nycflights` and they're arranged down the missing values table as rows. To |
| 35 | +the right we see column headers indicating 10 columns that are row sectors. Row sectors are groups |
| 36 | +of rows and each sector contains a tenth of the total rows in the table. The leftmost sectors are |
| 37 | +the rows at the top of the table whereas the sectors on the right are closer to the bottom. If you'd |
| 38 | +like to know which rows make up each row sector, there are details on this in the table footer area |
| 39 | +(click the `ROW SECTORS` text or the disclosure triangle). |
| 40 | + |
| 41 | +Now that we know about row sectors, we need to understand the visuals here. A light blue cell |
| 42 | +indicates there are no (`0`) missing values within a given row sector of a column. For `nycflights` |
| 43 | +we can see that several columns have no missing values at all (i.e., the light blue color makes up |
| 44 | +the entire row in the missing values table). |
| 45 | + |
| 46 | +When there are missing values in a column's row sector, you'll be met with a grayscale color. The |
| 47 | +proportion of missing values corresponds to the color ramp from light gray to solid black. |
| 48 | +Interestingly, most of the columns that have missing values appear to be related to each other in |
| 49 | +terms of the extent of missing values (i.e., the appearance in the reporting table looks roughly the |
| 50 | +same, indicating a sort of systematic missingness). These columns are `dep_time`, `dep_delay`, |
| 51 | +`arr_time`, `arr_delay`, and `air_time`. |
| 52 | + |
| 53 | +The odd column out with regard to the distribution of missing values is `tailnum`. By scanning the |
| 54 | +row and observing that the grayscale color values are all a little different we see that the degree |
| 55 | +of missingness of more variable and not related to the other columns containing missing values. |
| 56 | + |
| 57 | +## Missing Value Tables from the Other Datasets |
| 58 | + |
| 59 | +The `small_table` dataset has only 13 rows to it. Let's use that as a Pandas DataFrame with |
| 60 | +[`missing_vals_tbl()`](https://posit-dev.github.io/pointblank/reference/missing_vals_tbl.html): |
| 61 | + |
| 62 | +```{python} |
| 63 | +import pointblank as pb |
| 64 | +
|
| 65 | +small_table = pb.load_dataset(dataset="small_table", tbl_type="pandas") |
| 66 | +pb.missing_vals_tbl(small_table) |
| 67 | +``` |
| 68 | + |
| 69 | +It appears that only column `c` has missing values. And since the table is very small in terms of |
| 70 | +row count, most of the row sectors contain only a single row. |
| 71 | + |
| 72 | +The `game_revenue` dataset has *no* missing values. And this can be easily proven by using |
| 73 | +[`missing_vals_tbl()`](https://posit-dev.github.io/pointblank/reference/missing_vals_tbl.html) with |
| 74 | +it: |
| 75 | + |
| 76 | +```{python} |
| 77 | +import pointblank as pb |
| 78 | +
|
| 79 | +game_revenue = pb.load_dataset(dataset="game_revenue", tbl_type="duckdb") |
| 80 | +pb.missing_vals_tbl(game_revenue) |
| 81 | +``` |
| 82 | + |
| 83 | +We see nothing but light blue in this report! The header also indicates that there are no missing |
| 84 | +values by displaying a large green check mark (the other report tables provided a count of total |
| 85 | +missing values across all columns). |
0 commit comments