|
| 1 | +--- |
| 2 | +title: Previewing Data |
| 3 | +jupyter: python3 |
| 4 | +html-table-processing: none |
| 5 | +--- |
| 6 | + |
| 7 | +```{python} |
| 8 | +#| echo: false |
| 9 | +#| output: false |
| 10 | +import pointblank as pb |
| 11 | +pb.config(report_incl_header=False, report_incl_footer=False) |
| 12 | +``` |
| 13 | + |
| 14 | +In many cases, it's *good* to look at your data tables. Before validating a table, you'll likely want |
| 15 | +to inspect a portion of it before diving into the creation of data-quality rules. This is pretty |
| 16 | +easily done with Polars and Pandas DataFrames, however, it's not as easy with database tables and |
| 17 | +each table backend displays things differently. |
| 18 | + |
| 19 | +To make this common task a little better, you can use the |
| 20 | +[`preview()`](https://posit-dev.github.io/pointblank/reference/preview.html) function in Pointblank. |
| 21 | +It has been designed to work with every table that the package supports (i.e., DataFrames and |
| 22 | +Ibis-backend tables, the latter of which are largely database tables). Plus, what's shown in the |
| 23 | +output is consistent, no matter what type of data you're looking at. |
| 24 | + |
| 25 | +## Viewing a Table with `preview()` |
| 26 | + |
| 27 | +Let's look at how preview works. It requires only a table and, for this first example, let's use the |
| 28 | +`nycflights` dataset: |
| 29 | + |
| 30 | +```{python} |
| 31 | +nycflights = pb.load_dataset(dataset="nycflights", tbl_type="polars") |
| 32 | +pb.preview(nycflights) |
| 33 | +``` |
| 34 | + |
| 35 | +This is an HTML table using the style of the other reporting tables in the library. The header is |
| 36 | +more minimal here, only showing the type of table we're looking at (`POLARS` in this case) along |
| 37 | +with the table dimensions. The column headers provide both the column names and the column data |
| 38 | +types. |
| 39 | + |
| 40 | +By default, we're getting the first five rows and the last five rows. Row numbers (from the original |
| 41 | +dataset) provide an indication of which rows are the head and tail rows. The blue lines provide |
| 42 | +additional demarcation of the column containing the row numbers and the head and tail row groups. |
| 43 | +Finally, any cells with missing values are prominently styled with red lettering and a lighter red |
| 44 | +background. |
| 45 | + |
| 46 | +If you'd rather not see the row numbers in the table, you can use the `show_row_numbers=False` |
| 47 | +option. Let's try that with the `game_revenue` dataset as a DuckDB table: |
| 48 | + |
| 49 | +```{python} |
| 50 | +game_revenue = pb.load_dataset(dataset="game_revenue", tbl_type="duckdb") |
| 51 | +pb.preview(game_revenue, show_row_numbers=False) |
| 52 | +``` |
| 53 | + |
| 54 | +With the above preview, the row numbers are gone. The horizontal blue line still serves to divide |
| 55 | +the top and bottom rows of the table, however. |
| 56 | + |
| 57 | +## Adjusting the Number of Rows Shown |
| 58 | + |
| 59 | +It could be that displaying the five top and bottom rows is not preferred. This can be changed with |
| 60 | +the `n_head=` and `n_tail=`. Maybe, you want three from the top along with the last row? Let's try |
| 61 | +that out with the `small_table` dataset as a Pandas DataFrame: |
| 62 | + |
| 63 | +```{python} |
| 64 | +small_table = pb.load_dataset(dataset="small_table", tbl_type="pandas") |
| 65 | +pb.preview(small_table, n_head=3, n_tail=1) |
| 66 | +``` |
| 67 | + |
| 68 | +If you're looking at a small table and want to see the entirety of it, you can enlarge the `n_head=` |
| 69 | +and `n_tail=` values: |
| 70 | + |
| 71 | +```{python} |
| 72 | +small_table = pb.load_dataset(dataset="small_table", tbl_type="pandas") |
| 73 | +pb.preview(small_table, n_head=10, n_tail=10) |
| 74 | +``` |
| 75 | + |
| 76 | +Given that the table has 13 rows, asking for 20 rows to be displayed effectively shows the entire |
| 77 | +table. |
| 78 | + |
| 79 | +## Previewing a Subset of Columns |
| 80 | + |
| 81 | +The preview scales well to tables that have many columns by allowing for a horizontal scroll. |
| 82 | +However, previewing data from all columns can be impractical if you're only concerned with a key set |
| 83 | +of them. To preview only a subset of a table's columns, we can use the `columns_subset=` argument. |
| 84 | +Let's do this with the `nycflights` dataset and provide a list of six columns from that table. |
| 85 | + |
| 86 | +```{python} |
| 87 | +pb.preview( |
| 88 | + nycflights, |
| 89 | + columns_subset=["hour", "minute", "sched_dep_time", "year", "month", "day"] |
| 90 | +) |
| 91 | +``` |
| 92 | + |
| 93 | +What we see are the six columns we specified from the `nycflights` dataset. |
| 94 | + |
| 95 | +Note that the columns are displayed in the order provided in the `columns_subset=` list. This can be |
| 96 | +useful for making quick, side-by-side comparisons. In the example above, we placed `hour` and |
| 97 | +`minute` next to the `sched_dep_time` column. In the original dataset, `sched_dep_time` is far |
| 98 | +apart from the other two columns, but, it's useful to have them next to each other in the preview |
| 99 | +since `hour` and `minute` are derived from `sched_dep_time` (and this lets us spot check any |
| 100 | +issues). |
| 101 | + |
| 102 | +We can also use column selectors within `columns_subset=`. Suppose we want to only see those columns |
| 103 | +that have `"dep"` in the name. To do that, we use the |
| 104 | +[`matches()`](https://posit-dev.github.io/pointblank/reference/matches.html) column selector |
| 105 | +function: |
| 106 | + |
| 107 | +```{python} |
| 108 | +pb.preview(nycflights, columns_subset=pb.matches("dep")) |
| 109 | +``` |
| 110 | + |
| 111 | +Several selectors can be combined together through use of the |
| 112 | +[`col()`](https://posit-dev.github.io/pointblank/reference/col.html) function and operators such as |
| 113 | +`&` (*and*), `|` (*or*), `-` (*difference*), and `~` (*not*). Let's look at a column selection case |
| 114 | +where: |
| 115 | + |
| 116 | +- the first three columns are selected |
| 117 | +- all columns containing `"dep_"` or `"arr_"` are selected |
| 118 | +- any columns beginning with `"sched"` are omitted |
| 119 | + |
| 120 | +This is how we put that together within |
| 121 | +[`col()`](https://posit-dev.github.io/pointblank/reference/col.html): |
| 122 | + |
| 123 | +```{python} |
| 124 | +pb.preview( |
| 125 | + nycflights, |
| 126 | + columns_subset=pb.col((pb.first_n(3) | pb.matches("dep_|arr_")) & ~ pb.starts_with("sched")) |
| 127 | +) |
| 128 | +``` |
| 129 | + |
| 130 | +This gives us a preview with only the columns that fit the specific selection rules. Incidentally, |
| 131 | +using selectors with a dataset through |
| 132 | +[`preview()`](https://posit-dev.github.io/pointblank/reference/preview.html) is a good way to test |
| 133 | +out the use of selectors more generally. Since they are primarily used to select columns for |
| 134 | +validation, trying them beforehand with |
| 135 | +[`preview()`](https://posit-dev.github.io/pointblank/reference/preview.html) can help verify that |
| 136 | +your selection logic is sound. |
0 commit comments