Skip to content

Commit 7c4ee8b

Browse files
committed
Add the 'Previewing Data' article
1 parent 9f47b9c commit 7c4ee8b

File tree

2 files changed

+140
-1
lines changed

2 files changed

+140
-1
lines changed

docs/_quarto.yml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,10 +61,13 @@ website:
6161
- user-guide/columns.qmd
6262
- user-guide/across.qmd
6363
- user-guide/preprocessing.qmd
64-
- section: "Post-Interrogation Ops"
64+
- section: "Post Interrogation"
6565
contents:
6666
- user-guide/extracts.qmd
6767
- user-guide/sundering.qmd
68+
- section: "Data Inspection"
69+
contents:
70+
- user-guide/preview.qmd
6871

6972
html-table-processing: none
7073

docs/user-guide/preview.qmd

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
---
2+
title: Previewing Data
3+
jupyter: python3
4+
html-table-processing: none
5+
---
6+
7+
```{python}
8+
#| echo: false
9+
#| output: false
10+
import pointblank as pb
11+
pb.config(report_incl_header=False, report_incl_footer=False)
12+
```
13+
14+
In many cases, it's *good* to look at your data tables. Before validating a table, you'll likely want
15+
to inspect a portion of it before diving into the creation of data-quality rules. This is pretty
16+
easily done with Polars and Pandas DataFrames, however, it's not as easy with database tables and
17+
each table backend displays things differently.
18+
19+
To make this common task a little better, you can use the
20+
[`preview()`](https://posit-dev.github.io/pointblank/reference/preview.html) function in Pointblank.
21+
It has been designed to work with every table that the package supports (i.e., DataFrames and
22+
Ibis-backend tables, the latter of which are largely database tables). Plus, what's shown in the
23+
output is consistent, no matter what type of data you're looking at.
24+
25+
## Viewing a Table with `preview()`
26+
27+
Let's look at how preview works. It requires only a table and, for this first example, let's use the
28+
`nycflights` dataset:
29+
30+
```{python}
31+
nycflights = pb.load_dataset(dataset="nycflights", tbl_type="polars")
32+
pb.preview(nycflights)
33+
```
34+
35+
This is an HTML table using the style of the other reporting tables in the library. The header is
36+
more minimal here, only showing the type of table we're looking at (`POLARS` in this case) along
37+
with the table dimensions. The column headers provide both the column names and the column data
38+
types.
39+
40+
By default, we're getting the first five rows and the last five rows. Row numbers (from the original
41+
dataset) provide an indication of which rows are the head and tail rows. The blue lines provide
42+
additional demarcation of the column containing the row numbers and the head and tail row groups.
43+
Finally, any cells with missing values are prominently styled with red lettering and a lighter red
44+
background.
45+
46+
If you'd rather not see the row numbers in the table, you can use the `show_row_numbers=False`
47+
option. Let's try that with the `game_revenue` dataset as a DuckDB table:
48+
49+
```{python}
50+
game_revenue = pb.load_dataset(dataset="game_revenue", tbl_type="duckdb")
51+
pb.preview(game_revenue, show_row_numbers=False)
52+
```
53+
54+
With the above preview, the row numbers are gone. The horizontal blue line still serves to divide
55+
the top and bottom rows of the table, however.
56+
57+
## Adjusting the Number of Rows Shown
58+
59+
It could be that displaying the five top and bottom rows is not preferred. This can be changed with
60+
the `n_head=` and `n_tail=`. Maybe, you want three from the top along with the last row? Let's try
61+
that out with the `small_table` dataset as a Pandas DataFrame:
62+
63+
```{python}
64+
small_table = pb.load_dataset(dataset="small_table", tbl_type="pandas")
65+
pb.preview(small_table, n_head=3, n_tail=1)
66+
```
67+
68+
If you're looking at a small table and want to see the entirety of it, you can enlarge the `n_head=`
69+
and `n_tail=` values:
70+
71+
```{python}
72+
small_table = pb.load_dataset(dataset="small_table", tbl_type="pandas")
73+
pb.preview(small_table, n_head=10, n_tail=10)
74+
```
75+
76+
Given that the table has 13 rows, asking for 20 rows to be displayed effectively shows the entire
77+
table.
78+
79+
## Previewing a Subset of Columns
80+
81+
The preview scales well to tables that have many columns by allowing for a horizontal scroll.
82+
However, previewing data from all columns can be impractical if you're only concerned with a key set
83+
of them. To preview only a subset of a table's columns, we can use the `columns_subset=` argument.
84+
Let's do this with the `nycflights` dataset and provide a list of six columns from that table.
85+
86+
```{python}
87+
pb.preview(
88+
nycflights,
89+
columns_subset=["hour", "minute", "sched_dep_time", "year", "month", "day"]
90+
)
91+
```
92+
93+
What we see are the six columns we specified from the `nycflights` dataset.
94+
95+
Note that the columns are displayed in the order provided in the `columns_subset=` list. This can be
96+
useful for making quick, side-by-side comparisons. In the example above, we placed `hour` and
97+
`minute` next to the `sched_dep_time` column. In the original dataset, `sched_dep_time` is far
98+
apart from the other two columns, but, it's useful to have them next to each other in the preview
99+
since `hour` and `minute` are derived from `sched_dep_time` (and this lets us spot check any
100+
issues).
101+
102+
We can also use column selectors within `columns_subset=`. Suppose we want to only see those columns
103+
that have `"dep"` in the name. To do that, we use the
104+
[`matches()`](https://posit-dev.github.io/pointblank/reference/matches.html) column selector
105+
function:
106+
107+
```{python}
108+
pb.preview(nycflights, columns_subset=pb.matches("dep"))
109+
```
110+
111+
Several selectors can be combined together through use of the
112+
[`col()`](https://posit-dev.github.io/pointblank/reference/col.html) function and operators such as
113+
`&` (*and*), `|` (*or*), `-` (*difference*), and `~` (*not*). Let's look at a column selection case
114+
where:
115+
116+
- the first three columns are selected
117+
- all columns containing `"dep_"` or `"arr_"` are selected
118+
- any columns beginning with `"sched"` are omitted
119+
120+
This is how we put that together within
121+
[`col()`](https://posit-dev.github.io/pointblank/reference/col.html):
122+
123+
```{python}
124+
pb.preview(
125+
nycflights,
126+
columns_subset=pb.col((pb.first_n(3) | pb.matches("dep_|arr_")) & ~ pb.starts_with("sched"))
127+
)
128+
```
129+
130+
This gives us a preview with only the columns that fit the specific selection rules. Incidentally,
131+
using selectors with a dataset through
132+
[`preview()`](https://posit-dev.github.io/pointblank/reference/preview.html) is a good way to test
133+
out the use of selectors more generally. Since they are primarily used to select columns for
134+
validation, trying them beforehand with
135+
[`preview()`](https://posit-dev.github.io/pointblank/reference/preview.html) can help verify that
136+
your selection logic is sound.

0 commit comments

Comments
 (0)