Skip to content

Commit f5d5635

Browse files
committed
Update col-summary-tbl.qmd
1 parent 875f5f5 commit f5d5635

File tree

1 file changed

+34
-12
lines changed

1 file changed

+34
-12
lines changed

docs/user-guide/col-summary-tbl.qmd

Lines changed: 34 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -35,45 +35,67 @@ small_table = pb.load_dataset(dataset="small_table", tbl_type="polars")
3535
pb.col_summary_tbl(small_table)
3636
```
3737

38-
The header provides the type of table we're looking at (`POLARS`, since this is a Pandas DataFrame) and the table dimensions. The rest of the table focuses on the column-level summaries. As such, each row represents a summary of a column in the `small_table` dataset. There's a lot of information
39-
in this summary table to digest. Some of it is intuitive since this sort of table summarization isn't all that uncommon, but other aspects of it could also give some pause. So we'll carefully wade through how to interpret this report.
38+
The header provides the type of table we're looking at (`POLARS`, since this is a Pandas DataFrame)
39+
and the table dimensions. The rest of the table focuses on the column-level summaries. As such, each
40+
row represents a summary of a column in the `small_table` dataset. There's a lot of information in
41+
this summary table to digest. Some of it is intuitive since this sort of table summarization isn't
42+
all that uncommon, but other aspects of it could also give some pause. So we'll carefully wade
43+
through how to interpret this report.
4044

4145
## Data Categories in the Column Summary Table
4246

43-
On the left side of the table are icons of different colors. These represent categories that the columns fall into. There are only five categories and columns can only be of one type. The mapping from letter marks to categories are:
47+
On the left side of the table are icons of different colors. These represent categories that the
48+
columns fall into. There are only five categories and columns can only be of one type. The
49+
categories (and their letter marks) are:
4450

4551
- `N`: numeric
4652
- `S`: string-based
4753
- `D`: date/datetime
4854
- `T/F`: boolean
4955
- `O`: object
5056

51-
The numeric category (`N`) takes data types such as floats and integers. The `S` category is for string-based columns. Date or datetime values are lumped into the `D` category. Boolean columns
57+
The numeric category (`N`) takes data types such as floats and integers. The `S` category is for
58+
string-based columns. Date or datetime values are lumped into the `D` category. Boolean columns
5259
(`T/F`) have their own category and are *not* considered numeric (e.g., `0`/`1`). The `O` category
53-
is a catchall for all other types of columns. Given the disparity of these categories and that we want them in the same table, some statistical measures will be sensible for certain column categories but not for others. Given that, we'll explain how each category is represented in the column summary table.
60+
is a catchall for all other types of columns. Given the disparity of these categories and that we
61+
want them in the same table, some statistical measures will be sensible for certain column
62+
categories but not for others. Given that, we'll explain how each category is represented in the
63+
column summary table.
5464

5565
## Numeric Data
5666

5767
Three columns in `small_table` are numeric: `a` (`Int64`), `c` (`Int64`), and `d` (`Float64`). The
5868
common measures of the missing count/proportion (`NA`) and the unique value count/proportion (`UQ`)
59-
are provided for the numeric data type. For these two measures, the top number is the absolute
60-
count of missing values and the count of unique values. The bottom number is a proportion of the absolute count divided by the row count; this makes each proportion a value between `0` and `1`
61-
(bounds included).
69+
are provided for the numeric data type. For these two measures, the top number is the absolute count
70+
of missing values and the count of unique values. The bottom number is a proportion of the absolute
71+
count divided by the row count; this makes each proportion a value between `0` and `1` (bounds
72+
included).
6273

63-
The next two columns represent the mean (`Mean`) and the standard deviation (`SD`). The minumum (`Min`), maximum, (`Max`) and a set of quantiles occupy the next few columns (includes `P5`, `Q1`, `Med` for median, `Q3`, and `P95`). Finally, the interquartile range (`IQR`: `Q3` - `Q1`) is the last measure provided.
74+
The next two columns represent the mean (`Mean`) and the standard deviation (`SD`). The minumum
75+
(`Min`), maximum, (`Max`) and a set of quantiles occupy the next few columns (includes `P5`, `Q1`,
76+
`Med` for median, `Q3`, and `P95`). Finally, the interquartile range (`IQR`: `Q3` - `Q1`) is the
77+
last measure provided.
6478

6579
## String Data
6680

67-
String data is present in `small_table`, being in columns `b` and `f`. The missing value (`NA`) and uniqueness (`UQ`) measures are accounted for here. The statistical measures are all based on string lengths, so what happens is that all strings in a column are converted to those numeric values and a subset of stats values is presented. To avoid some understandable confusion when reading the table, the stats values in each of the cells with values are annotated with the text `"SL"`. It makes less sense to provide a full suite a quantile values so only the minimum (`Min`), median (`Med`), and maximum (`Max`) are provided.
81+
String data is present in `small_table`, being in columns `b` and `f`. The missing value (`NA`) and
82+
uniqueness (`UQ`) measures are accounted for here. The statistical measures are all based on string
83+
lengths, so what happens is that all strings in a column are converted to those numeric values and a
84+
subset of stats values is presented. To avoid some understandable confusion when reading the table,
85+
the stats values in each of the cells with values are annotated with the text `"SL"`. It makes less
86+
sense to provide a full suite a quantile values so only the minimum (`Min`), median (`Med`), and
87+
maximum (`Max`) are provided.
6888

6989
## Date/Datetime Data and Boolean Data
7090

71-
We see that in the first two rows of our summary table that we have summaries of the `date_time` and
91+
We see that in the first two rows of our summary table there are summaries of the `date_time` and
7292
`date` columns. The summaries we provide for a date/datetime category (notice the green `D` to the
7393
left of the column names) are:
7494

7595
1. the missing count/proportion (`NA`)
7696
2. the unique value count/proportion (`UQ`)
7797
3. the minimum and maximum dates/datetimes
7898

79-
One column, `e`, is of the `Boolean` type. Because columns of this type could only have `True`, `False`, or missing values, we provide summary data for missingness (under `NA`) and proportions of `True` and `False` values (under `UQ`).
99+
One column, `e`, is of the `Boolean` type. Because columns of this type could only have `True`,
100+
`False`, or missing values, we provide summary data for missingness (under `NA`) and proportions of
101+
`True` and `False` values (under `UQ`).

0 commit comments

Comments
 (0)