Skip to content

Commit ee5afeb

Browse files
authored
Merge pull request #130 from posit-dev/fix-set-limit-extracts
fix: set limit on extracts no matter which scheme was used for collection
2 parents 871cd9a + 2ed5dde commit ee5afeb

File tree

3 files changed

+149
-31
lines changed

3 files changed

+149
-31
lines changed

pointblank/data/api-docs.txt

Lines changed: 82 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
6060
[`Thresholds`](`pointblank.Thresholds`) object.
6161
actions
6262
The actions to take when validation steps meet or exceed any set threshold levels. This
63-
should be provided in the form of an `Actions` object. If `None` then no default actions
63+
should be provided in the form of an `Actions` object. If `None` then no global actions
6464
will be set.
6565
brief
6666
A global setting for briefs, which are optional brief descriptions for validation steps
@@ -104,7 +104,7 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
104104

105105
Examples
106106
--------
107-
## Creating a validation plan and interrogating
107+
### Creating a validation plan and interrogating
108108

109109
Let's walk through a data quality analysis of an extremely small table. It's actually called
110110
`"small_table"` and it's accessible through the [`load_dataset()`](`pointblank.load_dataset`)
@@ -170,11 +170,72 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
170170
[`get_tabular_report()`](`pointblank.Validate.get_tabular_report`) method, which contains
171171
options for modifying the display of the table.
172172

173-
Furthermore, post-interrogation methods such as
174-
[`get_step_report()`](`pointblank.Validate.get_step_report`),
175-
[`get_data_extracts()`](`pointblank.Validate.get_data_extracts`), and
176-
[`get_sundered_data()`](`pointblank.Validate.get_sundered_data`) allow you to generate
177-
additional reporting or extract useful data for downstream analysis from a `Validate` object.
173+
### Adding briefs
174+
175+
Briefs are short descriptions of the validation steps. While they can be set for each step
176+
individually, they can also be set globally. The global setting is done by using the
177+
`brief=` argument in `Validate`. The global setting can be as simple as `True` to have
178+
automatically-generated briefs for each step. Alternatively, we can use templating elements
179+
like `"{step}"` (to insert the step number) or `"{auto}"` (to include an automatically generated
180+
brief). Here's an example of a global setting for briefs:
181+
182+
```python
183+
validation = (
184+
pb.Validate(
185+
data=pb.load_dataset(),
186+
tbl_name="small_table",
187+
label="Validation example with briefs",
188+
brief="Step {step}: {auto}",
189+
)
190+
.col_vals_gt(columns="d", value=100)
191+
.col_vals_between(columns="c", left=3, right=10, na_pass=True)
192+
.col_vals_regex(
193+
columns="b",
194+
pattern=r"[0-9]-[a-z]{3}-[0-9]{3}",
195+
brief="Regex check for column {col}"
196+
)
197+
.interrogate()
198+
)
199+
200+
validation
201+
```
202+
203+
We see the text of the briefs appear in the `STEP` column of the reporting table. Furthermore,
204+
the global brief's template (`"Step {step}: {auto}"`) is applied to all steps except for the
205+
final step, where the step-level `brief=` argument provided an override.
206+
207+
If you should want to cancel the globally-defined brief for one or more validation steps, you
208+
can set `brief=False` in those particular steps.
209+
210+
### Post-interrogation methods
211+
212+
The `Validate` class has a number of post-interrogation methods that can be used to extract
213+
useful information from the validation results. For example, the
214+
[`get_data_extracts()`](`pointblank.Validate.get_data_extracts`) method can be used to get
215+
the data extracts for each validation step.
216+
217+
```python
218+
validation.get_data_extracts()
219+
```
220+
221+
We can also view step reports for each validation step using the
222+
[`get_step_report()`](`pointblank.Validate.get_step_report`) method. This method adapts to the
223+
type of validation step and shows the relevant information for a step's validation.
224+
225+
```python
226+
validation.get_step_report(i=2)
227+
```
228+
229+
The `Validate` class also has a method for getting the sundered data, which is the data that
230+
passed or failed the validation steps. This can be done using the
231+
[`get_sundered_data()`](`pointblank.Validate.get_sundered_data`) method.
232+
233+
```python
234+
pb.preview(validation.get_sundered_data())
235+
```
236+
237+
The sundered data is a DataFrame that contains the rows that passed or failed the validation.
238+
The default behavior is to return the rows that failed the validation, as shown above.
178239

179240

180241
Thresholds(warning: 'int | float | bool | None' = None, error: 'int | float | bool | None' = None, critical: 'int | float | bool | None' = None) -> None
@@ -4169,7 +4230,7 @@ validation steps, (3) `interrogate()`. After interrogation of the data, we can v
41694230
report table (by printing the object or using `get_tabular_report()`), extract key metrics, or we
41704231
can split the data based on the validation results (with `get_sundered_data()`).
41714232

4172-
interrogate(self, collect_extracts: 'bool' = True, collect_tbl_checked: 'bool' = True, get_first_n: 'int | None' = None, sample_n: 'int | None' = None, sample_frac: 'int | float | None' = None, sample_limit: 'int' = 5000) -> 'Validate'
4233+
interrogate(self, collect_extracts: 'bool' = True, collect_tbl_checked: 'bool' = True, get_first_n: 'int | None' = None, sample_n: 'int | None' = None, sample_frac: 'int | float | None' = None, extract_limit: 'int' = 500) -> 'Validate'
41734234

41744235
Execute each validation step against the table and store the results.
41754236

@@ -4179,8 +4240,8 @@ interrogate(self, collect_extracts: 'bool' = True, collect_tbl_checked: 'bool' =
41794240

41804241
The interrogation process will collect extracts of failing rows if the `collect_extracts=`
41814242
option is set to `True` (the default). We can control the number of rows collected using the
4182-
`get_first_n=`, `sample_n=`, and `sample_frac=` options. The `sample_limit=` option will
4183-
enforce a hard limit on the number of rows collected when using the `sample_frac=` option.
4243+
`get_first_n=`, `sample_n=`, and `sample_frac=` options. The `extract_limit=` option will
4244+
enforce a hard limit on the number of rows collected when `collect_extracts=True`.
41844245

41854246
After interrogation is complete, the `Validate` object will have gathered information, and
41864247
we can use methods like [`n_passed()`](`pointblank.Validate.n_passed`),
@@ -4199,9 +4260,9 @@ interrogate(self, collect_extracts: 'bool' = True, collect_tbl_checked: 'bool' =
41994260
The processed data frames produced by executing the validation steps is collected and
42004261
stored in the `Validate` object if `collect_tbl_checked=True`. This information is
42014262
necessary for some methods (e.g.,
4202-
[`get_sundered_data()`](`pointblank.Validate.get_sundered_data`)), but it potentially
4203-
makes the object grow to a large size. To opt out of attaching this data, set this
4204-
argument to `False`.
4263+
[`get_sundered_data()`](`pointblank.Validate.get_sundered_data`)), but it can
4264+
potentially make the object grow to a large size. To opt out of attaching this data, set
4265+
this to `False`.
42054266
get_first_n
42064267
If the option to collect rows where test units is chosen, there is the option here to
42074268
collect the first `n` rows. Supply an integer number of rows to extract from the top of
@@ -4215,11 +4276,15 @@ interrogate(self, collect_extracts: 'bool' = True, collect_tbl_checked: 'bool' =
42154276
sample_frac
42164277
If the option to collect non-passing rows is chosen, this option allows for the sampling
42174278
of a fraction of those rows. Provide a number in the range of `0` and `1`. The number of
4218-
rows to return could be very large, however, the `sample_limit=` option will apply a
4279+
rows to return could be very large, however, the `extract_limit=` option will apply a
42194280
hard limit to the returned rows.
4220-
sample_limit
4221-
A value that limits the possible number of rows returned when sampling non-passing rows
4222-
using the `sample_frac=` option.
4281+
extract_limit
4282+
A value that limits the possible number of rows returned when extracting non-passing
4283+
rows. The default is `500` rows. This limit is applied after any sampling or limiting
4284+
options are applied. If the number of rows to be returned is greater than this limit,
4285+
then the number of rows returned will be limited to this value. This is useful for
4286+
preventing the collection of too many rows when the number of non-passing rows is very
4287+
large.
42234288

42244289
Returns
42254290
-------

pointblank/validate.py

Lines changed: 17 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -5147,7 +5147,7 @@ def interrogate(
51475147
get_first_n: int | None = None,
51485148
sample_n: int | None = None,
51495149
sample_frac: int | float | None = None,
5150-
sample_limit: int = 5000,
5150+
extract_limit: int = 500,
51515151
) -> Validate:
51525152
"""
51535153
Execute each validation step against the table and store the results.
@@ -5158,8 +5158,8 @@ def interrogate(
51585158
51595159
The interrogation process will collect extracts of failing rows if the `collect_extracts=`
51605160
option is set to `True` (the default). We can control the number of rows collected using the
5161-
`get_first_n=`, `sample_n=`, and `sample_frac=` options. The `sample_limit=` option will
5162-
enforce a hard limit on the number of rows collected when using the `sample_frac=` option.
5161+
`get_first_n=`, `sample_n=`, and `sample_frac=` options. The `extract_limit=` option will
5162+
enforce a hard limit on the number of rows collected when `collect_extracts=True`.
51635163
51645164
After interrogation is complete, the `Validate` object will have gathered information, and
51655165
we can use methods like [`n_passed()`](`pointblank.Validate.n_passed`),
@@ -5178,9 +5178,9 @@ def interrogate(
51785178
The processed data frames produced by executing the validation steps is collected and
51795179
stored in the `Validate` object if `collect_tbl_checked=True`. This information is
51805180
necessary for some methods (e.g.,
5181-
[`get_sundered_data()`](`pointblank.Validate.get_sundered_data`)), but it potentially
5182-
makes the object grow to a large size. To opt out of attaching this data, set this
5183-
argument to `False`.
5181+
[`get_sundered_data()`](`pointblank.Validate.get_sundered_data`)), but it can
5182+
potentially make the object grow to a large size. To opt out of attaching this data, set
5183+
this to `False`.
51845184
get_first_n
51855185
If the option to collect rows where test units is chosen, there is the option here to
51865186
collect the first `n` rows. Supply an integer number of rows to extract from the top of
@@ -5194,11 +5194,15 @@ def interrogate(
51945194
sample_frac
51955195
If the option to collect non-passing rows is chosen, this option allows for the sampling
51965196
of a fraction of those rows. Provide a number in the range of `0` and `1`. The number of
5197-
rows to return could be very large, however, the `sample_limit=` option will apply a
5197+
rows to return could be very large, however, the `extract_limit=` option will apply a
51985198
hard limit to the returned rows.
5199-
sample_limit
5200-
A value that limits the possible number of rows returned when sampling non-passing rows
5201-
using the `sample_frac=` option.
5199+
extract_limit
5200+
A value that limits the possible number of rows returned when extracting non-passing
5201+
rows. The default is `500` rows. This limit is applied after any sampling or limiting
5202+
options are applied. If the number of rows to be returned is greater than this limit,
5203+
then the number of rows returned will be limited to this value. This is useful for
5204+
preventing the collection of too many rows when the number of non-passing rows is very
5205+
large.
52025206
52035207
Returns
52045208
-------
@@ -5708,9 +5712,9 @@ def interrogate(
57085712
elif sample_frac is not None:
57095713
validation_extract_nw = validation_extract_nw.sample(fraction=sample_frac)
57105714

5711-
# Ensure a limit is set on the number of rows to extract
5712-
if len(validation_extract_nw) > sample_limit:
5713-
validation_extract_nw = validation_extract_nw.head(sample_limit)
5715+
# Ensure a limit is set on the number of rows to extract
5716+
if len(validation_extract_nw) > extract_limit:
5717+
validation_extract_nw = validation_extract_nw.head(extract_limit)
57145718

57155719
validation.extract = nw.to_native(validation_extract_nw)
57165720

tests/test_validate.py

Lines changed: 50 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5057,6 +5057,55 @@ def test_interrogate_sample_n(request, tbl_fixture):
50575057
assert len(nw.from_native(validation.get_data_extracts(i=1, frame=True)).columns) == 4
50585058

50595059

5060+
def test_interrogate_sample_n_limit():
5061+
game_revenue = load_dataset(dataset="game_revenue", tbl_type="polars")
5062+
5063+
validation_default_limit = (
5064+
Validate(game_revenue).col_vals_gt(columns="item_revenue", value=10000).interrogate()
5065+
)
5066+
5067+
assert (
5068+
len(nw.from_native(validation_default_limit.get_data_extracts(i=1, frame=True)).rows())
5069+
== 500
5070+
)
5071+
5072+
validation_set_n_limit = (
5073+
Validate(game_revenue)
5074+
.col_vals_gt(columns="item_revenue", value=10000)
5075+
.interrogate(get_first_n=10)
5076+
)
5077+
5078+
assert (
5079+
len(nw.from_native(validation_set_n_limit.get_data_extracts(i=1, frame=True)).rows()) == 10
5080+
)
5081+
5082+
validation_set_n_no_limit_break = (
5083+
Validate(game_revenue)
5084+
.col_vals_gt(columns="item_revenue", value=10000)
5085+
.interrogate(get_first_n=750)
5086+
)
5087+
5088+
assert (
5089+
len(
5090+
nw.from_native(
5091+
validation_set_n_no_limit_break.get_data_extracts(i=1, frame=True)
5092+
).rows()
5093+
)
5094+
== 500
5095+
)
5096+
5097+
validation_set_n_adj_limit = (
5098+
Validate(game_revenue)
5099+
.col_vals_gt(columns="item_revenue", value=10000)
5100+
.interrogate(get_first_n=750, extract_limit=1000)
5101+
)
5102+
5103+
assert (
5104+
len(nw.from_native(validation_set_n_adj_limit.get_data_extracts(i=1, frame=True)).rows())
5105+
== 750
5106+
)
5107+
5108+
50605109
@pytest.mark.parametrize(
50615110
"tbl_fixture, sample_frac, expected",
50625111
[
@@ -5096,7 +5145,7 @@ def test_interrogate_sample_frac_with_sample_limit(request, tbl_fixture):
50965145
validation = (
50975146
Validate(tbl)
50985147
.col_vals_regex(columns="text", pattern=r"^[a-z]{3}")
5099-
.interrogate(sample_frac=0.8, sample_limit=1)
5148+
.interrogate(sample_frac=0.8, extract_limit=1)
51005149
)
51015150

51025151
# Expect that the extracts table has 2 entries out of 3 failures

0 commit comments

Comments
 (0)