@@ -60,7 +60,7 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
6060 [`Thresholds`](`pointblank.Thresholds`) object.
6161 actions
6262 The actions to take when validation steps meet or exceed any set threshold levels. This
63- should be provided in the form of an `Actions` object. If `None` then no default actions
63+ should be provided in the form of an `Actions` object. If `None` then no global actions
6464 will be set.
6565 brief
6666 A global setting for briefs, which are optional brief descriptions for validation steps
@@ -104,7 +104,7 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
104104
105105 Examples
106106 --------
107- ## Creating a validation plan and interrogating
107+ ### Creating a validation plan and interrogating
108108
109109 Let's walk through a data quality analysis of an extremely small table. It's actually called
110110 `"small_table"` and it's accessible through the [`load_dataset()`](`pointblank.load_dataset`)
@@ -170,11 +170,72 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
170170 [`get_tabular_report()`](`pointblank.Validate.get_tabular_report`) method, which contains
171171 options for modifying the display of the table.
172172
173- Furthermore, post-interrogation methods such as
174- [`get_step_report()`](`pointblank.Validate.get_step_report`),
175- [`get_data_extracts()`](`pointblank.Validate.get_data_extracts`), and
176- [`get_sundered_data()`](`pointblank.Validate.get_sundered_data`) allow you to generate
177- additional reporting or extract useful data for downstream analysis from a `Validate` object.
173+ ### Adding briefs
174+
175+ Briefs are short descriptions of the validation steps. While they can be set for each step
176+ individually, they can also be set globally. The global setting is done by using the
177+ `brief=` argument in `Validate`. The global setting can be as simple as `True` to have
178+ automatically-generated briefs for each step. Alternatively, we can use templating elements
179+ like `"{step}"` (to insert the step number) or `"{auto}"` (to include an automatically generated
180+ brief). Here's an example of a global setting for briefs:
181+
182+ ```python
183+ validation = (
184+ pb.Validate(
185+ data=pb.load_dataset(),
186+ tbl_name="small_table",
187+ label="Validation example with briefs",
188+ brief="Step {step}: {auto}",
189+ )
190+ .col_vals_gt(columns="d", value=100)
191+ .col_vals_between(columns="c", left=3, right=10, na_pass=True)
192+ .col_vals_regex(
193+ columns="b",
194+ pattern=r"[0-9]-[a-z]{3}-[0-9]{3}",
195+ brief="Regex check for column {col}"
196+ )
197+ .interrogate()
198+ )
199+
200+ validation
201+ ```
202+
203+ We see the text of the briefs appear in the `STEP` column of the reporting table. Furthermore,
204+ the global brief's template (`"Step {step}: {auto}"`) is applied to all steps except for the
205+ final step, where the step-level `brief=` argument provided an override.
206+
207+ If you should want to cancel the globally-defined brief for one or more validation steps, you
208+ can set `brief=False` in those particular steps.
209+
210+ ### Post-interrogation methods
211+
212+ The `Validate` class has a number of post-interrogation methods that can be used to extract
213+ useful information from the validation results. For example, the
214+ [`get_data_extracts()`](`pointblank.Validate.get_data_extracts`) method can be used to get
215+ the data extracts for each validation step.
216+
217+ ```python
218+ validation.get_data_extracts()
219+ ```
220+
221+ We can also view step reports for each validation step using the
222+ [`get_step_report()`](`pointblank.Validate.get_step_report`) method. This method adapts to the
223+ type of validation step and shows the relevant information for a step's validation.
224+
225+ ```python
226+ validation.get_step_report(i=2)
227+ ```
228+
229+ The `Validate` class also has a method for getting the sundered data, which is the data that
230+ passed or failed the validation steps. This can be done using the
231+ [`get_sundered_data()`](`pointblank.Validate.get_sundered_data`) method.
232+
233+ ```python
234+ pb.preview(validation.get_sundered_data())
235+ ```
236+
237+ The sundered data is a DataFrame that contains the rows that passed or failed the validation.
238+ The default behavior is to return the rows that failed the validation, as shown above.
178239
179240
180241Thresholds(warning: 'int | float | bool | None' = None, error: 'int | float | bool | None' = None, critical: 'int | float | bool | None' = None) -> None
@@ -4169,7 +4230,7 @@ validation steps, (3) `interrogate()`. After interrogation of the data, we can v
41694230report table (by printing the object or using `get_tabular_report()`), extract key metrics, or we
41704231can split the data based on the validation results (with `get_sundered_data()`).
41714232
4172- interrogate(self, collect_extracts: 'bool' = True, collect_tbl_checked: 'bool' = True, get_first_n: 'int | None' = None, sample_n: 'int | None' = None, sample_frac: 'int | float | None' = None, sample_limit : 'int' = 5000 ) -> 'Validate'
4233+ interrogate(self, collect_extracts: 'bool' = True, collect_tbl_checked: 'bool' = True, get_first_n: 'int | None' = None, sample_n: 'int | None' = None, sample_frac: 'int | float | None' = None, extract_limit : 'int' = 500 ) -> 'Validate'
41734234
41744235 Execute each validation step against the table and store the results.
41754236
@@ -4179,8 +4240,8 @@ interrogate(self, collect_extracts: 'bool' = True, collect_tbl_checked: 'bool' =
41794240
41804241 The interrogation process will collect extracts of failing rows if the `collect_extracts=`
41814242 option is set to `True` (the default). We can control the number of rows collected using the
4182- `get_first_n=`, `sample_n=`, and `sample_frac=` options. The `sample_limit =` option will
4183- enforce a hard limit on the number of rows collected when using the `sample_frac=` option .
4243+ `get_first_n=`, `sample_n=`, and `sample_frac=` options. The `extract_limit =` option will
4244+ enforce a hard limit on the number of rows collected when `collect_extracts=True` .
41844245
41854246 After interrogation is complete, the `Validate` object will have gathered information, and
41864247 we can use methods like [`n_passed()`](`pointblank.Validate.n_passed`),
@@ -4199,9 +4260,9 @@ interrogate(self, collect_extracts: 'bool' = True, collect_tbl_checked: 'bool' =
41994260 The processed data frames produced by executing the validation steps is collected and
42004261 stored in the `Validate` object if `collect_tbl_checked=True`. This information is
42014262 necessary for some methods (e.g.,
4202- [`get_sundered_data()`](`pointblank.Validate.get_sundered_data`)), but it potentially
4203- makes the object grow to a large size. To opt out of attaching this data, set this
4204- argument to `False`.
4263+ [`get_sundered_data()`](`pointblank.Validate.get_sundered_data`)), but it can
4264+ potentially make the object grow to a large size. To opt out of attaching this data, set
4265+ this to `False`.
42054266 get_first_n
42064267 If the option to collect rows where test units is chosen, there is the option here to
42074268 collect the first `n` rows. Supply an integer number of rows to extract from the top of
@@ -4215,11 +4276,15 @@ interrogate(self, collect_extracts: 'bool' = True, collect_tbl_checked: 'bool' =
42154276 sample_frac
42164277 If the option to collect non-passing rows is chosen, this option allows for the sampling
42174278 of a fraction of those rows. Provide a number in the range of `0` and `1`. The number of
4218- rows to return could be very large, however, the `sample_limit =` option will apply a
4279+ rows to return could be very large, however, the `extract_limit =` option will apply a
42194280 hard limit to the returned rows.
4220- sample_limit
4221- A value that limits the possible number of rows returned when sampling non-passing rows
4222- using the `sample_frac=` option.
4281+ extract_limit
4282+ A value that limits the possible number of rows returned when extracting non-passing
4283+ rows. The default is `500` rows. This limit is applied after any sampling or limiting
4284+ options are applied. If the number of rows to be returned is greater than this limit,
4285+ then the number of rows returned will be limited to this value. This is useful for
4286+ preventing the collection of too many rows when the number of non-passing rows is very
4287+ large.
42234288
42244289 Returns
42254290 -------
0 commit comments