@@ -245,6 +245,42 @@ Thresholds(warning: 'int | float | bool | None' = None, error: 'int | float | bo
245245 (in the [`Validate`](`pointblank.Validate`) class).
246246
247247
248+ Actions(warning: 'str | Callable | list[str | Callable] | None' = None, error: 'str | Callable | list[str | Callable] | None' = None, critical: 'str | Callable | list[str | Callable] | None' = None) -> None
249+
250+ Definition of action values.
251+
252+ Actions complement threshold values by defining what action should be taken when a threshold
253+ level is reached. The action can be a string or a `Callable`. When a string is used, it is
254+ interpreted as a message to be displayed. When a `Callable` is used, it will be invoked at
255+ interrogation time if the threshold level is met or exceeded.
256+
257+ There are three threshold levels: 'warning', 'error', and 'critical'. These levels correspond
258+ to different levels of severity when a threshold is reached. Those thresholds can be defined
259+ using the [`Thresholds`](`pointblank.Thresholds`) class or various shorthand forms. Actions
260+ don't have to be defined for all threshold levels; if an action is not defined for a level in
261+ exceedence, no action will be taken.
262+
263+ Parameters
264+ ----------
265+ warning
266+ A string, `Callable`, or list of `Callable`/string values for the 'warning' level. Using
267+ `None` means no action should be performed at the 'warning' level.
268+ error
269+ A string, `Callable`, or list of `Callable`/string values for the 'error' level. Using
270+ `None` means no action should be performed at the 'error' level.
271+ critical
272+ A string, `Callable`, or list of `Callable`/string values for the 'critical' level. Using
273+ `None` means no action should be performed at the 'critical' level.
274+
275+ Returns
276+ -------
277+ Actions
278+ An `Actions` object. This can be used when using the [`Validate`](`pointblank.Validate`)
279+ class (to set actions for meeting different threshold levels globally) or when defining
280+ validation steps like [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) (so that actions
281+ are scoped to individual validation steps, overriding any globally set actions).
282+
283+
248284Schema(columns: 'str | list[str] | list[tuple[str, str]] | list[tuple[str]] | dict[str, str] | None' = None, tbl: 'any | None' = None, **kwargs)
249285Definition of a schema object.
250286
@@ -491,6 +527,171 @@ Definition of a schema object.
491527 `Schema` object is used in a validation workflow.
492528
493529
530+ DraftValidation(data: 'FrameT | Any', model: 'str', api_key: 'str | None' = None) -> None
531+
532+ Draft a validation plan for a given table using an LLM.
533+
534+ By using a large language model (LLM) to draft a validation plan, you can quickly generate a
535+ starting point for validating a table. This can be useful when you have a new table and you
536+ want to get a sense of how to validate it (and adjustments could always be made later). The
537+ `DraftValidation` class uses the `chatlas` package to draft a validation plan for a given table
538+ using an LLM from either the `"anthropic"`, `"openai"`, or `"bedrock"` provider. You can install
539+ all requirements for the class by using an optional install of Pointblank via `pip install
540+ pointblank[generate]`.
541+
542+ :::{.callout-warning}
543+ The `DraftValidation()` class is still experimental. Please report any issues you encounter in
544+ the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues).
545+ :::
546+
547+ Parameters
548+ ----------
549+ data
550+ The data to be used for drafting a validation plan.
551+ model
552+ The model to be used. This should be in the form of `provider:model` (e.g.,
553+ `"anthropic:claude-3-5-sonnet-latest"`). Supported providers are `"anthropic"`, `"openai"`,
554+ and `"bedrock"` (Amazon Bedrock).
555+ api_key
556+ The API key to be used for the model.
557+
558+ Returns
559+ -------
560+ str
561+ The drafted validation plan.
562+
563+ Constructing the `model` Argument
564+ ---------------------------------
565+ The `model=` argument should be constructed using the provider and model name separated by a
566+ colon. The provider can be either `"anthropic"` or `"openai"`. The model name should be the
567+ specific model to be used. For example, model names are subject to change so consult the
568+ provider's documentation for the most up-to-date model names.
569+
570+ Notes on Authentication
571+ -----------------------
572+ Providing a valid API key as a string in the `api_key` argument is adequate for getting started
573+ but you should consider using a more secure method for handling API keys.
574+
575+ One way to do this is to load the API key from an environent variable and retrieve it using the
576+ `os` module (specifically the `os.getenv()` function). Places to store the API key might
577+ include `.bashrc`, `.bash_profile`, `.zshrc`, or `.zsh_profile`.
578+
579+ Another solution is to store one or more model provider API keys in an `.env` file (in the root
580+ of your project). If the API keys have correct names (e.g., `ANTHROPIC_API_KEY` or
581+ `OPENAI_API_KEY`) then DraftValidation will automatically load the API key from the `.env` file
582+ and there's no need to provide the `api_key` argument. An `.env` file might look like this:
583+
584+ ```plaintext
585+ ANTHROPIC_API_KEY="your_anthropic_api_key_here"
586+ OPENAI_API_KEY="your_openai_api_key_here"
587+ ```
588+
589+ There's no need to have the `python-dotenv` package installed when using `.env` files in this
590+ way.
591+
592+ Notes on Data Sent to the Model Provider
593+ ----------------------------------------
594+ The data sent to the model provider is a JSON summary of the table. This data summary is
595+ generated internally by `DraftValidation` using the `DataScan` class. The summary includes the
596+ following information:
597+
598+ - the number of rows and columns in the table
599+ - the type of dataset (e.g., Polars, DuckDB, Pandas, etc.)
600+ - the column names and their types
601+ - column level statistics such as the number of missing values, min, max, mean, and median, etc.
602+ - a short list of data values in each column
603+
604+ The JSON summary is used to provide the model with the necessary information to draft a
605+ validation plan. As such, even very large tables can be used with the `DraftValidation` class
606+ since the contents of the table are not sent to the model provider.
607+
608+ Examples
609+ --------
610+ Let's look at how the `DraftValidation` class can be used to draft a validation plan for a
611+ table. The table to be used is `"nycflights"`, which is available here via the
612+ [`load_dataset()`](`pointblank.load_dataset`) function. The model to be used is
613+ `"anthropic:claude-3-5-sonnet-latest"`. The example assumes that the API key is stored in an
614+ `.env` file as `ANTHROPIC_API_KEY`.
615+
616+ ```python
617+ import pointblank as pb
618+
619+ # Load the "nycflights" dataset as a DuckDB table
620+ data = pb.load_dataset(dataset="nycflights", tbl_type="duckdb")
621+
622+ # Draft a validation plan for the "nycflights" table
623+ pb.DraftValidation(data=nycflights, model="anthropic:claude-3-5-sonnet-latest")
624+ ```
625+
626+ The output will be a drafted validation plan for the `"nycflights"` table and this will appear
627+ in the console.
628+
629+ ````plaintext
630+ ```python
631+ import pointblank as pb
632+
633+ # Define schema based on column names and dtypes
634+ schema = pb.Schema(columns=[
635+ ("year", "int64"),
636+ ("month", "int64"),
637+ ("day", "int64"),
638+ ("dep_time", "int64"),
639+ ("sched_dep_time", "int64"),
640+ ("dep_delay", "int64"),
641+ ("arr_time", "int64"),
642+ ("sched_arr_time", "int64"),
643+ ("arr_delay", "int64"),
644+ ("carrier", "string"),
645+ ("flight", "int64"),
646+ ("tailnum", "string"),
647+ ("origin", "string"),
648+ ("dest", "string"),
649+ ("air_time", "int64"),
650+ ("distance", "int64"),
651+ ("hour", "int64"),
652+ ("minute", "int64")
653+ ])
654+
655+ # The validation plan
656+ validation = (
657+ pb.Validate(
658+ data=your_data,
659+ label="Draft Validation",
660+ thresholds=pb.Thresholds(warning=0.10, error=0.25, critical=0.35)
661+ )
662+ .col_schema_match(schema=schema)
663+ .col_vals_not_null(columns=[
664+ "year", "month", "day", "sched_dep_time", "carrier", "flight",
665+ "origin", "dest", "distance", "hour", "minute"
666+ ])
667+ .col_vals_between(columns="month", left=1, right=12)
668+ .col_vals_between(columns="day", left=1, right=31)
669+ .col_vals_between(columns="sched_dep_time", left=106, right=2359)
670+ .col_vals_between(columns="dep_delay", left=-43, right=1301, na_pass=True)
671+ .col_vals_between(columns="air_time", left=20, right=695, na_pass=True)
672+ .col_vals_between(columns="distance", left=17, right=4983)
673+ .col_vals_between(columns="hour", left=1, right=23)
674+ .col_vals_between(columns="minute", left=0, right=59)
675+ .col_vals_in_set(columns="origin", set=["EWR", "LGA", "JFK"])
676+ .col_count_match(count=18)
677+ .row_count_match(count=336776)
678+ .rows_distinct()
679+ .interrogate()
680+ )
681+
682+ validation
683+ ```
684+ ````
685+
686+ The drafted validation plan can be copied and pasted into a Python script or notebook for
687+ further use. In other words, the generated plan can be adjusted as needed to suit the specific
688+ requirements of the table being validated.
689+
690+ Note that the output does not know how the data was obtained, so it uses the placeholder
691+ `your_data` in the `data=` argument of the `Validate` class. This should be replaced with the
692+ actual data variable.
693+
694+
494695
495696## The Validation Steps family
496697
0 commit comments