Add docs on working with Great Expectations for data validation#5181
Add docs on working with Great Expectations for data validation#5181
Conversation
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
|
|
||
| ## Understanding Great Expectations | ||
|
|
||
| Great Expectations version 1.0+ introduced a major API change: **everything is now done in Python code** rather than through CLI commands and YAML configuration files. |
There was a problem hiding this comment.
What was the main reason for this? I do kind of feel that the expectation definition would be great to define as part of the dataset (in the catalog yaml file), rather than buried inside a hook.
There was a problem hiding this comment.
I don't know exactly what was behind this design decision, but that's seems to be how GX suggests handling it for this latest version.
It can be seen on the examples they provide on their own documentation: https://docs.greatexpectations.io/docs/core/define_expectations/create_an_expectation?procedure=instructions
There's an option to store expectations on an external file, but not as part of the Kedro data catalog. I'll add this option to the document.
There was a problem hiding this comment.
You could make a custom dataset for this, e.g., a GreatExpectationsDataset that wraps any of the other datasets, and applies the expectation checking before hitting save on the dataset.
Adding example to the mlflow plugin, which does something similar.
There was a problem hiding this comment.
That's a good idea! Would you mind opening a Feature Request issue for it, so we can look further into it?
There was a problem hiding this comment.
Jumping in without the context:
- GreatExpectationDataset as a wrapper is feasible, it will follow a pattern similar to PartitionDatset
The downside of this is, this is a hard dependencies coupled with the pipeline, now you cannot run the pipeline without GE.
Bundling the validation logic also make the catalog very verbose. This is more a subjective opinion. We did loads of research in the past and people do both. As core Kedro team we try to keep this as lean as possible.
I think it is better to implement this as a hook / plugin. Optionally, if you wish to include the validation logic alongside with catalog.yml, the plugin/hook can read from the metadata tag (which is optional). This keep this flexible to support both cases and have options to skip validation (and dependency)
There was a problem hiding this comment.
Why is that a problem? You are specifically defining the dataset as GE, so I don't see the issue with that. I do like the idea that the catalog entry provides expectations about the dataset to be used by others.
Alternatively, in our current project we have a small data type agnostic wrapper for Pandera, and we use decorators on our nodes to check the input/output. This keeps the validation close to the processing logic.
@check_output(
schema=DataFrameSchema(
columns={
"id": Column(T.StringType(), nullable=False),
"is_drug": Column(T.BooleanType(), nullable=False),
"is_disease": Column(T.BooleanType(), nullable=False),
},
unique=["id"],
)
)
def prefilter_nodes(
nodes: ps.DataFrame,
gt: ps.DataFrame,
drug_types: list[str],
disease_types: list[str],
) -> ps.DataFrame:
.... # logic hereSigned-off-by: Laura Couto <laurarccouto@gmail.com>
deepyaman
left a comment
There was a problem hiding this comment.
The content largely makes sense, but it's organized in a way that I find confusing. After reading through, I think I see that there are two main scenarios:
- You're a Kedro user who wants to get started validating their data
- You're using Kedro and Great Expectations, and you want to be able to integrate them
Scenario 1 can transition to Scenario 2.
For Scenario 1, you define some rules (somewhere, maybe it shouldn't be in the hook body, but leaving that aside for now) and execute them (using hooks; I think it would be most straightforward to just choose that approach for now rather than presenting multiple alternatives). It might be worth clarifying that Great Expectations constructs an ephemeral Data Context, and that nothing is persisted between runs (right?).
In reality, I would expect most Great Expectations users to persist their expectations; I think this is how they see you getting value out of the framework. I'm not 100% sure, but the Python-based, interactive Expectation Suite creation seems to be more part of the development workflow, but in production you'd just load your existing data context.
Finally, doesn't necessarily have to be in the scope of the initial guide, but it might be useful to show how you can see things in the GX UI and stuff.
Coming back to what the flow could look like, I think:
- Developing expectations interactively:
- Open up a notebook using
kedro jupyter notebook - View the catalog, load some dataset, see there are currently no expectations defined (probably there isn't even a Context, but the user doesn't necessarily need to know that)
- Add some expectation, try running it, update it if was mistaken/failed (the Great Expectations demo has a nice example of this for NYC taxi, where they update the max passengers from 4->6 because they didn't account for vans)
- When you're ready, save the expectations
- Open up a notebook using
- Running expectations as part of your pipeline:
- Define the hook
- Run the pipeline with the expectation defined in part 1 (and maybe some others? not sure how important it is); watch it pass
- Simulate a change to the data that would cause the expectation to fail
- See what the output looks like for failure
- For GX Cloud users: (optional, probably should be done as a collaboration with GX, I thought there was some interest)
- Show stuff in the UI, like validation result history, TBH I haven't checked their offering lately
@rashidakanchwala I think at least the first two parts could be very similar to the flow for Pandera. There would be some differences, like Pandera I think you have to explicitly write out the expectations as YAML (since it's not a framework in the same way as GX), but it should be pretty similar overall. In the end, I would expect whatever flow gets documented would also end up being the pathway towards what an eventual plugin probably looks like, saving a lot of the more manual work re saving and loading validation rules, etc.
(This is just my view on what this flow could look like; since a lot of people have used data validation libraries with Kedro in practice, would be great to see their thoughts on what resonates and what actually should be different, cc @lvijnck since you were already commenting on this thread :))
| batch_request = asset.build_batch_request(options={"dataframe": data}) | ||
| batch = asset.get_batch(batch_request) | ||
|
|
||
| suite = gx.ExpectationSuite(name=f"{name}_validation") |
There was a problem hiding this comment.
Should you be creating Expectations Suites dynamically? Why wouldn't a Great Expectations user already have their expectations organized into suites?
| - Allows parallel validation of different datasets | ||
| - Makes it easier to skip validation for specific datasets | ||
|
|
||
| ### Alternative: Using a file data context |
There was a problem hiding this comment.
If anything, this sounds like the more "standard" approach for a Great Expectations user? (Would be good to verify.)
There was a problem hiding this comment.
Yeah this part confused me a little bit. I'd assume an external file would be the standard, but looking at the GX docs, it looks like they suggest defining them as python code (https://docs.greatexpectations.io/docs/core/define_expectations/create_an_expectation). Maybe they expect you to write an expectation module and import from it?
Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com>
Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com>
|
Thanks for the feedback @deepyaman ! @rashidakanchwala and I were discussing what'd be the best way to structure both this one and the Pandera docs, and this insight is very valuable. |
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
…into great-expectations-docs
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
…into great-expectations-docs
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
|
Hi @lrcouto , I could follow the doc pretty well and left few [Nit] comments, but overall its pretty good. Thank you |
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Thank you for the review! I've added the requested changes. |
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
|
|
||
| ### The core concept: Expectations | ||
|
|
||
| In Great Expectations, rules for data validation are called an **expectation**. |
There was a problem hiding this comment.
| In Great Expectations, rules for data validation are called an **expectation**. | |
| In Great Expectations, a rule for data validation is called an **expectation**. |
| @@ -0,0 +1,502 @@ | |||
| # How to validate data in your Kedro workflow using Great Expectations | |||
There was a problem hiding this comment.
Since this page is quite long, I'd suggest adding an index at the top that outlines the topics you can find on this page.
|
|
||
| ### Defining expectations | ||
|
|
||
| To keep the project organized, we can define data expectations in a dedicated Python module. |
There was a problem hiding this comment.
| To keep the project organized, we can define data expectations in a dedicated Python module. | |
| To keep the project organised, we can define data expectations in a dedicated Python module. |
|
|
||
| In this example we create a file in `src/spaceflights_great_expectations/expectations.py` and declare a dictionary that maps dataset names to the list of expectations that apply to them. | ||
|
|
||
| We also include a convenience function get_suite(), which builds a Great Expectations ExpectationSuite object based on the rules defined in the dictionary. |
There was a problem hiding this comment.
| We also include a convenience function get_suite(), which builds a Great Expectations ExpectationSuite object based on the rules defined in the dictionary. | |
| We also include a convenience function `get_suite()`, which builds a Great Expectations `ExpectationSuite` object based on the rules defined in the dictionary. |
|
|
||
| By placing validation logic in hooks, you create a "safety net" that catches bad data without cluttering your pipeline definitions. | ||
|
|
||
| To implement a hook, create or edit a file named hooks.py inside your project’s `src/spaceflights_great_expectations/` directory. |
There was a problem hiding this comment.
| To implement a hook, create or edit a file named hooks.py inside your project’s `src/spaceflights_great_expectations/` directory. | |
| To implement a hook, create or edit a file named `hooks.py` inside your project’s `src/spaceflights_great_expectations/` directory. |
| func=validate_datasets, | ||
| inputs=["companies", "reviews", "shuttles"], | ||
| outputs=None, | ||
| name="validade_datasets_node", |
There was a problem hiding this comment.
| name="validade_datasets_node", | |
| name="validate_datasets_node", |
| mkdir great_expectations | ||
| ``` | ||
|
|
||
| Then, initialize a context in Python: |
There was a problem hiding this comment.
| Then, initialize a context in Python: | |
| Then, initialise a context in Python: |
rashidakanchwala
left a comment
There was a problem hiding this comment.
Look's good to me. Thanks @lrcouto
The bullet's not rendering just seems like an issue with my machine. So i wil look into it later.
Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com>
The requested changes were applied. Dismissing just so we can merge, since this PR is blocking a release.
Description
#5170
Development notes
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-byline in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.mdfile