Add docs on working with Great Expectations for data validation by lrcouto · Pull Request #5181 · kedro-org/kedro

lrcouto · 2025-10-29T22:52:03Z

Description

Development notes

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

lvijnck · 2025-10-30T12:17:28Z

docs/integrations-and-plugins/great_expectations.md

+
+## Understanding Great Expectations
+
+Great Expectations version 1.0+ introduced a major API change: **everything is now done in Python code** rather than through CLI commands and YAML configuration files.


What was the main reason for this? I do kind of feel that the expectation definition would be great to define as part of the dataset (in the catalog yaml file), rather than buried inside a hook.

I don't know exactly what was behind this design decision, but that's seems to be how GX suggests handling it for this latest version.

It can be seen on the examples they provide on their own documentation: https://docs.greatexpectations.io/docs/core/define_expectations/create_an_expectation?procedure=instructions

There's an option to store expectations on an external file, but not as part of the Kedro data catalog. I'll add this option to the document.

You could make a custom dataset for this, e.g., a GreatExpectationsDataset that wraps any of the other datasets, and applies the expectation checking before hitting save on the dataset.

https://github.com/Galileo-Galilei/kedro-mlflow/blob/master/kedro_mlflow/io/artifacts/mlflow_artifact_dataset.py

Adding example to the mlflow plugin, which does something similar.

That's a good idea! Would you mind opening a Feature Request issue for it, so we can look further into it?

Jumping in without the context:

GreatExpectationDataset as a wrapper is feasible, it will follow a pattern similar to PartitionDatset

The downside of this is, this is a hard dependencies coupled with the pipeline, now you cannot run the pipeline without GE.

Bundling the validation logic also make the catalog very verbose. This is more a subjective opinion. We did loads of research in the past and people do both. As core Kedro team we try to keep this as lean as possible.

I think it is better to implement this as a hook / plugin. Optionally, if you wish to include the validation logic alongside with catalog.yml, the plugin/hook can read from the metadata tag (which is optional). This keep this flexible to support both cases and have options to skip validation (and dependency)

Why is that a problem? You are specifically defining the dataset as GE, so I don't see the issue with that. I do like the idea that the catalog entry provides expectations about the dataset to be used by others.

Alternatively, in our current project we have a small data type agnostic wrapper for Pandera, and we use decorators on our nodes to check the input/output. This keeps the validation close to the processing logic.

@check_output( schema=DataFrameSchema( columns={ "id": Column(T.StringType(), nullable=False), "is_drug": Column(T.BooleanType(), nullable=False), "is_disease": Column(T.BooleanType(), nullable=False), }, unique=["id"], ) ) def prefilter_nodes( nodes: ps.DataFrame, gt: ps.DataFrame, drug_types: list[str], disease_types: list[str], ) -> ps.DataFrame: .... # logic here

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

deepyaman

The content largely makes sense, but it's organized in a way that I find confusing. After reading through, I think I see that there are two main scenarios:

You're a Kedro user who wants to get started validating their data
You're using Kedro and Great Expectations, and you want to be able to integrate them

Scenario 1 can transition to Scenario 2.

For Scenario 1, you define some rules (somewhere, maybe it shouldn't be in the hook body, but leaving that aside for now) and execute them (using hooks; I think it would be most straightforward to just choose that approach for now rather than presenting multiple alternatives). It might be worth clarifying that Great Expectations constructs an ephemeral Data Context, and that nothing is persisted between runs (right?).

In reality, I would expect most Great Expectations users to persist their expectations; I think this is how they see you getting value out of the framework. I'm not 100% sure, but the Python-based, interactive Expectation Suite creation seems to be more part of the development workflow, but in production you'd just load your existing data context.

Finally, doesn't necessarily have to be in the scope of the initial guide, but it might be useful to show how you can see things in the GX UI and stuff.

Coming back to what the flow could look like, I think:

Developing expectations interactively:
1. Open up a notebook using kedro jupyter notebook
2. View the catalog, load some dataset, see there are currently no expectations defined (probably there isn't even a Context, but the user doesn't necessarily need to know that)
3. Add some expectation, try running it, update it if was mistaken/failed (the Great Expectations demo has a nice example of this for NYC taxi, where they update the max passengers from 4->6 because they didn't account for vans)
4. When you're ready, save the expectations
Running expectations as part of your pipeline:
1. Define the hook
2. Run the pipeline with the expectation defined in part 1 (and maybe some others? not sure how important it is); watch it pass
3. Simulate a change to the data that would cause the expectation to fail
4. See what the output looks like for failure
For GX Cloud users: (optional, probably should be done as a collaboration with GX, I thought there was some interest)
1. Show stuff in the UI, like validation result history, TBH I haven't checked their offering lately

@rashidakanchwala I think at least the first two parts could be very similar to the flow for Pandera. There would be some differences, like Pandera I think you have to explicitly write out the expectations as YAML (since it's not a framework in the same way as GX), but it should be pretty similar overall. In the end, I would expect whatever flow gets documented would also end up being the pathway towards what an eventual plugin probably looks like, saving a lot of the more manual work re saving and loading validation rules, etc.

(This is just my view on what this flow could look like; since a lot of people have used data validation libraries with Kedro in practice, would be great to see their thoughts on what resonates and what actually should be different, cc @lvijnck since you were already commenting on this thread :))

docs/integrations-and-plugins/great_expectations.md

deepyaman · 2025-11-10T17:15:29Z

docs/integrations-and-plugins/great_expectations.md

+        batch_request = asset.build_batch_request(options={"dataframe": data})
+        batch = asset.get_batch(batch_request)
+
+        suite = gx.ExpectationSuite(name=f"{name}_validation")


Should you be creating Expectations Suites dynamically? Why wouldn't a Great Expectations user already have their expectations organized into suites?

deepyaman · 2025-11-10T17:18:46Z

docs/integrations-and-plugins/great_expectations.md

+- Allows parallel validation of different datasets
+- Makes it easier to skip validation for specific datasets
+
+### Alternative: Using a file data context


If anything, this sounds like the more "standard" approach for a Great Expectations user? (Would be good to verify.)

Yeah this part confused me a little bit. I'd assume an external file would be the standard, but looking at the GX docs, it looks like they suggest defining them as python code (https://docs.greatexpectations.io/docs/core/define_expectations/create_an_expectation). Maybe they expect you to write an expectation module and import from it?

docs/integrations-and-plugins/great_expectations.md

Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com>

lrcouto · 2025-11-11T16:28:06Z

Thanks for the feedback @deepyaman ! @rashidakanchwala and I were discussing what'd be the best way to structure both this one and the Pandera docs, and this insight is very valuable.

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

…into great-expectations-docs

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

…into great-expectations-docs

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

docs/integrations-and-plugins/great_expectations.md

ravi-kumar-pilla · 2025-11-21T16:46:04Z

Hi @lrcouto , I could follow the doc pretty well and left few [Nit] comments, but overall its pretty good. Thank you

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

lrcouto · 2025-11-24T00:17:35Z

Hi @lrcouto , I could follow the doc pretty well and left few [Nit] comments, but overall its pretty good. Thank you

Thank you for the review! I've added the requested changes.

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

merelcht

Great work @lrcouto ! I added some small suggestions.

merelcht · 2025-11-25T11:00:27Z

docs/integrations-and-plugins/great_expectations.md

+
+### The core concept: Expectations
+
+In Great Expectations, rules for data validation are called an **expectation**.


Suggested change

In Great Expectations, rules for data validation are called an **expectation**.

In Great Expectations, a rule for data validation is called an **expectation**.

merelcht · 2025-11-25T11:02:52Z

docs/integrations-and-plugins/great_expectations.md

@@ -0,0 +1,502 @@
+# How to validate data in your Kedro workflow using Great Expectations


Since this page is quite long, I'd suggest adding an index at the top that outlines the topics you can find on this page.

merelcht · 2025-11-25T11:03:36Z

docs/integrations-and-plugins/great_expectations.md

+
+### Defining expectations
+
+To keep the project organized, we can define data expectations in a dedicated Python module.


Suggested change

To keep the project organized, we can define data expectations in a dedicated Python module.

To keep the project organised, we can define data expectations in a dedicated Python module.

merelcht · 2025-11-25T11:04:09Z

docs/integrations-and-plugins/great_expectations.md

+
+In this example we create a file in `src/spaceflights_great_expectations/expectations.py` and declare a dictionary that maps dataset names to the list of expectations that apply to them.
+
+We also include a convenience function get_suite(), which builds a Great Expectations ExpectationSuite object based on the rules defined in the dictionary.


Suggested change

We also include a convenience function get_suite(), which builds a Great Expectations ExpectationSuite object based on the rules defined in the dictionary.

We also include a convenience function `get_suite()`, which builds a Great Expectations `ExpectationSuite` object based on the rules defined in the dictionary.

merelcht · 2025-11-25T11:05:21Z

docs/integrations-and-plugins/great_expectations.md

+
+By placing validation logic in hooks, you create a "safety net" that catches bad data without cluttering your pipeline definitions.
+
+To implement a hook, create or edit a file named hooks.py inside your project’s `src/spaceflights_great_expectations/` directory.


Suggested change

To implement a hook, create or edit a file named hooks.py inside your project’s `src/spaceflights_great_expectations/` directory.

To implement a hook, create or edit a file named `hooks.py` inside your project’s `src/spaceflights_great_expectations/` directory.

merelcht · 2025-11-25T11:06:50Z

docs/integrations-and-plugins/great_expectations.md

+                func=validate_datasets,
+                inputs=["companies", "reviews", "shuttles"],
+                outputs=None,
+                name="validade_datasets_node",


Suggested change

name="validade_datasets_node",

name="validate_datasets_node",

merelcht · 2025-11-25T11:08:02Z

docs/integrations-and-plugins/great_expectations.md

+mkdir great_expectations
+```
+
+Then, initialize a context in Python:


Suggested change

Then, initialize a context in Python:

Then, initialise a context in Python:

rashidakanchwala

Look's good to me. Thanks @lrcouto
The bullet's not rendering just seems like an issue with my machine. So i wil look into it later.

Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com>

The requested changes were applied. Dismissing just so we can merge, since this PR is blocking a release.

lrcouto added 7 commits October 29, 2025 00:56

First draft - just bullets

e5c5ff3

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Fill up draft a bit more

a58d2cc

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Lint

60e0a69

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Add hook example

f15e75f

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Add pipeline example

860a8e9

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Lint

75d6982

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Add more explaining on how GX works

b0795c1

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

lvijnck reviewed Oct 30, 2025

View reviewed changes

Add option to save expectations on file

5a48ae9

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

lrcouto marked this pull request as ready for review October 31, 2025 14:57

lrcouto requested review from merelcht and yetudada as code owners October 31, 2025 14:57

lrcouto requested a review from deepyaman October 31, 2025 17:11

merelcht requested review from rashidakanchwala and removed request for yetudada October 31, 2025 17:16

lrcouto changed the title ~~Great Expectations integration docs~~ Add docs on working with Great Expectations for data validation Oct 31, 2025

lrcouto added 2 commits November 3, 2025 10:52

Merge branch 'main' into great-expectations-docs

388b933

Merge branch 'main' into great-expectations-docs

1dc1b57

rashidakanchwala mentioned this pull request Nov 5, 2025

Create documentation about data validation #4255

Closed

Merge branch 'main' into great-expectations-docs

85ccd7d

deepyaman previously requested changes Nov 10, 2025

View reviewed changes

lrcouto and others added 2 commits November 10, 2025 19:27

Update docs/integrations-and-plugins/great_expectations.md

cbc351e

Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com>

Update docs/integrations-and-plugins/great_expectations.md

ab1ccfc

Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu> Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com>

rashidakanchwala mentioned this pull request Nov 17, 2025

Release Kedro 1.1.0 #5215

Closed

lrcouto and others added 5 commits November 17, 2025 19:45

Merge branch 'main' into great-expectations-docs

17b3b9e

Merge branch 'main' into great-expectations-docs

afca0e1

Apply some changes to doc wording, hook example

4984673

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Merge branch 'great-expectations-docs' of github.com:kedro-org/kedro …

555903d

…into great-expectations-docs

Merge branch 'main' into great-expectations-docs

b18398b

lrcouto and others added 4 commits November 19, 2025 20:02

Change examples to define expectations on their own module

c383cbb

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Merge branch 'great-expectations-docs' of github.com:kedro-org/kedro …

918ec57

…into great-expectations-docs

Add lint to expectations reference

747bb9b

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Merge branch 'main' into great-expectations-docs

8e1cca9

ravi-kumar-pilla reviewed Nov 21, 2025

View reviewed changes

docs/integrations-and-plugins/great_expectations.md Outdated Show resolved Hide resolved