-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
When using the Spark Execution Engine in Great Expectations 1.x, the validation result does not include [unexpected_index_list] even when unexpected_index_column_names is properly configured in the [result_format]
This is a regression from GX 0.17.x, where these fields were returned correctly and contained full row data with the specified index columns.
import great_expectations as gx
from great_expectations.expectations import ExpectColumnValuesToNotBeNull
from pyspark.sql import SparkSession
# Create Spark session and sample data
spark = SparkSession.builder.getOrCreate()
data = [
("ABC", "Broker A"),
("XYZ", None), # This should fail - null broker_name
("MNP", None), # This should fail - null broker_name
]
df = spark.createDataFrame(data, ["broker_code", "broker_name"])
# Setup GX 1.x context
context = gx.get_context(mode="ephemeral")
datasource = context.data_sources.add_spark(name="my_spark_datasource")
data_asset = datasource.add_dataframe_asset(name="my_data_asset")
batch_definition = data_asset.add_batch_definition_whole_dataframe(name="my_batch")
# Create expectation with result_format including unexpected_index_column_names
suite = gx.ExpectationSuite(name="my_suite")
suite.add_expectation(
ExpectColumnValuesToNotBeNull(
column="broker_name",
result_format={
"result_format": "COMPLETE",
"unexpected_index_column_names": ["broker_code"] # <-- Should return broker_code in results
}
)
)
suite = context.suites.add(suite)
# Run validation
validation_definition = gx.ValidationDefinition(
name="my_validation",
data=batch_definition,
suite=suite,
)
validation_definition = context.validation_definitions.add(validation_definition)
checkpoint = gx.Checkpoint(
name="my_checkpoint",
validation_definitions=[validation_definition],
)
checkpoint = context.checkpoints.add(checkpoint)
result = checkpoint.run(batch_parameters={"dataframe": df})
validation_result = list(result.run_results.values())[0]
# Print the result
import json
print(json.dumps(validation_result.to_json_dict(), indent=2, default=str))
==========
Expected behavior
GX 0.17.x returned (correct):
"result": {
"element_count": 3,
"unexpected_count": 2,
"unexpected_percent": 66.67,
"partial_unexpected_list": [null, null],
"partial_unexpected_index_list": [
{"broker_code": "XYZ", "broker_name": null},
{"broker_code": "MNP", "broker_name": null}
],
"unexpected_index_list": [
{"broker_code": "XYZ", "broker_name": null},
{"broker_code": "MNP", "broker_name": null}
],
"unexpected_index_query": "df.filter(F.expr(NOT (broker_name IS NOT NULL)))"
}
The [unexpected_index_list] column specified in unexpected_index_column_names, allowing us to identify which specific rows failed validation.
Actual behavior
GX 1.11.0 returns (missing index lists):
"result": {
"element_count": 100,
"unexpected_count": 15,
"unexpected_percent": 15.0,
"partial_unexpected_list": [null, null, null, ...],
"partial_unexpected_counts": [{"value": null, "count": 15}]
}
The following fields are completely missing:
- [unexpected_index_list]
- [partial_unexpected_index_list]
- unexpected_list
- unexpected_index_query
This makes it impossible to identify which specific rows failed validation when using Spark DataFrames.
Impact
This is a breaking change for users migrating from GX 0.17.x to 1.x who rely on [unexpected_index_list] to
- Track which specific records failed data quality checks
- Build failure reports with row-level identifiers
- Join failure data back to source tables for remediation
Environment
Great Expectations Version: 1.11.0
Execution Engine: Spark (PySpark)
Previously working in: 0.17.19
Python Version: 3.10
Operating System: Linux (AWS Glue)
Additional context
The unexpected_index_column_names configuration is being passed correctly to the expectation (visible in [expectation_config.kwargs.result_format], but the Spark execution engine is not honoring it and not returning the index data in the results.
This may be related to changes in metric Computation for spark in GX 1.x.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status