Skip to content

[BUG] Spark Execution Engine: unexpected_index_list not returned in GX 1.x even with unexpected_index_column_names configured #11647

@raj-verma24

Description

@raj-verma24

When using the Spark Execution Engine in Great Expectations 1.x, the validation result does not include [unexpected_index_list] even when unexpected_index_column_names is properly configured in the [result_format]

This is a regression from GX 0.17.x, where these fields were returned correctly and contained full row data with the specified index columns.

import great_expectations as gx
from great_expectations.expectations import ExpectColumnValuesToNotBeNull
from pyspark.sql import SparkSession

# Create Spark session and sample data
spark = SparkSession.builder.getOrCreate()
data = [
    ("ABC", "Broker A"),
    ("XYZ", None),      # This should fail - null broker_name
    ("MNP", None),      # This should fail - null broker_name
]
df = spark.createDataFrame(data, ["broker_code", "broker_name"])

# Setup GX 1.x context
context = gx.get_context(mode="ephemeral")
datasource = context.data_sources.add_spark(name="my_spark_datasource")
data_asset = datasource.add_dataframe_asset(name="my_data_asset")
batch_definition = data_asset.add_batch_definition_whole_dataframe(name="my_batch")

# Create expectation with result_format including unexpected_index_column_names
suite = gx.ExpectationSuite(name="my_suite")
suite.add_expectation(
    ExpectColumnValuesToNotBeNull(
        column="broker_name",
        result_format={
            "result_format": "COMPLETE",
            "unexpected_index_column_names": ["broker_code"]  # <-- Should return broker_code in results
        }
    )
)
suite = context.suites.add(suite)

# Run validation
validation_definition = gx.ValidationDefinition(
    name="my_validation",
    data=batch_definition,
    suite=suite,
)
validation_definition = context.validation_definitions.add(validation_definition)

checkpoint = gx.Checkpoint(
    name="my_checkpoint",
    validation_definitions=[validation_definition],
)
checkpoint = context.checkpoints.add(checkpoint)

result = checkpoint.run(batch_parameters={"dataframe": df})
validation_result = list(result.run_results.values())[0]

# Print the result
import json
print(json.dumps(validation_result.to_json_dict(), indent=2, default=str))

==========
Expected behavior
GX 0.17.x returned (correct):

"result": {
"element_count": 3,
"unexpected_count": 2,
"unexpected_percent": 66.67,
"partial_unexpected_list": [null, null],
"partial_unexpected_index_list": [
{"broker_code": "XYZ", "broker_name": null},
{"broker_code": "MNP", "broker_name": null}
],
"unexpected_index_list": [
{"broker_code": "XYZ", "broker_name": null},
{"broker_code": "MNP", "broker_name": null}
],
"unexpected_index_query": "df.filter(F.expr(NOT (broker_name IS NOT NULL)))"
}

The [unexpected_index_list] column specified in unexpected_index_column_names, allowing us to identify which specific rows failed validation.

Actual behavior
GX 1.11.0 returns (missing index lists):

"result": {
"element_count": 100,
"unexpected_count": 15,
"unexpected_percent": 15.0,
"partial_unexpected_list": [null, null, null, ...],
"partial_unexpected_counts": [{"value": null, "count": 15}]
}

The following fields are completely missing:

  • [unexpected_index_list]
  • [partial_unexpected_index_list]
  • unexpected_list
  • unexpected_index_query

This makes it impossible to identify which specific rows failed validation when using Spark DataFrames.

Impact
This is a breaking change for users migrating from GX 0.17.x to 1.x who rely on [unexpected_index_list] to

  • Track which specific records failed data quality checks
  • Build failure reports with row-level identifiers
  • Join failure data back to source tables for remediation

Environment
Great Expectations Version: 1.11.0
Execution Engine: Spark (PySpark)
Previously working in: 0.17.19
Python Version: 3.10
Operating System: Linux (AWS Glue)

Additional context
The unexpected_index_column_names configuration is being passed correctly to the expectation (visible in [expectation_config.kwargs.result_format], but the Spark execution engine is not honoring it and not returning the index data in the results.

This may be related to changes in metric Computation for spark in GX 1.x.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    To Do

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions