Skip to content

expect_compound_columns_to_be_unique fails on Spark DataFrame with Pandas 2.x #11633

@sonlac

Description

@sonlac

Describe the bug
expect_compound_columns_to_be_unique fails when validating Spark DataFrames with timestamp columns under Pandas 2.x. The error occurs because GX internally calls .toPandas() which creates datetime64 without [ns] precision, violating Pandas 2.x requirements.

To Reproduce

import great_expectations as gx
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# Create Spark DataFrame with timestamp
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
    ("2024-01-01", "A", 100),
    ("2024-01-02", "B", 200),
], ["event_date", "category", "value"])
df = df.withColumn("event_date", F.col("event_date").cast("timestamp"))

# Setup GX validation
context = gx.get_context(mode="ephemeral")
datasource = context.data_sources.add_spark("test_ds")
asset = datasource.add_dataframe_asset("test_asset")
batch_def = asset.add_batch_definition_whole_dataframe("batch")
batch = batch_def.get_batch(batch_parameters={"dataframe": df})

# Create suite with compound uniqueness check
suite = context.suites.add(gx.ExpectationSuite(name="test"))
suite.add_expectation(
    gx.expectations.ExpectCompoundColumnsToBeUnique(
        column_list=["event_date", "category"]
    )
)

# This fails with datetime64 precision error
result = batch.validate(suite)

Stack trace:

ValueError: Passing in 'datetime64' dtype with no precision is not allowed. Please pass in 'datetime64[ns]' instead.

Traceback:
  File "great_expectations/expectations/metrics/map_metric_provider/multicolumn_map_condition_auxilliary_methods.py", line 261
    domain_values = filtered.select(column_selector).limit(limit).toPandas().to_dict("records")
  File "pandas/core/generic.py", line 6643, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  ...
  ValueError: Passing in 'datetime64' dtype with no precision is not allowed.

Root cause:
In great_expectations/expectations/metrics/map_metric_provider/multicolumn_map_condition_auxilliary_methods.py, function _spark_multicolumn_map_condition_values() at lines 257 and 261:

domain_values = filtered.select(column_selector).limit(limit).toPandas().to_dict("records")

PySpark's .toPandas() creates datetime64 without precision, but Pandas 2.0+ requires datetime64[ns].

Expected behavior
Validation should succeed, returning success/failure status with metrics. Compound uniqueness checks should work seamlessly with Spark DataFrames containing timestamp columns when using Pandas 2.x.

Environment:

  • Operating System: Linux (verified on RHEL 8.9)
  • Great Expectations Version: 1.11.3 (latest stable)
  • Data Source: PySpark 3.1.3
  • Additional: Pandas 2.2.3, Python 3.10
  • Environment: Yarn/HDFS

Impact:

  • Affects any Spark DataFrame validation with compound uniqueness on timestamp columns when using Pandas 2.x
  • Workaround: Convert Spark → Pandas before validation (loses native Spark execution benefits)
  • Related to Pandas 2.0 breaking changes requiring explicit datetime precision

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    To Do

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions