-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Describe the bug
expect_compound_columns_to_be_unique fails when validating Spark DataFrames with timestamp columns under Pandas 2.x. The error occurs because GX internally calls .toPandas() which creates datetime64 without [ns] precision, violating Pandas 2.x requirements.
To Reproduce
import great_expectations as gx
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
# Create Spark DataFrame with timestamp
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
("2024-01-01", "A", 100),
("2024-01-02", "B", 200),
], ["event_date", "category", "value"])
df = df.withColumn("event_date", F.col("event_date").cast("timestamp"))
# Setup GX validation
context = gx.get_context(mode="ephemeral")
datasource = context.data_sources.add_spark("test_ds")
asset = datasource.add_dataframe_asset("test_asset")
batch_def = asset.add_batch_definition_whole_dataframe("batch")
batch = batch_def.get_batch(batch_parameters={"dataframe": df})
# Create suite with compound uniqueness check
suite = context.suites.add(gx.ExpectationSuite(name="test"))
suite.add_expectation(
gx.expectations.ExpectCompoundColumnsToBeUnique(
column_list=["event_date", "category"]
)
)
# This fails with datetime64 precision error
result = batch.validate(suite)Stack trace:
ValueError: Passing in 'datetime64' dtype with no precision is not allowed. Please pass in 'datetime64[ns]' instead.
Traceback:
File "great_expectations/expectations/metrics/map_metric_provider/multicolumn_map_condition_auxilliary_methods.py", line 261
domain_values = filtered.select(column_selector).limit(limit).toPandas().to_dict("records")
File "pandas/core/generic.py", line 6643, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
...
ValueError: Passing in 'datetime64' dtype with no precision is not allowed.
Root cause:
In great_expectations/expectations/metrics/map_metric_provider/multicolumn_map_condition_auxilliary_methods.py, function _spark_multicolumn_map_condition_values() at lines 257 and 261:
domain_values = filtered.select(column_selector).limit(limit).toPandas().to_dict("records")PySpark's .toPandas() creates datetime64 without precision, but Pandas 2.0+ requires datetime64[ns].
Expected behavior
Validation should succeed, returning success/failure status with metrics. Compound uniqueness checks should work seamlessly with Spark DataFrames containing timestamp columns when using Pandas 2.x.
Environment:
- Operating System: Linux (verified on RHEL 8.9)
- Great Expectations Version: 1.11.3 (latest stable)
- Data Source: PySpark 3.1.3
- Additional: Pandas 2.2.3, Python 3.10
- Environment: Yarn/HDFS
Impact:
- Affects any Spark DataFrame validation with compound uniqueness on timestamp columns when using Pandas 2.x
- Workaround: Convert Spark → Pandas before validation (loses native Spark execution benefits)
- Related to Pandas 2.0 breaking changes requiring explicit datetime precision
Metadata
Metadata
Assignees
Labels
Type
Projects
Status