There is an issue in the implementation of DataFrame.distinct in snowpark 1.28.0.
It places previously defined ORDER BY calls in the DISTINCT query, which causes an error when the column is not the set of distinct columns.
For example, this snippet:
customers = session.table("snowflake_sample_data.tpch_sf1.customer").order_by("c_name")
marketing_segments = customers.select("c_mktsegment").distinct().collect()
Causes this error:
SQL compilation error: [CUSTOMER.C_NAME] is not a valid order by expression
The generated query in snowpark 1.28.0:
SELECT DISTINCT "C_MKTSEGMENT"
FROM snowflake_sample_data.tpch_sf1.customer
ORDER BY "C_NAME" ASC NULLS FIRST
While nicer, the query is incorrect since C_NAME is not in the distinct result.
The generated query in snowpark 1.27.0 was:
SELECT "C_MKTSEGMENT"
FROM (
SELECT "C_MKTSEGMENT"
FROM snowflake_sample_data.tpch_sf1.customer
ORDER BY "C_NAME" ASC NULLS FIRST
)
GROUP BY "C_MKTSEGMENT"
Details
- Python version:
3.11.11
- System:
macOS-15.3.1-x86_64-i386-64bit
There is an issue in the implementation of
DataFrame.distinctin snowpark 1.28.0.It places previously defined
ORDER BYcalls in theDISTINCTquery, which causes an error when the column is not the set of distinct columns.For example, this snippet:
Causes this error:
The generated query in snowpark 1.28.0:
While nicer, the query is incorrect since
C_NAMEis not in the distinct result.The generated query in snowpark 1.27.0 was:
Details
3.11.11macOS-15.3.1-x86_64-i386-64bit