Skip to content

SNOW-1959601: Bug - order_by before distinct #3106

@adrien-turiot-maxa

Description

@adrien-turiot-maxa

There is an issue in the implementation of DataFrame.distinct in snowpark 1.28.0.
It places previously defined ORDER BY calls in the DISTINCT query, which causes an error when the column is not the set of distinct columns.

For example, this snippet:

customers = session.table("snowflake_sample_data.tpch_sf1.customer").order_by("c_name")
marketing_segments = customers.select("c_mktsegment").distinct().collect()

Causes this error:

SQL compilation error: [CUSTOMER.C_NAME] is not a valid order by expression

The generated query in snowpark 1.28.0:

SELECT DISTINCT "C_MKTSEGMENT" 
FROM snowflake_sample_data.tpch_sf1.customer
ORDER BY "C_NAME" ASC NULLS FIRST

While nicer, the query is incorrect since C_NAME is not in the distinct result.

The generated query in snowpark 1.27.0 was:

SELECT "C_MKTSEGMENT" 
FROM ( 
    SELECT "C_MKTSEGMENT" 
    FROM snowflake_sample_data.tpch_sf1.customer 
    ORDER BY "C_NAME" ASC NULLS FIRST
) 
GROUP BY "C_MKTSEGMENT"

Details

  • Python version: 3.11.11
  • System: macOS-15.3.1-x86_64-i386-64bit

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtriaged

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions