Skip to content

SNOW-3266495: Defer HAVING, ORDER BY, LIMIT clauses in SCOS compatibility mode#4132

Open
sfc-gh-joshi wants to merge 3 commits intomainfrom
joshi-SNOW-3266495-having-order
Open

SNOW-3266495: Defer HAVING, ORDER BY, LIMIT clauses in SCOS compatibility mode#4132
sfc-gh-joshi wants to merge 3 commits intomainfrom
joshi-SNOW-3266495-having-order

Conversation

@sfc-gh-joshi
Copy link
Copy Markdown
Contributor

@sfc-gh-joshi sfc-gh-joshi commented Mar 23, 2026

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-3266495

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
    • If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
  3. Please describe how your code solves the related issue.

Snowflake SQL requires HAVING, ORDER BY, and LIMIT clauses of a GROUP BY statement to appear in that particular order. For example:

SELECT 
    "DEPT", 
    count(1) AS "HEADCOUNT", 
    avg("SALARY") AS "AVG_SALARY"
 FROM (
 SELECT 
    "ID", 
    "DEPT", 
    "SALARY"
 FROM (
 SELECT $1 AS "ID", $2 AS "DEPT", $3 AS "SALARY" FROM  VALUES (1 :: INT, 'engineering' :: STRING, 80000 :: INT), (2 :: INT, 'engineering' :: STRING, 90000 :: INT), (3 :: INT, 'sales' :: STRING, 50000 :: INT), (4 :: INT, 'sales' :: STRING, 60000 :: INT), (5 :: INT, 'hr' :: STRING, 45000 :: INT), (6 :: INT, 'hr' :: STRING, 55000 :: INT), (7 :: INT, 'engineering' :: STRING, 85000 :: INT)
)
)
 GROUP BY 
    "DEPT"
 HAVING ("HEADCOUNT" > 1)
 ORDER BY "AVG_SALARY" DESC NULLS LAST
 LIMIT 2 OFFSET 0

Re-arranging any of these clauses produces a syntax error. Currently, when _is_snowpark_connect_compatible_mode is set, the order in which these clauses are added depends on the order in which they are specified by the user. That is, df.groupBy(...).agg(...).sort(...).filter(...) would emit ORDER BY before HAVING, resulting in a compilation error.

This PR defers the generation of these clauses for GROUP BY aggregations. Instead of appending each clause to the analyzer tree when a method is called, the new DataFrame._build_post_agg_df method ensures the clauses are added in the correct order.

@sfc-gh-joshi sfc-gh-joshi requested review from a team as code owners March 23, 2026 20:20
@sfc-gh-joshi sfc-gh-joshi force-pushed the joshi-SNOW-3266495-having-order branch from d7aca89 to 0d60875 Compare March 25, 2026 23:03
@sfc-gh-joshi sfc-gh-joshi added the NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md label Mar 25, 2026
Comment on lines +698 to +700
self._pending_having = None
self._pending_order_by = None
self._pending_limit = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how multiple filter would affect the plan generation

df1 = df.groupBy(...).agg(...)
df2 = df1.filter(...).limit().filter(...)
df3 = df1.filter(...).filter(...).limit()
df4 = df1.limit().filter(...).filter(...)
  1. do df2,3,4 output the same query?
  2. what's the behavior in spark and do we align with spark behavior after your code change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. It looks like in spark, filter is not commutative across a sequence of df.filter(...).orderBy(...).limit(...).filter(...) (the final call will see a deterministic subset of rows based on the prior order/limit). I'll need to do some more testing to see what this means for SQL generation, and whether the cases you mentioned have similar problems.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-aling I added some test cases covering this behavior, and checked the output against spark. The changes are:

  1. Operations that occur after a LIMIT now always produce a new sub-query, since FILTER -> LIMIT and LIMIT -> FILTER are not equivalent.
  2. Consecutive filter operations are now conjoined into a single HAVING clause. I don't think SQL has any short-circuiting evaluation behavior that imperative languages do, so I believe this should always be equivalent. The Spark explain plans I looked at did also combine filter clauses together into a single operator.
  3. Consecutive ordering operations are now combined into a single ORDER BY, with the last ordering clause appearing first in the SQL.

Most of these cases were previously broken in SCOS, as the only sequence of operations that would have produced valid SQL was df.groupby(...).agg(...).filter(...).orderBy(...).limit(...), where the operations appeared in the same order as that required by SQL.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thanks for checking the behavior!
re1: does orderBy also produce a subquery like limit?

@sfc-gh-joshi sfc-gh-joshi requested a review from a team March 27, 2026 20:46
Copy link
Copy Markdown
Contributor

@sfc-gh-aling sfc-gh-aling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I suggest kicking off the SCOS regression pipeline to see if how many new cases we support with the change or if there's any gap in our implementation before merging the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants