Skip to content

Conversation

chenkovsky
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

first, it seems that grouping udaf document is not correct.

and for aggregation with grouping,

e.g.

CREATE TABLE test (c1 VARCHAR,c2 VARCHAR,c3 INT) as values
('a','A',1), ('b','B',2);

EXPLAIN FORMAT INDENT select
  c1,
  c2,
  CASE WHEN grouping(c1) = 1 THEN sum(c3) ELSE NULL END as gx,
  grouping(c1) as g0,
  grouping(c2) as g1,
  grouping(c1, c2) as g2,
  grouping(c2, c1) as g3
from
  test
group by
  grouping sets (
    (c1, c2),
    (c1),
    (c2),
    ()
  );

current logical plan is:

| logical_plan  | Projection: test.c1, test.c2, CASE WHEN grouping(test.c1) = Int32(1) THEN sum(test.c3) ELSE Int64(NULL) END AS gx, grouping(test.c1) AS g0, grouping(test.c2) AS g1, grouping(test.c1,test.c2) AS g2, grouping(test.c2,test.c1) AS g3                                                                                                              |
|               |   Projection: test.c1, test.c2, CAST(__common_expr_1 AS Int32) AS grouping(test.c1), sum(test.c3), CAST(__common_expr_2 AS Int32) AS grouping(test.c2), CAST(__grouping_id AS Int32) AS grouping(test.c1,test.c2), CAST(__common_expr_1 | __common_expr_2 << UInt8(1) AS Int32) AS grouping(test.c2,test.c1)                                       |
|               |     Projection: __grouping_id & UInt8(2) >> UInt8(1) AS __common_expr_1, __grouping_id & UInt8(1) AS __common_expr_2, test.c1, test.c2, __grouping_id, sum(test.c3)                                                                                                                                                                                |
|               |       Aggregate: groupBy=[[GROUPING SETS ((test.c1, test.c2), (test.c1), (test.c2), ())]], aggr=[[sum(CAST(test.c3 AS Int64))]]                                                                                                                                                                                                                    |
|               |         TableScan: test projection=[c1, c2, c3]  

the problems are:

  1. there are three projections. for bitwise operation, there's no benifit for extra projection.

  2. it makes grouping level optimization very hard. for example,

     select  case when grouping(c1) = 1 then sum(c2) else null end from test group by grouping sets (...)

    we only need to calculate sum(c2) in when grouping(c1) = 1, this is a very useful optimization trick. I'm also using this trick to optimizing sql with multiple count distinct.

  3. unparsing is not supported. Internal("Tried to unproject column referring to internal grouping id") will be thrown.

What changes are included in this PR?

  1. create a grouping udf.
  2. modify analyzer, replace grouping_udaf with grouping udf
  3. change it back, when unparsing.

now, the logical plan is:

| logical_plan  | Projection: test.c1, test.c2, CASE WHEN grouping(test.c1) = Int32(1) THEN sum(test.c3) ELSE Int64(NULL) END AS gx, grouping(test.c1) AS g0, grouping(test.c2) AS g1, grouping(test.c1,test.c2) AS g2, grouping(test.c2,test.c1) AS g3                                                                             |
|               |   Projection: test.c1, test.c2, grouping(__grouping_id, List([1])) AS grouping(test.c1), sum(test.c3), grouping(__grouping_id, List([0])) AS grouping(test.c2), grouping(__grouping_id) AS grouping(test.c1,test.c2), grouping(__grouping_id, List([0, 1])) AS grouping(test.c2,test.c1)                          |
|               |     Aggregate: groupBy=[[GROUPING SETS ((test.c1, test.c2), (test.c1), (test.c2), ())]], aggr=[[sum(CAST(test.c3 AS Int64))]]                                                                                                                                                                                     |
|               |       TableScan: test projection=[c1, c2, c3]                                                                                                                                                                             

there are only two projections. and unparsing is supported.

Are these changes tested?

UT

Are there any user-facing changes?

No

@github-actions github-actions bot added sql SQL Planner optimizer Optimizer rules functions Changes to functions implementation labels May 23, 2025
@github-actions github-actions bot added the documentation Improvements or additions to documentation label May 23, 2025
@alamb
Copy link
Contributor

alamb commented May 24, 2025

Thanks @chenkovsky -- can. you find the original PR that added this GROUPING function and perhaps @ mention the author to see if they have any feedback / could help with review?

@chenkovsky
Copy link
Contributor Author

chenkovsky commented May 24, 2025

@eejbyfeldt @comphead could you please help me review this PR?

@chenkovsky
Copy link
Contributor Author

it's related to #12704

@eejbyfeldt
Copy link
Contributor

I did not look closely at yet since I have not really contributed here in months.

there are three projections. for bitwise operation, there's no benifit for extra projection.

This extra projection is introduced by CommonSubexprEliminate. I understand that the changes make it so that they do not interact in the presented query, but is this really a general fix of that behavior?

But the goal of leaving more structure in logical plan after resolving the grouping expr so that optimization and unparsing sounds reasonable to me.

bigdecimal = { workspace = true }
datafusion-common = { workspace = true, default-features = true }
datafusion-expr = { workspace = true }
datafusion-functions-aggregate = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have tried hard to avoid having the sql parser depend on function libraries -- is there any way to avoid changing this dependency?

@chenkovsky chenkovsky force-pushed the feature/optimize_grouping branch from 13151cc to 3373cd0 Compare June 25, 2025 01:25
@chenkovsky chenkovsky force-pushed the feature/optimize_grouping branch from ed53ae1 to bf3a290 Compare July 15, 2025 07:08
Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Sep 14, 2025
@chenkovsky
Copy link
Contributor Author

keep it open

@github-actions github-actions bot removed the Stale PR has not had any activity for some time label Sep 15, 2025
@alamb
Copy link
Contributor

alamb commented Sep 16, 2025

Thanks @chenkovsky

@eejbyfeldt it looks like you contributed the initial implementation of grouping in #12704

Could you help review this PR?

@Slimsammylim
Copy link

Hi @alamb and @chenkovsky! I am running into the same issue @lorenarosati brought up a couple months ago (#16590) and have not found a workaround solution yet. Just wanted to check in, what's the status of this PR?

@alamb
Copy link
Contributor

alamb commented Oct 2, 2025

Hi @alamb and @chenkovsky! I am running into the same issue @lorenarosati brought up a couple months ago (#16590) and have not found a workaround solution yet. Just wanted to check in, what's the status of this PR?

I am waiting for someone to help review the PR. Can you help here @Slimsammylim ? Perhaps verify that this solution fixes your issue?

@Slimsammylim
Copy link

Hi, I ran my code using this branch and unfortunately it did not solve my issue (#16590).

@chenkovsky
Copy link
Contributor Author

Hi, I ran my code using this branch and unfortunately it did not solve my issue (#16590).

hi,could you please push your code, then i can see what's your issue. currently i havent touch substrait, some extra codes should be modified to solve issue 16590.

@github-actions github-actions bot added logical-expr Logical plan and expressions substrait Changes to the substrait crate labels Oct 4, 2025
@chenkovsky
Copy link
Contributor Author

chenkovsky commented Oct 4, 2025

I added substrait support. @Slimsammylim please try again

@chenkovsky chenkovsky changed the title feat: optimize and unparse grouping feat: optimize grouping and introduced unparsing and substrait support Oct 4, 2025

// To avoid adding datafusion-functions-aggregate dependency, implement a DummyGroupingUDAF here
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct DummyGroupingUDAF {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, I created a dummy udaf. maybe we can move grouping udaf to datafusion-expr module directly. because grouping udaf is a pseudo udaf itself.

@Slimsammylim
Copy link

Hi! I tried again and but I keep getting the same error message.

@Slimsammylim
Copy link

Hi! I am not running SQL. The type of plan that is failing for me is the same as the one Victor mentions in #17909. Lorena mentions a similar plan in #16590.

@chenkovsky
Copy link
Contributor Author

chenkovsky commented Oct 7, 2025

Hi! I am not running SQL. The type of plan that is failing for me is the same as the one Victor mentions in #17909. Lorena mentions a similar plan in #16590.

@Slimsammylim I see, I only tested roundtrip, you only consume substrait plan. I think the simplest workaround is moving __grouping_id to the end of schema, then index will be correct. but It's too tricky. also there's another issue caused by __grouping_id #16983. It's better to remove __grouping_id. I will work on it

@chenkovsky
Copy link
Contributor Author

@Slimsammylim I created another PR. #17961 , please take a snapshot. it will not touch grouping udaf in logical plan level. so It will support substrait by design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation functions Changes to functions implementation logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants