Skip to content

Logical plan creation for Substrait plans with aggregate relations #16590

@lorenarosati

Description

@lorenarosati

Describe the bug

When we convert a Substrait plan with an aggregate relation to a DataFusion LogicalPlan using the from_substrait_plan() function, the function from_aggregate_rel() in aggregate_rel.rs is called on the AggregateRel. Part of this function creates a vector of grouping expressions (group_exprs) and then calls the function aggregate() in builder.rs with these grouping expressions, which creates a new Aggregate.

If there are any grouping expressions (which there will be for all aggregate relations), a new field called __grouping_id (INTERNAL_GROUPING_ID) is added and the DFSchema is built with this extra field (see here). See below:
Image
This is what a sample schema looks like after the above code snippet runs:
DFSchema { inner: Schema { fields: [Field { name: "__grouping_id", data_type: UInt8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "min(Float64(1))", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None, None], functional_dependencies: FunctionalDependencies { deps: [] } }

In from_substrait_plan_with_consumer(), we're able to get to the point where make_renamed_schema() in utils.rs is called. The problem is that our DFSchema has one more field than the vector of fields "names" in the root of the Substrait plan does (__grouping_id). This will result in the error "Named schema must contain names for all fields" here, and from_substrait_plan() will fail.

__grouping_id is a field used internally, but this should not result in a failure to convert Substrait plans to logical plans.

To Reproduce

  1. Create a Substrait plan with a simple aggregate relation - here's an example substrait_aggregate_example.json
  2. Convert the Substrait plan to a DataFusion LogicalPlan using the function from_substrait_plan() and observe that the conversion fails.

You can see that in this example plan, there is one name listed under "names": just "result". However, the schema will have "__grouping_id" and "result", so attempting to create a renamed schema with the expected names will fail.

Expected behavior

A Substrait plan with an aggregate relation should be able to be converted to a LogicalPlan successfully. The consumer code should be able to handle the extra field __grouping_id being added and that field should not cause an error when renaming the result fields in accordance with the names in the Substrait plan.

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions