-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Describe the bug
When we convert a Substrait plan with an aggregate relation to a DataFusion LogicalPlan using the from_substrait_plan()
function, the function from_aggregate_rel()
in aggregate_rel.rs is called on the AggregateRel. Part of this function creates a vector of grouping expressions (group_exprs) and then calls the function aggregate()
in builder.rs with these grouping expressions, which creates a new Aggregate.
If there are any grouping expressions (which there will be for all aggregate relations), a new field called __grouping_id (INTERNAL_GROUPING_ID) is added and the DFSchema is built with this extra field (see here). See below:
This is what a sample schema looks like after the above code snippet runs:
DFSchema { inner: Schema { fields: [Field { name: "__grouping_id", data_type: UInt8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "min(Float64(1))", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None, None], functional_dependencies: FunctionalDependencies { deps: [] } }
In from_substrait_plan_with_consumer()
, we're able to get to the point where make_renamed_schema()
in utils.rs is called. The problem is that our DFSchema has one more field than the vector of fields "names" in the root of the Substrait plan does (__grouping_id). This will result in the error "Named schema must contain names for all fields" here, and from_substrait_plan()
will fail.
__grouping_id is a field used internally, but this should not result in a failure to convert Substrait plans to logical plans.
To Reproduce
- Create a Substrait plan with a simple aggregate relation - here's an example substrait_aggregate_example.json
- Convert the Substrait plan to a DataFusion LogicalPlan using the function
from_substrait_plan()
and observe that the conversion fails.
You can see that in this example plan, there is one name listed under "names": just "result". However, the schema will have "__grouping_id" and "result", so attempting to create a renamed schema with the expected names will fail.
Expected behavior
A Substrait plan with an aggregate relation should be able to be converted to a LogicalPlan successfully. The consumer code should be able to handle the extra field __grouping_id being added and that field should not cause an error when renaming the result fields in accordance with the names in the Substrait plan.
Additional context
No response