Skip to content

Conversation

@LiaCastaneda
Copy link
Collaborator

@LiaCastaneda LiaCastaneda commented Jul 7, 2025

This PR adds support to input substrait plans, the substrait plan select_one corresponding to select 1 as test_col translates to the distributed plan:

[ output_partitions: 1]MaxRowsExec[max_rows=8192]
[ output_partitions: 1]  CoalesceBatchesExec: target_batch_size=8192
[ output_partitions: 1]    ProjectionExec: expr=[1 as test_col]
[ output_partitions: 1]      DataSourceExec: partitions=1, partition_sizes=[1]

Even if the distributed plan is built, most substrait queries will fail when the plan is assigned to the workers, because distributed-datafusion doesn’t support serializing DataSourceExec physical nodes. (see here).

For a standard sql string query that executes successfully, the distributed plan for the same query would be:

[ output_partitions: 1]MaxRowsExec[max_rows=8192]
[ output_partitions: 1]  CoalesceBatchesExec: target_batch_size=8192
[ output_partitions: 1]    ProjectionExec: expr=[1 as test_col]
[ output_partitions: 1]      PlaceholderRowExec

I had trouble finding a Substrait plan in which triggred a physical plan with PlaceholderRowExec instead of DataSourceExec. Because of the way the consumer handles a virtual table, it’s quite difficult to force the physical plan to include PlaceholderRowExec rather than DataSourceExec. I think we can account for this issue in a separate PR, for now I'm not sure if it should be a fix in Datafusion or in datafusion-distributed.

@LiaCastaneda LiaCastaneda marked this pull request as ready for review July 7, 2025 12:50
@LiaCastaneda LiaCastaneda force-pushed the lia/add-support-for-substrait-plans branch from 3d380ae to 5808d24 Compare July 7, 2025 13:25
src/flight.rs Outdated
request: Request<FlightDescriptor>,
) -> Result<Response<FlightInfo>, Status>;

async fn get_flight_info_substrait(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also name this get_flight_info_substrait_plan got it to match with the name in FlightSqlService?

async fn do_get_statement(
&self,
ticket: arrow_flight::sql::TicketStatementQuery,
ticket: TicketStatementQuery,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

schema,
explain_data: None,
})
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactor

let error_msg = format!("{:?}", result.unwrap_err());
assert!(error_msg.contains("worker") || error_msg.contains("address"));
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test

Copy link
Collaborator

@NGA-TRAN NGA-TRAN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very nice, Lia.

Have you considered supporting EXPLAIN for Substrait input as well? It could return the logical, physical, and distributed plans (including execution stages) without actually running them. Might be a great follow-up PR if you're aiming for transparency and introspection tooling around Substrait ingestion.

@LiaCastaneda
Copy link
Collaborator Author

Have you considered supporting EXPLAIN for Substrait input as well? It could return the logical, physical, and distributed plans (including execution stages) without actually running them. Might be a great follow-up PR if you're aiming for transparency and introspection tooling around Substrait ingestion.

I'll give it a try 👍

@LiaCastaneda
Copy link
Collaborator Author

LiaCastaneda commented Jul 8, 2025

Okay, I was going to do it in this PR, but it ended up being a bigger refactor than expected. I'll leave it for a follow-up PR. I'll leave some work on this branch

@LiaCastaneda LiaCastaneda merged commit eed7176 into main Jul 8, 2025
3 checks passed
@gabotechs gabotechs deleted the lia/add-support-for-substrait-plans branch August 4, 2025 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants