Skip to content

Conversation

@robtandy
Copy link
Collaborator

Often, the datafusion physical planner will determine that a table is small enough that it can be fully materialized during a join resulting in a HashJoinExec with mode=CollectLeft. We previously did not support these nodes, and required that all hash joins be partitioned, resulting in an awkward setting of config values:

        // FIXME: these three options are critical for the correct function of the library
        // but we are not enforcing that the user sets them.  They are here at the moment
        // but we should figure out a way to do this better.
        config
            .options_mut()
            .optimizer
            .hash_join_single_partition_threshold = 0;
        config
            .options_mut()
            .optimizer
            .hash_join_single_partition_threshold_rows = 0;

        config.options_mut().optimizer.prefer_hash_join = true;
        // end critical options section

This PR builds upon #104 and allows us to specify that if datafusion produces a CollectLeft join, we can honor that and not split that stage, benefiting from the optimization.

Importantly, it also lets us remove the awkward required settings above which were confusion to users of the library.

@gabotechs gabotechs merged commit 5c92110 into main Aug 21, 2025
3 checks passed
@gabotechs gabotechs deleted the robtandy/collect_left_hash_joins branch August 21, 2025 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants