Skip to content

Balancing memory usage and duplicate execution #9

@itamarst

Description

@itamarst

The current collect_fold/scan() are collect_* functions because they don't return an Expr or LazyFrame, they actually run the calculation. This can result in duplicate calculations if they are used inside a lazy calculation. On the plus side, they support streaming.

On the other hand, plumba.fold()/scan() are for Expr, so they can do things like group_by() and allow for common subquery elimination. But, they don't support streaming due to limitations in Expr.map_batches().

It's possible one could get both limited memory usage and laziness (and corresponding reduction in duplicate calculations) in the latter with some tweaks to Polars' APIs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions