Balancing memory usage and duplicate execution

The current `collect_fold/scan()` are `collect_*` functions because they don't return an `Expr` or `LazyFrame`, they actually run the calculation. This can result in duplicate calculations if they are used inside a lazy calculation. On the plus side, they support streaming.

On the other hand, `plumba.fold()/scan()` are for `Expr`, so they can do things like `group_by()` and allow for common subquery elimination. But, they don't support streaming due to limitations in `Expr.map_batches()`.

It's possible one could get both limited memory usage and laziness (and corresponding reduction in duplicate calculations) in the latter with some tweaks to Polars' APIs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Balancing memory usage and duplicate execution #9

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Balancing memory usage and duplicate execution #9

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions