ENH: Lazy Cross-Join With Deferred Filtering

__Problem:__
One of the only operations I ever find myself going out of my way to create a sql temp-table upload my frame and do the operation in sql is for filtered cross joins. For large datasets they can both take a long time and use large amounts of memory. Here is an example of such an operation

__Current Solution:__
```
dr = pd.date_range('1900-01-01', periods=10000)
df = pd.DataFrame({
    'start_date': pd.Series(dr), 
    'end_date': pd.Series(dr) + pd.Timedelta(days=30), 
    'value': range(len(dr)), 
    'id': np.random.choice(list('abcdefg'), len(dr))
})
dates = pd.Series(pd.date_range(df.start_date.min(), df.end_date.max())).to_frame('ts')

# takes ~8 sec (not particularly long in this example) and peak memory at ~10 GB
df.merge(dates, how='cross').loc[lambda dfx: (dfx.ts >= dfx.start_date) & (dfx.ts < dfx.end_date)]
```
This is how it would be done in SQL, which I don't have memory stats for, but is much faster
```
select * from table1
cross join table2
where ts >= start_date and ts <= end_date
```

This is a common problem people try to solve using pandas, as there are several instances on stack overflow of people asking how to do this (here are just a few I found with a quick search, there are many more):
https://stackoverflow.com/questions/43593554/merging-two-dataframes-based-on-a-date-between-two-other-dates-without-a-common/43594038
https://stackoverflow.com/questions/30627968/merge-pandas-dataframes-where-one-value-is-between-two-others
https://stackoverflow.com/questions/31328014/merging-dataframes-based-on-date-range

There are many different solutions ranging in complexity and performance. Often the answer people provide is to just do it in sql.

__Proposed Solution:__ (not concrete/final proposal, just a starting point for discussion)
I think that one way this could be solved in a performant way with an easy to use interface is to provide a lazily evaluated cross join. Which is resolved to a pandas dataframe after a filtering operation (e.g. `.loc`, `.query`, etc.). Below I provide an example implementation, that I don't think is the complete or optimal solution, but rather just a quick example to get the point across.

```
import itertools as it

 def chunks(lst, n):
     for i in range(0, len(lst), n):
         yield lst[i:i + n]

class LazyMergeLoc:
    def __get__(self, instance, owner):
        self.chunk_pairs = instance.chunk_pairs
        return self

    def __getitem__(self, item):
        results = []
        for df1, df2 in self.chunk_pairs:
            result = df1.merge(df2, how='cross').loc[item]
            results.append(result)
        return pd.concat(results)

class LazyMerge:
    loc = LazyMergeLoc()

    def __init__(self, chunk_pairs):
        self.chunk_pairs = chunk_pairs

def lazy_cross_join(df1, df2, left_chunk_size, right_chunk_size):
    left_chunks = list(chunks(df, left_chunk_size))
    right_chunks = list(chunks(dates, right_chunk_size))
    chunk_pairs = it.product(left_chuknks, right_chunks)
    return LazyMerge(chunk_pairs)

pd.lazy_cross_join(df, dates, left_chunk_size=1000, right_chunk_size=1000).loc[lambda dfx: (dfx.ts >= dfx.start_date) & (dfx.ts < dfx.end_date)]
```

This solution provides the same result in the same time (and much faster too in many cases), and peak memory of only a couple hundred MB. ~50x less memory.

Again, I don't think this is necessarily the right implementation or interface. I'm sure there are even more performant ways to resolve the lazily defined join, not through chunks, perhaps some optimized cython code. But I do think this type of lazy merge would provide a large benefit to the library.

__Breaking Implications:__
None

__Alternatives:__
Many solutions provided in the linked (and other) SO posts. All of which are either (a) not performant, (b) complicated, (c) not generalizable, or (d) require leaving the pandas ecosystem.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Lazy Cross-Join With Deferred Filtering #41150

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ENH: Lazy Cross-Join With Deferred Filtering #41150

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions