Skip to content

ENH: Lazy Cross-Join With Deferred Filtering #41150

@ms7463

Description

@ms7463

Problem:
One of the only operations I ever find myself going out of my way to create a sql temp-table upload my frame and do the operation in sql is for filtered cross joins. For large datasets they can both take a long time and use large amounts of memory. Here is an example of such an operation

Current Solution:

dr = pd.date_range('1900-01-01', periods=10000)
df = pd.DataFrame({
    'start_date': pd.Series(dr), 
    'end_date': pd.Series(dr) + pd.Timedelta(days=30), 
    'value': range(len(dr)), 
    'id': np.random.choice(list('abcdefg'), len(dr))
})
dates = pd.Series(pd.date_range(df.start_date.min(), df.end_date.max())).to_frame('ts')

# takes ~8 sec (not particularly long in this example) and peak memory at ~10 GB
df.merge(dates, how='cross').loc[lambda dfx: (dfx.ts >= dfx.start_date) & (dfx.ts < dfx.end_date)]

This is how it would be done in SQL, which I don't have memory stats for, but is much faster

select * from table1
cross join table2
where ts >= start_date and ts <= end_date

This is a common problem people try to solve using pandas, as there are several instances on stack overflow of people asking how to do this (here are just a few I found with a quick search, there are many more):
https://stackoverflow.com/questions/43593554/merging-two-dataframes-based-on-a-date-between-two-other-dates-without-a-common/43594038
https://stackoverflow.com/questions/30627968/merge-pandas-dataframes-where-one-value-is-between-two-others
https://stackoverflow.com/questions/31328014/merging-dataframes-based-on-date-range

There are many different solutions ranging in complexity and performance. Often the answer people provide is to just do it in sql.

Proposed Solution: (not concrete/final proposal, just a starting point for discussion)
I think that one way this could be solved in a performant way with an easy to use interface is to provide a lazily evaluated cross join. Which is resolved to a pandas dataframe after a filtering operation (e.g. .loc, .query, etc.). Below I provide an example implementation, that I don't think is the complete or optimal solution, but rather just a quick example to get the point across.

import itertools as it

 def chunks(lst, n):
     for i in range(0, len(lst), n):
         yield lst[i:i + n]

class LazyMergeLoc:
    def __get__(self, instance, owner):
        self.chunk_pairs = instance.chunk_pairs
        return self

    def __getitem__(self, item):
        results = []
        for df1, df2 in self.chunk_pairs:
            result = df1.merge(df2, how='cross').loc[item]
            results.append(result)
        return pd.concat(results)

class LazyMerge:
    loc = LazyMergeLoc()

    def __init__(self, chunk_pairs):
        self.chunk_pairs = chunk_pairs

def lazy_cross_join(df1, df2, left_chunk_size, right_chunk_size):
    left_chunks = list(chunks(df, left_chunk_size))
    right_chunks = list(chunks(dates, right_chunk_size))
    chunk_pairs = it.product(left_chuknks, right_chunks)
    return LazyMerge(chunk_pairs)

pd.lazy_cross_join(df, dates, left_chunk_size=1000, right_chunk_size=1000).loc[lambda dfx: (dfx.ts >= dfx.start_date) & (dfx.ts < dfx.end_date)]

This solution provides the same result in the same time (and much faster too in many cases), and peak memory of only a couple hundred MB. ~50x less memory.

Again, I don't think this is necessarily the right implementation or interface. I'm sure there are even more performant ways to resolve the lazily defined join, not through chunks, perhaps some optimized cython code. But I do think this type of lazy merge would provide a large benefit to the library.

Breaking Implications:
None

Alternatives:
Many solutions provided in the linked (and other) SO posts. All of which are either (a) not performant, (b) complicated, (c) not generalizable, or (d) require leaving the pandas ecosystem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions