-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Problem:
One of the only operations I ever find myself going out of my way to create a sql temp-table upload my frame and do the operation in sql is for filtered cross joins. For large datasets they can both take a long time and use large amounts of memory. Here is an example of such an operation
Current Solution:
dr = pd.date_range('1900-01-01', periods=10000)
df = pd.DataFrame({
'start_date': pd.Series(dr),
'end_date': pd.Series(dr) + pd.Timedelta(days=30),
'value': range(len(dr)),
'id': np.random.choice(list('abcdefg'), len(dr))
})
dates = pd.Series(pd.date_range(df.start_date.min(), df.end_date.max())).to_frame('ts')
# takes ~8 sec (not particularly long in this example) and peak memory at ~10 GB
df.merge(dates, how='cross').loc[lambda dfx: (dfx.ts >= dfx.start_date) & (dfx.ts < dfx.end_date)]
This is how it would be done in SQL, which I don't have memory stats for, but is much faster
select * from table1
cross join table2
where ts >= start_date and ts <= end_date
This is a common problem people try to solve using pandas, as there are several instances on stack overflow of people asking how to do this (here are just a few I found with a quick search, there are many more):
https://stackoverflow.com/questions/43593554/merging-two-dataframes-based-on-a-date-between-two-other-dates-without-a-common/43594038
https://stackoverflow.com/questions/30627968/merge-pandas-dataframes-where-one-value-is-between-two-others
https://stackoverflow.com/questions/31328014/merging-dataframes-based-on-date-range
There are many different solutions ranging in complexity and performance. Often the answer people provide is to just do it in sql.
Proposed Solution: (not concrete/final proposal, just a starting point for discussion)
I think that one way this could be solved in a performant way with an easy to use interface is to provide a lazily evaluated cross join. Which is resolved to a pandas dataframe after a filtering operation (e.g. .loc
, .query
, etc.). Below I provide an example implementation, that I don't think is the complete or optimal solution, but rather just a quick example to get the point across.
import itertools as it
def chunks(lst, n):
for i in range(0, len(lst), n):
yield lst[i:i + n]
class LazyMergeLoc:
def __get__(self, instance, owner):
self.chunk_pairs = instance.chunk_pairs
return self
def __getitem__(self, item):
results = []
for df1, df2 in self.chunk_pairs:
result = df1.merge(df2, how='cross').loc[item]
results.append(result)
return pd.concat(results)
class LazyMerge:
loc = LazyMergeLoc()
def __init__(self, chunk_pairs):
self.chunk_pairs = chunk_pairs
def lazy_cross_join(df1, df2, left_chunk_size, right_chunk_size):
left_chunks = list(chunks(df, left_chunk_size))
right_chunks = list(chunks(dates, right_chunk_size))
chunk_pairs = it.product(left_chuknks, right_chunks)
return LazyMerge(chunk_pairs)
pd.lazy_cross_join(df, dates, left_chunk_size=1000, right_chunk_size=1000).loc[lambda dfx: (dfx.ts >= dfx.start_date) & (dfx.ts < dfx.end_date)]
This solution provides the same result in the same time (and much faster too in many cases), and peak memory of only a couple hundred MB. ~50x less memory.
Again, I don't think this is necessarily the right implementation or interface. I'm sure there are even more performant ways to resolve the lazily defined join, not through chunks, perhaps some optimized cython code. But I do think this type of lazy merge would provide a large benefit to the library.
Breaking Implications:
None
Alternatives:
Many solutions provided in the linked (and other) SO posts. All of which are either (a) not performant, (b) complicated, (c) not generalizable, or (d) require leaving the pandas ecosystem.