-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Closed
Closed
Copy link
Labels
Milestone
Description
I've noticed that the first time I resample a DataFrame with a datetime index, it's REALLY slow compared to doing exactly the same thing in pure Python.
My use case is analyzing logs from a web application server. The index datetimes indicate when an event happened, and there are many per second (so non-unique), and millions of rows per day (I'm totally fine with having separate days in separate dataframes). Once I've done a resample to a particular frequency (.resample("T")
), I can do another resample with a different how=
and it's nice and snappy, so I suspect there's an inefficiency somewhere in the code that figures out the resampled row groupings, and that those groupings are cached.
From iPython (for %time):
from pandas import DataFrame
from datetime import datetime
from itertools import groupby
datelist = sum([[datetime(2012,7,26,0,x)]*1000 for x in range(10)], [])
a = DataFrame([1]*len(datelist), index=datelist)
# This is slower than it should be
%time a.resample("T", how=len)
# CPU times: user 1.49 s, sys: 0.01 s, total: 1.50 s
# Wall time: 1.50 s
# Out[6]:
# 0
#2012-07-26 00:00:00 1000
#2012-07-26 00:01:00 1000
#2012-07-26 00:02:00 1000
#2012-07-26 00:03:00 1000
#2012-07-26 00:04:00 1000
#2012-07-26 00:05:00 1000
#2012-07-26 00:06:00 1000
#2012-07-26 00:07:00 1000
#2012-07-26 00:08:00 1000
#2012-07-26 00:09:00 1000
# This is unexpectedly fast (guess: caching)
%time a.resample("T", how=len)
# CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
# Wall time: 0.00 s
# Out[7]:
# 0
#2012-07-26 00:00:00 1000
#2012-07-26 00:01:00 1000
#2012-07-26 00:02:00 1000
#2012-07-26 00:03:00 1000
#2012-07-26 00:04:00 1000
#2012-07-26 00:05:00 1000
#2012-07-26 00:06:00 1000
#2012-07-26 00:07:00 1000
#2012-07-26 00:08:00 1000
#2012-07-26 00:09:00 1000
def by_minute(dt):
return (dt.year, dt.month, dt.day, dt.hour, dt.minute)
def how((key, values)):
return datetime(*key), len(list(values))
# This is how fast I think .resample() SHOULD be. :)
%time map(how, groupby(datelist, key=by_minute))
# CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
# Wall time: 0.01 s
# Out[9]:
# [(datetime.datetime(2012, 7, 26, 0, 0), 1000),
# (datetime.datetime(2012, 7, 26, 0, 1), 1000),
# (datetime.datetime(2012, 7, 26, 0, 2), 1000),
# (datetime.datetime(2012, 7, 26, 0, 3), 1000),
# (datetime.datetime(2012, 7, 26, 0, 4), 1000),
# (datetime.datetime(2012, 7, 26, 0, 5), 1000),
# (datetime.datetime(2012, 7, 26, 0, 6), 1000),
# (datetime.datetime(2012, 7, 26, 0, 7), 1000),
# (datetime.datetime(2012, 7, 26, 0, 8), 1000),
# (datetime.datetime(2012, 7, 26, 0, 9), 1000)]
Thanks!