Skip to content

Performance of DataFrame.resample with a lot of non-unique datetime index values #1686

@eloraburns

Description

@eloraburns

I've noticed that the first time I resample a DataFrame with a datetime index, it's REALLY slow compared to doing exactly the same thing in pure Python.

My use case is analyzing logs from a web application server. The index datetimes indicate when an event happened, and there are many per second (so non-unique), and millions of rows per day (I'm totally fine with having separate days in separate dataframes). Once I've done a resample to a particular frequency (.resample("T")), I can do another resample with a different how= and it's nice and snappy, so I suspect there's an inefficiency somewhere in the code that figures out the resampled row groupings, and that those groupings are cached.

From iPython (for %time):

from pandas import DataFrame
from datetime import datetime
from itertools import groupby

datelist = sum([[datetime(2012,7,26,0,x)]*1000 for x in range(10)], [])
a = DataFrame([1]*len(datelist), index=datelist)

# This is slower than it should be
%time a.resample("T", how=len)
# CPU times: user 1.49 s, sys: 0.01 s, total: 1.50 s
# Wall time: 1.50 s
# Out[6]: 
#                         0
#2012-07-26 00:00:00  1000
#2012-07-26 00:01:00  1000
#2012-07-26 00:02:00  1000
#2012-07-26 00:03:00  1000
#2012-07-26 00:04:00  1000
#2012-07-26 00:05:00  1000
#2012-07-26 00:06:00  1000
#2012-07-26 00:07:00  1000
#2012-07-26 00:08:00  1000
#2012-07-26 00:09:00  1000

# This is unexpectedly fast (guess: caching)
%time a.resample("T", how=len)
# CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
# Wall time: 0.00 s
# Out[7]: 
#                         0
#2012-07-26 00:00:00  1000
#2012-07-26 00:01:00  1000
#2012-07-26 00:02:00  1000
#2012-07-26 00:03:00  1000
#2012-07-26 00:04:00  1000
#2012-07-26 00:05:00  1000
#2012-07-26 00:06:00  1000
#2012-07-26 00:07:00  1000
#2012-07-26 00:08:00  1000
#2012-07-26 00:09:00  1000

def by_minute(dt):
    return (dt.year, dt.month, dt.day, dt.hour, dt.minute)
def how((key, values)):
    return datetime(*key), len(list(values))

# This is how fast I think .resample() SHOULD be. :)
%time map(how, groupby(datelist, key=by_minute))
# CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
# Wall time: 0.01 s
# Out[9]: 
# [(datetime.datetime(2012, 7, 26, 0, 0), 1000),
#  (datetime.datetime(2012, 7, 26, 0, 1), 1000),
#  (datetime.datetime(2012, 7, 26, 0, 2), 1000),
#  (datetime.datetime(2012, 7, 26, 0, 3), 1000),
#  (datetime.datetime(2012, 7, 26, 0, 4), 1000),
#  (datetime.datetime(2012, 7, 26, 0, 5), 1000),
#  (datetime.datetime(2012, 7, 26, 0, 6), 1000),
#  (datetime.datetime(2012, 7, 26, 0, 7), 1000),
#  (datetime.datetime(2012, 7, 26, 0, 8), 1000),
#  (datetime.datetime(2012, 7, 26, 0, 9), 1000)]

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions