Skip to content

dataframe memory usage increased after .loc[] or df[a:b]Β #31197

@miaoz2001

Description

@miaoz2001

Code Sample

import numpy
import pandas


def test1(freq):
    """
    after using df.loc[], no matter how big/small the df is, the memory usage is
    increased, almost doubled
    """
    index = pandas.date_range('2019-01-01', '2019-01-02', freq=freq,
                              tz='UTC')
    df = pandas.DataFrame(numpy.random.randn(len(index), 10), index=index,
                          columns=['ax', 'ay', 'az', 'qw', 'qx', 'qy', 'qz',
                                   'wx', 'wy', 'wz'], dtype=numpy.float32)
    idx = pandas.date_range(start='2019-01-01 12:00:00',
                            end='2019-01-01 13:00:00',
                            freq=freq, tz='UTC')
    df.info()
    df.loc[idx] = 0
    df.info()


def test2(freq):
    """
    after using df[], rough observation:
         - df is less than around 50mb, the memory usage is increased
         - df is greater than 50mb, the memory usage is NOT increased
    """
    index = pandas.date_range('2019-01-01', '2019-01-02', freq=freq,
                              tz='UTC')
    df = pandas.DataFrame(numpy.random.randn(len(index), 10), index=index,
                          columns=['ax', 'ay', 'az', 'qw', 'qx', 'qy', 'qz',
                                   'wx', 'wy', 'wz'], dtype=numpy.float32)
    df.info()
    df[df.index.min(): df.index.max()]
    df.info()


if __name__ == '__main__':
    freqs = ['40ms', '80ms', '90ms', '150ms']
    for freq in freqs:
        print(f'\n=====freq is {freq} in loc test====\n')
        test1(freq)

    for freq in freqs:
        print(f'\n=====freq is {freq} in slice test====\n')
        test2(freq)

Problem description

The memory usage of a dataframe is increased somehow after .loc or df[a:b]

  • after using df.loc[], no matter how big/small the df is, the memory usage is
    increased, almost doubled
  • after using df[], rough observation:
    - df is less than around 50mb, the memory usage is increased
    - df is greater than 50mb, the memory usage is NOT increased

as from the log of the above code:

=====freq is 40ms in loc test====

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2160001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 40L
Data columns (total 10 columns):
ax    float32
ay    float32
az    float32
qw    float32
qx    float32
qy    float32
qz    float32
wx    float32
wy    float32
wz    float32
dtypes: float32(10)
memory usage: 98.9 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2160001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 40L
Data columns (total 10 columns):
ax    float32
ay    float32
az    float32
qw    float32
qx    float32
qy    float32
qz    float32
wx    float32
wy    float32
wz    float32
dtypes: float32(10)
memory usage: 178.9 MB            <------------ increased!

=====freq is 80ms in loc test====

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1080001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 80L
Data columns (total 10 columns):
ax    1080001 non-null float32
ay    1080001 non-null float32
az    1080001 non-null float32
qw    1080001 non-null float32
qx    1080001 non-null float32
qy    1080001 non-null float32
qz    1080001 non-null float32
wx    1080001 non-null float32
wy    1080001 non-null float32
wz    1080001 non-null float32
dtypes: float32(10)
memory usage: 49.4 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1080001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 80L
Data columns (total 10 columns):
ax    1080001 non-null float32
ay    1080001 non-null float32
az    1080001 non-null float32
qw    1080001 non-null float32
qx    1080001 non-null float32
qy    1080001 non-null float32
qz    1080001 non-null float32
wx    1080001 non-null float32
wy    1080001 non-null float32
wz    1080001 non-null float32
dtypes: float32(10)
memory usage: 89.4 MB              <------------ increased!

=====freq is 90ms in loc test====

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 960001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 90L
Data columns (total 10 columns):
ax    960001 non-null float32
ay    960001 non-null float32
az    960001 non-null float32
qw    960001 non-null float32
qx    960001 non-null float32
qy    960001 non-null float32
qz    960001 non-null float32
wx    960001 non-null float32
wy    960001 non-null float32
wz    960001 non-null float32
dtypes: float32(10)
memory usage: 43.9 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 960001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 90L
Data columns (total 10 columns):
ax    960001 non-null float32
ay    960001 non-null float32
az    960001 non-null float32
qw    960001 non-null float32
qx    960001 non-null float32
qy    960001 non-null float32
qz    960001 non-null float32
wx    960001 non-null float32
wy    960001 non-null float32
wz    960001 non-null float32
dtypes: float32(10)
memory usage: 83.9 MB .                <------------ increased!

=====freq is 150ms in loc test====

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 576001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 150L
Data columns (total 10 columns):
ax    576001 non-null float32
ay    576001 non-null float32
az    576001 non-null float32
qw    576001 non-null float32
qx    576001 non-null float32
qy    576001 non-null float32
qz    576001 non-null float32
wx    576001 non-null float32
wy    576001 non-null float32
wz    576001 non-null float32
dtypes: float32(10)
memory usage: 26.4 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 576001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 150L
Data columns (total 10 columns):
ax    576001 non-null float32
ay    576001 non-null float32
az    576001 non-null float32
qw    576001 non-null float32
qx    576001 non-null float32
qy    576001 non-null float32
qz    576001 non-null float32
wx    576001 non-null float32
wy    576001 non-null float32
wz    576001 non-null float32
dtypes: float32(10)
memory usage: 46.4 MB                         <------------ increased!







=====freq is 40ms in slice test====

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2160001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 40L
Data columns (total 10 columns):
ax    float32
ay    float32
az    float32
qw    float32
qx    float32
qy    float32
qz    float32
wx    float32
wy    float32
wz    float32
dtypes: float32(10)
memory usage: 98.9 MB                                    
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2160001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 40L
Data columns (total 10 columns):
ax    float32
ay    float32
az    float32
qw    float32
qx    float32
qy    float32
qz    float32
wx    float32
wy    float32
wz    float32
dtypes: float32(10)
memory usage: 98.9 MB                             <------------ unchanged!

=====freq is 80ms in slice test====

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1080001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 80L
Data columns (total 10 columns):
ax    1080001 non-null float32
ay    1080001 non-null float32
az    1080001 non-null float32
qw    1080001 non-null float32
qx    1080001 non-null float32
qy    1080001 non-null float32
qz    1080001 non-null float32
wx    1080001 non-null float32
wy    1080001 non-null float32
wz    1080001 non-null float32
dtypes: float32(10)
memory usage: 49.4 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1080001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 80L
Data columns (total 10 columns):
ax    1080001 non-null float32
ay    1080001 non-null float32
az    1080001 non-null float32
qw    1080001 non-null float32
qx    1080001 non-null float32
qy    1080001 non-null float32
qz    1080001 non-null float32
wx    1080001 non-null float32
wy    1080001 non-null float32
wz    1080001 non-null float32
dtypes: float32(10)
memory usage: 49.4 MB                                     <------------ unchanged!

=====freq is 90ms in slice test====

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 960001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 90L
Data columns (total 10 columns):
ax    960001 non-null float32
ay    960001 non-null float32
az    960001 non-null float32
qw    960001 non-null float32
qx    960001 non-null float32
qy    960001 non-null float32
qz    960001 non-null float32
wx    960001 non-null float32
wy    960001 non-null float32
wz    960001 non-null float32
dtypes: float32(10)
memory usage: 43.9 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 960001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 90L
Data columns (total 10 columns):
ax    960001 non-null float32
ay    960001 non-null float32
az    960001 non-null float32
qw    960001 non-null float32
qx    960001 non-null float32
qy    960001 non-null float32
qz    960001 non-null float32
wx    960001 non-null float32
wy    960001 non-null float32
wz    960001 non-null float32
dtypes: float32(10)
memory usage: 83.9 MB                             <------------ increased!

=====freq is 150ms in slice test====

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 576001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 150L
Data columns (total 10 columns):
ax    576001 non-null float32
ay    576001 non-null float32
az    576001 non-null float32
qw    576001 non-null float32
qx    576001 non-null float32
qy    576001 non-null float32
qz    576001 non-null float32
wx    576001 non-null float32
wy    576001 non-null float32
wz    576001 non-null float32
dtypes: float32(10)
memory usage: 26.4 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 576001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 150L
Data columns (total 10 columns):
ax    576001 non-null float32
ay    576001 non-null float32
az    576001 non-null float32
qw    576001 non-null float32
qx    576001 non-null float32
qy    576001 non-null float32
qz    576001 non-null float32
wx    576001 non-null float32
wy    576001 non-null float32
wz    576001 non-null float32
dtypes: float32(10)
memory usage: 46.4 MB                                  <------------ increased!

Expected Output

The memory usage of dataframe shall be no change after .loc or slicing.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_AU.UTF-8

pandas : 0.25.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.0.3
setuptools : 40.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions