-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Code Sample
import numpy
import pandas
def test1(freq):
"""
after using df.loc[], no matter how big/small the df is, the memory usage is
increased, almost doubled
"""
index = pandas.date_range('2019-01-01', '2019-01-02', freq=freq,
tz='UTC')
df = pandas.DataFrame(numpy.random.randn(len(index), 10), index=index,
columns=['ax', 'ay', 'az', 'qw', 'qx', 'qy', 'qz',
'wx', 'wy', 'wz'], dtype=numpy.float32)
idx = pandas.date_range(start='2019-01-01 12:00:00',
end='2019-01-01 13:00:00',
freq=freq, tz='UTC')
df.info()
df.loc[idx] = 0
df.info()
def test2(freq):
"""
after using df[], rough observation:
- df is less than around 50mb, the memory usage is increased
- df is greater than 50mb, the memory usage is NOT increased
"""
index = pandas.date_range('2019-01-01', '2019-01-02', freq=freq,
tz='UTC')
df = pandas.DataFrame(numpy.random.randn(len(index), 10), index=index,
columns=['ax', 'ay', 'az', 'qw', 'qx', 'qy', 'qz',
'wx', 'wy', 'wz'], dtype=numpy.float32)
df.info()
df[df.index.min(): df.index.max()]
df.info()
if __name__ == '__main__':
freqs = ['40ms', '80ms', '90ms', '150ms']
for freq in freqs:
print(f'\n=====freq is {freq} in loc test====\n')
test1(freq)
for freq in freqs:
print(f'\n=====freq is {freq} in slice test====\n')
test2(freq)
Problem description
The memory usage of a dataframe is increased somehow after .loc or df[a:b]
- after using df.loc[], no matter how big/small the df is, the memory usage is
increased, almost doubled - after using df[], rough observation:
- df is less than around 50mb, the memory usage is increased
- df is greater than 50mb, the memory usage is NOT increased
as from the log of the above code:
=====freq is 40ms in loc test====
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2160001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 40L
Data columns (total 10 columns):
ax float32
ay float32
az float32
qw float32
qx float32
qy float32
qz float32
wx float32
wy float32
wz float32
dtypes: float32(10)
memory usage: 98.9 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2160001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 40L
Data columns (total 10 columns):
ax float32
ay float32
az float32
qw float32
qx float32
qy float32
qz float32
wx float32
wy float32
wz float32
dtypes: float32(10)
memory usage: 178.9 MB <------------ increased!
=====freq is 80ms in loc test====
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1080001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 80L
Data columns (total 10 columns):
ax 1080001 non-null float32
ay 1080001 non-null float32
az 1080001 non-null float32
qw 1080001 non-null float32
qx 1080001 non-null float32
qy 1080001 non-null float32
qz 1080001 non-null float32
wx 1080001 non-null float32
wy 1080001 non-null float32
wz 1080001 non-null float32
dtypes: float32(10)
memory usage: 49.4 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1080001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 80L
Data columns (total 10 columns):
ax 1080001 non-null float32
ay 1080001 non-null float32
az 1080001 non-null float32
qw 1080001 non-null float32
qx 1080001 non-null float32
qy 1080001 non-null float32
qz 1080001 non-null float32
wx 1080001 non-null float32
wy 1080001 non-null float32
wz 1080001 non-null float32
dtypes: float32(10)
memory usage: 89.4 MB <------------ increased!
=====freq is 90ms in loc test====
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 960001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 90L
Data columns (total 10 columns):
ax 960001 non-null float32
ay 960001 non-null float32
az 960001 non-null float32
qw 960001 non-null float32
qx 960001 non-null float32
qy 960001 non-null float32
qz 960001 non-null float32
wx 960001 non-null float32
wy 960001 non-null float32
wz 960001 non-null float32
dtypes: float32(10)
memory usage: 43.9 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 960001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 90L
Data columns (total 10 columns):
ax 960001 non-null float32
ay 960001 non-null float32
az 960001 non-null float32
qw 960001 non-null float32
qx 960001 non-null float32
qy 960001 non-null float32
qz 960001 non-null float32
wx 960001 non-null float32
wy 960001 non-null float32
wz 960001 non-null float32
dtypes: float32(10)
memory usage: 83.9 MB . <------------ increased!
=====freq is 150ms in loc test====
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 576001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 150L
Data columns (total 10 columns):
ax 576001 non-null float32
ay 576001 non-null float32
az 576001 non-null float32
qw 576001 non-null float32
qx 576001 non-null float32
qy 576001 non-null float32
qz 576001 non-null float32
wx 576001 non-null float32
wy 576001 non-null float32
wz 576001 non-null float32
dtypes: float32(10)
memory usage: 26.4 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 576001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 150L
Data columns (total 10 columns):
ax 576001 non-null float32
ay 576001 non-null float32
az 576001 non-null float32
qw 576001 non-null float32
qx 576001 non-null float32
qy 576001 non-null float32
qz 576001 non-null float32
wx 576001 non-null float32
wy 576001 non-null float32
wz 576001 non-null float32
dtypes: float32(10)
memory usage: 46.4 MB <------------ increased!
=====freq is 40ms in slice test====
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2160001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 40L
Data columns (total 10 columns):
ax float32
ay float32
az float32
qw float32
qx float32
qy float32
qz float32
wx float32
wy float32
wz float32
dtypes: float32(10)
memory usage: 98.9 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2160001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 40L
Data columns (total 10 columns):
ax float32
ay float32
az float32
qw float32
qx float32
qy float32
qz float32
wx float32
wy float32
wz float32
dtypes: float32(10)
memory usage: 98.9 MB <------------ unchanged!
=====freq is 80ms in slice test====
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1080001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 80L
Data columns (total 10 columns):
ax 1080001 non-null float32
ay 1080001 non-null float32
az 1080001 non-null float32
qw 1080001 non-null float32
qx 1080001 non-null float32
qy 1080001 non-null float32
qz 1080001 non-null float32
wx 1080001 non-null float32
wy 1080001 non-null float32
wz 1080001 non-null float32
dtypes: float32(10)
memory usage: 49.4 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1080001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 80L
Data columns (total 10 columns):
ax 1080001 non-null float32
ay 1080001 non-null float32
az 1080001 non-null float32
qw 1080001 non-null float32
qx 1080001 non-null float32
qy 1080001 non-null float32
qz 1080001 non-null float32
wx 1080001 non-null float32
wy 1080001 non-null float32
wz 1080001 non-null float32
dtypes: float32(10)
memory usage: 49.4 MB <------------ unchanged!
=====freq is 90ms in slice test====
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 960001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 90L
Data columns (total 10 columns):
ax 960001 non-null float32
ay 960001 non-null float32
az 960001 non-null float32
qw 960001 non-null float32
qx 960001 non-null float32
qy 960001 non-null float32
qz 960001 non-null float32
wx 960001 non-null float32
wy 960001 non-null float32
wz 960001 non-null float32
dtypes: float32(10)
memory usage: 43.9 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 960001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 90L
Data columns (total 10 columns):
ax 960001 non-null float32
ay 960001 non-null float32
az 960001 non-null float32
qw 960001 non-null float32
qx 960001 non-null float32
qy 960001 non-null float32
qz 960001 non-null float32
wx 960001 non-null float32
wy 960001 non-null float32
wz 960001 non-null float32
dtypes: float32(10)
memory usage: 83.9 MB <------------ increased!
=====freq is 150ms in slice test====
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 576001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 150L
Data columns (total 10 columns):
ax 576001 non-null float32
ay 576001 non-null float32
az 576001 non-null float32
qw 576001 non-null float32
qx 576001 non-null float32
qy 576001 non-null float32
qz 576001 non-null float32
wx 576001 non-null float32
wy 576001 non-null float32
wz 576001 non-null float32
dtypes: float32(10)
memory usage: 26.4 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 576001 entries, 2019-01-01 00:00:00+00:00 to 2019-01-02 00:00:00+00:00
Freq: 150L
Data columns (total 10 columns):
ax 576001 non-null float32
ay 576001 non-null float32
az 576001 non-null float32
qw 576001 non-null float32
qx 576001 non-null float32
qy 576001 non-null float32
qz 576001 non-null float32
wx 576001 non-null float32
wy 576001 non-null float32
wz 576001 non-null float32
dtypes: float32(10)
memory usage: 46.4 MB <------------ increased!
Expected Output
The memory usage of dataframe shall be no change after .loc or slicing.
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_AU.UTF-8
pandas : 0.25.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.0.3
setuptools : 40.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None