-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
The memory_usage()
method can be called to get information about the memory used by some of the Pandas objects.
However, in some cases the cached data aren't included.
For example, MultiIndex.memory_usage()
includes memory used by:
levels
codes
names
_engine
(if initialised)
but it doesn't consider:
_engine.values
(it could be included inengine.sizeof
)values
(cached in_values
)dtypes
and a few other negligible cached properties
Example code (using the current main branch using this code):
In [52]: idx = pd.MultiIndex.from_product([np.arange(100), np.arange(100), np.arange(100)], names=["x0", "x1", "x2"])
In [53]: list(idx._cache)
Out[53]: ['levels']
In [54]: idx.memory_usage(deep=True)
Out[54]: 3002553
In [55]: idx._engine.values.nbytes
Out[55]: 4000000
In [56]: idx._engine.sizeof(deep=True)
Out[56]: 0
In [57]: list(idx._cache)
Out[57]: ['levels', '_engine']
In [58]: idx.memory_usage(deep=True)
Out[58]: 3002553
In [111]: idx.values
Out[111]:
array([(0, 0, 0), (0, 0, 1), (0, 0, 2), ..., (99, 99, 97), (99, 99, 98),
(99, 99, 99)], dtype=object)
In [112]: idx.memory_usage(deep=True)
Out[112]: 3002553
In [113]: getsizeof(idx.values[0]) * len(idx.values)
Out[113]: 64000000
In [114]: list(idx._cache)
Out[114]: ['levels', '_engine', '_values', 'nbytes']
In [115]: idx.memory_usage(deep=True)
Out[115]: 3002553
In [117]: idx.get_loc((99, 99, 99))
Out[117]: 999999
In [118]: idx.memory_usage(deep=True)
Out[118]: 3015057
In [133]: idx._engine.get_indexer(idx._engine.values[0:2])
Out[133]: array([0, 1])
In [134]: idx._engine.is_mapping_populated
Out[134]: True
In [135]: idx._engine.sizeof(deep=True)
Out[135]: 25428008
In [136]: idx.memory_usage(deep=True)
Out[136]: 28443065
Feature Description
memory_usage()
could accept an optional bool parameter cache
with default value False
.
def memory_usage(self, deep: bool = False, cache: bool = False) -> int: ...
If True, it should include also the cached data.
If False, it should keep the existing behaviour (although including the engine
data might not be the most intuitive thing, after adding the cache
parameter)
Alternative Solutions
Alternatively, the signature of memory_usage()
can remain the same, but the result should include the cached data.
However, it may be surprising for the user if the result changes, depending on what properties have been called (but this is already happening for the engine, and it can be documented).
Additional Context
If memory_usage
is used to inspect the memory usage of Pandas objects, it would be better to return a value as close as possible to the actually used memory.