-
Notifications
You must be signed in to change notification settings - Fork 109
Open
Labels
Description
I ran into an out of memory error using the Matrix.get method in a case where I have a lot of zero-weighted observations that I would like to carry through history-matching as metadata. It looks like the error occurs when np.diag is called:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
Cell In[59], line 6
3 print(len(obs_cov.x))
5 # reduce cov down to only include non-zero obsverations
----> 6 obs_cov = obs_cov.get(row_names=pst.nnz_obs_names, col_names=pst.nnz_obs_names, )
File ~\Documents\Development\repos\GMDSI_notebooks\dependencies\pyemu\pyemu\mat\mat_handler.py:1686, in Matrix.get(self, row_names, col_names, drop)
1684 return Cov(x=extract, names=names, isdiagonal=self.isdiagonal)
1685 if self.isdiagonal:
-> 1686 extract = np.diag(self.__x[:, 0])
1687 else:
1688 extract = self.__x.copy()
File ~\AppData\Local\anaconda3\envs\gmdsitut\lib\site-packages\numpy\lib\twodim_base.py:293, in diag(v, k)
291 if len(s) == 1:
292 n = s[0]+abs(k)
--> 293 res = zeros((n, n), v.dtype)
294 if k >= 0:
295 i = k
MemoryError: Unable to allocate 110. GiB for an array with shape (121252, 121252) and data type float64
It seems modifying the method to use scipy.sparse.diags can offer a workaround:
def spget(self, row_names=None, col_names=None, drop=False):
if row_names is None and col_names is None:
raise Exception(
"Matrix.get(): must pass at least" + " row_names or col_names"
)
if row_names is not None and not isinstance(row_names, list):
row_names = [row_names]
if col_names is not None and not isinstance(col_names, list):
col_names = [col_names]
if isinstance(self, Cov) and (row_names is None or col_names is None):
if row_names is not None:
idxs = self.indices(row_names, axis=0)
names = row_names
else:
idxs = self.indices(col_names, axis=1)
names = col_names
if self.isdiagonal:
extract = self.__x[idxs].copy()
else:
extract = self.__x[idxs, :].copy()
extract = extract[:, idxs]
if drop:
self.drop(names, 0)
return Cov(x=extract, names=names, isdiagonal=self.isdiagonal)
if self.isdiagonal:
extract = sp.diags(self.__x[:, 0], offsets=0, format='csr') # switch from np to sp
else:
extract = self.__x.copy()
if row_names is not None:
row_idxs = self.indices(row_names, axis=0)
extract = extract[row_idxs, :] # this row is modified
if drop:
self.drop(row_names, axis=0)
else:
row_names = self.row_names
if col_names is not None:
col_idxs = self.indices(col_names, axis=1)
extract = extract[:, col_idxs] # this row is also modified
if drop:
self.drop(col_names, axis=1)
else:
col_names = copy.deepcopy(self.col_names)
return type(self)(x=extract.toarray(), row_names=row_names, col_names=col_names) # modified to use .toarray()
Comparing the methods using a smaller amount of observations (allowing the use of both get and spget):
obs_cov_reduced = obs_cov.get(row_names=pst.nnz_obs_names, col_names=pst.nnz_obs_names, )
obs_cov_reduced_sp = obs_cov.spget(row_names=pst.nnz_obs_names, col_names=pst.nnz_obs_names, )
print(np.array_equal(obs_cov_reduced.x, obs_cov_reduced_sp.x))
True
I wanted to bring this to attention because quite a lot of use-cases seem to rely on this method. Although I'm not sure whether this modification has any unforseen impact further down the line.
Reactions are currently unavailable