Fix memory use in stats() for dtype=object by markotoplak · Pull Request #5702 · biolab/orange3

markotoplak · 2021-11-23T17:20:13Z

Issue

Before, .astype("str") was applied to the input array. This created an output array of fixed-length strings of the size of the longest string in the table. Converting numpy object arrays to string arrays should be, therefore, avoided.

This bug stems from #3722, which fixed #3671, but introduced this issue. :)

ORANGE3-311 on Sentry.

Description of changes

A combination of pandas.isnull and direct comparisons is used instead.

Includes

Code changes
Tests
Documentation

codecov · 2021-11-23T17:37:43Z

Codecov Report

Merging #5702 (5dc49d1) into master (3aae878) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #5702   +/-   ##
=======================================
  Coverage   86.12%   86.12%           
=======================================
  Files         315      315           
  Lines       66074    66074           
=======================================
+ Hits        56905    56906    +1     
+ Misses       9169     9168    -1

markotoplak · 2021-11-23T18:09:22Z

The failed tests are unrelated.

For a dramatic effect, from Sentry ORANGE3-311:

  File "itemdelegates.py", line 573, in paint
    ratio = self.barFillRatioData(index)
  File "itemdelegates.py", line 539, in barFillRatioData
    return cast_(float, self.cachedData(index, self.barFillRatioRole))
  File "itemdelegates.py", line 291, in cachedData
    return self.__cache.data(index, role)
  File "itemdelegates.py", line 139, in data
    data[role] = model.data(index, role)
  File "owtable.py", line 78, in data
    dist = super().data(index, TableModel.VariableStatsRole)
  File "itemmodels.py", line 1000, in data
    return self._stats_for_column(col)
  File "itemmodels.py", line 1078, in _stats_for_column
    self.__stats = datacaching.getCached(
  File "datacaching.py", line 15, in getCached
    info[funct] = res = funct(*params, **kwparams)
  File "basic_stats.py", line 40, in __init__
    data._compute_basic_stats(include_metas=include_metas)]
  File "table.py", line 1634, in _compute_basic_stats
    rr.append(fast_stats(self.metas, W))
  File "util.py", line 375, in stats
    X_str = X.astype(str)
  numpy.core._exceptions._ArrayMemoryError:  Unable to allocate 109. GiB for an array with shape (47323, 19) and data type <U32502

Before, .astype("str") was applied to the input array. This created output array of fixed-length strings of the size of the longest string in the table. Converting numpy object arrays to string arrays should be, therefore, avoided.

markotoplak · 2021-11-25T15:00:19Z

Why were empty strings used to denote unknown values in object arrays in Orange? Does anyone remember why did we decide that? Who relies on that?

Would it be hard to change? Why not just use None as an unknown value? See https://pandas.pydata.org/docs/reference/api/pandas.isnull.html

PrimozGodec · 2021-12-03T08:48:38Z

Why were empty strings used to denote unknown values in object arrays in Orange? Does anyone remember why did we decide that? Who relies on that?

Would it be hard to change? Why not just use None as an unknown value? See https://pandas.pydata.org/docs/reference/api/pandas.isnull.html

I was also surprised about that when I discovered that some time ago. As I understand they are used just for string variables. Actually, I think that we should use np.nan as it is used for other data types. The empty string can mean that we have information that is not unknown is just an empty string (maybe it is not useful in practice but it can happen), currently unknown is the same as an empty string.

Fix memory use in stats() for dtype=object

5dc49d1

Before, .astype("str") was applied to the input array. This created output array of fixed-length strings of the size of the longest string in the table. Converting numpy object arrays to string arrays should be, therefore, avoided.

markotoplak force-pushed the fix-stats-str branch from d0ae501 to 5dc49d1 Compare November 25, 2021 14:52

janezd assigned PrimozGodec Nov 26, 2021

PrimozGodec merged commit 2234de6 into biolab:master Dec 3, 2021

markotoplak deleted the fix-stats-str branch February 28, 2022 11:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix memory use in stats() for dtype=object#5702

Fix memory use in stats() for dtype=object#5702
PrimozGodec merged 1 commit intobiolab:masterfrom
markotoplak:fix-stats-str

markotoplak commented Nov 23, 2021

Uh oh!

codecov bot commented Nov 23, 2021 •

edited

Loading

Uh oh!

markotoplak commented Nov 23, 2021

Uh oh!

markotoplak commented Nov 25, 2021

Uh oh!

PrimozGodec commented Dec 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

markotoplak commented Nov 23, 2021

Issue

Description of changes

Includes

Uh oh!

codecov bot commented Nov 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

markotoplak commented Nov 23, 2021

Uh oh!

markotoplak commented Nov 25, 2021

Uh oh!

PrimozGodec commented Dec 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Nov 23, 2021 •

edited

Loading