Fix memory use in stats() for dtype=object#5702
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5702 +/- ##
=======================================
Coverage 86.12% 86.12%
=======================================
Files 315 315
Lines 66074 66074
=======================================
+ Hits 56905 56906 +1
+ Misses 9169 9168 -1 |
|
The failed tests are unrelated. For a dramatic effect, from Sentry ORANGE3-311: |
Before, .astype("str") was applied to the input array. This created
output array of fixed-length strings of the size of the longest string
in the table. Converting numpy object arrays to string arrays should
be, therefore, avoided.
d0ae501 to
5dc49d1
Compare
|
Why were empty strings used to denote unknown values in object arrays in Orange? Does anyone remember why did we decide that? Who relies on that? Would it be hard to change? Why not just use |
I was also surprised about that when I discovered that some time ago. As I understand they are used just for string variables. Actually, I think that we should use np.nan as it is used for other data types. The empty string can mean that we have information that is not unknown is just an empty string (maybe it is not useful in practice but it can happen), currently unknown is the same as an empty string. |
Issue
Before, .astype("str") was applied to the input array. This created an output array of fixed-length strings of the size of the longest string in the table. Converting numpy object arrays to string arrays should be, therefore, avoided.
This bug stems from #3722, which fixed #3671, but introduced this issue. :)
ORANGE3-311 on Sentry.
Description of changes
A combination of
pandas.isnulland direct comparisons is used instead.Includes