[ENH] Scatter Plot Graph and Scaling can handle metas#2699
[ENH] Scatter Plot Graph and Scaling can handle metas#2699lanzagar merged 5 commits intobiolab:masterfrom jerneju:spg-scaling-metas
Conversation
janezd
left a comment
There was a problem hiding this comment.
This is so much simpler. All widgets should use models for combos/lists of variables.
Orange/widgets/utils/scaling.py
Outdated
| no_jit[index] /= 2 * len(attr.values) | ||
| c *= 2 | ||
| c += 1 | ||
| c /= 2 * len(attr.values) |
There was a problem hiding this comment.
Isn't (2x+1)/(2a) same as (x + 0.5)/a? Aren't the above three lines the same as
c += 0.5
c /= len(attr.values)
After rewriting it this way, I also understand what the code actually does. :) Moves the data into the middle of intervals and normalizes it to [0, 1].
I know this was here before, but only God knows why it was written like this.
Orange/widgets/utils/scaling.py
Outdated
| random = np.random.RandomState(seed=self.jitter_seed) | ||
| for index, col in enumerate(self.jittered_data): | ||
| domain = self.domain | ||
| for attr in domain.attributes + domain.class_vars + domain.metas: |
There was a problem hiding this comment.
You can use chain(domain, domain.metas) in such situations.
Orange/widgets/utils/scaling.py
Outdated
| self.data = data | ||
| domain = data.domain | ||
| self.primitives = data.domain.attributes + data.domain.class_vars + \ | ||
| tuple([v for v in data.domain.metas if v.is_primitive()]) |
There was a problem hiding this comment.
You could use chain(data.domain, (v for v in data.domain.metas if v.is_primitive())).
If you keep it as it is: don't use tuple([...]): just use generators, tuple(...), without constructing a list first.
Orange/widgets/utils/scaling.py
Outdated
| tuple([v for v in data.domain.metas if v.is_primitive()]) | ||
| new_domain = Domain(attributes=domain.attributes, | ||
| class_vars=domain.class_vars, | ||
| metas=tuple([v for v in domain.metas if v.is_primitive()])) |
There was a problem hiding this comment.
Same - no need to construct a list first.
Orange/widgets/utils/scaling.py
Outdated
| col = self.jittered_data.get_column_view(attr)[0] | ||
| col = 1 - col | ||
| col = self.scaled_data.get_column_view(attr)[0] | ||
| col = 1 - col |
There was a problem hiding this comment.
These four lines have no effect, do they?
Did you mean something like this?
col *= -1
col += 1
Or (shorter, but maybe slower) col[:] = 1 - col.
There was a problem hiding this comment.
Yes, the latter one can consume even 4 times more time.
|
|
||
| attributes = self.data.domain.attributes + (self.variable_x, self.variable_y) + \ | ||
| primitive_metas | ||
| attributes = data.domain.attributes + (self.variable_x, self.variable_y) |
There was a problem hiding this comment.
@BlazZupan, do we want the columns that contain x and y coordinates from the MDS to appear as ordinary attributes or as metas?
They used to be in metas, but now that visualization widgets can show metas as well, I would tend to store x and y as metas to not pollute that feature set.
There was a problem hiding this comment.
Actually this line here is not related to that (it's for constructing data for plotting).
Output data has new coordinates in metas as we want, see commit function below (L671)
Orange/widgets/utils/scaling.py
Outdated
| attr = self.domain[index] | ||
| getCached(self.data, DomainBasicStats, (self.data, True)) | ||
| domain = self.domain | ||
| for attr in domain.attributes + domain.class_vars + domain.metas: |
There was a problem hiding this comment.
Maybe chain(domain, domain.metas)?
There was a problem hiding this comment.
To be sincere, I would prefer domain.all_variables or something similar. Like we have domain.variables = domain.attributes + domain.class_vars we could have something that includes all three: attributes, class_vars and metas.
There was a problem hiding this comment.
I hate this, too.
When one gets older, (s)he start adding historic notes to everything. So let me say that in the beginning, like 20 years ago, Orange used to have attributes and classes, and both together were called variables. Then I introduced meta attributes, and they didn't mean the same thing as today: meta attributes were (essentially) sparse, while attributes and classes were dense ("essentially", because there was no numpy, these were C++ structures). Since all code - before the introduction of metas - expected that variables were attributes + classes, I would break everything if I added metas there. It also made more sense than today since variables (=attributes + classes) were the dense part of the data.
The idea that any part can be dense or sparse, and that metas are not so different from attributes and so forth evolved gradually even in Orange 3. Hence we have not initially decided that variables will be attributes + class + meta. To make things worse, we (most probably: I) extended this bad design to iteration over Domain. @kernc (I think) correctly pointed out that since var in domain returns True also when var is in domain.metas, so for var in domain should iterate over attributes, classes and also metas. So in all the places where you could use chain(domain, domain.metas) in your code, just domain should suffice instead. Alas, fixing the iteration over Domain would break so much code that we gave up. This horrible design decision stays forever, I fear.
I don't like all_variables. variables should already be all, so all_variables sounds almost like really_all_variables. Plus, the next thing you'll want will be primitive_variables. And maybe also continuous_variables. I wouldn't like to multiply the number of domains properties.
There was a problem hiding this comment.
For the sake of brainstorming - and if this is indeed necessary - what about having a method Domain.all(condition). If condition is a type derived from Variable, all returns all variables of that type (and derived types). Hence , Domain.all(DiscreteVariable) returns all discrete variables. We then add PrimitiveVariable as a new common base class for DiscreteVariable and ContinuousVariable. With that, Domain.all(PrimitiveVariable) returns all primitive variables. Finally, the default value for condition is Variable, so Domain.all() returns attributes + class_vars + metas.
Method all could also accept other types of conditions if we can think of any.
Just for brainstorming.
There was a problem hiding this comment.
Great idea, I implemented it.
Orange/widgets/utils/scaling.py
Outdated
| return np.array([], np.bool) | ||
| domain = self.domain | ||
| indices = [] | ||
| for index, attr in enumerate(domain.attributes + domain.class_vars + domain.metas): |
There was a problem hiding this comment.
enumerate(chain(domain, domain.metas))?
Orange/data/domain.py
Outdated
| return self._metas | ||
|
|
||
| def all(self, condition=Variable): | ||
| return [v for v in chain(self._variables, self._metas) if isinstance(v, condition)] |
There was a problem hiding this comment.
.all() is not good. The idiom means something completely different in Python/NumPy context. .filter() would be better, .where() would be best.
This change should require re-touching everywhere the former ([]+[] or chain) idiom was used.
There was a problem hiding this comment.
I don't like where. :)
all reads well: I want all discrete variables, hence domain.all(DiscreteVariable). But I agree the name is wrong wrt its usual meaning in Python.
domain.where(DiscreteVariable) doesn't read so well and it also means two different things in numpy, one of which is similar to this Domain.where (but with different arguments) and one which is completely different.
domain.filter(DiscreteVariable) is sort of fine, but it gives an impression that it will return a new instance of Domain with just discrete variables.
What about domain.collect(DiscreteVariable)? Or domain.pick(DiscreteVariable) -- though I prefer collect?
Codecov Report
@@ Coverage Diff @@
## master #2699 +/- ##
==========================================
- Coverage 76.06% 76.05% -0.02%
==========================================
Files 338 338
Lines 59675 59670 -5
==========================================
- Hits 45391 45381 -10
- Misses 14284 14289 +5 |
Orange/data/domain.py
Outdated
| def metas(self): | ||
| return self._metas | ||
|
|
||
| def where(self, condition=Variable): |
There was a problem hiding this comment.
Could you do this in a separate PR?
While I agree that this change is neatly hidden here, I would prefer to have modifications to Variables / Domain in a separate, clearly marked PR (where we can discuss different alternatives).
Orange/widgets/utils/scaling.py
Outdated
| attr = self.domain[index] | ||
| getCached(self.data, DomainBasicStats, (self.data, True)) | ||
| domain = self.domain | ||
| for attr in domain.where(): |
| cols = [] | ||
| for var in attrs: | ||
| cols.append(graph.jittered_data.get_column_view(var)[0][valid]) | ||
| X = np.vstack(cols).T |
| all_data = (data.X, Y, data.metas) | ||
| else: | ||
| all_data = (data.X, Y) | ||
| all_data = np.hstack(all_data).T |
There was a problem hiding this comment.
This can have dtype==object when stacking with metas which will cause np.isfinite to fail 2 lines below
There was a problem hiding this comment.
Fixed here, but I get similar errors elsewhere (see sentry report).
Try fixing all metas dtype problems (converting all primitive metas to float might be enough, maybe something else is also needed). Then please test thoroughly! Use at least the workflow from the sentry report.
There was a problem hiding this comment.
I found another error caused by Corpus (issue: biolab/orange3-text#324 )
[ENH] Scatter Plot Graph and Scaling can handle metas
Issue
Scatter Plot moves primitive meas to attributes.
Description of changes
Includes