Skip to content

[ENH] Silhouette Plot: Add cosine distance#3176

Merged
ajdapretnar merged 8 commits intobiolab:masterfrom
lanzagar:silhouette-distances
Aug 6, 2018
Merged

[ENH] Silhouette Plot: Add cosine distance#3176
ajdapretnar merged 8 commits intobiolab:masterfrom
lanzagar:silhouette-distances

Conversation

@lanzagar
Copy link
Copy Markdown
Contributor

Added cosine distance to Silhouette Plot.
Handle nan values in the computed dist matrix (e.g. in case of all-zero vectors for cosine) by omitting instances and showing a warning.

Includes
  • Code changes
  • Tests
  • Documentation

@lanzagar lanzagar changed the title Silhouette Plot: Add cosine distance [ENH] Silhouette Plot: Add cosine distance Jul 31, 2018
@codecov-io
Copy link
Copy Markdown

codecov-io commented Jul 31, 2018

Codecov Report

Merging #3176 into master will increase coverage by 0.16%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3176      +/-   ##
==========================================
+ Coverage   82.48%   82.64%   +0.16%     
==========================================
  Files         336      342       +6     
  Lines       58338    59016     +678     
==========================================
+ Hits        48118    48774     +656     
- Misses      10220    10242      +22

@ales-erjavec
Copy link
Copy Markdown
Contributor

In case of selected Cosine distance can you check the input domain to ensure it has no discrete columns?

Either show an error and stop, or show a warning and drop them from the domain before computing the distance.

The way that Cosine treats discrete columns means that it implicitly depends on the variable reuse™ meaning it can produce different results depending on the history and order of loaded data

For instance using
discrete-confound-a.tab.txt
discrete-confound-b.tab.txt

$ cat discrete-confound-a.tab
A	B	C
d	d	d
		class
a1	b1	+
a1	b2	+
a3	b3	-
a1	b2	-
a2	b3	+
a3	b4	-
$ cat discrete-confound-b.tab
A	B	C
d	d	d
		class
a0	a1	+
a1	a2	+
a3	a3	-
a1	b2	-
a2	b3	+
a3	b4	-

compare

import Orange
A = Orange.data.Table("discrete-confound-a.tab")
print(Orange.distance.Cosine(A).round(3))

which prints:

[[0.      nan   nan   nan   nan   nan]
 [  nan 0.    0.293 0.    0.293 0.293]
 [  nan 0.293 0.    0.293 0.    0.   ]
 [  nan 0.    0.293 0.    0.293 0.293]
 [  nan 0.293 0.    0.293 0.    0.   ]
 [  nan 0.293 0.    0.293 0.    0.   ]]

and

import Orange
B = Orange.data.Table("discrete-confound-b.tab")
A = Orange.data.Table("discrete-confound-a.tab")
print(Orange.distance.Cosine(A).round(3))

that produces

[[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]

@lanzagar
Copy link
Copy Markdown
Contributor Author

lanzagar commented Aug 1, 2018

The way that Cosine treats discrete columns means that it implicitly depends on the variable reuse™ meaning it can produce different results depending on the history and order of loaded data

Not just that, it seems to me that the way Cosine currently treats discrete columns is just plain wrong. It basically differentiates between the first and all other values (which it treats as equal)?
I am not sure if this is wanted in some circumstances or was just a first approach to make it work and never redone? Maybe @janezd remembers anything about this?
Why I find it strange is that it explicitly implements discrete_to_indicators that does this instead of e.g. simply calling Continuize on the data. Maybe due to performance concerns? Anyway, we should probably either change supports_discrete to False or change to one hot encoding (or something even better?).

Currently discrete features are only clipped (i.e. first value, all
other values) which can give misleading results. Until better handling
of discrete features, Cosine should say it does not support them so that
a warning is displayed when using it.
@lanzagar
Copy link
Copy Markdown
Contributor Author

lanzagar commented Aug 3, 2018

I have changed Cosine to not advocate support of categorical features until this is resolved. This affects the Distances widget too - everything above is the same there as well. Now it shows a warning and ignores categorical features for Cosine distance.
I added the same warning in Silhouette for metrics that do not support categorical features.

@lanzagar lanzagar added this to the 3.15 milestone Aug 3, 2018
@ajdapretnar ajdapretnar merged commit e47dc90 into biolab:master Aug 6, 2018
@lanzagar lanzagar deleted the silhouette-distances branch March 14, 2022 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants