You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/visual-programming/source/widgets/unsupervised/hierarchicalclustering.md
+25-11Lines changed: 25 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,34 +12,48 @@ Groups items using a hierarchical clustering algorithm.
12
12
- Selected Data: instances selected from the plot
13
13
- Data: data with an additional column showing whether an instance is selected
14
14
15
-
The widget computes [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) of arbitrary types of objects from a matrix of distances and shows a corresponding [dendrogram](https://en.wikipedia.org/wiki/Dendrogram).
15
+
The widget computes [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) of arbitrary types of objects from a matrix of distances and shows a corresponding [dendrogram](https://en.wikipedia.org/wiki/Dendrogram). Distances can be computed with the [Distances](../unsupervised/distances.md) widget
16
16
17
-

17
+

18
18
19
-
1. The widget supports four ways of measuring distances between clusters:
19
+
1. The widget supports the following ways of measuring distances between clusters:
20
20
-**Single linkage** computes the distance between the closest elements of the two clusters
21
21
-**Average linkage** computes the average distance between elements of the two clusters
22
22
-**Weighted linkage** uses the [WPGMA](http://research.amnh.org/~siddall/methods/day1.html) method
23
23
-**Complete linkage** computes the distance between the clusters' most distant elements
24
+
-**Ward linkage** computes the increase of the error sum of squares. In other words, the [Ward's minimum variance criterion](https://en.wikipedia.org/wiki/Ward%27s_method) minimizes the total within-cluster variance.
24
25
2. Labels of nodes in the dendrogram can be chosen in the **Annotation** box.
25
26
3. Huge dendrograms can be pruned in the *Pruning* box by selecting the maximum depth of the dendrogram. This only affects the display, not the actual clustering.
26
27
4. The widget offers three different selection methods:
27
28
-**Manual** (Clicking inside the dendrogram will select a cluster. Multiple clusters can be selected by holding Ctrl/Cmd. Each selected cluster is shown in a different color and is treated as a separate cluster in the output.)
28
29
-**Height ratio** (Clicking on the bottom or top ruler of the dendrogram places a cutoff line in the graph. Items to the right of the line are selected.)
29
30
-**Top N** (Selects the number of top nodes.)
30
31
5. Use *Zoom* and scroll to zoom in or out.
31
-
6. If the items being clustered are instances, they can be added a cluster index (*Append cluster IDs*). The ID can appear as an ordinary **Attribute**, **Class attribute** or a **Meta attribute**. In the second case, if the data already has a class attribute, the original class is placed among meta attributes.
32
-
7. The data can be automatically output on any change (*Auto send is on*) or, if the box isn't ticked, by pushing *Send Data*.
33
-
8. Clicking this button produces an image that can be saved.
34
-
9. Produce a report.
32
+
6. The data can be automatically output on any change (*Send Automatically*) or, if the box isn't ticked, by pushing *Send Selection*.
33
+
34
+
To output the cluster, click on the ruler at the top or the bottom of the visualization. This will create a cut-off for the clusters.
35
35
36
36
Examples
37
37
--------
38
38
39
-
The workflow below shows the output of **Hierarchical Clustering** for the *Iris* dataset in [Data Table](../data/datatable.md) widget. We see that if we choose *Append cluster IDs* in hierarchical clustering, we can see an additional column in the **Data Table** named *Cluster*. This is a way to check how hierarchical clustering clustered individual instances.
39
+
#### Cluster selection and projections
40
+
41
+
We start with the *Grades for English and Math* data set from the [Datasets](../data/datasets.md) widget. The data contains two numeric variables, grades for English and for Algebra.
42
+
43
+
**Hierarchical Clustering** requires distance matrix on the input. We compute it with [Distances](../unsupervised/distances.md), where we use the *Euclidean* distance metric.
44
+
45
+
Once the data is passed to the hierarchical clustering, the widget displays a dendrogram, a tree-like clustering structure. Each node represents an instance in the data set, in our case a student. Tree nodes are labelled with student names.
46
+
47
+
To create the clusters, we click on the ruler at the desired threshold. In this case, we chose three clusters. We pass those clusters to [MDS](../unsupervised/mds.md), which shows a 2D projection of data instances, colored by cluster label.
48
+
49
+

50
+
51
+
#### Cluster explanation
52
+
53
+
In the second example, we continue the *Grades for English and Math* data. Say we wish to explain what characterizes the cluster with Maya, George, Lea, and Phill.
40
54
41
-

55
+
We select the cluster in the dendrogram and pass the entire data set to [Box Plot](../visualize/boxplot.md). Note that the connection here is *Data*, not *Selected Data*. To rewire the connection, double-click on it.
42
56
43
-
In the second example, we loaded the *Iris* dataset again, but this time we added the [Scatter Plot](../visualize/scatterplot.md), showing all the instances from the [File](../data/file.md) widget, while at the same time receiving the selected instances signal from **Hierarchical Clustering**. This way we can observe the position of the selected cluster(s) in the projection.
57
+
In **Box Plot**, we set *Selected* variable as the Subgroup. This will split the plot into selected data instances (our cluster) and the remaining data. Next, we use *Order by relevance to subgroup* option, which sorts the variables according to how well they distinguish between subgroups. It turns out, that our cluster contains students who are bad at math (they have low values of the Algebra variable).
0 commit comments