@@ -15,11 +15,13 @@ Selecting ``min_cluster_size``
1515
1616The primary parameter to effect the resulting clustering is
1717``min_cluster_size ``. Ideally this is a relatively intuitive parameter
18- to select -- set it to the smallest size grouping that you sih to
18+ to select -- set it to the smallest size grouping that you wish to
1919consider a cluster. It can have slightly non-obvious effects however.
2020Let's consider the digits dataset from sklearn. We can project the data
2121into two dimensions to visualize it via t-SNE.
2222
23+ .. code :: python
24+
2325 digits = datasets.load_digits()
2426 data = digits.data
2527 projection = TSNE().fit_transform(data)
@@ -29,7 +31,7 @@ into two dimensions to visualize it via t-SNE.
2931 .. image :: images/parameter_selection_3_1.png
3032
3133
32- If we cluster this data in the full 64 dimensional space with hdbscan we
34+ If we cluster this data in the full 64 dimensional space with HDBSCAN \* we
3335can see some effects from varying the ``min_cluster_size ``.
3436
3537We start with a ``min_cluster_size `` of 15.
@@ -52,7 +54,7 @@ We start with a ``min_cluster_size`` of 15.
5254Increasing the ``min_cluster_size `` to 30 reduces the number of
5355clusters, merging some together. This is a result of HDBSCAN\*
5456reoptimizing which flat clustering provides greater stability under a
55- slightly different notion of what constitutes cluster.
57+ slightly different notion of what constitutes a cluster.
5658
5759.. code :: python
5860
@@ -113,7 +115,7 @@ pruned out. Thus ``min_cluster_size`` does behave more closely to our
113115intuitions, but only if we fix ``min_samples ``. If you wish to explore
114116different ``min_cluster_size `` settings with a fixed ``min_samples ``
115117value, especially for larger dataset sizes, you can cache the hard
116- computation, and recompute onlythe relatively cheap flat cluster
118+ computation, and recompute only the relatively cheap flat cluster
117119extraction using the ``memory `` parameter, which makes use of ``joblib ``
118120[link].
119121
@@ -156,7 +158,7 @@ leaving the ``min_cluster_size`` at 60, but reducing ``min_samples`` to
156158
157159Now most points are clustered, and there are much fewer noise points.
158160Steadily increasing ``min_samples `` will, as we saw in the examples
159- above, make the clustering progressivly more conservative, culiminating
161+ above, make the clustering progressivly more conservative, culminating
160162in the example above where ``min_samples `` was set to 60 and we had only
161163two clusters with most points declared as noise.
162164
0 commit comments