You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/advanced_hdbscan.rst
+20-21Lines changed: 20 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,14 +3,13 @@ Getting More Information About a Clustering
3
3
===========================================
4
4
5
5
Once you have the basics of clustering sorted you may want to dig a
6
-
little deeper than just the cluster labels returned to you. Fortunately
7
-
the hdbscan library provides you with the facilities to do this. During
6
+
little deeper than just the cluster labels returned to you. Fortunately, the hdbscan library provides you with the facilities to do this. During
8
7
processing HDBSCAN\* builds a hierarchy of potential clusters, from
9
8
which it extracts the flat clustering returned. It can be informative to
10
9
look at that hierarchy, and potentially make use of the extra
11
10
information contained therein.
12
11
13
-
Suppose we have a dataset for clustering. It is a binary file in nunpy format and it can be found at https://github.com/lmcinnes/hdbscan/blob/master/notebooks/clusterable_data.npy.
12
+
Suppose we have a dataset for clustering. It is a binary file in NumPy format and it can be found at https://github.com/lmcinnes/hdbscan/blob/master/notebooks/clusterable_data.npy.
14
13
15
14
.. code:: python
16
15
@@ -116,7 +115,7 @@ each branch representing the number of points in the cluster at that
116
115
level. If we wish to know which branches were selected by the HDBSCAN\*
117
116
algorithm we can pass ``select_clusters=True``. You can even pass a
118
117
selection palette to color the selections according to the cluster
119
-
labelling.
118
+
labeling.
120
119
121
120
.. code:: python
122
121
@@ -127,8 +126,8 @@ labelling.
127
126
.. image:: images/advanced_hdbscan_11_1.png
128
127
129
128
130
-
From this we can see, for example, that the yellow cluster, at the
131
-
center of the plot, forms early (breaking off from the pale blue and
129
+
From this, we can see, for example, that the yellow cluster at the
130
+
center of the plot forms early (breaking off from the pale blue and
132
131
purple clusters) and persists for a long time. By comparison the green
133
132
cluster, which also forms early, quickly breaks apart and then
134
133
vanishes altogether (shattering into clusters all smaller than the
@@ -141,7 +140,7 @@ for example, in the dark blue cluster.
141
140
142
141
If this was a simple visual analysis of the condensed tree can tell you
143
142
a lot more about the structure of your data. This is not all we can do
144
-
with condensed trees however. For larger and more complex datasets the
143
+
with condensed trees, however. For larger and more complex datasets the
145
144
tree itself may be very complex, and it may be desirable to run more
146
145
interesting analytics over the tree itself. This can be achieved via
147
146
several converter methods: :py:meth:`~hdbscan.plots.CondensedTree.to_networkx`, :py:meth:`~hdbscan.plots.CondensedTree.to_pandas`, and
@@ -162,9 +161,9 @@ First we'll consider :py:meth:`~hdbscan.plots.CondensedTree.to_networkx`
162
161
163
162
164
163
165
-
As you can see we get a networkx directed graph, which we can then use
166
-
all the regular networkx tools and analytics on. The graph is richer
167
-
than the visual plot above may lead you to believe however:
164
+
As you can see we get a NetworkX directed graph, which we can then use
165
+
all the regular NetworkX tools and analytics on. The graph is richer
166
+
than the visual plot above may lead you to believe, however:
168
167
169
168
.. code:: python
170
169
@@ -182,12 +181,12 @@ than the visual plot above may lead you to believe however:
182
181
183
182
The graph actually contains nodes for all the points falling out of
184
183
clusters as well as the clusters themselves. Each node has an associated
185
-
``size`` attribute, and each edge has a ``weight`` of the lambda value
184
+
``size`` attribute and each edge has a ``weight`` of the lambda value
186
185
at which that edge forms. This allows for much more interesting
187
186
analyses.
188
187
189
-
Next we have the :py:meth:`~hdbscan.plots.CondensedTree.to_pandas` method, which returns a panda dataframe
190
-
where each row corresponds to an edge of the networkx graph:
188
+
Next, we have the :py:meth:`~hdbscan.plots.CondensedTree.to_pandas` method, which returns a panda DataFrame
189
+
where each row corresponds to an edge of the NetworkX graph:
191
190
192
191
.. code:: python
193
192
@@ -258,11 +257,11 @@ the id of the child cluster (or, if the child is a single data point
258
257
rather than a cluster, the index in the dataset of that point), the
259
258
``lambda_val`` provides the lambda value at which the edge forms, and
260
259
the ``child_size`` provides the number of points in the child cluster.
261
-
As you can see the start of the dataframe has singleton points falling
260
+
As you can see the start of the DataFrame has singleton points falling
262
261
out of the root cluster, with each ``child_size`` equal to 1.
263
262
264
263
If you want just the clusters, rather than all the individual points
265
-
as well, simply select the rows of the dataframe with ``child_size``
264
+
as well, simply select the rows of the DataFrame with ``child_size``
266
265
greater than 1.
267
266
268
267
.. code:: python
@@ -293,13 +292,13 @@ array:
293
292
294
293
295
294
296
-
This is equivalent to the pandas dataframe, but is in pure numpy and
295
+
This is equivalent to the pandas DataFrame but is in pure NumPy and
297
296
hence has no pandas dependencies if you do not wish to use pandas.
298
297
299
298
Single Linkage Trees
300
299
--------------------
301
300
302
-
We have still more data at our disposal however. As noted in the How
301
+
We have still more data at our disposal, however. As noted in the How
303
302
HDBSCAN Works section, prior to providing a condensed tree the algorithm
304
303
builds a complete dendrogram. We have access to this too via the
305
304
:py:attr:`~hdbscan.HDBSCAN.single_linkage_tree_` attribute of the clusterer.
@@ -333,13 +332,13 @@ As you can see we gain a lot from condensing the tree in terms of better
333
332
presenting and summarising the data. There is a lot less to be gained
334
333
from visual inspection of a plot like this (and it only gets worse for
335
334
larger datasets). The plot function support most of the same
336
-
fucntionality as the dendrogram plotting from
335
+
functionality as the dendrogram plotting from
337
336
``scipy.cluster.hierarchy``, so you can view various truncations of the
338
337
tree if necessary. In practice, however, you are more likely to be
339
338
interested in access the raw data for further analysis. Again we have
340
339
:py:meth:`~hdbscan.plots.SingleLinkageTree.to_networkx`, :py:meth:`~hdbscan.plots.SingleLinkageTree.to_pandas` and :py:meth:`~hdbscan.plots.SingleLinkageTree.to_numpy`. This time the
341
-
:py:meth:`~hdbscan.plots.SingleLinkageTree.to_networkx` provides a direct networkx version of what you see
342
-
above. The numpy and pandas results conform to the single linkage
340
+
:py:meth:`~hdbscan.plots.SingleLinkageTree.to_networkx` provides a direct NetworkX version of what you see
341
+
above. The NumPy and pandas results conform to the single linkage
343
342
hierarchy format of ``scipy.cluster.hierarchy``, and can be passed to
344
343
routines there if necessary.
345
344
@@ -360,6 +359,6 @@ noise points (any cluster smaller than the ``minimum_cluster_size``).
360
359
array([ 0, -1, 0, ..., -1, -1, 0])
361
360
362
361
363
-
In this way it is possible to extract the DBSCAN clustering that would result
362
+
In this way, it is possible to extract the DBSCAN clustering that would result
364
363
for any given epsilon value, all from one run of hdbscan.
0 commit comments