Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/dbscan_from_hdbscan.rst
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ be some minor discrepancies between core point results largely due to implementa
details and optimizations with the code base.

Why might one just extract the DBSCAN* clustering results from a single HDBSCAN* run
instead of making use of sklearns DBSSCAN code? The short answer is efficiency.
instead of making use of sklearn's DBSCAN code? The short answer is efficiency.
If you aren't sure what epsilon parameter to select for DBSCAN then you may have to
run the algorithm many times on your data set. While those runs can be inexpensive for
very small epsilon values they can get quite expensive for large parameter values.
Expand Down
6 changes: 3 additions & 3 deletions docs/how_hdbscan_works.rst
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ algorithm to be robust against noise so we need to find a way to help
How can we characterize 'sea' and 'land' without doing a clustering? As
long as we can get an estimate of density we can consider lower density
points as the 'sea'. The goal here is not to perfectly distinguish 'sea'
from 'land' -- this is an initial step in clustering, not the ouput --
from 'land' -- this is an initial step in clustering, not the output --
just to make our clustering core a little more robust to noise. So given
an identification of 'sea' we want to lower the sea level. For practical
purposes that means making 'sea' points more distant from each other and
Expand Down Expand Up @@ -172,7 +172,7 @@ blue and green as larger -- equal to the radius of the green circle

.. image:: images/distance4a.svg

On the other hand the mutual reachablity distance from red to green is
On the other hand the mutual reachability distance from red to green is
simply distance from red to green since that distance is greater than
either core distance (i.e. the distance arrow passes through both
circles).
Expand Down Expand Up @@ -257,7 +257,7 @@ data structure. We can view the result as a dendrogram as we see below:

This brings us to the point where robust single linkage stops. We want
more though; a cluster hierarchy is good, but we really want a set of
flat clusters. We could do that by drawing a a horizontal line through
flat clusters. We could do that by drawing a horizontal line through
the above diagram and selecting the clusters that it cuts through. This
is in practice what
`DBSCAN <http://scikit-learn.org/stable/modules/clustering.html#dbscan>`__
Expand Down
2 changes: 1 addition & 1 deletion docs/how_to_use_epsilon.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ same time avoid the abundance of micro-clusters in the original HDBSCAN\* cluste
Note that for the given parameter setting, running HDBSCAN\* based on ``cluster_selection_method = 'eom'`` or ``cluster_selection_method = 'leaf'`` does not make
any difference: the ``cluster_selection_epsilon`` threshold neutralizes the effect of HDBSCAN(eom)'s stability calculations.
When using a lower threshold, some minor differences can be noticed. For example, an epsilon value of 3 meters with ``'eom'`` produces the same results as
a the 5 meter value on the given data set, but 3 meters in combination with ``'leaf'`` achieves a slightly different result:
a 5 meter value on the given data set, but 3 meters in combination with ``'leaf'`` achieves a slightly different result:

.. image:: images/epsilon_parameter_hdbscan_e3_leaf.png
:align: center
Expand Down
6 changes: 3 additions & 3 deletions docs/parameter_selection.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ choosing them effectively.
Selecting ``min_cluster_size``
------------------------------

The primary parameter to effect the resulting clustering is
The primary parameter to affect the resulting clustering is
``min_cluster_size``. Ideally this is a relatively intuitive parameter
to select -- set it to the smallest size grouping that you wish to
consider a cluster. It can have slightly non-obvious effects however.
Expand Down Expand Up @@ -188,7 +188,7 @@ has on the resulting clustering.
Selecting ``alpha``
-----------------

A further parameter that effects the resulting clustering is ``alpha``.
A further parameter that affects the resulting clustering is ``alpha``.
In practice it is best not to mess with this parameter -- ultimately it
is part of the ``RobustSingleLinkage`` code, but flows naturally into
HDBSCAN\*. If, for some reason, ``min_samples`` or ``cluster_selection_epsilon`` is not providing you
Expand Down Expand Up @@ -225,7 +225,7 @@ Leaf clustering
HDBSCAN supports an extra parameter ``cluster_selection_method`` to determine
how it selects flat clusters from the cluster tree hierarchy. The default
method is ``'eom'`` for Excess of Mass, the algorithm described in
:doc:`how_hdbscan_works`. This is not always the most desireable approach to
:doc:`how_hdbscan_works`. This is not always the most desirable approach to
cluster selection. If you are more interested in having small homogeneous
clusters then you may find Excess of Mass has a tendency to pick one or two
large clusters and then a number of small extra clusters. In this situation
Expand Down
20 changes: 10 additions & 10 deletions docs/performance_and_scalability.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ the implementation as the underlying algorithm. Obviously a well written
implementation in C or C++ will beat a naive implementation on pure
Python, but there is more to it than just that. The internals and data
structures used can have a large impact on performance, and can even
significanty change asymptotic performance. All of this means that,
significantly change asymptotic performance. All of this means that,
given some amount of data that you want to cluster your options as to
algorithm and implementation maybe significantly constrained. I'm both
lazy, and prefer empirical results for this sort of thing, so rather
Expand Down Expand Up @@ -139,7 +139,7 @@ datapoints.
dataset_sizes = np.hstack([np.arange(1, 6) * 500, np.arange(3,7) * 1000, np.arange(4,17) * 2000])

Now it is just a matter of running all the clustering algorithms via our
benchmark function to collect up all the requsite data. This could be
benchmark function to collect up all the requisite data. This could be
prettier, rolled up into functions appropriately, but sometimes brute
force is good enough. More importantly (for me) since this can take a
significant amount of compute time, I wanted to be able to comment out
Expand Down Expand Up @@ -342,7 +342,7 @@ before.


Clearly something has gone woefully wrong with the curve fitting for the
scipy single linkage implementation, but what exactly? If we look at the
Scipy single linkage implementation, but what exactly? If we look at the
raw data we can see.

.. code:: python
Expand Down Expand Up @@ -448,7 +448,7 @@ array in RAM then clearly we are going to spend time paging out the
distance array to disk and back and hence we will see the runtimes
increase dramatically as we become disk IO bound. If we just leave off
the last element we can get a better idea of the curve, but keep in mind
that the scipy single linkage implementation does not scale past a limit
that the Scipy single linkage implementation does not scale past a limit
set by your available RAM.

.. code:: python
Expand Down Expand Up @@ -491,12 +491,12 @@ set by your available RAM.
.. image:: images/performance_and_scalability_20_2.png


If we're looking for scaling we can write off the scipy single linkage
If we're looking for scaling we can write off the Scipy single linkage
implementation -- if even we didn't hit the RAM limit the :math:`O(n^2)`
scaling is going to quickly catch up with us. Fastcluster has the same
asymptotic scaling, but is heavily optimized to being the constant down
much lower -- at this point it is still keeping close to the faster
algorithms. It's asymtotics will still catch up with it eventually
algorithms. It's asymptotic will still catch up with it eventually
however.

In practice this is going to mean that for larger datasets you are going
Expand All @@ -505,7 +505,7 @@ enough datapoints only K-Means, DBSCAN, and HDBSCAN will be left. This
is somewhat disappointing, paritcularly as `K-Means is not a
particularly good clustering
algorithm <http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb>`__,
paricularly for exploratory data analysis.
particularly for exploratory data analysis.

With this in mind it is worth looking at how these last several
implementations perform at much larger sizes, to see, for example, when
Expand Down Expand Up @@ -585,7 +585,7 @@ DBSCAN, while having sub-\ :math:`O(n^2)` complexity, can't achieve
:math:`O(n \log(n))` at this dataset dimension, and start to curve
upward precipitously. Finally it demonstrates again how much of a
difference implementation can make: the sklearn implementation of
K-Means is far better than the scipy implementation. Since HDBSCAN
K-Means is far better than the Scipy implementation. Since HDBSCAN
clustering is a lot better than K-Means (unless you have good reasons to
assume that the clusters partition your data and are all drawn from
Gaussian distributions) and the scaling is still pretty good I would
Expand All @@ -600,7 +600,7 @@ thing to know in practice is, given a dataset, what can I run
interactively? What can I run while I go and grab some coffee? How about
a run over lunch? What if I'm willing to wait until I get in tomorrow
morning? Each of these represent significant breaks in productivity --
once you aren't working interactively anymore your productivity drops
once you aren't working interactively any more your productivity drops
measurably, and so on.

We can build a table for this. To start we'll need to be able to
Expand Down Expand Up @@ -641,7 +641,7 @@ Now we run that for each of our pre-existing datasets to extrapolate out
predicted performance on the relevant dataset sizes. A little pandas
wrangling later and we've produced a table of roughly how large a
dataset you can tackle in each time frame with each implementation. I
had to leave out the scipy KMeans timings because the noise in timing
had to leave out the Scipy KMeans timings because the noise in timing
results caused the model to be unrealistic at larger data sizes. Note
how the :math:`O(n\log n)` algorithms utterly dominate here. In the
meantime, for medium sizes data sets you can still get quite a lot done
Expand Down
6 changes: 3 additions & 3 deletions docs/soft_clustering_explanation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ point for each cluster to measure distance to. This is tricky since our
clusters may have off shapes. In practice there isn't really any single
clear exemplar for a cluster. The right solution, then, is to have a set
of exemplar points for each cluster? How do we determine which points
those should be? They should be the points that persist in the the
those should be? They should be the points that persist in the
cluster (and it's children in the HDBSCAN condensed tree) for the
longest range of lambda values -- such points represent the "heart" of
the cluster around which the ultimate cluster forms.
Expand Down Expand Up @@ -188,7 +188,7 @@ clusters having several subclusters stretched along their length.
Now to compute a cluster membership score for a point we need to simply
compute the distance to each of the cluster exemplar sets and scale
membership scores accordingly. In practice we work with the inverse
distance (just as HDBCSAN handles things with lambda values in the
distance (just as HDBSCAN handles things with lambda values in the
tree). Whether we do a softmax or simply normalize by dividing by the
sum is "to be determined" as there isn't necessarily a clear answer.
We'll leave it as an option in the code.
Expand Down Expand Up @@ -242,7 +242,7 @@ the red and green clusters, and the purple and blue clusters in a way
that is not really ideal. This is because we are using pure distance
(rather than any sort of cluster/manifold/density aware distance) and
latching on to whatever is closest. What we need is an approach the
understands the cluster structure better -- something based off the the
understands the cluster structure better -- something based off the
actual structure (and lambda values therein) of the condensed tree.
This is exactly the sort of approach something based on outlier scores
can provide.
Expand Down