diff --git a/docs/dbscan_from_hdbscan.rst b/docs/dbscan_from_hdbscan.rst index 24656926..c02cf913 100644 --- a/docs/dbscan_from_hdbscan.rst +++ b/docs/dbscan_from_hdbscan.rst @@ -104,7 +104,7 @@ be some minor discrepancies between core point results largely due to implementa details and optimizations with the code base. Why might one just extract the DBSCAN* clustering results from a single HDBSCAN* run -instead of making use of sklearns DBSSCAN code? The short answer is efficiency. +instead of making use of sklearn's DBSCAN code? The short answer is efficiency. If you aren't sure what epsilon parameter to select for DBSCAN then you may have to run the algorithm many times on your data set. While those runs can be inexpensive for very small epsilon values they can get quite expensive for large parameter values. diff --git a/docs/how_hdbscan_works.rst b/docs/how_hdbscan_works.rst index a6049a7f..4e3a5eab 100644 --- a/docs/how_hdbscan_works.rst +++ b/docs/how_hdbscan_works.rst @@ -100,7 +100,7 @@ algorithm to be robust against noise so we need to find a way to help How can we characterize 'sea' and 'land' without doing a clustering? As long as we can get an estimate of density we can consider lower density points as the 'sea'. The goal here is not to perfectly distinguish 'sea' -from 'land' -- this is an initial step in clustering, not the ouput -- +from 'land' -- this is an initial step in clustering, not the output -- just to make our clustering core a little more robust to noise. So given an identification of 'sea' we want to lower the sea level. For practical purposes that means making 'sea' points more distant from each other and @@ -172,7 +172,7 @@ blue and green as larger -- equal to the radius of the green circle .. image:: images/distance4a.svg -On the other hand the mutual reachablity distance from red to green is +On the other hand the mutual reachability distance from red to green is simply distance from red to green since that distance is greater than either core distance (i.e. the distance arrow passes through both circles). @@ -257,7 +257,7 @@ data structure. We can view the result as a dendrogram as we see below: This brings us to the point where robust single linkage stops. We want more though; a cluster hierarchy is good, but we really want a set of -flat clusters. We could do that by drawing a a horizontal line through +flat clusters. We could do that by drawing a horizontal line through the above diagram and selecting the clusters that it cuts through. This is in practice what `DBSCAN `__ diff --git a/docs/how_to_use_epsilon.rst b/docs/how_to_use_epsilon.rst index 0f7e2aec..856faa0f 100644 --- a/docs/how_to_use_epsilon.rst +++ b/docs/how_to_use_epsilon.rst @@ -58,7 +58,7 @@ same time avoid the abundance of micro-clusters in the original HDBSCAN\* cluste Note that for the given parameter setting, running HDBSCAN\* based on ``cluster_selection_method = 'eom'`` or ``cluster_selection_method = 'leaf'`` does not make any difference: the ``cluster_selection_epsilon`` threshold neutralizes the effect of HDBSCAN(eom)'s stability calculations. When using a lower threshold, some minor differences can be noticed. For example, an epsilon value of 3 meters with ``'eom'`` produces the same results as -a the 5 meter value on the given data set, but 3 meters in combination with ``'leaf'`` achieves a slightly different result: +a 5 meter value on the given data set, but 3 meters in combination with ``'leaf'`` achieves a slightly different result: .. image:: images/epsilon_parameter_hdbscan_e3_leaf.png :align: center diff --git a/docs/parameter_selection.rst b/docs/parameter_selection.rst index 0f459df3..f18d6200 100644 --- a/docs/parameter_selection.rst +++ b/docs/parameter_selection.rst @@ -13,7 +13,7 @@ choosing them effectively. Selecting ``min_cluster_size`` ------------------------------ -The primary parameter to effect the resulting clustering is +The primary parameter to affect the resulting clustering is ``min_cluster_size``. Ideally this is a relatively intuitive parameter to select -- set it to the smallest size grouping that you wish to consider a cluster. It can have slightly non-obvious effects however. @@ -188,7 +188,7 @@ has on the resulting clustering. Selecting ``alpha`` ----------------- -A further parameter that effects the resulting clustering is ``alpha``. +A further parameter that affects the resulting clustering is ``alpha``. In practice it is best not to mess with this parameter -- ultimately it is part of the ``RobustSingleLinkage`` code, but flows naturally into HDBSCAN\*. If, for some reason, ``min_samples`` or ``cluster_selection_epsilon`` is not providing you @@ -225,7 +225,7 @@ Leaf clustering HDBSCAN supports an extra parameter ``cluster_selection_method`` to determine how it selects flat clusters from the cluster tree hierarchy. The default method is ``'eom'`` for Excess of Mass, the algorithm described in -:doc:`how_hdbscan_works`. This is not always the most desireable approach to +:doc:`how_hdbscan_works`. This is not always the most desirable approach to cluster selection. If you are more interested in having small homogeneous clusters then you may find Excess of Mass has a tendency to pick one or two large clusters and then a number of small extra clusters. In this situation diff --git a/docs/performance_and_scalability.rst b/docs/performance_and_scalability.rst index 243eb1d3..bfaf4568 100644 --- a/docs/performance_and_scalability.rst +++ b/docs/performance_and_scalability.rst @@ -8,7 +8,7 @@ the implementation as the underlying algorithm. Obviously a well written implementation in C or C++ will beat a naive implementation on pure Python, but there is more to it than just that. The internals and data structures used can have a large impact on performance, and can even -significanty change asymptotic performance. All of this means that, +significantly change asymptotic performance. All of this means that, given some amount of data that you want to cluster your options as to algorithm and implementation maybe significantly constrained. I'm both lazy, and prefer empirical results for this sort of thing, so rather @@ -139,7 +139,7 @@ datapoints. dataset_sizes = np.hstack([np.arange(1, 6) * 500, np.arange(3,7) * 1000, np.arange(4,17) * 2000]) Now it is just a matter of running all the clustering algorithms via our -benchmark function to collect up all the requsite data. This could be +benchmark function to collect up all the requisite data. This could be prettier, rolled up into functions appropriately, but sometimes brute force is good enough. More importantly (for me) since this can take a significant amount of compute time, I wanted to be able to comment out @@ -342,7 +342,7 @@ before. Clearly something has gone woefully wrong with the curve fitting for the -scipy single linkage implementation, but what exactly? If we look at the +Scipy single linkage implementation, but what exactly? If we look at the raw data we can see. .. code:: python @@ -448,7 +448,7 @@ array in RAM then clearly we are going to spend time paging out the distance array to disk and back and hence we will see the runtimes increase dramatically as we become disk IO bound. If we just leave off the last element we can get a better idea of the curve, but keep in mind -that the scipy single linkage implementation does not scale past a limit +that the Scipy single linkage implementation does not scale past a limit set by your available RAM. .. code:: python @@ -491,12 +491,12 @@ set by your available RAM. .. image:: images/performance_and_scalability_20_2.png -If we're looking for scaling we can write off the scipy single linkage +If we're looking for scaling we can write off the Scipy single linkage implementation -- if even we didn't hit the RAM limit the :math:`O(n^2)` scaling is going to quickly catch up with us. Fastcluster has the same asymptotic scaling, but is heavily optimized to being the constant down much lower -- at this point it is still keeping close to the faster -algorithms. It's asymtotics will still catch up with it eventually +algorithms. It's asymptotic will still catch up with it eventually however. In practice this is going to mean that for larger datasets you are going @@ -505,7 +505,7 @@ enough datapoints only K-Means, DBSCAN, and HDBSCAN will be left. This is somewhat disappointing, paritcularly as `K-Means is not a particularly good clustering algorithm `__, -paricularly for exploratory data analysis. +particularly for exploratory data analysis. With this in mind it is worth looking at how these last several implementations perform at much larger sizes, to see, for example, when @@ -585,7 +585,7 @@ DBSCAN, while having sub-\ :math:`O(n^2)` complexity, can't achieve :math:`O(n \log(n))` at this dataset dimension, and start to curve upward precipitously. Finally it demonstrates again how much of a difference implementation can make: the sklearn implementation of -K-Means is far better than the scipy implementation. Since HDBSCAN +K-Means is far better than the Scipy implementation. Since HDBSCAN clustering is a lot better than K-Means (unless you have good reasons to assume that the clusters partition your data and are all drawn from Gaussian distributions) and the scaling is still pretty good I would @@ -600,7 +600,7 @@ thing to know in practice is, given a dataset, what can I run interactively? What can I run while I go and grab some coffee? How about a run over lunch? What if I'm willing to wait until I get in tomorrow morning? Each of these represent significant breaks in productivity -- -once you aren't working interactively anymore your productivity drops +once you aren't working interactively any more your productivity drops measurably, and so on. We can build a table for this. To start we'll need to be able to @@ -641,7 +641,7 @@ Now we run that for each of our pre-existing datasets to extrapolate out predicted performance on the relevant dataset sizes. A little pandas wrangling later and we've produced a table of roughly how large a dataset you can tackle in each time frame with each implementation. I -had to leave out the scipy KMeans timings because the noise in timing +had to leave out the Scipy KMeans timings because the noise in timing results caused the model to be unrealistic at larger data sizes. Note how the :math:`O(n\log n)` algorithms utterly dominate here. In the meantime, for medium sizes data sets you can still get quite a lot done diff --git a/docs/soft_clustering_explanation.rst b/docs/soft_clustering_explanation.rst index 7ae20903..b7ca0f1a 100644 --- a/docs/soft_clustering_explanation.rst +++ b/docs/soft_clustering_explanation.rst @@ -131,7 +131,7 @@ point for each cluster to measure distance to. This is tricky since our clusters may have off shapes. In practice there isn't really any single clear exemplar for a cluster. The right solution, then, is to have a set of exemplar points for each cluster? How do we determine which points -those should be? They should be the points that persist in the the +those should be? They should be the points that persist in the cluster (and it's children in the HDBSCAN condensed tree) for the longest range of lambda values -- such points represent the "heart" of the cluster around which the ultimate cluster forms. @@ -188,7 +188,7 @@ clusters having several subclusters stretched along their length. Now to compute a cluster membership score for a point we need to simply compute the distance to each of the cluster exemplar sets and scale membership scores accordingly. In practice we work with the inverse -distance (just as HDBCSAN handles things with lambda values in the +distance (just as HDBSCAN handles things with lambda values in the tree). Whether we do a softmax or simply normalize by dividing by the sum is "to be determined" as there isn't necessarily a clear answer. We'll leave it as an option in the code. @@ -242,7 +242,7 @@ the red and green clusters, and the purple and blue clusters in a way that is not really ideal. This is because we are using pure distance (rather than any sort of cluster/manifold/density aware distance) and latching on to whatever is closest. What we need is an approach the -understands the cluster structure better -- something based off the the +understands the cluster structure better -- something based off the actual structure (and lambda values therein) of the condensed tree. This is exactly the sort of approach something based on outlier scores can provide.