diff --git a/docs/dbscan_from_hdbscan.rst b/docs/dbscan_from_hdbscan.rst
index 24656926..c02cf913 100644
--- a/docs/dbscan_from_hdbscan.rst
+++ b/docs/dbscan_from_hdbscan.rst
@@ -104,7 +104,7 @@ be some minor discrepancies between core point results largely due to implementa
 details and optimizations with the code base.
 
 Why might one just extract the DBSCAN* clustering results from a single HDBSCAN* run
-instead of making use of sklearns DBSSCAN code?  The short answer is efficiency.
+instead of making use of sklearn's DBSCAN code?  The short answer is efficiency.
 If you aren't sure what epsilon parameter to select for DBSCAN then you may have to
 run the algorithm many times on your data set.  While those runs can be inexpensive for
 very small epsilon values they can get quite expensive for large parameter values.
diff --git a/docs/how_hdbscan_works.rst b/docs/how_hdbscan_works.rst
index a6049a7f..4e3a5eab 100644
--- a/docs/how_hdbscan_works.rst
+++ b/docs/how_hdbscan_works.rst
@@ -100,7 +100,7 @@ algorithm to be robust against noise so we need to find a way to help
 How can we characterize 'sea' and 'land' without doing a clustering? As
 long as we can get an estimate of density we can consider lower density
 points as the 'sea'. The goal here is not to perfectly distinguish 'sea'
-from 'land' -- this is an initial step in clustering, not the ouput --
+from 'land' -- this is an initial step in clustering, not the output --
 just to make our clustering core a little more robust to noise. So given
 an identification of 'sea' we want to lower the sea level. For practical
 purposes that means making 'sea' points more distant from each other and
@@ -172,7 +172,7 @@ blue and green as larger -- equal to the radius of the green circle
 
 .. image:: images/distance4a.svg
 
-On the other hand the mutual reachablity distance from red to green is
+On the other hand the mutual reachability distance from red to green is
 simply distance from red to green since that distance is greater than
 either core distance (i.e. the distance arrow passes through both
 circles).
@@ -257,7 +257,7 @@ data structure. We can view the result as a dendrogram as we see below:
 
 This brings us to the point where robust single linkage stops. We want
 more though; a cluster hierarchy is good, but we really want a set of
-flat clusters. We could do that by drawing a a horizontal line through
+flat clusters. We could do that by drawing a horizontal line through
 the above diagram and selecting the clusters that it cuts through. This
 is in practice what
 `DBSCAN <http://scikit-learn.org/stable/modules/clustering.html#dbscan>`__
diff --git a/docs/how_to_use_epsilon.rst b/docs/how_to_use_epsilon.rst
index 0f7e2aec..856faa0f 100644
--- a/docs/how_to_use_epsilon.rst
+++ b/docs/how_to_use_epsilon.rst
@@ -58,7 +58,7 @@ same time avoid the abundance of micro-clusters in the original HDBSCAN\* cluste
 Note that for the given parameter setting, running HDBSCAN\* based on ``cluster_selection_method = 'eom'`` or ``cluster_selection_method = 'leaf'`` does not make
 any difference: the ``cluster_selection_epsilon`` threshold neutralizes the effect of HDBSCAN(eom)'s stability calculations.
 When using a lower threshold, some minor differences can be noticed. For example, an epsilon value of 3 meters with ``'eom'`` produces the same results as
-a the 5 meter value on the given data set, but 3 meters in combination with ``'leaf'`` achieves a slightly different result:
+a 5 meter value on the given data set, but 3 meters in combination with ``'leaf'`` achieves a slightly different result:
 	
 .. image:: images/epsilon_parameter_hdbscan_e3_leaf.png
 	:align: center
diff --git a/docs/parameter_selection.rst b/docs/parameter_selection.rst
index 0f459df3..f18d6200 100644
--- a/docs/parameter_selection.rst
+++ b/docs/parameter_selection.rst
@@ -13,7 +13,7 @@ choosing them effectively.
 Selecting ``min_cluster_size``
 ------------------------------
 
-The primary parameter to effect the resulting clustering is
+The primary parameter to affect the resulting clustering is
 ``min_cluster_size``. Ideally this is a relatively intuitive parameter
 to select -- set it to the smallest size grouping that you wish to
 consider a cluster. It can have slightly non-obvious effects however.
@@ -188,7 +188,7 @@ has on the resulting clustering.
 Selecting ``alpha``
 -----------------
 
-A further parameter that effects the resulting clustering is ``alpha``.
+A further parameter that affects the resulting clustering is ``alpha``.
 In practice it is best not to mess with this parameter -- ultimately it
 is part of the ``RobustSingleLinkage`` code, but flows naturally into
 HDBSCAN\*. If, for some reason, ``min_samples`` or ``cluster_selection_epsilon`` is not providing you
@@ -225,7 +225,7 @@ Leaf clustering
 HDBSCAN supports an extra parameter ``cluster_selection_method`` to determine
 how it selects flat clusters from the cluster tree hierarchy. The default
 method is ``'eom'`` for Excess of Mass, the algorithm described in
-:doc:`how_hdbscan_works`. This is not always the most desireable approach to
+:doc:`how_hdbscan_works`. This is not always the most desirable approach to
 cluster selection. If you are more interested in having small homogeneous
 clusters then you may find Excess of Mass has a tendency to pick one or two
 large clusters and then a number of small extra clusters. In this situation
diff --git a/docs/performance_and_scalability.rst b/docs/performance_and_scalability.rst
index 243eb1d3..bfaf4568 100644
--- a/docs/performance_and_scalability.rst
+++ b/docs/performance_and_scalability.rst
@@ -8,7 +8,7 @@ the implementation as the underlying algorithm. Obviously a well written
 implementation in C or C++ will beat a naive implementation on pure
 Python, but there is more to it than just that. The internals and data
 structures used can have a large impact on performance, and can even
-significanty change asymptotic performance. All of this means that,
+significantly change asymptotic performance. All of this means that,
 given some amount of data that you want to cluster your options as to
 algorithm and implementation maybe significantly constrained. I'm both
 lazy, and prefer empirical results for this sort of thing, so rather
@@ -139,7 +139,7 @@ datapoints.
     dataset_sizes = np.hstack([np.arange(1, 6) * 500, np.arange(3,7) * 1000, np.arange(4,17) * 2000])
 
 Now it is just a matter of running all the clustering algorithms via our
-benchmark function to collect up all the requsite data. This could be
+benchmark function to collect up all the requisite data. This could be
 prettier, rolled up into functions appropriately, but sometimes brute
 force is good enough. More importantly (for me) since this can take a
 significant amount of compute time, I wanted to be able to comment out
@@ -342,7 +342,7 @@ before.
 
 
 Clearly something has gone woefully wrong with the curve fitting for the
-scipy single linkage implementation, but what exactly? If we look at the
+Scipy single linkage implementation, but what exactly? If we look at the
 raw data we can see.
 
 .. code:: python
@@ -448,7 +448,7 @@ array in RAM then clearly we are going to spend time paging out the
 distance array to disk and back and hence we will see the runtimes
 increase dramatically as we become disk IO bound. If we just leave off
 the last element we can get a better idea of the curve, but keep in mind
-that the scipy single linkage implementation does not scale past a limit
+that the Scipy single linkage implementation does not scale past a limit
 set by your available RAM.
 
 .. code:: python
@@ -491,12 +491,12 @@ set by your available RAM.
 .. image:: images/performance_and_scalability_20_2.png
 
 
-If we're looking for scaling we can write off the scipy single linkage
+If we're looking for scaling we can write off the Scipy single linkage
 implementation -- if even we didn't hit the RAM limit the :math:`O(n^2)`
 scaling is going to quickly catch up with us. Fastcluster has the same
 asymptotic scaling, but is heavily optimized to being the constant down
 much lower -- at this point it is still keeping close to the faster
-algorithms. It's asymtotics will still catch up with it eventually
+algorithms. It's asymptotic will still catch up with it eventually
 however.
 
 In practice this is going to mean that for larger datasets you are going
@@ -505,7 +505,7 @@ enough datapoints only K-Means, DBSCAN, and HDBSCAN will be left. This
 is somewhat disappointing, paritcularly as `K-Means is not a
 particularly good clustering
 algorithm <http://nbviewer.jupyter.org/github/scikit-learn-contrib/hdbscan/blob/master/notebooks/Comparing%20Clustering%20Algorithms.ipynb>`__,
-paricularly for exploratory data analysis.
+particularly for exploratory data analysis.
 
 With this in mind it is worth looking at how these last several
 implementations perform at much larger sizes, to see, for example, when
@@ -585,7 +585,7 @@ DBSCAN, while having sub-\ :math:`O(n^2)` complexity, can't achieve
 :math:`O(n \log(n))` at this dataset dimension, and start to curve
 upward precipitously. Finally it demonstrates again how much of a
 difference implementation can make: the sklearn implementation of
-K-Means is far better than the scipy implementation. Since HDBSCAN
+K-Means is far better than the Scipy implementation. Since HDBSCAN
 clustering is a lot better than K-Means (unless you have good reasons to
 assume that the clusters partition your data and are all drawn from
 Gaussian distributions) and the scaling is still pretty good I would
@@ -600,7 +600,7 @@ thing to know in practice is, given a dataset, what can I run
 interactively? What can I run while I go and grab some coffee? How about
 a run over lunch? What if I'm willing to wait until I get in tomorrow
 morning? Each of these represent significant breaks in productivity --
-once you aren't working interactively anymore your productivity drops
+once you aren't working interactively any more your productivity drops
 measurably, and so on.
 
 We can build a table for this. To start we'll need to be able to
@@ -641,7 +641,7 @@ Now we run that for each of our pre-existing datasets to extrapolate out
 predicted performance on the relevant dataset sizes. A little pandas
 wrangling later and we've produced a table of roughly how large a
 dataset you can tackle in each time frame with each implementation. I
-had to leave out the scipy KMeans timings because the noise in timing
+had to leave out the Scipy KMeans timings because the noise in timing
 results caused the model to be unrealistic at larger data sizes. Note
 how the :math:`O(n\log n)` algorithms utterly dominate here. In the
 meantime, for medium sizes data sets you can still get quite a lot done
diff --git a/docs/soft_clustering_explanation.rst b/docs/soft_clustering_explanation.rst
index 7ae20903..b7ca0f1a 100644
--- a/docs/soft_clustering_explanation.rst
+++ b/docs/soft_clustering_explanation.rst
@@ -131,7 +131,7 @@ point for each cluster to measure distance to. This is tricky since our
 clusters may have off shapes. In practice there isn't really any single
 clear exemplar for a cluster. The right solution, then, is to have a set
 of exemplar points for each cluster? How do we determine which points
-those should be? They should be the points that persist in the the
+those should be? They should be the points that persist in the
 cluster (and it's children in the HDBSCAN condensed tree) for the
 longest range of lambda values -- such points represent the "heart" of
 the cluster around which the ultimate cluster forms.
@@ -188,7 +188,7 @@ clusters having several subclusters stretched along their length.
 Now to compute a cluster membership score for a point we need to simply
 compute the distance to each of the cluster exemplar sets and scale
 membership scores accordingly. In practice we work with the inverse
-distance (just as HDBCSAN handles things with lambda values in the
+distance (just as HDBSCAN handles things with lambda values in the
 tree). Whether we do a softmax or simply normalize by dividing by the
 sum is "to be determined" as there isn't necessarily a clear answer.
 We'll leave it as an option in the code.
@@ -242,7 +242,7 @@ the red and green clusters, and the purple and blue clusters in a way
 that is not really ideal. This is because we are using pure distance
 (rather than any sort of cluster/manifold/density aware distance) and
 latching on to whatever is closest. What we need is an approach the
-understands the cluster structure better -- something based off the the
+understands the cluster structure better -- something based off the
 actual structure (and lambda values therein) of the condensed tree.
 This is exactly the sort of approach something based on outlier scores
 can provide.