Skip to content

Chapter 2: Geometry and Nearest Neighbors

Michael Willis edited this page Mar 12, 2015 · 4 revisions

This chapter discusses two algorithms: K-Nearest Neighbors and K-Means. Again, if you haven't already cloned the repo or done the Project Setup, please do so now. To look at the finished code, see knn_and_kmeans.clj

If you don't already have a REPL session open, start it from the project directory:

cd ciml
lein repl

K-Nearest Neighbors

Require priority-map, pprint, and incanter's get-dataset and euclidean-distance functions:

(require '[clojure.data.priority-map :refer :all]
         '[clojure.pprint :refer [pprint]]
         '[incanter.datasets :refer [get-dataset]]
         '[incanter.stats :refer [euclidean-distance]])

Incanter provides some sample datasets available via the get-dataset function. To learn more about them, view the doc:

(doc get-dataset)

We will be using the :iris dataset. According to the doc:

    :iris -- the Fisher's or Anderson's Iris data set gives the
             measurements in centimeters of the variables sepal
             length and width and petal length and width,
             respectively, for 50 flowers from each of 3 species
             of iris.

Take a look at a sample row in the dataset:

(-> (get-dataset :iris) :rows (nth 10))
; => {:Species "setosa", :Petal.Width 0.2, :Petal.Length 1.5, :Sepal.Width 3.7, :Sepal.Length 5.4}

Further investigation will show that there are three different species: setosa, versicolor, and virginica. The dataset is primarily sorted by species, with fifty samples of each, making 150 samples in total.

The chapter talks about using a subset of the data for training, and then using the rest as test data. Let's say we want to use 100 samples as training data, and the remaining 50 samples as test data. Incidentally, it will not work to simply use the first 100 samples as training data, because the training data would consist only of setosa and versicolor, and the test data would only consist of virginica, thus thwarting any attempt at testing. The simple solution is to shuffle the dataset first, and then divide it into training data and test data.

(def iris (shuffle (:rows (get-dataset :iris))))
(def training-data (take 100 iris))
(def test-data (drop 100 iris))

Take a look at some samples from the training data:

(pprint (take 3 training-data))
; => ({:Species "setosa",
;      :Petal.Width 0.2,
;      :Petal.Length 1.4,
;      :Sepal.Width 3.0,
;      :Sepal.Length 4.9}
;     {:Species "versicolor",
;      :Petal.Width 1.4,
;      :Petal.Length 4.6,
;      :Sepal.Width 3.0,
;      :Sepal.Length 6.1}
;     {:Species "virginica",
;      :Petal.Width 2.0,
;      :Petal.Length 6.7,
;      :Sepal.Width 2.8,
;      :Sepal.Length 7.7})

Your results will be different, of course, because shuffle is non-deterministic.

To implement K-Nearest Neighbors, we'll need a way to calculate distances from other samples. First we'll define a way to select the features that we want to use as dimensions, and then use euclidean-distance to calculate the distance.

(def features #{:Petal.Width :Petal.Length :Sepal.Width :Sepal.Length})

(for [training-sample (take 3 training-data)]
  (euclidean-distance (map (first test-data) features)
                      (map training-sample features)))

; => (0.9273618495495699 5.40092584655631 0.3464101615137755)

This example calculated the distance of the first test-data sample from the first three training-data samples. Those distances aren't very useful though, without knowing which samples the distances were calculated from. Let's fix that:

(->>
 (for [training-sample (take 3 training-data)]
   (euclidean-distance (map (first test-data) features)
                       (map training-sample features)))
 (zipmap training-data)
 (pprint))

;; => {{:Species "setosa",
;;     :Petal.Width 0.1,
;;     :Petal.Length 1.1,
;;     :Sepal.Width 3.0,
;;     :Sepal.Length 4.3}
;;    0.3464101615137755,
;;    {:Species "virginica",
;;     :Petal.Width 1.6,
;;     :Petal.Length 5.8,
;;     :Sepal.Width 3.0,
;;     :Sepal.Length 7.2}
;;    5.40092584655631,
;;    {:Species "setosa",
;;     :Petal.Width 0.3,
;;     :Petal.Length 1.4,
;;     :Sepal.Width 3.5,
;;     :Sepal.Length 5.1}
;;    0.9273618495495699}

That's better. The result is a map in which the keys are training samples, and the values are the distance from the given test sample. From there, the next step is to calculate the distance from all training samples, sort the results by distance, take the K nearest, and determine the species of each. For now, assume that K = 3.

(->>
 (for [training-sample training-data]
   (euclidean-distance (map (first test-data) features)
                       (map training-sample features)))
 (zipmap training-data)
 (into (priority-map))
 (take 3)
 (keys)
 (map :Species))

;; => ("setosa" "setosa" "setosa")

In this case, all 3 nearest neighbors have the same species, but that won't necessarily be the case. Since K-Nearest Neighbors is a majority-rule algorithm, the function needs to take a vote from all samples.

(->>
 (for [training-sample training-data]
   (euclidean-distance (map (first test-data) features)
                       (map training-sample features)))
 (zipmap training-data)
 (into (priority-map))
 (take 3)
 (keys)
 (map :Species)
 (frequencies))

;; => {"setosa" 3}

Sure enough, 3 votes for setosa, no votes for anything else. It could have been something like {"setosa" 2, "virginica" 1}

Take the key with the highest vote:

(->>
 (for [training-sample (take 10 training-data)]
   (euclidean-distance (map (first test-data) features)
                       (map training-sample features)))
 (zipmap training-data)
 (into (priority-map))
 (take 3)
 (keys)
 (map :Species)
 (frequencies)
 (into (priority-map-by >))
 (ffirst))

; => "setosa"

...and that's really all there is to K-Nearest Neighbor. Implement it as a function, and test it:

(defn knn-predict [k training-data test-sample features label]
  (->>
   (for [training-sample training-data]
     (euclidean-distance (map test-sample features)xo
                         (map training-sample features)))
   (zipmap training-data)
   (into (priority-map))
   (take k)
   (keys)
   (map label)
   (frequencies)
   (into (priority-map-by >))
   (ffirst)))

(knn-predict 3 training-data (first test-data) features :Species)
; => "setosa"

For every sample in the test data, invoke knn-predict and compare the result to the test sample's actual species:

(for [test-sample test-data]
  (= (knn-predict 3 training-data test-sample features :Species)
     (test-sample :Species)))
; => (true true true true true true true true true true true true true true true true true true true true true true true true true true true true true true true true true true true true true true true true true true true true true true true true true false)

Use the frequencies function to make the results a little more clear:

(frequencies
  (for [test-sample test-data]
    (= (knn-predict 3 training-data test-sample features :Species)
      (test-sample :Species))))
; => {true 49, false 1}

Nice. Out of 50 test samples, the knn-predict function got 49 right.

K Means

Looking at the pseudocode for K-Means, The first item of interest is clearly randomly initializing the center of each cluster. Using the same REPL session (to keep some of the helpers defined for knn):

(defn random-value-in-range [low high]
  (-> (- high low) (* (rand)) (+ low)))

(random-value-in-range 0.5 1.0)
; => 0.7214331298987502
(random-value-in-range 1.0 2.5)
; => 2.2810654311723395
; of course, this being random, you'll get different results

(defn random-location [samples]
  (let [lows (apply map min samples)
        highs (apply map max samples)]
    (map random-value-in-range lows highs)))

; To see the (apply map ...) in action:
(apply map min [[1 5 9] [5 6 7] [2 4 8]]) 
; => (1 4 7)
(apply map max [[1 5 9] [5 6 7] [2 4 8]])
; => (5 6 9)
; Those two lines just pull the minimum and maximum values for each dimension

(random-location [[0 0 0] [1 1 1]])
; => (0.2218274085187797 0.9926206341499997 0.014844880956397244)
(random-location [[0 0 0] [0.5 1 1]])
; => (0.2214190698279815 0.5994476487677112 0.7131121419505133)

Now that we have a function that will give us a random location within the range of a sequence of samples, let's try it on our training data:

(random-location
 (for [sample training-data] (map sample features)))
; => (7.671192681240427 3.776487101292618 6.50637579588853 0.34285239341161367)

Given that our dataset is four-dimensional, this expression generates a four-dimensional random location, with each coordinate falling within the range that occurs in the dataset.

Moving on, the psuedocode shows that for each sample in the training data, we need a way to find the closest cluster center.

(defn closest [centers point]
  (->> centers
       (map (partial euclidean-distance point))
       (zipmap centers)
       (into (priority-map))
       (ffirst)))

(closest [[10 10 10] [20 10 20] [5 20 5]] [12 15 10])
; => [10 10 10]
(closest [[10 10 10] [20 10 20] [5 20 5]] [17 15 14])
; => [20 10 20]

This function takes a sequence of centers, and a point. It calculates the distance from the point to each center. It then uses zipmap to create a map in which the keys are the centers and the values represent the distance from the point to the corresponding center. It uses priority-map to sort the map by value, then ffirst gets the key of the first key-value pair.

After assigning each sample to one of the centers, the next step is to re-estimate the center of each cluster. This is done by taking the means of the samples for each center. Here is a function that will calculate the mean, given a sequence of samples:

(defn mean [samples]
  (->> samples
       (apply map +)
       (map #(/ % (count samples)))))

(mean [[10 10 10] [18 12 20] [5 20 6]])
; => (11 14 12)

With that, we are ready to implement the k-means function.

(defn k-means

  ([k training-data features]
     (let [samples (for [sample training-data] (map sample features))
           centers (take k (repeatedly #(random-location samples)))]
       (map (partial zipmap features)
            (k-means samples centers))))

  ([samples centers]
     (let [new-centers
           (->> samples
                (group-by (partial closest centers))
                (vals)
                (map mean))]
       (if (= new-centers centers) centers
           (recur samples new-centers)))))

This function has two implementations. The first can be considered the "public" interface. It takes k, the training data, and a sequence representing the features in the training data. Given the training data, it randomly generates k centers. It invokes the second implementation, passing in the samples from the training data and the generated centers. Finally, it uses zipmap on each of the resulting centers in order to convert each center (which is just a sequence of numbers) into a hash-map keyed by feature name.

The second implementation contains the bulk of the algorithm. It groups the samples by their closest center, calculates the mean for each group, then stores the means in a lexical value called new-centers. It compares the new centers to the previous centers; if they are the same, the algorithm is done. Otherwise, it recursively calls itself, passing in the new centers.

(pprint (k-means 3 training-data features))
; => ({:Petal.Width 1.4657894736842103,
;      :Petal.Length 4.455263157894737,
;      :Sepal.Width 2.736842105263158,
;      :Sepal.Length 5.88157894736842}
;     {:Petal.Width 0.24871794871794883,
;      :Petal.Length 1.4538461538461538,
;      :Sepal.Width 3.4128205128205127,
;      :Sepal.Length 4.992307692307692}
;     {:Petal.Width 2.1086956521739126,
;      :Petal.Length 5.773913043478261,
;      :Sepal.Width 3.104347826086957,
;      :Sepal.Length 6.908695652173916})

For comparison, calculate the actual mean for each species for the test data set:

(pprint
  (for [[species rows] (group-by :Species test-data)]
    (zipmap features (mean (for [row rows] (map row features))))))
; => ({:Petal.Width 2.023529411764706,
;      :Petal.Length 5.605882352941177,
;      :Sepal.Width 2.9411764705882355,
;      :Sepal.Length 6.611764705882352}
;     {:Petal.Width 0.23636363636363636,
;      :Petal.Length 1.4909090909090907,
;      :Sepal.Width 3.481818181818181,
;      :Sepal.Length 5.054545454545455}
;     {:Petal.Width 1.3181818181818183,
;      :Petal.Length 4.236363636363636,
;      :Sepal.Width 2.809090909090909,
;      :Sepal.Length 5.972727272727272})

They're not too different from the means calculated by the k-means function, albeit in a different order!

That wraps up chapter two; chapter three to come later.

Clone this wiki locally