an alternative distance metric, categorical variable handling, and optimization ideas

First of all, thank you for the excellent package and companion articles.

While looking over the `aoa` code it occurred to me that some of the complexity associated with handling categorical variables can be simplified by switching to a different distance metric. Gower's generalized distance metric is ideal because it can integrate mixtures of ratio, nominal, and ordinal data types. Also, the metric automatically includes scaling / centering. There are a couple of implementations:

   * https://github.com/markvanderloo/gower/
   * https://www.rdocumentation.org/packages/cluster/versions/2.1.2/topics/daisy
   * https://drostlab.github.io/philentropy/articles/Distances.html

It would appear that the [`knnx.dist`](https://github.com/HannaMeyer/CAST/blob/8a217f97889d5779c649e8eb9b960ad4c79d2e7a/R/aoa.R#L229) function does all of the heavy lifting in `aoa`.

A quick benchmark of a couple candidate methods.
```r
library(gower)
library(cluster)
library(FNN)
library(microbenchmark)

set.seed(10101)
n <- 1000
a <- rnorm(n = n, mean = 0, sd = 2)
x <- rnorm(n = n, mean = 0, sd = 2)
y <- rnorm(n = n, mean = 0, sd = 2)

z <- data.frame(x, y, a)

microbenchmark(
  gower = gower_dist(z[1:10, ], z),
  knn = knnx.dist(data = z, query = z[1:10, ], k = 1),
  daisy = daisy(z, metric = 'gower')
)
```

The interface and resulting objects aren't directly compatible, but it does seem like `gower::gower_dist()` is a reasonable candidate in terms of speed. The main reason to consider `cluster::daisy` is that it can accommodate all variable types, while `gower::gower_dist()` does not [yet](https://github.com/markvanderloo/gower/issues/2) differentiate between nominal / ordinal factors.
```
Unit: microseconds
  expr     min       lq       mean   median      uq      max neval cld
 gower   395.7   444.70    523.737   497.35   559.0    874.3   100  a 
   knn   772.6   794.05    892.615   842.70   925.2   1382.7   100  a 
 daisy 56398.0 73496.70 100253.478 78571.80 88727.8 276262.1   100   b
```

Profiling data for `aoa` run in a single thred:
![image](https://user-images.githubusercontent.com/624277/131720628-f544b433-3a79-4288-9fbf-c5c327396043.png)

This was performed with a model based on 1,030 observations as applied to a raster stack 
`dimensions : 3628, 2351, 8529428, 18  (nrow, ncol, ncell, nlayers)`



I'll follow-up with a small example dataset that contains nominal and ordinal variables.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

an alternative distance metric, categorical variable handling, and optimization ideas #19

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

an alternative distance metric, categorical variable handling, and optimization ideas #19

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions