You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a precision threshold to Euclidean and SqEuclidean (#63)
If a matrix contains duplicated columns, often the distance between identical points (which should be 1) is of order 1e-8 due to the fact that sqrt(roundofferror) ~ 1e-8. This changes the behavior of Euclidean to recalculate the distance by direct subtraction when the points are close compared to their magnitudes.
Copy file name to clipboardExpand all lines: README.md
+26-2Lines changed: 26 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -154,6 +154,32 @@ Each distance corresponds to a distance type. The type name and the correspondin
154
154
155
155
**Note:** The formulas above are using *Julia*'s functions. These formulas are mainly for conveying the math concepts in a concise way. The actual implementation may use a faster way.
156
156
157
+
### Precision for Euclidean and SqEuclidean
158
+
159
+
For efficiency (see the benchmarks below), `Euclidean` and
160
+
`SqEuclidean` make use of BLAS3 matrix-matrix multiplication to
161
+
calculate distances. This corresponds to the following expansion:
162
+
163
+
```julia
164
+
(x-y)^2== x^2-2xy + y^2
165
+
```
166
+
167
+
However, equality is not precise in the presence of roundoff error,
168
+
and particularly when `x` and `y` are nearby points this may not be
169
+
accurate. Consequently, `Euclidean` and `SqEuclidean` allow you to
170
+
supply a relative tolerance to force recalculation:
171
+
172
+
```julia
173
+
julia> x =reshape([0.1, 0.3, -0.1], 3, 1);
174
+
175
+
julia>pairwise(Euclidean(), x, x)
176
+
1×1 Array{Float64,2}:
177
+
7.45058e-9
178
+
179
+
julia>pairwise(Euclidean(1e-12), x, x)
180
+
1×1 Array{Float64,2}:
181
+
0.0
182
+
```
157
183
158
184
## Benchmarks
159
185
@@ -215,5 +241,3 @@ The table below compares the performance (measured in terms of average elapsed t
For distances of which a major part of the computation is a quadratic form (e.g. *Euclidean*, *CosineDist*, *Mahalanobis*), the performance can be drastically improved by restructuring the computation and delegating the core part to ``GEMM`` in *BLAS*. The use of this strategy can easily lead to 100x performance gain over simple loops (see the highlighted part of the table above).
0 commit comments