Skip to content

Commit 3dec24c

Browse files
committed
Merge pull request #21 from rawls238/master
Add Jaccard + Rogers-Tanimoto Distances Fixes #10
2 parents 81cc6e4 + 6a310a4 commit 3dec24c

File tree

4 files changed

+71
-12
lines changed

4 files changed

+71
-12
lines changed

README.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,9 @@ This package also provides optimized functions to compute column-wise and pairwi
1414

1515
* Euclidean distance
1616
* Squared Euclidean distance
17-
* Cityblock distance
17+
* Cityblock distance
18+
* Jaccard distance
19+
* Rogers-Tanimoto distance
1820
* Chebyshev distance
1921
* Minkowski distance
2022
* Hamming distance
@@ -47,7 +49,7 @@ Here, dist is an instance of a distance type. For example, the type for Euclidea
4749

4850
```julia
4951
r = evaluate(Euclidean(), x, y)
50-
```
52+
```
5153

5254
Common distances also come with convenient functions for distance evaluation. For example, you may also compute Euclidean distance between two vectors as below
5355

@@ -103,36 +105,38 @@ Please pay attention to the difference, the functions for inplace computation ar
103105

104106
## Distance type hierarchy
105107

106-
The distances are organized into a type hierarchy.
108+
The distances are organized into a type hierarchy.
107109

108110
At the top of this hierarchy is an abstract class **PreMetric**, which is defined to be a function ``d`` that satisfies
109111

110112
d(x, x) == 0 for all x
111113
d(x, y) >= 0 for all x, y
112-
114+
113115
**SemiMetric** is a abstract type that refines **PreMetric**. Formally, a *semi-metric* is a *pre-metric* that is also symmetric, as
114116

115117
d(x, y) == d(y, x) for all x, y
116-
118+
117119
**Metric** is a abstract type that further refines **SemiMetric**. Formally, a *metric* is a *semi-metric* that also satisfies triangle inequality, as
118120

119121
d(x, z) <= d(x, y) + d(y, z) for all x, y, z
120-
122+
121123
This type system has practical significance. For example, when computing pairwise distances between a set of vectors, you may only perform computation for half of the pairs, and derive the values immediately for the remaining halve by leveraging the symmetry of *semi-metrics*.
122124

123125
Each distance corresponds to a distance type. The type name and the corresponding mathematical definitions of the distances are listed in the following table.
124126

125-
| type name | convenient syntax | math definition |
127+
| type name | convenient syntax | math definition |
126128
| -------------------- | -------------------- | --------------------|
127129
| Euclidean | euclidean(x, y) | sqrt(sum((x - y) .^ 2)) |
128130
| SqEuclidean | sqeuclidean(x, y) | sum((x - y).^2) |
129131
| Cityblock | cityblock(x, y) | sum(abs(x - y)) |
130132
| Chebyshev | chebyshev(x, y) | max(abs(x - y)) |
131133
| Minkowski | minkowski(x, y, p) | sum(abs(x - y).^p) ^ (1/p) |
132134
| Hamming | hamming(x, y) | sum(x .!= y) |
135+
| Rogers-Tanimoto | rogerstanimoto(x, y)| 2(sum(x&!y) + sum(!x&y)) / (2(sum(x&!y) + sum(!x&y)) + sum(x&y) + sum(!x&!y)) |
136+
| Jaccard | jaccard(x, y) | 1 - sum(min(x, y)) / sum(max(x, y)) |
133137
| CosineDist | cosine_dist(x, y) | 1 - dot(x, y) / (norm(x) * norm(y)) |
134138
| CorrDist | corr_dist(x, y) | cosine_dist(x - mean(x), y - mean(y)) |
135-
| ChiSqDist | chisq_dist(x, y) | sum((x - y).^2 / (x + y)) |
139+
| ChiSqDist | chisq_dist(x, y) | sum((x - y).^2 / (x + y)) |
136140
| KLDivergence | kl_divergence(x, y) | sum(p .* log(p ./ q)) |
137141
| JSDivergence | js_divergence(x, y) | KL(x, m) / 2 + KL(y, m) / 2 with m = (x + y) / 2 |
138142
| SpanNormDist | spannorm_dist(x, y) | max(x - y) - min(x - y ) |
@@ -145,7 +149,7 @@ Each distance corresponds to a distance type. The type name and the correspondin
145149
| WeightedCityblock | cityblock(x, y, w) | sum(abs(x - y) .* w) |
146150
| WeightedMinkowski | minkowski(x, y, w, p) | sum(abs(x - y).^p .* w) ^ (1/p) |
147151
| WeightedHamming | hamming(x, y, w) | sum((x .!= y) .* w) |
148-
152+
149153
**Note:** The formulas above are using *Julia*'s functions. These formulas are mainly for conveying the math concepts in a concise way. The actual implementation may use a faster way.
150154

151155

src/Distances.jl

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,8 @@ export
2424
Cityblock,
2525
Chebyshev,
2626
Minkowski,
27+
Jaccard,
28+
RogersTanimoto,
2729

2830
Hamming,
2931
CosineDist,
@@ -47,6 +49,8 @@ export
4749
euclidean,
4850
sqeuclidean,
4951
cityblock,
52+
jaccard,
53+
rogerstanimoto,
5054
chebyshev,
5155
minkowski,
5256
mahalanobis,

src/metrics.jl

Lines changed: 46 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ type Euclidean <: Metric end
1010
type SqEuclidean <: SemiMetric end
1111
type Chebyshev <: Metric end
1212
type Cityblock <: Metric end
13+
type Jaccard <: Metric end
14+
type RogersTanimoto <: Metric end
1315

1416
immutable Minkowski{T <: Real} <: Metric
1517
p::T
@@ -26,7 +28,7 @@ type JSDivergence <: SemiMetric end
2628

2729
type SpanNormDist <: SemiMetric end
2830

29-
typealias UnionMetrics @compat(Union{Euclidean, SqEuclidean, Chebyshev, Cityblock, Minkowski, Hamming, CosineDist, CorrDist, ChiSqDist, KLDivergence, JSDivergence, SpanNormDist})
31+
typealias UnionMetrics @compat(Union{Euclidean, SqEuclidean, Chebyshev, Cityblock, Minkowski, Hamming, Jaccard, RogersTanimoto, CosineDist, CorrDist, ChiSqDist, KLDivergence, JSDivergence, SpanNormDist})
3032

3133
###########################################################
3234
#
@@ -155,7 +157,7 @@ js_divergence(a::AbstractArray, b::AbstractArray) = evaluate(JSDivergence(), a,
155157

156158
# SpanNormDist
157159
function eval_start(::SpanNormDist, a::AbstractArray, b::AbstractArray)
158-
a[1] - b[1], a[1]- b[1]
160+
a[1] - b[1], a[1] - b[1]
159161
end
160162
@compat @inline eval_op(::SpanNormDist, ai, bi) = ai - bi
161163
@compat @inline function eval_reduce(::SpanNormDist, s1, s2)
@@ -175,6 +177,46 @@ function result_type{T1, T2}(dist::SpanNormDist, ::AbstractArray{T1}, ::Abstract
175177
end
176178

177179

180+
# Jaccard
181+
182+
@compat @inline eval_start(::Jaccard, a::AbstractArray, b::AbstractArray) = 0, 0
183+
@compat @inline function eval_op(::Jaccard, s1, s2)
184+
denominator = max(s1, s2)
185+
numerator = min(s1, s2)
186+
numerator, denominator
187+
end
188+
@compat @inline function eval_reduce(::Jaccard, s1, s2)
189+
a = s1[1] + s2[1]
190+
b = s1[2] + s2[2]
191+
a, b
192+
end
193+
@compat @inline eval_end(::Jaccard, a) = 1 - (a[1]/a[2])
194+
jaccard(a::AbstractArray, b::AbstractArray) = evaluate(Jaccard(), a, b)
195+
196+
# Tanimoto
197+
198+
@compat @inline eval_start(::RogersTanimoto, a::AbstractArray, b::AbstractArray) = 0, 0, 0, 0
199+
@compat @inline function eval_op(::RogersTanimoto, s1, s2)
200+
tt = s1 && s2
201+
tf = s1 && !s2
202+
ft = !s1 && s2
203+
ff = !s1 && !s2
204+
tt, tf, ft, ff
205+
end
206+
@compat @inline function eval_reduce(::RogersTanimoto, s1, s2)
207+
a = s1[1] + s2[1]
208+
b = s1[2] + s2[2]
209+
c = s1[3] + s2[3]
210+
d = s1[4] + s1[4]
211+
a, b, c, d
212+
end
213+
@compat @inline function eval_end(::RogersTanimoto, a)
214+
numerator = 2(a[2] + a[3])
215+
denominator = a[1] + a[4] + 2(a[2] + a[3])
216+
numerator / denominator
217+
end
218+
rogerstanimoto{T <: Bool}(a::AbstractArray{T}, b::AbstractArray{T}) = evaluate(RogersTanimoto(), a, b)
219+
178220
###########################################################
179221
#
180222
# Special method
@@ -227,6 +269,7 @@ function pairwise!(r::AbstractMatrix, dist::Euclidean, a::AbstractMatrix, b::Abs
227269
end
228270
r
229271
end
272+
230273
function pairwise!(r::AbstractMatrix, dist::Euclidean, a::AbstractMatrix)
231274
m, n = get_pairwise_dims(r, a)
232275
At_mul_B!(r, a, a)
@@ -245,6 +288,7 @@ function pairwise!(r::AbstractMatrix, dist::Euclidean, a::AbstractMatrix)
245288
end
246289

247290
# CosineDist
291+
248292
function pairwise!(r::AbstractMatrix, dist::CosineDist, a::AbstractMatrix, b::AbstractMatrix)
249293
m, na, nb = get_pairwise_dims(r, a, b)
250294
At_mul_B!(r, a, b)

test/test_dists.jl

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,10 @@ b = 2
3232
@test hamming(a, a) == 0
3333
@test hamming(a, b) == 1
3434

35+
bt = [true, false, true]
36+
bf = [false, true, true]
37+
@test rogerstanimoto(bt, bt) == 0
38+
@test rogerstanimoto(bt, bf) == 4./5
3539

3640

3741
p = rand(12)
@@ -47,6 +51,9 @@ for (x, y) in (([4., 5., 6., 7.], [3., 9., 8., 1.]),
4751
@test euclidean(x, x) == 0.
4852
@test euclidean(x, y) == sqrt(57.)
4953

54+
@test jaccard(x, x) == 0
55+
@test jaccard(x, y) == 13./28
56+
5057
@test cityblock(x, x) == 0.
5158
@test cityblock(x, y) == 13.
5259

@@ -56,11 +63,11 @@ for (x, y) in (([4., 5., 6., 7.], [3., 9., 8., 1.]),
5663
@test minkowski(x, x, 2) == 0.
5764
@test minkowski(x, y, 2) == sqrt(57.)
5865

59-
6066
@test_approx_eq_eps cosine_dist(x, x) 0.0 1.0e-12
6167
@test_throws DimensionMismatch cosine_dist(1.:2, 1.:3)
6268
@test_approx_eq_eps cosine_dist(x, y) (1.0 - 112. / sqrt(19530.)) 1.0e-12
6369

70+
6471
@test_approx_eq_eps corr_dist(x, x) 0. 1.0e-12
6572
@test_approx_eq corr_dist(x, y) cosine_dist(x .- mean(x), vec(y) .- mean(y))
6673

0 commit comments

Comments
 (0)