Skip to content

Commit 32d4120

Browse files
author
Vanessa K Lee
committed
rebase
1 parent 26cf547 commit 32d4120

File tree

14 files changed

+12129
-7623
lines changed

14 files changed

+12129
-7623
lines changed

README.md

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -257,17 +257,14 @@ The bag distance is a cheap distance measure which always returns a distance sma
257257
</details>
258258

259259
<details>
260-
<summary><u>Substring Set</u></summary>
261-
262-
Splits the strings on spaces, sorts, re-joins, and then determines Jaro-Winkler distance. Best when the strings contain irrelevent substrings.
263-
</details>
264-
265-
<details>
266-
<summary><u>Sørensen–Dice</u></summary>
260+
<summary><u>Double Metaphone</u></summary>
267261

268-
Sørensen–Dice coefficient is calculated using bigrams. The equation is `2nt / nx + ny` where nx is the number of bigrams in string x, ny is the number of bigrams in string y, and nt is the number of bigrams in both strings. For example, the bigrams of `night` and `nacht` are `{ni,ig,gh,ht}` and `{na,ac,ch,ht}`. They each have four and the intersection is `ht`.
262+
Calculates the [Double Metaphone Phonetic Algorithm](https://xlinux.nist.gov/dads/HTML/doubleMetaphone.html) metric of two strings. The return value is based on the match level: strict, strong, normal (default), or weak.
269263

270-
``` (2 · 1) / (4 + 4) = 0.25 ```
264+
* "strict": both encodings for each string must match
265+
* "strong": the primary encoding for each string must match
266+
* "normal": the primary encoding of one string must match either encoding of other string (default)
267+
* "weak": either primary or secondary encoding of one string must match one encoding of other string
271268
</details>
272269

273270
<details>
@@ -303,14 +300,23 @@ Compares two strings by converting each to an approximate phonetic representatio
303300
</details>
304301

305302
<details>
306-
<summary><u>Double Metaphone</u></summary>
303+
<summary><u>N-Gram Similarity</u></summary>
307304

308-
Calculates the [Double Metaphone Phonetic Algorithm](https://xlinux.nist.gov/dads/HTML/doubleMetaphone.html) metric of two strings. The return value is based on the match level: strict, strong, normal (default), or weak.
305+
Calculates the ngram distance between two strings. Default ngram: 2.
306+
</details>
309307

310-
* "strict": both encodings for each string must match
311-
* "strong": the primary encoding for each string must match
312-
* "normal": the primary encoding of one string must match either encoding of other string (default)
313-
* "weak": either primary or secondary encoding of one string must match one encoding of other string
308+
<details>
309+
<summary><u>Overlap Metric</u></summary>
310+
311+
Uses the Overlap Similarity metric to compare two strings by tokenizing the strings and measuring their overlap. Default ngram: 1.
312+
</details>
313+
314+
<details>
315+
<summary><u>Sørensen–Dice</u></summary>
316+
317+
Sørensen–Dice coefficient is calculated using bigrams. The equation is `2nt / nx + ny` where nx is the number of bigrams in string x, ny is the number of bigrams in string y, and nt is the number of bigrams in both strings. For example, the bigrams of `night` and `nacht` are `{ni,ig,gh,ht}` and `{na,ac,ch,ht}`. They each have four and the intersection is `ht`.
318+
319+
``` (2 · 1) / (4 + 4) = 0.25 ```
314320
</details>
315321

316322
<details>
@@ -324,15 +330,9 @@ accuracy for search terms containing more than one word.
324330
</details>
325331

326332
<details>
327-
<summary><u>N-Gram Similarity</u></summary>
328-
329-
Calculates the ngram distance between two strings. Default ngram: 2.
330-
</details>
331-
332-
<details>
333-
<summary><u>Overlap Metric</u></summary>
333+
<summary><u>Substring Set</u></summary>
334334

335-
Uses the Overlap Similarity metric to compare two strings by tokenizing the strings and measuring their overlap. Default ngram: 1.
335+
Splits the strings on spaces, sorts, re-joins, and then determines Jaro-Winkler distance. Best when the strings contain irrelevent substrings.
336336
</details>
337337

338338
<details>
@@ -361,7 +361,7 @@ A generalization of Sørensen–Dice and Jaccard.
361361

362362
## In Development
363363

364-
* Author Name Disambiguation (see lib/akin/and.ex for developments)
364+
* Further enhancements to name matching
365365
* Add Damerau-Levenshtein algorithm
366366
* [Damerau-Levenshtein](https://en.wikipedia.org/wiki/Damerau-Levenshtein_distance)
367367
* [Examples](https://datascience.stackexchange.com/questions/60019/damerau-levenshtein-edit-distance-in-python)

lib/akin.ex

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,8 @@ defmodule Akin do
7272
"""
7373
def match_names(left, rights, opts \\ default_opts())
7474

75+
def match_names(_, [], _), do: []
76+
7577
def match_names(left, rights, opts) when is_binary(left) and is_list(rights) do
7678
rights = Enum.map(rights, fn right -> compose(right) end)
7779
match_names(compose(left), rights, opts)

lib/akin/algorithms/helpers/initials_comparison.ex

Lines changed: 35 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -5,17 +5,9 @@ defmodule Akin.Helpers.InitialsComparison do
55
import Akin.Util, only: [ngram_tokenize: 2]
66
alias Akin.Corpus
77

8-
# the mean bag distance from training is 0.71
9-
@min_bag_distance 0.5
10-
118
def similarity(%Corpus{} = left, %Corpus{} = right) do
12-
similarity(left, right, String.bag_distance(left.string, right.string) >= @min_bag_distance)
13-
end
14-
15-
# do the inital letters of each string match?
16-
def similarity(left, right, true) do
17-
left_initials = initials(left)
18-
right_initials = initials(right)
9+
left_initials = initials(left) |> Enum.sort()
10+
right_initials = initials(right) |> Enum.sort()
1911

2012
left_i_count = Enum.count(left_initials)
2113
right_i_count = Enum.count(right_initials)
@@ -30,10 +22,34 @@ defmodule Akin.Helpers.InitialsComparison do
3022
|> List.flatten()
3123
|> Enum.uniq()
3224

33-
case {left_i_count, right_i_count} do
34-
{li, ri} when li == ri -> left_initials == right_initials
35-
{li, ri} when li > ri -> left_initials -- right_initials == []
36-
{li, ri} when li < ri -> right_initials -- left_initials == []
25+
if String.contains?(left.original, ["-", "'"]) or String.contains?(right.original, ["-", "'"]) do
26+
case {left_i_count, right_i_count} do
27+
{li, ri} when li == ri -> left_initials == right_initials
28+
{li, ri} when li > ri ->
29+
case left_initials -- right_initials do
30+
[] -> true
31+
[_i] ->
32+
combined_hyphenation = right.list -- left.list
33+
full_permutations = get_permuations(left.list)
34+
combined_hyphenation -- full_permutations == []
35+
_ -> false
36+
end
37+
{li, ri} when li < ri ->
38+
case right_initials -- left_initials do
39+
[] -> true
40+
[_i] ->
41+
combined_hyphenation = left.list -- right.list
42+
full_permutations = get_permuations(right.list)
43+
combined_hyphenation -- full_permutations == []
44+
_ -> false
45+
end
46+
end
47+
else
48+
case {left_i_count, right_i_count} do
49+
{li, ri} when li == ri -> left_initials == right_initials
50+
{li, ri} when li > ri -> left_initials -- right_initials == []
51+
{li, ri} when li < ri -> right_initials -- left_initials == []
52+
end
3753
end
3854
|> cartesian_match(left_c_intials, right_c_intials)
3955
|> permutation_match(left.list, right.list)
@@ -45,6 +61,10 @@ defmodule Akin.Helpers.InitialsComparison do
4561
Enum.map(lists, fn list -> String.at(list, 0) end)
4662
end
4763

64+
defp initials(list) when is_list(list) do
65+
Enum.map(list, fn l -> String.at(l, 0) end)
66+
end
67+
4868
defp initials(_), do: []
4969

5070
defp actual_initials(list) do
@@ -68,7 +88,7 @@ defmodule Akin.Helpers.InitialsComparison do
6888
defp cartesian_match(false, left, right) do
6989
Enum.filter(left, fn l -> l in right end)
7090
|> Enum.count()
71-
|> Kernel.>(0)
91+
|> Kernel.>(1)
7292
end
7393

7494
defp permutation_match(true, _, _), do: true

lib/akin/algorithms/names.ex

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,6 @@ defmodule Akin.Names do
1010
@weight 0.05
1111
@shortness_boost 0.0175
1212

13-
@spec compare(binary() | %Corpus{}, binary() | %Corpus{}, keyword()) :: float()
1413
@doc """
1514
Manage the steps of comparing two names. Collect metrics from the algorithms requested
1615
in the options or the default algorithms. Give weight to the consideration of initials
@@ -32,10 +31,12 @@ defmodule Akin.Names do
3231
metrics = Akin.compare(left, right)
3332

3433
short_length = opts(opts, :short_length)
34+
initials_match? = if weight > 0, do: 1.0, else: 0.0
3535

3636
score =
3737
calc(metrics, weight, short_length, len(right.string))
3838
|> Enum.map(fn {k, v} -> {k, r(v)} end)
39+
|> Keyword.put(:initials, initials_match?)
3940

4041
%{scores: score}
4142
end

lib/akin/task.ex

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ defmodule Akin.Task do
22
@moduledoc """
33
API for all string comparison modules.
44
"""
5-
@callback compare(%Akin.Corpus{}, %Akin.Corpus{}, Keyword.t(any())) :: number()
6-
@callback compare(%Akin.Corpus{}, %Akin.Corpus{}) :: number()
5+
@callback compare(%Akin.Corpus{}, %Akin.Corpus{}, Keyword.t(any())) :: number() | map()
6+
@callback compare(%Akin.Corpus{}, %Akin.Corpus{}) :: number() | map()
77
@optional_callbacks compare: 2
88
end

lib/akin/util.ex

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,7 @@ defmodule Akin.Util do
8787
end
8888

8989
defp replace(string) do
90+
string = String.replace(string, "'", "")
9091
Regex.replace(@nontext_codepoints, string, " ")
9192
|> String.replace(~r/[\p{P}\p{S}]/u, " ")
9293
|> :unicode.characters_to_nfd_binary()

lib/scripts/ml.ex

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
defmodule Akin.ML do
2+
def training_data() do
3+
NimbleCSV.define(CSVParse, separator: ",", escape: "\\")
4+
File.rm("test/support/metrics_for_training.csv")
5+
6+
File.stream!("test/support/dblp_for_training.csv")
7+
|> Stream.map(&String.trim(&1))
8+
|> Enum.to_list()
9+
|> Enum.each(fn row ->
10+
[left, right, match] = String.split(row, "\t")
11+
12+
case Akin.match_names_metrics(left, [right]) do
13+
[%{left: _, right: _, metrics: scores, match: _}] ->
14+
# names = l <> " <- (" <> to_string(m) <> ") -> " <> r
15+
match = if match == "1", do: "match", else: "non-match"
16+
scores = Enum.into(scores, %{})
17+
18+
data =
19+
[
20+
[
21+
scores.bag_distance,
22+
scores.substring_set,
23+
scores.sorensen_dice,
24+
scores.metaphone,
25+
scores.double_metaphone,
26+
scores.substring_double_metaphone,
27+
scores.jaccard,
28+
scores.jaro_winkler,
29+
scores.levenshtein,
30+
scores.ngram,
31+
scores.overlap,
32+
scores.substring_sort,
33+
scores.tversky,
34+
match
35+
]
36+
]
37+
|> CSVParse.dump_to_iodata()
38+
39+
File.write!("test/support/metrics_for_training.csv", [data], [:append])
40+
41+
_ ->
42+
nil
43+
end
44+
end)
45+
end
46+
47+
def tangram_data() do
48+
NimbleCSV.define(CSVParse, separator: "\t")
49+
File.rm("test/support/metrics_for_predicting.csv")
50+
51+
# File.stream!("test/support/orcid_for_predicting.csv")
52+
File.stream!("test/support/orcid/predict_b.csv")
53+
|> Stream.map(&String.trim(&1))
54+
|> Enum.to_list()
55+
|> Enum.reduce(:ok, fn row, acc ->
56+
# Phase 4 prediction data for tangram
57+
# [a, b, c, d] = String.split(row, "\t")
58+
[_, _, _, a, _, b, _, _, _, c, d] = String.split(row, "\t")
59+
b = String.replace(b, "|", ", ")
60+
c = String.replace(c, "_", " ")
61+
d = String.replace(d, "_", " ")
62+
63+
Akin.match_names_metrics(b, [a, c, d])
64+
|> Enum.each(fn %{left: l, right: r, metrics: s, match: m} ->
65+
names = l <> " <- (" <> to_string(m) <> ") -> " <> r
66+
scores = Enum.into(s, %{})
67+
match = "match"
68+
69+
IO.inspect scores
70+
71+
data =
72+
[
73+
[
74+
scores.bag_distance,
75+
scores.substring_set,
76+
scores.sorensen_dice,
77+
scores.metaphone,
78+
scores.double_metaphone,
79+
scores.substring_double_metaphone,
80+
scores.jaccard,
81+
scores.jaro_winkler,
82+
scores.levenshtein,
83+
scores.ngram,
84+
scores.overlap,
85+
scores.substring_sort,
86+
scores.tversky,
87+
scores.
88+
match
89+
# scores.bag_distance,
90+
# scores.substring_set,
91+
# scores.sorensen_dice,
92+
# scores.metaphone,
93+
# scores.double_metaphone,
94+
# scores.substring_double_metaphone,
95+
# scores.jaccard,
96+
# scores.jaro_winkler,
97+
# scores.levenshtein,
98+
# scores.ngram,
99+
# scores.overlap,
100+
# scores.substring_sort,
101+
# scores.tversky,
102+
# names,
103+
# match
104+
]
105+
]
106+
|> CSVParse.dump_to_iodata()
107+
108+
File.write!("test/support/orcid_for_training.csv", [data], [:append])
109+
end)
110+
111+
acc
112+
end)
113+
end
114+
end

test/algorithms/chunk_set_test.exs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,10 +50,10 @@ defmodule SubstringSetTest do
5050

5151
test "returns expected float value for comparing string of extreme length difference" do
5252
left = "alice in wonderland"
53-
right = "alice's adventures through the looking glass"
53+
right = "alice's adventures in wonderland"
5454

5555
normal = normal(left, right)
56-
assert normal == 0.79
56+
assert normal == 0.83
5757
assert normal < weak(left, right)
5858
end
5959

0 commit comments

Comments
 (0)