Skip to content

Commit 9d9754d

Browse files
author
Maarten Grootendorst
authored
v0.4.0 (#35)
* Add spacy, use, sbert, gensim * Add fit, transform, and fit_transform * Add options to save and load model
1 parent 241d7d3 commit 9d9754d

File tree

27 files changed

+1232
-79
lines changed

27 files changed

+1232
-79
lines changed

.github/workflows/testing.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,6 @@ jobs:
2525
- name: Install dependencies
2626
run: |
2727
python -m pip install --upgrade pip
28-
pip install -e ".[dev]"
28+
pip install -e ".[dev, sbert]"
2929
- name: Run Checking Mechanisms
3030
run: make check

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,3 +75,6 @@ venv.bak/
7575

7676
.idea
7777
.idea/
78+
79+
# For quick testing
80+
/Untitled.ipynb

README.md

Lines changed: 47 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -22,24 +22,21 @@ You can install **`PolyFuzz`** via pip:
2222
pip install polyfuzz
2323
```
2424

25-
This will install the base dependencies. If you want to speed
26-
up the cosine similarity comparison and decrease memory usage,
27-
you can use `sparse_dot_topn` which is installed via:
25+
You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:
2826

29-
```bash
30-
pip install polyfuzz[fast]
31-
```
32-
33-
If you want to be making use of 🤗 Transformers, install the additional additional `Flair` dependency:
34-
35-
```bash
36-
pip install polyfuzz[flair]
27+
```python
28+
pip install bertopic[sbert]
29+
pip install bertopic[flair]
30+
pip install bertopic[gensim]
31+
pip install bertopic[spacy]
32+
pip install bertopic[use]
3733
```
3834

39-
To install all the additional dependencies:
35+
If you want to speed up the cosine similarity comparison and decrease memory usage when using embedding models,
36+
you can use `sparse_dot_topn` which is installed via:
4037

4138
```bash
42-
pip install polyfuzz[all]
39+
pip install polyfuzz[fast]
4340
```
4441

4542
<details>
@@ -103,6 +100,42 @@ The resulting matches can be accessed through `model.get_matches()`:
103100
**NOTE 2**: When instantiating `PolyFuzz` we also could have used "EditDistance" or "Embeddings" to quickly
104101
access Levenshtein and FastText (English) respectively.
105102

103+
### Production
104+
The `.match` function allows you to quickly extract similar strings. However, after selecting the right models to be used, you may want to use PolyFuzz
105+
in production to match incoming strings. To do so, we can make use of the familiar `fit`, `transform`, and `fit_transform` functions.
106+
107+
Let's say that we have a list of words that we know to be correct called `train_words`. We want to any incoming word to mapped to one of the words in `train_words`.
108+
In other words, we `fit` on `train_words` and we use `transform` on any incoming words:
109+
110+
```python
111+
from sklearn.datasets import fetch_20newsgroups
112+
from sklearn.feature_extraction.text import CountVectorizer
113+
from polyfuzz import PolyFuzz
114+
115+
train_words = ["apple", "apples", "appl", "recal", "house", "similarity"]
116+
unseen_words = ["apple", "apples", "mouse"]
117+
118+
# Fit
119+
model = PolyFuzz("TF-IDF")
120+
model.fit(train_words)
121+
122+
# Transform
123+
results = model.transform(unseen_words)
124+
```
125+
126+
In the above example, we are using `fit` on `train_words` to calculate the TF-IDF representation of those words which are saved to be used again in `transform`.
127+
This speeds up `transform` quite a bit since all TF-IDF representations are stored when applying `fit`.
128+
129+
Then, we apply save and load the model as follows to be used in production:
130+
131+
```python
132+
# Save the model
133+
model.save("my_model")
134+
135+
# Load the model
136+
loaded_model = PolyFuzz.load("my_model")
137+
```
138+
106139
### Group Matches
107140
We can group the matches `To` as there might be significant overlap in strings in our to_list.
108141
To do this, we calculate the similarity within strings in to_list and use `single linkage` to then
@@ -214,7 +247,7 @@ from polyfuzz.models import BaseMatcher
214247

215248

216249
class MyModel(BaseMatcher):
217-
def match(self, from_list, to_list):
250+
def match(self, from_list, to_list, **kwargs):
218251
# Calculate distances
219252
matches = [[fuzz.ratio(from_string, to_string) / 100 for to_string in to_list]
220253
for from_string in from_list]

docs/api/models/gensim.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# `polyfuzz.models.GensimEmbeddings`
2+
3+
::: polyfuzz.models.GensimEmbeddings

docs/api/models/sbert.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# `polyfuzz.models.SentenceEmbeddings`
2+
3+
::: polyfuzz.models.SentenceEmbeddings

docs/api/models/spacy.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# `polyfuzz.models.SpacyEmbeddings`
2+
3+
::: polyfuzz.models.SpacyEmbeddings

docs/api/models/use.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# `polyfuzz.models.USEEmbeddings`
2+
3+
::: polyfuzz.models.USEEmbeddings

docs/index.md

Lines changed: 1 addition & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -8,20 +8,4 @@ Currently, methods include Levenshtein distance with RapidFuzz, a character-base
88
techniques such as FastText and GloVe, and 🤗 transformers embeddings.
99

1010
The philosophy of PolyFuzz is: `Easy to use yet highly customizable`. It is a string matcher tool that requires only
11-
a few lines of code but that allows you customize and create your own models.
12-
13-
14-
## Installation
15-
You can install **`PolyFuzz`** via pip:
16-
17-
```
18-
pip install polyfuzz
19-
```
20-
21-
This will install the base dependencies and excludes any deep learning/embedding models.
22-
23-
If you want to be making use of 🤗 Transformers, install the additional additional `Flair` dependency:
24-
25-
```
26-
pip install polyfuzz[flair]
27-
```
11+
a few lines of code but that allows you customize and create your own models.

docs/releases.md

Lines changed: 84 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,83 @@
1-
v0.3.4
1+
## **v0.4.0**
2+
3+
4+
* Added new models (SentenceTransformers, Gensim, USE, Spacy)
5+
* Added `.fit`, `.transform`, and `.fit_transform` methods
6+
* Added `.save` and `PolyFuzz.load()`
7+
8+
9+
**SentenceTransformers**
10+
```python
11+
from polyfuzz.models import SentenceEmbeddings
12+
distance_model = SentenceEmbeddings("all-MiniLM-L6-v2")
13+
model = PolyFuzz(distance_model)
14+
```
15+
16+
**Gensim**
17+
```python
18+
from polyfuzz.models import GensimEmbeddings
19+
distance_model = GensimEmbeddings("glove-twitter-25")
20+
model = PolyFuzz(distance_model)
21+
```
22+
23+
**USE**
24+
```python
25+
from polyfuzz.models import USEEmbeddings
26+
distance_model = USEEmbeddings("https://tfhub.dev/google/universal-sentence-encoder/4")
27+
model = PolyFuzz(distance_model)
28+
```
29+
30+
**Spacy**
31+
```python
32+
from polyfuzz.models import SpacyEmbeddings
33+
distance_model = SpacyEmbeddings("en_core_web_md")
34+
model = PolyFuzz(distance_model)
35+
```
36+
37+
38+
**fit, transform, fit_transform**
39+
Add `fit`, `transform`, and `fit_transform` in order to use PolyFuzz in production [#34](https://github.com/MaartenGr/PolyFuzz/issues/34)
40+
41+
```python
42+
from sklearn.datasets import fetch_20newsgroups
43+
from sklearn.feature_extraction.text import CountVectorizer
44+
from polyfuzz import PolyFuzz
45+
46+
train_words = ["apple", "apples", "appl", "recal", "house", "similarity"]
47+
unseen_words = ["apple", "apples", "mouse"]
48+
49+
# Fit
50+
model = PolyFuzz("TF-IDF")
51+
model.fit(train_words)
52+
53+
# Transform
54+
results = model.transform(unseen_words)
55+
```
56+
57+
In the code above, we fit our TF-IDF model on `train_words` and use `.transform()` to match the words in `unseen_words` to the words that we trained on in `train_words`.
58+
59+
After fitting our model, we can save it as follows:
60+
61+
```python
62+
model.save("my_model")
63+
```
64+
65+
Then, we can load our model to be used elsewhere:
66+
67+
```python
68+
from polyfuzz import PolyFuzz
69+
70+
model = PolyFuzz.load("my_model")
71+
```
72+
73+
74+
## **v0.3.4**
75+
276
- Make sure that when you use two lists that are exactly the same, it will return 1 for identical terms:
377

478
```python
579
from polyfuzz import PolyFuzz
80+
681
from_list = ["apple", "house"]
782
model = PolyFuzz("TF-IDF")
883
model.match(from_list, from_list)
@@ -14,6 +89,7 @@ mapping to itself:
1489

1590
```python
1691
from polyfuzz import PolyFuzz
92+
1793
from_list = ["apple", "apples"]
1894
model = PolyFuzz("TF-IDF")
1995
model.match(from_list)
@@ -22,32 +98,32 @@ model.match(from_list)
2298
In the example above, `apple` will be mapped to `apples` and not to `apple`. Here, we assume that the user wants to
2399
find the most similar words within a list without mapping to itself.
24100

25-
v0.3.3
101+
## **v0.3.3**
26102
- Update numpy to "numpy>=1.20.0" to prevent [this](https://github.com/MaartenGr/PolyFuzz/issues/23) and this [issue](https://github.com/MaartenGr/PolyFuzz/issues/21)
27103
- Update pytorch to "torch>=1.4.0,<1.7.1" to prevent save_state_warning error
28104

29-
v0.3.2
105+
## **v0.3.2**
30106
- Fix exploding memory usage when using `top_n`
31107

32-
v0.3.0
108+
## **v0.3.0**
33109
- Use `top_n` in `polyfuzz.models.TFIDF` and `polyfuzz.models.Embeddings`
34110

35-
v0.2.2
111+
## **v0.2.2**
36112
- Update grouping to include all strings only if identical lists of strings are compared
37113

38-
v0.2.0
114+
## **v0.2.0**
39115
- Update naming convention matcher --> model
40116
- Update documentation
41117
- Add basic models to grouper
42118
- Fix issues with vector order in cosine similarity
43119
- Update naming of cosine similarity function
44120

45-
v0.1.0
121+
## **v0.1.0**
46122
- Additional tests
47123
- More thorough documentation
48124
- Prepare for public release
49125

50-
v0.0.1
126+
## **v0.0.1**
51127
- First release of `PolyFuzz`
52128
- Matching through:
53129
- Edit Distance

docs/tutorial/basematcher/basematcher.md

Lines changed: 66 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ You simply create a class using `BaseMatcher`, make sure it has a function `matc
99
two lists and outputs a pandas dataframe. That's it!
1010

1111
We start by creating our own model that implements the ratio similarity measure from RapidFuzz:
12+
1213
```python
1314
import numpy as np
1415
import pandas as pd
@@ -19,7 +20,7 @@ from polyfuzz.models import BaseMatcher
1920

2021

2122
class MyModel(BaseMatcher):
22-
def match(self, from_list, to_list):
23+
def match(self, from_list, to_list, **kwargs):
2324
# Calculate distances
2425
matches = [[fuzz.ratio(from_string, to_string) / 100
2526
for to_string in to_list] for from_string in from_list]
@@ -53,3 +54,67 @@ model.visualize_precision_recall(kde=True)
5354
```
5455

5556
![](custom_model.png)
57+
58+
59+
## fit, transform, fit_transform
60+
61+
Although the above model can be used in production using `fit`, it does not track its state between `fit` and `transform`.
62+
This is not necessary here, since edit distances should be recalculated but if you have embeddings that you do not
63+
want to re-calculate, then it is helpful to track the states between `fit` and `transform` so that embeddings do not need
64+
to be re-calculated. To do so, we can use the `re_train` parameter to define what happens if we re-train a model (for example when using `fit`)
65+
and what happens when we do not re-train a model (for example when using `transform`).
66+
67+
In the example below, when we set `re_train=True` we calculate the embeddings from both the `from_list` and `to_list` if they are defined
68+
and save the embeddings to the `self.embeddings_to` variable. Then, when we set `re_train=True`, we can prevent redoing the `fit` by leveraging
69+
the pre-calculated `self.embeddings_to` variable.
70+
71+
```python
72+
import numpy as np
73+
from sentence_transformers import SentenceTransformer
74+
75+
from ._utils import cosine_similarity
76+
from ._base import BaseMatcher
77+
78+
79+
class SentenceEmbeddings(BaseMatcher):
80+
def __init__(self, model_id):
81+
super().__init__(model_id)
82+
self.type = "Embeddings"
83+
84+
self.embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
85+
self.embeddings_to = None
86+
87+
def match(self, from_list, to_list, re_train=True) -> pd.DataFrame:
88+
# Extract embeddings from the `from_list`
89+
embeddings_from = self.embedding_model.encode(from_list, show_progress_bar=False)
90+
91+
# Extract embeddings from the `to_list` if it exists
92+
if not isinstance(embeddings_to, np.ndarray):
93+
if not re_train:
94+
embeddings_to = self.embeddings_to
95+
elif to_list is None:
96+
embeddings_to = self.embedding_model.encode(from_list, show_progress_bar=False)
97+
else:
98+
embeddings_to = self.embedding_model.encode(to_list, show_progress_bar=False)
99+
100+
# Extract matches
101+
matches = cosine_similarity(embeddings_from, embeddings_to, from_list, to_list)
102+
103+
self.embeddings_to = embeddings_to
104+
105+
return matches
106+
```
107+
108+
Then, we can use it as follows:
109+
110+
```python
111+
from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
112+
to_list = ["apple", "apples", "mouse"]
113+
114+
custom_matcher = MyModel()
115+
116+
model = PolyFuzz(custom_matcher).fit(from_list)
117+
```
118+
119+
By using the `.fit` function, embeddings are created from the `from_list` variable and saved. Then, when we
120+
run `model.transform(to_list)`, the embeddings created from the `from_list` variable do not need to be recalculated.

0 commit comments

Comments
 (0)