You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+47-14Lines changed: 47 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,24 +22,21 @@ You can install **`PolyFuzz`** via pip:
22
22
pip install polyfuzz
23
23
```
24
24
25
-
This will install the base dependencies. If you want to speed
26
-
up the cosine similarity comparison and decrease memory usage,
27
-
you can use `sparse_dot_topn` which is installed via:
25
+
You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:
28
26
29
-
```bash
30
-
pip install polyfuzz[fast]
31
-
```
32
-
33
-
If you want to be making use of 🤗 Transformers, install the additional additional `Flair` dependency:
34
-
35
-
```bash
36
-
pip install polyfuzz[flair]
27
+
```python
28
+
pip install bertopic[sbert]
29
+
pip install bertopic[flair]
30
+
pip install bertopic[gensim]
31
+
pip install bertopic[spacy]
32
+
pip install bertopic[use]
37
33
```
38
34
39
-
To install all the additional dependencies:
35
+
If you want to speed up the cosine similarity comparison and decrease memory usage when using embedding models,
36
+
you can use `sparse_dot_topn` which is installed via:
40
37
41
38
```bash
42
-
pip install polyfuzz[all]
39
+
pip install polyfuzz[fast]
43
40
```
44
41
45
42
<details>
@@ -103,6 +100,42 @@ The resulting matches can be accessed through `model.get_matches()`:
103
100
**NOTE 2**: When instantiating `PolyFuzz` we also could have used "EditDistance" or "Embeddings" to quickly
104
101
access Levenshtein and FastText (English) respectively.
105
102
103
+
### Production
104
+
The `.match` function allows you to quickly extract similar strings. However, after selecting the right models to be used, you may want to use PolyFuzz
105
+
in production to match incoming strings. To do so, we can make use of the familiar `fit`, `transform`, and `fit_transform` functions.
106
+
107
+
Let's say that we have a list of words that we know to be correct called `train_words`. We want to any incoming word to mapped to one of the words in `train_words`.
108
+
In other words, we `fit` on `train_words` and we use `transform` on any incoming words:
109
+
110
+
```python
111
+
from sklearn.datasets import fetch_20newsgroups
112
+
from sklearn.feature_extraction.text import CountVectorizer
In the above example, we are using `fit` on `train_words` to calculate the TF-IDF representation of those words which are saved to be used again in `transform`.
127
+
This speeds up `transform` quite a bit since all TF-IDF representations are stored when applying `fit`.
128
+
129
+
Then, we apply save and load the model as follows to be used in production:
130
+
131
+
```python
132
+
# Save the model
133
+
model.save("my_model")
134
+
135
+
# Load the model
136
+
loaded_model = PolyFuzz.load("my_model")
137
+
```
138
+
106
139
### Group Matches
107
140
We can group the matches `To` as there might be significant overlap in strings in our to_list.
108
141
To do this, we calculate the similarity within strings in to_list and use `single linkage` to then
@@ -214,7 +247,7 @@ from polyfuzz.models import BaseMatcher
214
247
215
248
216
249
classMyModel(BaseMatcher):
217
-
defmatch(self, from_list, to_list):
250
+
defmatch(self, from_list, to_list, **kwargs):
218
251
# Calculate distances
219
252
matches = [[fuzz.ratio(from_string, to_string) /100for to_string in to_list]
In the code above, we fit our TF-IDF model on `train_words` and use `.transform()` to match the words in `unseen_words` to the words that we trained on in `train_words`.
58
+
59
+
After fitting our model, we can save it as follows:
60
+
61
+
```python
62
+
model.save("my_model")
63
+
```
64
+
65
+
Then, we can load our model to be used elsewhere:
66
+
67
+
```python
68
+
from polyfuzz import PolyFuzz
69
+
70
+
model = PolyFuzz.load("my_model")
71
+
```
72
+
73
+
74
+
## **v0.3.4**
75
+
2
76
- Make sure that when you use two lists that are exactly the same, it will return 1 for identical terms:
3
77
4
78
```python
5
79
from polyfuzz import PolyFuzz
80
+
6
81
from_list = ["apple", "house"]
7
82
model = PolyFuzz("TF-IDF")
8
83
model.match(from_list, from_list)
@@ -14,6 +89,7 @@ mapping to itself:
14
89
15
90
```python
16
91
from polyfuzz import PolyFuzz
92
+
17
93
from_list = ["apple", "apples"]
18
94
model = PolyFuzz("TF-IDF")
19
95
model.match(from_list)
@@ -22,32 +98,32 @@ model.match(from_list)
22
98
In the example above, `apple` will be mapped to `apples` and not to `apple`. Here, we assume that the user wants to
23
99
find the most similar words within a list without mapping to itself.
24
100
25
-
v0.3.3
101
+
## **v0.3.3**
26
102
- Update numpy to "numpy>=1.20.0" to prevent [this](https://github.com/MaartenGr/PolyFuzz/issues/23) and this [issue](https://github.com/MaartenGr/PolyFuzz/issues/21)
27
103
- Update pytorch to "torch>=1.4.0,<1.7.1" to prevent save_state_warning error
28
104
29
-
v0.3.2
105
+
## **v0.3.2**
30
106
- Fix exploding memory usage when using `top_n`
31
107
32
-
v0.3.0
108
+
## **v0.3.0**
33
109
- Use `top_n` in `polyfuzz.models.TFIDF` and `polyfuzz.models.Embeddings`
34
110
35
-
v0.2.2
111
+
## **v0.2.2**
36
112
- Update grouping to include all strings only if identical lists of strings are compared
37
113
38
-
v0.2.0
114
+
## **v0.2.0**
39
115
- Update naming convention matcher --> model
40
116
- Update documentation
41
117
- Add basic models to grouper
42
118
- Fix issues with vector order in cosine similarity
0 commit comments