Skip to content

Commit b01e426

Browse files
author
Maarten Grootendorst
authored
Add tutorial datasets (#4)
1 parent 1cd46f4 commit b01e426

File tree

6 files changed

+92
-2
lines changed

6 files changed

+92
-2
lines changed

docs/tutorial/datasets/datasets.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Datasets
2+
There are two datasets prepared for you to play around with:
3+
* Company Names
4+
* Movie Titles
5+
6+
## Movie Titles
7+
This data is retrieved from:
8+
* https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset
9+
* https://www.kaggle.com/shivamb/netflix-shows
10+
11+
It contains Netflix and IMDB movie titles that can be matched against each other.
12+
Where IMDB has 80852 movie titles and Netflix has 6172 movie titles.
13+
14+
You can use them as follows:
15+
16+
```python
17+
from polyfuzz import PolyFuzz
18+
from polyfuzz.datasets import load_movie_titles
19+
20+
data = load_movie_titles()
21+
model = PolyFuzz("TF-IDF").match(data["Netflix"], data["IMDB"])
22+
```
23+
24+
## Company Names
25+
This data is retrieved from https://www.kaggle.com/dattapiy/sec-edgar-companies-list?select=sec__edgar_company_info.csv
26+
and contains 100_000 company names to be matched against each other.
27+
28+
This is a different use case than what you have typically seen so far. We often see two different lists compared
29+
with each other. Here, you can use this dataset to compare the company names with themselves in order to clean
30+
them up.
31+
32+
You can use them as follows:
33+
34+
```python
35+
from polyfuzz import PolyFuzz
36+
from polyfuzz.datasets import load_company_names
37+
38+
data = load_company_names()
39+
model = PolyFuzz("TF-IDF").match(data, data)
40+
```
41+
42+
PolyFuzz will recognize that the lists are similar and that you are looking to match the titles with themselves.
43+
It will ignore any comparison a string has with itself, otherwise everything will get mapped to itself.

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ nav:
1212
- Models: tutorial/models/models.md
1313
- Custom Models: tutorial/basematcher/basematcher.md
1414
- Custom Grouper: tutorial/grouper/grouper.md
15+
- Datasets: tutorial/datasets/datasets.md
1516
- API:
1617
- PolyFuzz: api/polyfuzz.md
1718
- Linkage: api/linkage.md

polyfuzz/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
from .polyfuzz import PolyFuzz
2-
__version__ = "0.2.0"
2+
__version__ = "0.2.1"

polyfuzz/datasets/__init__.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
from ._load_data import load_movie_titles, load_company_names
2+
3+
__all__ = [
4+
"load_movie_titles",
5+
"load_company_names"
6+
]

polyfuzz/datasets/_load_data.py

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
import json
2+
import requests
3+
from typing import List, Mapping
4+
5+
6+
def load_movie_titles() -> Mapping[str, List[str]]:
7+
""" Load Netflix and IMDB movie titles to be matched against each other
8+
9+
Retrieved from:
10+
https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset
11+
https://www.kaggle.com/shivamb/netflix-shows
12+
13+
Preprocessed such that it only contains the title names where
14+
IMDB has 80852 titles and Netflix has 6172
15+
16+
Returns:
17+
data: a dictionary with two keys: "Netflix" and "IMDB" where
18+
each value contains a list of movie titles
19+
"""
20+
url = 'https://github.com/MaartenGr/PolyFuzz/raw/master/data/movie_titles.json'
21+
resp = requests.get(url)
22+
data = json.loads(resp.text)
23+
return data
24+
25+
26+
def load_company_names() -> List[str]:
27+
""" Load company names to be matched against each other.
28+
29+
Retrieved from:
30+
https://www.kaggle.com/dattapiy/sec-edgar-companies-list?select=sec__edgar_company_info.csv
31+
32+
Preprocessed such that it only contains 100_000 company names.
33+
34+
Returns:
35+
data: a list of company names
36+
"""
37+
url = 'https://github.com/MaartenGr/PolyFuzz/raw/master/data/company_names.json'
38+
resp = requests.get(url)
39+
data = json.loads(resp.text)
40+
return data

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@
3737
setup(
3838
name="polyfuzz",
3939
packages=find_packages(exclude=["notebooks", "docs"]),
40-
version="0.2.0",
40+
version="0.2.1",
4141
author="Maarten Grootendorst",
4242
author_email="maartengrootendorst@gmail.com",
4343
description="PolyFuzz performs fuzzy string matching, grouping, and evaluation.",

0 commit comments

Comments
 (0)