-
Notifications
You must be signed in to change notification settings - Fork 10
Notes on the recordlinkage Python library
Marco Fossati edited this page Dec 19, 2018
·
9 revisions
- uses
jellyfishunder the hood for edit distances and phonetic algorithms.
- uses
pandas.DataFrameto represent datasets. It's basically a table with column headers; - conversion from a
dictis easy: key = column header, value = cell; - a value is a list, so
defaultdict(list)is helpful;
dataset = pandas.DataFrame(
{
'catalog_id': [666, 777, 888],
'name': ['huey', 'dewey', 'louie'],
...
}
)- remember the order of values, i.e.,
666->'huey';
- AKA pre-processing AKA normalization AKA standardization;
- https://recordlinkage.readthedocs.io/en/latest/ref-preprocessing.html;
- uses
pandas.Series, a list-like object; - the
cleanfunction seems interesting at a first glimpse; - by default, it removes text inside brackets. Might be useful, trivial to re-implement;
- terrible default regex, removes everything that is not an ASCII letter! Non-ASCII strings are just deleted! Use a custom regex or
Noneinreplace_by_none=kwarg to avoid this; - nice ASCII folding via
strip_accents='ascii', not done by default; -
strip_accents='unicode'keeps intact some Unicode chars, e.g.,œ; - non-latin scripts are just not handled;
- the
phoneticfunction has the same problems as injellyfish, see #79.
from recordlinkage.preprocessing import clean
names = pandas.Series(
[
'хартшорн, чарльз',
'charles hartshorne',
'チャールズ・ハートショーン',
'تشارلز هارتشورن',
'찰스 하츠혼',
àáâäæãåāèéêëēėęîïíīįìôöòóœøōõûüùúū'
]
)clean(names)Output:
0
1 charles hartshorne
2
3
4
5
dtype: object
clean(names, replace_by_none=None, strip_accents='ascii')Output:
0 ,
1 charles hartshorne
2
3
4
5 aaaaaaaeeeeeeeiiiiiioooooouuuuu
dtype: object
TODO
TODO