Notes on the recordlinkage Python library

General

uses jellyfish under the hood for edit distances and phonetic algorithms.

Data format

uses pandas.DataFrame to represent datasets. It's basically a table with column headers;
conversion from a dict is easy: key = column header, value = cell;
a value is a list, so defaultdict(list) is helpful;

dataset = pandas.DataFrame(
  {
    'catalog_id': [666, 777, 888],
    'name': ['huey', 'dewey', 'louie'],
    ...
  }
)

remember the order of values, i.e., 666 -> 'huey';

Cleaning

AKA pre-processing AKA normalization AKA standardization;
https://recordlinkage.readthedocs.io/en/latest/ref-preprocessing.html;
uses pandas.Series, a list-like object;
the clean function seems interesting at a first glimpse;
by default, it removes text inside brackets. Might be useful, trivial to re-implement;
terrible default regex, removes everything that is not an ASCII letter! Non-ASCII strings are just deleted! Use a custom regex or None in replace_by_none= kwarg to avoid this;
nice ASCII folding via strip_accents='ascii', not done by default;
strip_accents='unicode' keeps intact some Unicode chars, e.g., œ;
non-latin scripts are just not handled;
the phonetic function has the same problems as in jellyfish, see #79.

from recordlinkage.preprocessing import clean

names = pandas.Series(
  [
    'хартшорн, чарльз',
    'charles hartshorne',
    'チャールズ・ハートショーン',
    'تشارلز هارتشورن',
    '찰스 하츠혼',
    àáâäæãåāèéêëēėęîïíīįìôöòóœøōõûüùúū'
  ]
)

clean(names)

Output:

0
1    charles hartshorne
2
3
4
5
dtype: object

clean(names, replace_by_none=None, strip_accents='ascii')

Output:

0                                  ,
1                 charles hartshorne
2
3
4
5    aaaaaaaeeeeeeeiiiiiioooooouuuuu
dtype: object

Indexing

TODO

Blocking

TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Notes on the recordlinkage Python library

General

Data format

Cleaning

Indexing

Blocking

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally